Wednesday, October 18, 2017

Selective persistence of Oracle Diagnostic Logging (ODL) output

Background and Goal

In any application, logging is widely used for diagnostics and debugging. 

Logging at various "checkpoints" (such as entering with request, exiting with response, error handler) in the application can provide a fairly reliable way to trace the execution path of the application - which a subsequent sweep or count can be used to report on. When the logs are regularly analysed and reported on, anomalies can get flagged up proactively and investigated further. Some examples include: a report on requests without a corresponding responses, and a report on fault counts and fault codes.

For such reporting, one can write fairly complex scripts to extract the required data OR persist specific log entries to a database for querying using SQL tools or even a custom-designed user interface. This blog post shows an easy approach to achieve this. 

Oracle Diagnostics Logging

In Oracle Fusion Middleware, the Oracle Diagnostics Logging (ODL) framework is responsible for handling all application level logging. The main configuration for ODL is stored in a file called logging.xml, a copy of which is configured in each managed server and the Admin server (this configuration can also be done via the fusion middleware control)
Briefly, the ODL consists of a set of pre-defined "log handlers" (represented by log_handler elements in logging.xml that in turn use an implementation class or interface and filters and are configured using a set of nested property elements) 

Each application or sub-system is then identified by a logger element - where a logger can use one or more log handlers. Loggers form an inheritable hierarchy where sub-loggers can inherit parent logger properties but can override these as well. 

Loggers called "console-handler" and "odl-handler" are two common ones - sending outputs to the server console and a ${SERVER_NAME}-diagnostic.log file respectively. 

Oracle Service Bus (OSB) Pipeline logs

In the Oracle Service Bus, "log" and "report" are two actions that can be used in complementary ways. 
The Log action used in OSB pipelines specifically uses the ODL framework to write content to the configured log files. The logger used is oracle.osb.logging.pipeline and in the default setup, the output goes to the odl-handler and in turn, to the server specific diagnostic.log file. 

You can view the standard/default configuration of the odl-handler in logging.xml

Custom Log Handler for OSB pipeline logs

I created a slight variant of the odl-handler and called it SBMessageFlowtraceHandler
This is the configuration I created for it:


   

The key differences from the configuration for odl-handler are:
* The output generated is an XML structure instead of the default text
* I used a rotation frequency of 1 minute - I will come to the reason later
* Used a specific file name pattern and location 

The log handler created above was then mapped to the OSB logger below (which is also a parent logger of oracle.osb.logging.pipeline:
  

The result of the above configuration is that any log actions used in the pipelines start getting routed to my new osb_managed_server-diagnostic.xml log file. One such entry can be seen below:



We can observe that each log entry is represented by a "msg" element. This includes the original content we logged within the "txt" element but also a lot of interesting metadata:
1. Timestamp of the message in the time attribute
2. host name / IP address where the OSB managed server was running
3. ECID - an invaluable identifier that can be used to search for other related entries in all related OSB logs across servers
4. Within the txt element, there is the 4-tuple structure enclosed in square brackets that includes the route name and whether it was the REQUEST, RESPONSE or ERROR handler pipeline. 

To map these to design time, I have included this image of the relevant section of my pipeline:

Log extraction and Persistence

Once we have made the diagnostic logs available in a well defined format and location, the solution is just like any other OSB project that uses a file transport based proxy service. 
We can use a simple file transport/poller, combined with an XSL transformation and the DBAdapter to save this data into the database!
I have created a draft OSB project with JDev/OSB 12.2.1 that does precisely this here:

The database includes a simple table with these main columns:
1) MessageDateTime - timestamp of the original message, which we can pick from the logs as shown above
2) AuditDateTime - the timestamp when the data was inserted into the database (there is likely to be a slight delay between when the logs are written and when they are polled for persistence)
3) MessagePayload - to store the actual content of the message: which would either be the full content of the 'txt' element or the part after "log summary:". In my example, this is of type XML (SQL Server)

You can of course create your own custom database schema and pick and choose the content and format of data you need inside it. You can choose to save all logged messages or filter based on a certain log level or even certain keywords in the "summary" content of the Log activity. 

Workarounds etc.

1) To prevent the SBAuditLogger from reading the same log file that the server writes to, I used the following "mask" in the file proxy: *flowtrace-*.xml

This means, only 'rolled' files are read (which I already specified to be 1 minute, so the rolling is frequent). The proxy can be configured to delete or archive the files that are read just like any other file polling proxy service. 

2) To avoid situations where the "txt" element comes in with nested CDATA sections, I had to write this small Java utility called in the SBAuditLogger pipeline via a Java callout. 

References

1) Adding a log action in an OSB pipeline:
https://docs.oracle.com/cd/E23943_01/admin.1111/e15867/proxy_actions.htm#OSBAG1148

2) Configuring Logging and log files / Understanding Oracle Diagnostic Logging:
https://docs.oracle.com/middleware/12211/lcm/ASADM/logs.htm#ASADM217

3) https://docs.oracle.com/middleware/1221/osb/administer/GUID-49554F3F-38F1-42C6-A265-D7689D6BFD8B.htm#OSBAG783

Tuesday, October 17, 2017

Geographical clusters with the biggest concentration of web services

From a data set of approximately 145 million IP addresses running at least one publicly accessible web service (such as a website), I was able to determine these 20 geographic "clusters".



Note that these are not "18 locations with the most number of web services" but data set clustered into 18 geo locations - similar to centres of gravity. (Which is why some of the clusters appear in the sea but that just happens to be a location central to other clusters close by - such as Southern India and South-East Asia ). It is not surprising that many of these correspond to major technology hubs in the United States - including the Silicon valley areas of California and Seattle. (I did a UK-only analysis and the data clustered around London, a location between Oxford and Cambridge, and a location between Glasgow and Edinburgh - I'm guessing somewhere around Linlithgow). It is a sheer coincidence that these clusters should correspond to specific well known locations - the prevalence of a large number of web hosts at these locations happens to skew the clusters close to these locations.

Background 

It is possible for nearly any Internet connected device to run a service accessible over the Internet, as long as it has a public IP address. (All my discussion and analysis below assumes IPv4) 

A "service" in this context could any data or functionality exposed to "clients" using protocols running on TCP/IP - in other words, something that can be reached using an IP address and a TCP port. To actually talk to the service, you need to know its protocol - http is a widespread example. 

Historically, the way IP addresses have been allocated has been in chunks of "networks" assigned to companies or ISP's rather than using some mathematical scheme. Each IP address has some metadata associated with it. Because companies and ISP's that get addresses allocated to them are physically located in some part of the world, the IP addresses allocated to them also therefore get a de-facto geo-location "allocated" to them. (It is possible that large ISP's divide their IP address into smaller sub-nets with different geo-locations assigned to groups of IP addresses). 

To find out yours, you can search  "My IP address" and find out the geo location information about yours from ip2location.com or even by searching "My Location" on Google. 

The IP address data I used is available from scans.io . 
The data model that they store for each IP address looks like this and includes the latitude and longitude of each IP address where available:


Techniques

Machine learning is heavily based on statistical analysis and probability techniques and clustering is one of them. With the large computing power ("cloud scale" distributed computing) and large data sets available to us today, we are able to perform some very interesting analyses. The raw data I used is in JSon format, acquired from scans.io and stored on an S3 bucket. To process the data, I used a cluster of Spark nodes. The code to process the data is python with an existing machine learning library to cluster the data and executed with pyspark. 
The crux of this experiment was that 145 million potential geo-location pairs were reduced to  or clustered into a group of 20 - a set we can make practical use of (even if that is in writing a blog post) 

Acknowledgements

Map Plotted using:
https://www.darrinward.com/lat-long/?id=59e61599d7f409.66075067
(https://www.darrinward.com/) 

References

1) Introduction to clustering: https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf
2) https://www.iana.org/numbers

Saturday, October 07, 2017

Raw results - countries list with total IP (IPv4) addresses


Background: 
http://weblog.singhpora.com/2017/10/how-many-programmers-does-it-take-to.html


Presented below is a list of countries (country codes) and the total count of live IPv4 addresses where a public facing service (such as a website) might be hosted as counted from the scan data of 1st October 2017

The reason these don't quite add up to anywhere in the ballpark of 4 billion (the total IPv4 address space) is because the data set I used might only be scanning for hosts that run some public service exposed over a TCP port (e.g. a website running on port 80 or 443)

The numbers definitely look incorrect and total up to only 145,430,195 - I will continue to investigate why, but they seem to be in proportion. 
It is likely that scans.io are only able to gather data about live IP addresses at the time of the scan as opposed to total allocated ones)


+-------+--------+
|country|ip_count|
+-------+--------+
|     LT|  120718|
|     DZ|  362827|
|     MM|    3494|
|     CI|   18954|
|     TC|     675|
|     AZ|   39468|
|     FI|  220723|
|     SC|   83878|
|     PM|     323|
|     UA|  768681|
|     RO|  730479|
|     ZM|    9618|
|     KI|     274|
|     SL|     474|
|     NL| 3077280|
|     LA|    5319|
|     SB|     746|
|     BW|    6165|
|     MN|    9664|
|     BS|    7838|
|     PS|   36320|
|     PL| 1539024|
|     AM|   57860|
|     RE|    6976|
|     MK|   41856|
|     MX| 9361233|
|     PF|    7506|
|     TV|      41|
|     GL|   10279|
|     EE|   74403|
|     VG|   13871|
|     SM|    2514|
|     CN|11905007|
|     AT|  403069|
|     RU| 4002435|
|     IQ|   76489|
|     NA|   13269|
|     SJ|     125|
|     CG|   13541|
|     AD|   12536|
|     LI|    6136|
|     HR|   84459|
|     SV|  134530|
|   null|  827348|
|     NP|   22618|
|     CZ|  434625|
|     VA|     409|
|     PT|  278365|
|     SO|    1158|
|     PG|    3291|
|     GG|    2601|
|     CX|     125|
|     KY|    5329|
|     GH|   11492|
|     HK| 1127634|
|     CV|    1745|
|     BN|    6363|
|     LR|     769|
|     TW| 2785149|
|     BD|   88409|
|     LB|   43745|
|     PY|   33953|
|     CL|  340123|
|     TO|     756|
|     ID|  495095|
|     LY|   18077|
|     FK|    1158|
|     AU| 1875091|
|     SA| 1098611|
|     PK|  279205|
|     CA| 3073028|
|     MW|    5162|
|     BM|    6359|
|     BL|     104|
|     UZ|   12856|
|     NE|    1597|
|     GB| 5182929|
|     MT|   20472|
|     YE|    6356|
|     BR| 3554113|
|     KZ|  400583|
|     BY|   59159|
|     NC|   18117|
|     HN|   25888|
|     GT|  115383|
|     MD|  107923|
|     DE| 6338938|
|     AW|    2612|
|     GN|    1140|
|     IO|      65|
|     ES| 1810492|
|     IR|  609566|
|     NR|     178|
|     MO|   26437|
|     BH|   24639|
|     EC|  210964|
|     VI|    1233|
|     IL|  337670|
|     TR|  751779|
|     ME|   26218|
|     VE|  660044|
|     MR|    3197|
|     ZA|  453373|
|     CR|  122065|
|     AI|     469|
|     SX|     869|
|     GU|   21634|
|     KR| 4705816|
|     TZ|   14240|
|     US|45381144|
|     RS|  128773|
|     MS|     262|
|     AL|   45857|
|     MY|  462057|
|     PN|     125|
|     IN| 2169583|
|     JM|   16720|
|     CK|     650|
|     LC|    1418|
|     GM|    1627|
|     AE| 1001729|
|     MQ|    5890|
|     CM|    9684|
|     RW|    3714|
|     TG|    1992|
|     FR| 2709666|
|     GF|    1521|
|     CH|  544074|
|     MG|    5532|
|     CC|     124|
|     TN|  293295|
|     GQ|     759|
|     NU|     136|
|     TL|     745|
|     WF|     479|
|     GR|  243484|
|     PA|  200845|
|     TD|     519|
|     GI|    5229|
|     SD|   15635|
|     AG|    4250|
|     MC|   10245|
|     DJ|     723|
|     JO|   40809|
|     BA|   59273|
|     ET|    1776|
|     SG|  734373|
|     KP|     319|
|     BF|    2820|
|     IT| 3523490|
|     CU|   13847|
|     GW|     254|
|     FO|    1282|
|     MV|    9439|
|     SE|  663630|
|     PH|  392585|
|     WS|    1259|
|     BG|  538707|
|     FJ|    3198|
|     GE|   61683|
|     SK|  128175|
|     FM|     906|
|     MH|    1745|
|     CW|   21457|
|     LV|  102735|
|     MU|   27736|
|     PE|  275323|
|     LS|    5507|
|     MZ|   12728|
|     GD|    3400|
|     DM|     646|
|     KM|     389|
|     DO|  554824|
|     QA|   34995|
|     XK|     581|
|     BZ|   12967|
|     TH| 1366956|
|     EG|  327882|
|     SH|     125|
|     BI|     771|
|     BJ|    1948|
|     MF|     429|
|     GY|    3847|
|     JP| 3299718|
|     TM|     572|
|     VC|    5377|
|     ZW|   11952|
|     SN|   12707|
|     NZ|  401608|
|     OM|   49103|
|     LK|   33816|
|     BT|    2126|
|     HU|  407222|
|     KN|    2990|
|     KE|   32116|
|     SI|  130608|
|     CY|   32025|
|     ML|    9998|
|     HT|    7375|
|     GP|    4018|
|     UG|    7357|
|     IE|  636087|
|     KW|   64836|
|     GA|    8910|
|     VU|    1473|
|     BE|  347894|
|     MA|  227130|
|     AS|     320|
|     KH|   33846|
|     NI|   53612|
|     KG|   14067|
|       |  649814|
|     TT|   32719|
|     SY|   75436|
|     NO|  368080|
|     BO|   93018|
|     ER|     257|
|     CO| 1135399|
|     IM|    7208|
|     SS|     570|
|     UY|   75799|
|     NG|   37838|
|     JE|    4069|
|     YT|     232|
|     AR| 1273489|
|     CF|     249|
|     PW|     251|
|     PR|   27204|
|     TK|     135|
|     LU|   56661|
|     SZ|    5313|
|     NF|     125|
|     VN|  880606|
|     IS|   50124|
|     MP|     529|
|     AF|   14127|
|     BB|    5340|
|     BQ|    4461|
|     SR|   23450|
|     DK|  772845|
|     CD|     458|
|     TJ|    5421|
|     AO|   17188|
|     AX|    1292|
|     ST|     335|
+-------+--------+


Total:
145,430,195

Friday, October 06, 2017

How many programmers does it take to update a Wikipedia page?

......or what it took to count the number of IPv4 addresses in every country (as of 1st October 2017). 

This Sunday, I found that the Wikipedia page on List of countries by IPv4 address allocation was using data from 2012. I wondered what it might take to add more up to date information on that page. During a recent course I attended, I got to know about scans.io - a fascinating project that involves periodically scanning ALL of the IPv4 address space and storing as much of publicly visible metadata about the active addresses as possible (location, ISP, open ports, services running, operating system, vulnerable services running if any). Each daily dump of the IPv4 address space is close to a terabyte.
An individual IP address record is represented as a JSon object - part of one of the records is shown here:


There is a lot of information to be gleaned from analysing this data - some might have very useful applications and some purely to satisfy curiousity. Also, copying the raw dataset is not the only way to analyse this - censys.io might allow querying their data directly on request.
 Given the volumes, this clearly falls in the realm of a Big Data problem and any querying or analytics on this is best achieved using a distributed approach - so this is a perfect problem to leverage fully cloud based resources.

Stage 1:

Copy the latest data set to an S3 bucket.

This might sound easy but the full data is close to 1TB. Ideally I would have preferred a more distributed way of transferring this data. But for now, old fashioned wget from the censys.io and then an "aws s3 cp" to S3 storage did the job. 
wget of the compressed data set took around 24 hours and "aws s3 cp" of the uncompressed data took just under 48 hours (with a few hours in the middle that it took to uncompress the downloaded lz4 file). 

For intermediate storage, I created an instance with 2TB of storage. The cost didn't seem bad if all my data transfer completed within a day or so.
https://aws.amazon.com/ebs/pricing/

Test run:
wget --user=jvsingh --ask-password https://censys.io/data/ipv4/historical

The actual command to get that ~221G file (compressed version):
nohup wget --user=jvsingh --password=***** https://scans.io/zsearch/r5vhnlm9vqxh5z1e-20170930.json.lz4 &

(used nohup as I know it was going to take hours so didn't want to keep my ssh terminal open just for this)

For the second stage of uploading the uncompressed file to my S3 bucket, it seems an elegant and faster way might have been to use a multipart upload using a distributed approach. But looking at the upfront setup required for it, I decided against it for this particular test. 

Stage 2:

AWS Setup - I already had an aws account with an SSL key-pair for the region I selected (the cheapest in terms of instance costs and also costs of S3 storage - to avoid intra region data transfer costs and possible network latency, I used the same region for both my S3 bucket and spark instances). 
Additionally, to allow command line tools (such as flintrock) to connect and operate the AWS account, I had to install and set up the local aws command line interface - which requires a pair of credentials generated through AWS - IAM 
I had also previously created an S3 bucket to hold the 1 TB data file. This would allow multiple spark instances to access the data, which otherwise won't be possible or too complex to set up with general purpose disk-like storage (might be possible with hadoop distributed file store but using S3 here definitely saved me from a lot of extra configuration)

Stage 3:

Download and install flintrock, configure its yaml configuration (here's their template) to set up the spark cluster. This is convenient as I intended to do this on AWS, which is very easy to set up with flintrock. (I used an Amazon Linux AMI - the rest of the setup is self-explanatory in the template)
I start an initial cluster with 3 worker nodes. 

One can configure a spark cluster without flintrock as well - I found a set of steps here. Flintrock made things a lot easier. 

Step 4:


  • Login to the spark master instance
  • Submit the spark job using spark-submit 

nohup ~/spark/bin/spark-submit --master spark://0.0.0.0:7077 --executor-memory 6G --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 --conf "spark.driver.maxResultSize=2g" sparkjob.py > main_submittedjob.out &

I first executed a dry run on a smaller 1GB dataset to make sure everything was ready and working. A snippet of results from the dry run is shown here (I used country_code instead of country name to be safe - these can always be translated and sorted later - at this point I am eager to get the main counts):


  • Gradually increase the number of worker instances to see the data analysis speeding up as the work gets distributed evenly on the newly joined instances.
"flintrock add-slaves" does this seamlessly for most part (it installed spark and other libraries) 
I did have to manually log in to each new instance and use the command 
spark/sbin/start-slave.sh spark://master_host:7077  
to ensure they got added to the cluster

After this, I could sit back and watch with satisfaction the jobs (rather individual tasks) getting evenly redistributed on the new nodes. 
  • Watch progress on the spark master console and wait for the final results to appear!


Shown below, the job stages console, 30 minutes in -

Coming up: The actual results

(if nothing breaks down till then!)
I posted my initial results here - sorry to report, the counts don't quite add up. Will investigate why in due course.
http://weblog.singhpora.com/2017/10/raw-results-countries-list-with-total.html

Credits: 

1) Paul Fremantle (WSO2 co-founder)- for the tools and techniques he taught on his Cloud and Big Data course at Oxford
2) scans.io for the idea of scanning the whole of IPv4 address space, the initiative and execution