Friday, October 06, 2017

How many programmers does it take to update a Wikipedia page?

......or what it took to count the number of IPv4 addresses in every country (as of 1st October 2017). 

This Sunday, I found that the Wikipedia page on List of countries by IPv4 address allocation was using data from 2012. I wondered what it might take to add more up to date information on that page. During a recent course I attended, I got to know about - a fascinating project that involves periodically scanning ALL of the IPv4 address space and storing as much of publicly visible metadata about the active addresses as possible (location, ISP, open ports, services running, operating system, vulnerable services running if any). Each daily dump of the IPv4 address space is close to a terabyte.
An individual IP address record is represented as a JSon object - part of one of the records is shown here:

There is a lot of information to be gleaned from analysing this data - some might have very useful applications and some purely to satisfy curiousity. Also, copying the raw dataset is not the only way to analyse this - might allow querying their data directly on request.
 Given the volumes, this clearly falls in the realm of a Big Data problem and any querying or analytics on this is best achieved using a distributed approach - so this is a perfect problem to leverage fully cloud based resources.

Stage 1:

Copy the latest data set to an S3 bucket.

This might sound easy but the full data is close to 1TB. Ideally I would have preferred a more distributed way of transferring this data. But for now, old fashioned wget from the and then an "aws s3 cp" to S3 storage did the job. 
wget of the compressed data set took around 24 hours and "aws s3 cp" of the uncompressed data took just under 48 hours (with a few hours in the middle that it took to uncompress the downloaded lz4 file). 

For intermediate storage, I created an instance with 2TB of storage. The cost didn't seem bad if all my data transfer completed within a day or so.

Test run:
wget --user=jvsingh --ask-password

The actual command to get that ~221G file (compressed version):
nohup wget --user=jvsingh --password=***** &

(used nohup as I know it was going to take hours so didn't want to keep my ssh terminal open just for this)

For the second stage of uploading the uncompressed file to my S3 bucket, it seems an elegant and faster way might have been to use a multipart upload using a distributed approach. But looking at the upfront setup required for it, I decided against it for this particular test. 

Stage 2:

AWS Setup - I already had an aws account with an SSL key-pair for the region I selected (the cheapest in terms of instance costs and also costs of S3 storage - to avoid intra region data transfer costs and possible network latency, I used the same region for both my S3 bucket and spark instances). 
Additionally, to allow command line tools (such as flintrock) to connect and operate the AWS account, I had to install and set up the local aws command line interface - which requires a pair of credentials generated through AWS - IAM 
I had also previously created an S3 bucket to hold the 1 TB data file. This would allow multiple spark instances to access the data, which otherwise won't be possible or too complex to set up with general purpose disk-like storage (might be possible with hadoop distributed file store but using S3 here definitely saved me from a lot of extra configuration)

Stage 3:

Download and install flintrock, configure its yaml configuration (here's their template) to set up the spark cluster. This is convenient as I intended to do this on AWS, which is very easy to set up with flintrock. (I used an Amazon Linux AMI - the rest of the setup is self-explanatory in the template)
I start an initial cluster with 3 worker nodes. 

One can configure a spark cluster without flintrock as well - I found a set of steps here. Flintrock made things a lot easier. 

Step 4:

  • Login to the spark master instance
  • Submit the spark job using spark-submit 

nohup ~/spark/bin/spark-submit --master spark:// --executor-memory 6G --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 --conf "spark.driver.maxResultSize=2g" > main_submittedjob.out &

I first executed a dry run on a smaller 1GB dataset to make sure everything was ready and working. A snippet of results from the dry run is shown here (I used country_code instead of country name to be safe - these can always be translated and sorted later - at this point I am eager to get the main counts):

  • Gradually increase the number of worker instances to see the data analysis speeding up as the work gets distributed evenly on the newly joined instances.
"flintrock add-slaves" does this seamlessly for most part (it installed spark and other libraries) 
I did have to manually log in to each new instance and use the command 
spark/sbin/ spark://master_host:7077  
to ensure they got added to the cluster

After this, I could sit back and watch with satisfaction the jobs (rather individual tasks) getting evenly redistributed on the new nodes. 
  • Watch progress on the spark master console and wait for the final results to appear!

Shown below, the job stages console, 30 minutes in -

Coming up: The actual results

(if nothing breaks down till then!)
I posted my initial results here - sorry to report, the counts don't quite add up. Will investigate why in due course.


1) Paul Fremantle (WSO2 co-founder)- for the tools and techniques he taught on his Cloud and Big Data course at Oxford
2) for the idea of scanning the whole of IPv4 address space, the initiative and execution

No comments: