Tuesday, October 17, 2017

Geographical clusters with the biggest concentration of web services

From a data set of approximately 145 million IP addresses running at least one publicly accessible web service (such as a website), I was able to determine these 20 geographic "clusters".



Note that these are not "18 locations with the most number of web services" but data set clustered into 18 geo locations - similar to centres of gravity. (Which is why some of the clusters appear in the sea but that just happens to be a location central to other clusters close by - such as Southern India and South-East Asia ). It is not surprising that many of these correspond to major technology hubs in the United States - including the Silicon valley areas of California and Seattle. (I did a UK-only analysis and the data clustered around London, a location between Oxford and Cambridge, and a location between Glasgow and Edinburgh - I'm guessing somewhere around Linlithgow). It is a sheer coincidence that these clusters should correspond to specific well known locations - the prevalence of a large number of web hosts at these locations happens to skew the clusters close to these locations.

Background 

It is possible for nearly any Internet connected device to run a service accessible over the Internet, as long as it has a public IP address. (All my discussion and analysis below assumes IPv4) 

A "service" in this context could any data or functionality exposed to "clients" using protocols running on TCP/IP - in other words, something that can be reached using an IP address and a TCP port. To actually talk to the service, you need to know its protocol - http is a widespread example. 

Historically, the way IP addresses have been allocated has been in chunks of "networks" assigned to companies or ISP's rather than using some mathematical scheme. Each IP address has some metadata associated with it. Because companies and ISP's that get addresses allocated to them are physically located in some part of the world, the IP addresses allocated to them also therefore get a de-facto geo-location "allocated" to them. (It is possible that large ISP's divide their IP address into smaller sub-nets with different geo-locations assigned to groups of IP addresses). 

To find out yours, you can search  "My IP address" and find out the geo location information about yours from ip2location.com or even by searching "My Location" on Google. 

The IP address data I used is available from scans.io . 
The data model that they store for each IP address looks like this and includes the latitude and longitude of each IP address where available:


Techniques

Machine learning is heavily based on statistical analysis and probability techniques and clustering is one of them. With the large computing power ("cloud scale" distributed computing) and large data sets available to us today, we are able to perform some very interesting analyses. The raw data I used is in JSon format, acquired from scans.io and stored on an S3 bucket. To process the data, I used a cluster of Spark nodes. The code to process the data is python with an existing machine learning library to cluster the data and executed with pyspark. 
The crux of this experiment was that 145 million potential geo-location pairs were reduced to  or clustered into a group of 20 - a set we can make practical use of (even if that is in writing a blog post) 

Acknowledgements

Map Plotted using:
https://www.darrinward.com/lat-long/?id=59e61599d7f409.66075067
(https://www.darrinward.com/) 

References

1) Introduction to clustering: https://web.stanford.edu/class/cs345a/slides/12-clustering.pdf
2) https://www.iana.org/numbers

No comments: