Making a scalable IP 2 Geolocation service
As many of us know being able to geolocate an IP can be of tremendous help for any business that offers localized services on the net. Functionally such a service is straightforward : the caller has an IP and he needs a latitude/longitude pair (and maybe some extra data like : the closest city, population density, weather code, etc.).
Such services are readily available on the web today, free of charge, but with a limited number of calls/time unit. So, if your application needs this at a larger scale (ie. 100k calls/day), then you might need to do it on your own. In this post I will discuss some ways to build such a system.
Basic functional characteristics :
- it involves mostly reads (99.9999% reads & 0.0001% writes)
- usually the IP 2 Geo database gets updated once per month
- having stale data is no problem (on short periods of time) - it has to scale really big & perform really well
- every potential web visitor will hit this service at least once => this service is one of your first possible bottlenecks
- this hit will probably happen at the first page load => it has to be fast (so that it doesn’t affect the page load time) - it should store more than just a latitude/longitude pair
- it would be nice to have more information stored for every IP : city, region, country, weather code, connection speed, etc.
- this should be stored in a document structure that is easy to read and maintain (ie. JSON, etc.)
Storing the information on the filesystem
This would clearly be the most simple and straightforward solution. The IP number can be converted in a directory structure just
by replacing the DOTs with a SLASH, like : the IP 110.111.112.113 would be transformed in the filesystem path /110/111/112/113.txt.
Storing things in this way would us to use the basic filesystem tools to add/change/remove entries and also we could use the classic
FTP, rsync, etc. tools to replicate changes across multiple servers. As such there is almost no “fancy” technology & administrative
skills needed to build & maintain such a system. Also there would be little “application level overhead” because we would work directly
with OS level system calls (ie. the POSIX filesystem calls).
The problem :
There are 2^32 IP addresses out there (assuming you want to resolve good old IPv4 addresses) and that means you would need
to have 4 billion files on your filesystem. And this just doesn’t work!!! Most advanced filesystems (ie. ext3, xfs) just perform terribly for anything that
goes above 100 million files. And we are almost 2 orders of magnitude above that (we need > 40x that number). Also, if you encounter filesystem corruption (ie. servers loose power, etc.) the filesystem recovery process will just take hours and hours to finish.
B. Let’s make our solution someone’s else problem
Storing the IP entries on a CDN
This would be a quite elegant solution, especially for high volume traffic. Every IP entry would be stored as a 500bytes file and the name of the file would be
the IP address itself. As such, an IP adress like 110.111.112.113 would be translated into a CDN file request like : http://mycdn.mydomain.com/ip/110-111-112-113. At the origin we would have a container that stores those entries with a TTL of 1 month. Every month we could run a script that refreshes some of those entries at the origin level and that’s it. The price would also be extremely low on a CDN with a bandwidth based pricing model. If every geo request takes around 1KB of bandwidth (including the headers and everything) then you would consume 1 GB of traffic in 1 million requests.
The problem :
No CDN will actually allow you to store more than 10 million of files (for example Rackspace Cloudfiles limit you at 5 million files). And we are almost 3 orders of magnitude above that.
C. Use a noSQL database
I really liked the first solution. Basically I tried to use the filesystem as a huge HashMap, where the key was the path. All I needed here was basic matching between a key and a value. I don’t need any more advanced query system. As such, the perfect replacement for the first solution would be to use a very simple document based noSQL solution. Why noSQL? Because the data is not relational in any way (ie. no joins). And as such, it scales & performs pretty well.
In this case my tool of choice would be Redis. The reason? It’s just at the border line between a cache system and a persistence engine. It doesn’t compromise at all on speed (as some other more advanced noSQL soultions have to do – ie. MongoDB, Solr, etc. – ) because it offers just a minimal set of features.
In the end, I actually ended up with the initial solution. But implemented at a different level.

leave a comment