Advanced exact matching with Elastic Search
Exact matching is a basic feature of many search apps. One simple applications of this would be that of matching a location name to a document. For example : retrieve the document associated with the name “San Francisco”.
To be able to do this we need two things :
- configure the index schema so that the “name” field is never tokenized : ”San Francisco” and “San Mateo” will be indexed as two terms and not three : “San”, “Mateo”, “Francisco”)
- make sure the the query is doing exact matching : when searching for “San” it should not match “San Mateo” and “San Francisco”, but just “San”
1). Tokenization
Tokenization is one of the most important things to control in ElasticSearch/Lucene.
Whenever we need to match proper names : countries, cities, companies etc. we need to do exact
matching. As such the query “San Francisco” should not match “San Mateo”, but only “San Francisco”.
As such we have to make sure that specific fields are not tokenized at indexing time and the whole text
get’s treated as one big term (or token). The effect of this is that the text “San Mateo” is indexed as the
term “San Mateo” (one term) instead of two terms : “San” & “Mateo”.
One of the easiest way to do this is to use a PatternTokenizer that never matches anything.
The pattern tokenizer can be configured to match the token separators and it should be configured to
match nothing (the pattern $^ never matches anything). As such the pattern tokenizer will not match any separators => it will consider the whole text as one token.
The index settings :
{
“settings”: {
“analysis” :{
“analyzer”: {
“my_analizer”: {
“type”: “custom”,
“tokenizer”:”my_pattern_tokenizer”,
“filter”:[]
},
},
“tokenizer”: {
“my_pattern_tokenizer”: {
“type”:”pattern”,
“pattern”:”$^”
}
},
“filter”: {
}
}
}
}
The document schema :
{
“location”:
{
“properties”:
{
“name”: {
“type” : “string”, “store”: “yes”, “analyzer”:”my_analizer”, “search_analyzer”: “my_analizer”
}
}
}
}
2). Exact matching
There are two ways to do exact matching :
- using an ES term query
- using an ES string query
If we are concerned only about exact matching then the only difference between them is that the TermQuery does not accept custom analizers filters. The string query accepts that. This might prove usefull if we need need ES to apply some text transformations on the query before execution. Three filters come to mind :
- lowercase : transforms the query into lowercase
- ascii folding : transforms all the non ASCII characters into ASCII characters
- trim : eliminates the leading/trailing whitespaces
I believe it’s wiser to have those transformations done on the server side (ElasticSearch side) as this allows us to get this basically for free. It doesn’t matter from where we make those calls to ElasticSearch, our queries are always processed the same way. If we use the TermQuery we would need to apply all those transformations on the client side. If we use multiple programming languages to send queries then this could prove rather difficult to maintain.
The main challenge in using a string query for matching is that the string query is tokenized twice. The reason for this is that the Elastic Search query string is internally translated into a Lucene query string which accepts a minimal DSL. As such your query can contain some logical operators like : AND, OR, NOT, etc. very usefull for a basic site search box.
The effect of this is that the custom tokenizers configured in the index document mapping will be run only after the DSL tokenizer finished. To give an example :
- we have configured our custom analyzer so that there is no tokenization
- we search with a query like : “San Mateo”
- first ES/Lucene executes the DSL tokenization and our expression is tokenized as two expressions : “San” & “Mateo”
- secondly it will apply our tokenizer on each separate expression : one for “San” and one for “Mateo”
- finally it will generate two TermQueries : one for “San” and one for “Mateo” and construct a boolean TermQuery that contains them both with an OR : “San OR Mateo”
- we get all documents that match either “San” or “Mateo” (probably none as there is no place in the world that has either the name “San” or “Mateo”)
So the DSL tokenization can give us the impression that the custom anayzers that we have set have no effect. But they do. Just that we expect them to be the only ones who run. In reality they are just the second link in a chain that starts with the DSL tokenization.
The solution to this is rather simple. We have to put extra quotes inside our query to tell the DSL tokenizer that it should not tokenize “San Mateo” as two tokens but to leave it as one expression. So, change the query from “San Mateo” to “\”San Mateo\”" and things will work out as expected.
The use of aliases in ElasticSearch
Assuming that you want to be able to change an ES index in more advanced ways (ie. adding custom analyzers)
you will have to be able to recreate the index without disturbing your production system.
An efficient way to do this with ES alone is to make use of index aliases. An alias is basically just a smart forwarder
that can map a name to one or multiple indexes. See more on aliases here : http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html
Creating the index & the alias
You want to create an index with the name “bars” and then use/change it without any production “glitch”
What you should do :
- create an index with the name “bars_v1″
- create an alias with the name “bars” that points to “bars_v1″
- put all your data & percolated queries in “bars_v1″
- make sure that everyone that uses the index calls the alias “bars” instead of the actual index “bars_v1″
Note : the difference between adding documents & percolated queries when using aliases
- when you put just documents in the “bars_v1″ index you can use “bars_v1″ (the actual index name) or “bars” (the alias); both work the same (your operation will have the same effect)
- when you want to add percolated queries in “bars_v1″ make sure you use the actual index name (“bars_v1″) and not the alias “bars”; if you will use the alias “bars” instead of the actual index name “bars_v1″ the percolated queries will not work
Major changes to the index without any downtime
You want to make major changes to your index (“bars_v1″) but you cannot afford to put the index offline
so you should work on another index and then just redirect the alias when you are done
What you should do :
- create an index with the name “bars_v2″
- put all your data & percolated queries in “bars_v2″
- when you are done change the alias “bars” to point to the “bars_v2″ index (this can be done atomically)
- delete the index “bars_v1″
Making a scalable IP 2 Geolocation service
As many of us know being able to geolocate an IP can be of tremendous help for any business that offers localized services on the net. Functionally such a service is straightforward : the caller has an IP and he needs a latitude/longitude pair (and maybe some extra data like : the closest city, population density, weather code, etc.).
Such services are readily available on the web today, free of charge, but with a limited number of calls/time unit. So, if your application needs this at a larger scale (ie. 100k calls/day), then you might need to do it on your own. In this post I will discuss some ways to build such a system.
Basic functional characteristics :
- it involves mostly reads (99.9999% reads & 0.0001% writes)
- usually the IP 2 Geo database gets updated once per month
- having stale data is no problem (on short periods of time) - it has to scale really big & perform really well
- every potential web visitor will hit this service at least once => this service is one of your first possible bottlenecks
- this hit will probably happen at the first page load => it has to be fast (so that it doesn’t affect the page load time) - it should store more than just a latitude/longitude pair
- it would be nice to have more information stored for every IP : city, region, country, weather code, connection speed, etc.
- this should be stored in a document structure that is easy to read and maintain (ie. JSON, etc.)
Storing the information on the filesystem
This would clearly be the most simple and straightforward solution. The IP number can be converted in a directory structure just
by replacing the DOTs with a SLASH, like : the IP 110.111.112.113 would be transformed in the filesystem path /110/111/112/113.txt.
Storing things in this way would us to use the basic filesystem tools to add/change/remove entries and also we could use the classic
FTP, rsync, etc. tools to replicate changes across multiple servers. As such there is almost no “fancy” technology & administrative
skills needed to build & maintain such a system. Also there would be little “application level overhead” because we would work directly
with OS level system calls (ie. the POSIX filesystem calls).
The problem :
There are 2^32 IP addresses out there (assuming you want to resolve good old IPv4 addresses) and that means you would need
to have 4 billion files on your filesystem. And this just doesn’t work!!! Most advanced filesystems (ie. ext3, xfs) just perform terribly for anything that
goes above 100 million files. And we are almost 2 orders of magnitude above that (we need > 40x that number). Also, if you encounter filesystem corruption (ie. servers loose power, etc.) the filesystem recovery process will just take hours and hours to finish.
B. Let’s make our solution someone’s else problem
Storing the IP entries on a CDN
This would be a quite elegant solution, especially for high volume traffic. Every IP entry would be stored as a 500bytes file and the name of the file would be
the IP address itself. As such, an IP adress like 110.111.112.113 would be translated into a CDN file request like : http://mycdn.mydomain.com/ip/110-111-112-113. At the origin we would have a container that stores those entries with a TTL of 1 month. Every month we could run a script that refreshes some of those entries at the origin level and that’s it. The price would also be extremely low on a CDN with a bandwidth based pricing model. If every geo request takes around 1KB of bandwidth (including the headers and everything) then you would consume 1 GB of traffic in 1 million requests.
The problem :
No CDN will actually allow you to store more than 10 million of files (for example Rackspace Cloudfiles limit you at 5 million files). And we are almost 3 orders of magnitude above that.
C. Use a noSQL database
I really liked the first solution. Basically I tried to use the filesystem as a huge HashMap, where the key was the path. All I needed here was basic matching between a key and a value. I don’t need any more advanced query system. As such, the perfect replacement for the first solution would be to use a very simple document based noSQL solution. Why noSQL? Because the data is not relational in any way (ie. no joins). And as such, it scales & performs pretty well.
In this case my tool of choice would be Redis. The reason? It’s just at the border line between a cache system and a persistence engine. It doesn’t compromise at all on speed (as some other more advanced noSQL soultions have to do – ie. MongoDB, Solr, etc. – ) because it offers just a minimal set of features.
In the end, I actually ended up with the initial solution. But implemented at a different level.

leave a comment