Before We Start
As some of my blog readers may know, my previous startup project was an online location based search service. Thing does not go well and I am going to open source some of the related work.
In this post, I will show you how to use Lucene to do spatial search as well as how it works internally.
Spatial Search and Geo Tagging
Geo-tagging is a popular way to do spatial search. Instead of using the actual location to tag a POI, the geo-tagging uses geotag to group items within a predefined range as a whole. When performing a search, it is simply pull out all items with the same tag.
- There are few popular ways to do geo-tagging, including
GeoHash is way to encode latitude and longitude using base32 code. There is a fundamental flaw that makes it hard to do spatial search. The problem is the geo hash generated is not continuous. The neighbor of a particular block may not share the same prefix with its nearest eight neighbors. When using geohash to do spatial search, we need to calculate the hash value of nearby neighbors by ourselves.
1 2 3 4 5 6 7
Quad-Tree addresses the issue in geohash and adapt a better way to do geo-tagging. This example gives us a great example on how Quad-Tree does geo-tagging.
- The Quad-Tree has the following advantage when doing spatial search
- it can be really fast if you stores data as a tri. Such as, Lucene, internally, it uses indexed tri to store NumericField.
- the neighbors of a block share the longest common prefix.
Our Custom Approach
Lucene's contrib-spatial uses geo-tagging to implements spatial search and only support searching for features near a point.
A pair of latitude and longitude gives us a quickstart for your location based application. However, not every single feature in a LBS application can be described as a point. Let us take school district as an example. The School A may be closer to your house than School B is, but your house is belong to school district for school B. Another example is bike trails, bike trail is a line not a single point.
A full feature spatial search must support complicated geometric operations. This is why we create our own Lucene spatial search implementation.
In the GIS area, Geometry are used to describe shape of features. The common geometries are:
|Point||POINT (30 10)|
|LineString||LINESTRING (30 10, 10 30, 40 40)|
|Polygon||POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))|
|POLYGON ((35 10, 10 20, 15 40, 45 45, 35 10), (20 30, 35 35, 30 20, 20 30))||
Multipart geometries (2D)
|MultiPoint||MULTIPOINT ((10 40), (40 30), (20 20), (30 10))|
|MultiLineString||MULTILINESTRING ((10 10, 20 20, 10 40) (40 40, 30 30, 40 20, 30 10))|
|MultiPolygon||MULTIPOLYGON (((30 20, 10 40, 45 40, 30 20)), ((15 5, 40 10, 10 20, 5 10, 15 5)))|
How to Calculate QuadTree value for Geometries
When hashing a Geometry primitive, we will calculate the minimum bounding rectangle(MBR) of this geometry. For example, the minumum bouding box for is the grid made up of (1, 1), (1, 4), (4, 4), (4, 1).
We will calculate the quadtree hashcode for the MBR, the quadtree hashcode for the MBR is the longest common prefix of quadtree values for the four corners.
- When indexing this geometry with Lucene, we will store these fields in lucene
- geometry: POLYGON ((3 1, 1 2, 2 4, 4 4, 3 1))
- geometry__quadtree: QuadTree value of the polygon.
The number of digits we used in the quad-tree implementation is 17 digits. This allows us to lay items on an 600 x 600 meter block. This number comes from 40075000 / 2 ^16 (Earth Circumference / level of details). The 17 digits number will be stored as a long value. The digits we used are 1(upper left), 2(upper right), 3(lower left), 4(lower right), and 0 (whole block in the given detail).
When encoding a coordinate, we will stores the quadtree value as a NumericField in Lucene. Internally, Lucene will use indexed trie to store NumericField. The "Indexed Trie" uses buckets to group terms. We will use the default precisionStep(4) for now.
During the index phase, we will store the geometry in its text form and in quadtree value. Each of them has their own use during search phase.
1 2 3 4 5 6 7 8 9 10
- The search phase has two parts
- filter: fetches all possible features that may contains the geometry X by using Lucene's range query against the quadtree field.
- select: for each possible results, run geometric operation to see if result Y contains geometry X.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Open Source Projects
Shapely is Scala binding for JTS Topology Suite. The goal of Shapely is to provide an easy to use factory to create JTS geometry class instances. It allows us to create various geometry shape with easy to use factory methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Lucene Spatial is the geo-spatial module for lucene built on the top of shapely. The lucene-spatial modules uses a SpatialContext instance to encode location fields and to create location queries.
1 2 3 4 5 6 7 8 9 10 11