Spatial Search with Lucene
Before We Start
As some of my blog readers may know, my previous startup project was an online location based search service. Thing does not go well and I am going to open source some of the related work.
In this post, I will show you how to use Lucene to do spatial search as well as how it works internally.
Spatial Search
Spatial Search and Geo Tagging
Geotagging is a popular way to do spatial search. Instead of using the actual location to tag a POI, the geotagging uses geotag to group items within a predefined range as a whole. When performing a search, it is simply pull out all items with the same tag.
 There are few popular ways to do geotagging, including
 GeoHash
 QuadTree
GeoHash
GeoHash is way to encode latitude and longitude using base32 code. There is a fundamental flaw that makes it hard to do spatial search. The problem is the geo hash generated is not continuous. The neighbor of a particular block may not share the same prefix with its nearest eight neighbors. When using geohash to do spatial search, we need to calculate the hash value of nearby neighbors by ourselves.
1 2 3 4 5 6 7 

QuadTree
QuadTree addresses the issue in geohash and adapt a better way to do geotagging. This example gives us a great example on how QuadTree does geotagging.
 The QuadTree has the following advantage when doing spatial search
 it can be really fast if you stores data as a tri. Such as, Lucene, internally, it uses indexed tri to store NumericField.
 the neighbors of a block share the longest common prefix.
For more extensive information regarding geotagging, you should check here, here and here.
Our Custom Approach
Lucene's contribspatial uses geotagging to implements spatial search and only support searching for features near a point.
A pair of latitude and longitude gives us a quickstart for your location based application. However, not every single feature in a LBS application can be described as a point. Let us take school district as an example. The School A may be closer to your house than School B is, but your house is belong to school district for school B. Another example is bike trails, bike trail is a line not a single point.
A full feature spatial search must support complicated geometric operations. This is why we create our own Lucene spatial search implementation.
In the GIS area, Geometry are used to describe shape of features. The common geometries are:
Geometry primitives(2D)
Type  Text Format  Example 

Point  POINT (30 10)  
LineString  LINESTRING (30 10, 10 30, 40 40)  
Polygon  POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))  
POLYGON ((35 10, 10 20, 15 40, 45 45, 35 10), (20 30, 35 35, 30 20, 20 30)) 
Multipart geometries (2D)
Type  Text Format  Example 

MultiPoint  MULTIPOINT ((10 40), (40 30), (20 20), (30 10))  
MultiLineString  MULTILINESTRING ((10 10, 20 20, 10 40) (40 40, 30 30, 40 20, 30 10))  
MultiPolygon  MULTIPOLYGON (((30 20, 10 40, 45 40, 30 20)), ((15 5, 40 10, 10 20, 5 10, 15 5))) 
How to Calculate QuadTree value for Geometries
When hashing a Geometry primitive, we will calculate the minimum bounding rectangle(MBR) of this geometry. For example, the minumum bouding box for is the grid made up of (1, 1), (1, 4), (4, 4), (4, 1).
We will calculate the quadtree hashcode for the MBR, the quadtree hashcode for the MBR is the longest common prefix of quadtree values for the four corners.
 When indexing this geometry with Lucene, we will store these fields in lucene
 geometry: POLYGON ((3 1, 1 2, 2 4, 4 4, 3 1))
 geometry__quadtree: QuadTree value of the polygon.
Implementation Detail
The number of digits we used in the quadtree implementation is 17 digits. This allows us to lay items on an 600 x 600 meter block. This number comes from 40075000 / 2 ^16 (Earth Circumference / level of details). The 17 digits number will be stored as a long value. The digits we used are 1(upper left), 2(upper right), 3(lower left), 4(lower right), and 0 (whole block in the given detail).
When encoding a coordinate, we will stores the quadtree value as a NumericField in Lucene. Internally, Lucene will use indexed trie to store NumericField. The "Indexed Trie" uses buckets to group terms. We will use the default precisionStep(4) for now.
Index Phase
During the index phase, we will store the geometry in its text form and in quadtree value. Each of them has their own use during search phase.
1 2 3 4 5 6 7 8 9 10 

Search Phase
 The search phase has two parts
 filter: fetches all possible features that may contains the geometry X by using Lucene's range query against the quadtree field.
 select: for each possible results, run geometric operation to see if result Y contains geometry X.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Open Source Projects
Scala Shapely
Shapely is Scala binding for JTS Topology Suite. The goal of Shapely is to provide an easy to use factory to create JTS geometry class instances. It allows us to create various geometry shape with easy to use factory methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

Lucene Spatial
Lucene Spatial is the geospatial module for lucene built on the top of shapely. The lucenespatial modules uses a SpatialContext instance to encode location fields and to create location queries.
1 2 3 4 5 6 7 8 9 10 11 
