Before We Start ===================================== As some of my blog readers may know, my previous startup project was an online location based search service. Thing does not go well and I am going to open source some of the related work. In this post, I will show you how to use Lucene to do spatial search as well as how it works internally. Spatial Search ===================================== Spatial Search and Geo Tagging --------------------------------------- Geo-tagging is a popular way to do spatial search. Instead of using the actual location to tag a POI, the geo-tagging uses geotag to group items within a predefined range as a whole. When performing a search, it is simply pull out all items with the same tag. There are few popular ways to do geo-tagging, including - GeoHash - QuadTree GeoHash ---------- GeoHash_ is way to encode latitude and longitude using base32 code. There is a fundamental flaw that makes it hard to do spatial search. The problem is the geo hash generated is not continuous. The neighbor of a particular block may not share the same prefix with its nearest eight neighbors. When using geohash to do spatial search, we need to calculate the hash value of nearby neighbors by ourselves. .. code-block:: none +-------+-------+-------+ | ezefr | ezs48 | ezs49 | +-------+-------+-------+ | ezefr | ezs42 | ezs43 | +-------+-------+-------+ | ezefp | ezs40 | ezs41 | +-------+-------+-------+ .. _Geohash: http://en.wikipedia.org/wiki/Geohash Quad-Tree ---------------------- Quad-Tree addresses the issue in geohash and adapt a better way to do geo-tagging. This example gives us a great example on how Quad-Tree does geo-tagging. .. image:: http://i.msdn.microsoft.com/dynimg/IC96238.jpg The Quad-Tree has the following advantage when doing spatial search - it can be really fast if you stores data as a tri. Such as, Lucene, internally, it uses indexed tri to store NumericField. - the neighbors of a block share the longest common prefix. For more extensive information regarding geo-tagging, you should check `here `__, `here `__ and `here `__. Our Custom Approach ========================================================== Lucene's `contrib-spatial`_ uses geo-tagging to implements spatial search and only support searching for features near a point. A pair of latitude and longitude gives us a quickstart for your location based application. However, not every single feature in a LBS application can be described as a point. Let us take school district as an example. The School A may be closer to your house than School B is, but your house is belong to school district for school B. Another example is bike trails, bike trail is a line not a single point. A full feature spatial search must support complicated geometric operations. This is why we create our own Lucene spatial search implementation. In the GIS area, Geometry are used to describe shape of features. The common geometries are: .. _contrib-spatial: http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/api/contrib-spatial/index.html Geometry primitives(2D) --------------------------------------------- +------------+-----------------------------------------------+-------------------------+ | Type | Text Format | Example | +============+===============================================+=========================+ | Point | POINT (30 10) | |PointExample| | +------------+-----------------------------------------------+-------------------------+ | LineString | LINESTRING (30 10, 10 30, 40 40) | |LineStringExample| | +------------+-----------------------------------------------+-------------------------+ | Polygon | POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10)) | |PolygonExample| | | +-----------------------------------------------+-------------------------+ | | POLYGON ((35 10, 10 20, 15 40, 45 45, 35 10), | |PolygonExample2| | | | (20 30, 35 35, 30 20, 20 30))| | | +------------+-----------------------------------------------+-------------------------+ Multipart geometries (2D) --------------------------------------------- +-----------------+-------------------------------------------------+--------------------------+ | Type | Text Format | Example | +=================+=================================================+==========================+ | MultiPoint | MULTIPOINT ((10 40), (40 30), (20 20), (30 10)) | |MultiPointExample| | +-----------------+-------------------------------------------------+--------------------------+ | MultiLineString | MULTILINESTRING ((10 10, 20 20, 10 40) | |MultiLineStringExample| | | | (40 40, 30 30, 40 20, 30 10)) | | +-----------------+-------------------------------------------------+--------------------------+ | MultiPolygon | MULTIPOLYGON (((30 20, 10 40, 45 40, 30 20)), | |MultiPolygonExample| | | | ((15 5, 40 10, 10 20, 5 10, 15 5))) | | +-----------------+-------------------------------------------------+--------------------------+ .. |PointExample| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/SFA_Point.svg/51px-SFA_Point.svg.png .. |LineStringExample| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/SFA_LineString.svg/51px-SFA_LineString.svg.png .. |PolygonExample| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/SFA_Polygon.svg/51px-SFA_Polygon.svg.png .. |PolygonExample2| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/5/55/SFA_Polygon_with_hole.svg/51px-SFA_Polygon_with_hole.svg.png .. |MultiPointExample| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/SFA_MultiPoint.svg/51px-SFA_MultiPoint.svg.png .. |MultiLineStringExample| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/8/86/SFA_MultiLineString.svg/51px-SFA_MultiLineString.svg.png .. |MultiPolygonExample| image:: http://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/SFA_MultiPolygon.svg/51px-SFA_MultiPolygon.svg.png How to Calculate QuadTree value for Geometries --------------------------------------------------- When hashing a ``Geometry primitive``, we will calculate the minimum bounding rectangle(MBR) of this geometry. For example, the minumum bouding box for |PolygonExample| is the grid made up of (1, 1), (1, 4), (4, 4), (4, 1). We will calculate the quadtree hashcode for the MBR, the quadtree hashcode for the MBR is the longest common prefix of quadtree values for the four corners. When indexing this geometry with Lucene, we will store these fields in lucene - geometry: POLYGON ((3 1, 1 2, 2 4, 4 4, 3 1)) - geometry__quadtree: QuadTree value of the polygon. Implementation Detail ====================================================================== The number of digits we used in the quad-tree implementation is 17 digits. This allows us to lay items on an 600 x 600 meter block. This number comes from 40075000 / 2 ^16 (Earth Circumference / level of details). The 17 digits number will be stored as a long value. The digits we used are 1(upper left), 2(upper right), 3(lower left), 4(lower right), and 0 (whole block in the given detail). When encoding a coordinate, we will stores the quadtree value as a NumericField in Lucene. Internally, Lucene will use indexed trie to store NumericField. The "Indexed Trie" uses buckets to group terms. We will use the default precisionStep(4) for now. Index Phase -------------------------------- During the index phase, we will store the geometry in its text form and in quadtree value. Each of them has their own use during search phase. .. code-block:: scala val point = Point(24.123, 18.921) val code = point.quadtree val raw = new Field("location", point.wkt, Field.Store.YES, Field.Index.NOT_ANALYZED) val field = new NumericField("location" + "__quadtree") field.setLongValue(code) doc.add(raw) doc.add(field) Search Phase -------------------------------- The search phase has two parts - filter: fetches all possible features that may contains the geometry X by using Lucene's range query against the quadtree field. - select: for each possible results, run geometric operation to see if result Y contains geometry X. .. code-block:: scala val field = "location" // create a point. val point = context.makeShape(Point(24, 18)) // turn point into search range. val circle = point.buffer(Distance("500m")) // create a query container for complex queries. val query = new BooleanQuery() // filtering: search for items in this range query.add(QuadTreeRangeQueryFactory.buildLocalQuery(field, circle.quadtree)) // selecting and scoring: the value source will return the distance between // the point and each feature. query.add(new ValueSourceQuery(context.makeValueSource("location", point)); val founds = indexSearcher.search(query) Open Source Projects ======================================= Scala Shapely --------------------------------------- Shapely_ is Scala binding for JTS Topology Suite. The goal of Shapely is to provide an easy to use factory to create JTS geometry class instances. It allows us to create various geometry shape with easy to use factory methods. .. code-block:: scala // create a point Point(30, 10) // create a line string LineString((30, 10), (10, 30), (40, 40)) // create a polygon Polygon((30, 10), (10, 20), (20, 40), (40, 40), (30, 10)) // create a polygon with a hole in it. Polygon( Seq((35, 10), (10, 20), (15, 40), (45, 45), (35, 10)), Seq((20, 30), (35, 35), (30, 20), (20, 30))) // create a multi-point MultiPoint((10, 40), (40, 30), (20, 20), (30, 10)) // create a multi line-string. MultiLineString( Seq((10, 10), (20, 20), (10, 40)), Seq((40, 40), (30, 30), (40, 20), (30, 10)) ) // create a multi-polygon. MultiPolygon( Seq( Seq((30, 20), (10, 40), (45, 40), (30, 20)) ), Seq( Seq((15, 5), (40, 10), (10, 20), (5, 10), (15, 5)) ) ) ) .. _Shapely: https://bitbucket.org/bluetang/common-shapely/ Lucene Spatial ------------------------------------------- `Lucene Spatial`_ is the geo-spatial module for lucene built on the top of shapely. The lucene-spatial modules uses a SpatialContext instance to encode location fields and to create location queries. .. code-block:: scala // create a context. val context = new SpatialContext // create a shape from the representation of a geometry. val shape = context.makeShape("Point(0 10)") // create lucene fieldables for a location field. val fields = context.makeFieldables("field", shape).toList // create a location query to look for features within 5KM radius. val query = context.makeQuery("field", shape, new Distance(5, Distance.Unit.KM)) .. _Lucene Spatial: https://bitbucket.org/bluetang/lucene-spatial/