ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

OSTMap - Open Source Tweet Map - Geotemporal Index

Beitragsseiten

Geotemporal Index

How to map 2d coordinates into 1d key space?

In order to support fast geotemporal queries we use space filling curves. There are many known, for example z curves or geohashes, which we use. In short geohashes are a “hierarchical spatial data structure which subdivides space into buckets of grid shape” (Wikipedia). The length of a given geohash defines its precision. For example: a geohash with length 2 like “d2” covers an area of about 1250km x 625km and a geohash of length 7 like “d2669pz” an area of about 153m x 153m. Locations which are near to each other usually have only small differences in their hashes. Sadly this didn’t hold for all locations. There are “border” situations where hashes differ strongly from each other even when the corresponding locations are next to each other. But this should be no big problem. The calculation of geohashes for given latitude, longitude coordinates is fast with about 3 million GeoHash.encodeHash calls per second on an I7, single thread using the java Geohash library from https://github.com/davidmoten/geo/. The image shows a map covered with (shortened) geohashes and the order they appear as row keys in Apache Accumulo.

As one can see, regions lying near to each other on the map will be saved near to each other in Apache Accumulo. Since Apache Accumulo is built to support so called “range scans” which means: reading a sequence of data with a given start row until a given end row, it is easy to support queries on data partitioned by geohashes. The Apache Accumulo table holding the geo index would look like this:

How to query a bounding box?

When querying a bounding box we start by calculating the coverage of the box. The result will be a set of geohashes. Those will be used to calculate scan ranges in Apache Accumulo. After the scans are made all data which is not in the bounding box will be dropped by server side iterators and the resulting set of data is returned. The process is visualized in the following image.

To perform efficient filtering the server side iterators use the lat/lon information given in the column qualifier:

Add the time dimension!

To further increase the query performance we added the time dimension to our table design. We introduced so called “day buckets” in front of our row key. This two byte wide value holds the number of days since first of january 1970. The result are many geospatial indices, one for each day. This enables us to to further shrinken the search space while querying OSTMap.

To query a bounding box in this 3-dimensional space one has generate one 2-dimensional geospatial query as shown above for each involved day.

How to deal with hotspots?

This suggestion lacks a method of hotspot mitigation. As shown in the following figure. If there are many data points within one single geohash + time bucket all of these data points are handled by the same node in the cluster, which can lead to high workload for some nodes relative to others.

This issue is easily fixed within the distributed approach by adding a “random” value from 0 to 255 as prefix byte to each row key and a table presplit with 256 splits, one for each value of this so called spreading byte. This leads to a uniform distribution among all involved cluster nodes. From this point on each query will leverage the whole cluster computation power.

The final design of the geotemporal index table in Apache Accumulo can be seen in the following figure.