The world’s leading publication for data science, AI, and ML professionals.

Geospatial Index 101

How to manage geospatial data efficiently

Introduction

Location data has become more available nowadays due to the advancement of mobile devices, IoT, and earth observation technologies. The abundance of data challenges those who develop location-oriented systems and applications. This article will give a brief introduction to the geospatial index, the most commonly used technique for managing Geospatial Data.

What is Indexing?

Indexing is a data management approach for optimizing data retrieval and searching. The benefit of using an index is to make locating a specific piece of data possible without scanning through all the existing ones. A real-life example of indexing that everyone is familiar with is libraries.

In a library, books are grouped and ordered by some criteria such as titles. All the books starting with ‘A’ are on shelf 1 whereas those starting with ‘B’ are on shelf 2, and so on. When people visit a library, they just go to the shelf that corresponds to the book title they’re finding instead of going through all of the books available which would take a huge amount of time.

Photo by Ryunosuke Kikuno on Unsplash
Photo by Ryunosuke Kikuno on Unsplash

Indexing is crucial for all database technologies available nowadays. Having indexes allows databases to speed up the query because the number of data to process is significantly reduced similar to libraries helping users find the books they want.

Geospatial Data

Geospatial data is data recorded with its geographical location. Think of it as an ordinary database record with an additional column(s) specifying where on earth this data is associated to. Here is an example of geospatial data:

Geospatial data examples(Image by author).
Geospatial data examples(Image by author).

All records on the table look like normal structured data except the location which looks like a mathematics function. This data type is called geometry which __ is used to store geographic data. Below are 4 geometry objects that are most commonly found:

  • Point: A single position on earth
  • Line: A set of points that connect
  • Polygon: A set of points representing a boundary of the specific area
  • Multi-Polygon: A collection of polygons that share the same properties

(Note: There are 2 types of geospatial data: vector and raster. Point, Line, and Polygon are vector data which is a representation of location or area while raster data is an image of a specific scene that also contains its location information such as satellite and drone images. This article only focuses on vector data.)

Geospatial Indexing

Unlike primitive data types(Integer, e.g.) where _traditional indexing algorithms_ can be applied, indexing geospatial data is much trickier since it has 2 dimensions: x and y, let alone the complexity coming from a collection of Points such as Line and Polygon. Therefore, it requires a different indexing methodology.

One possible approach is to divide the entire space into a set of subspaces as follows. Each subspace has a name assigned to it so that we can refer to it correctly later.

Image by the author.
Image by the author.

Suppose you have 12 records of data that look geographically like below:

Image by the author.
Image by the author.

By combining the data with the predefined zones, this is what you get.

Image by the author.
Image by the author.

Now, if you group the records by the zone they belong, the result is in the table below:

Image by author.
Image by author.

Suppose you want to retrieve all the records that are, for instance, associated with Australia. The first is to find all the cells overlapping with the country boundary. In this case, L and P.

Image by the author.
Image by the author.

Then, looking up all the records located in both cells gives _X_₄, _X_₅, and _X_₁₂.

Image by the author.
Image by the author.

Finally, scanning through the remaining records, we found that only _X_₄ and _X_₅ meet the search criteria.

Image by the author.
Image by the author.

Notice that even though there are 12 records in total, we only scan 3 of them. The number of reduced operations in this example may sound small but, in reality, with a vast amount of data, it can make a huge difference.

Geohash

The example illustrated above is a simplified explanation of Geohash. Geohash is an algorithm that encodes any geo location to a text by splitting the world into an equal-sized unit called cell(or grid) __ where each has a specific _has_h assigned to it(but, instead of an alphabet, it uses a base-32 number). What’s not mentioned in the previous example about geohash is its _hierarchical structur_e.

Suppose we divide the entire space into grids as in the previous example.

Image by the author.
Image by the author.

Instead of stopping here, we go further by splitting each cell into even smaller subspaces. For example, in the below image, cell A is split into 16 smaller cells.

Image by the author.
Image by the author.

Then, we go one level deeper by repeating it on one child of A, say, AF.

Image by the author.
Image by the author.

The more you apply this process, the smaller the area each cell represents and the longer the hash string is. The length of the name indicates the precision level of a hash.

One interesting property of geohash is that the name assigned to each cell has its parent name as a prefix. For instance, cells AFA to AFP are all children of AF.

The hierarchical structure of Geohash(Image by author).
The hierarchical structure of Geohash(Image by author).

The prefix property is very handy when you want to validate whether one area is a subarea of another. It can be seen as an approximated version of the ST_Contains (spatial contain) function in a relational database with a much cheaper computational cost because it only deals with text, not Point, Line, or Polygon.

You may also notice that the name of each cell sort alphabetically from left to right. This mechanism allows us to use cell names to estimate the geographic distance between them. For example, suppose there are 3 hashes: AA, AB, and AK. Since AA is closer to AB than AK alphabetically (the letter ‘A’ is closer to ‘B’ than ‘K’), AA is closer to AB than AK geographically too. Although this property has an exception where the new row is started(such as AE is closer to AA than AC), the fact that all cells are equally spaced makes this edge-case manageable.

Note that Geohash is only one of the geospatial indexes available. Other index systems, such as Quadtree and H3, even though based on the same foundations, apply different partitioning strategies, so they have different strengths and weaknesses.

Geospatial Index Usage

Other than being an index, a geospatial index can also be applied to various use cases.

Data Reduction

Geospatial data is very different from primitive data like Integer, Float, and String therefore relational databases need a special data type for them: Geography. The geography data type, however, is only a string in the format of _Well-Known Text_(WKT) that looks like the following:

POINT(103.14411713327522 15.553626958655387)
LINESTRING(103.14261509622688 15.5550326423843,103.14400984491462 15.552738065460904,103.14585520471687 15.552944785150965,103.14594103540534 15.554040396042701)
POLYGON((103.1434304877674 15.553296208147678,103.14435316766853 15.5547639094724,103.14585520471687 15.553482255373645,103.1448896094715 15.552138577185888,103.1434304877674 15.553296208147678))

Notice that the text is very long due to the decimal precision of the coordinate reference system. On numerous occasions, nevertheless, this precision is unnecessary. For instance, in a ride-sharing application where millions of drivers pick up and drop off millions of passengers a day, it would be impossible(and absurd) to analyze all the points these activities occur. Aggregating the data per zone is much more sensible.

Using an index to aggregate geospatial data(Image by author).
Using an index to aggregate geospatial data(Image by author).

A geohash can be an identifier of each zone that can be used to not only reference the area but also infer its spatial properties. Moreover, since there are finite numbers of cells, using geohash can keep the amount of data in the system constant and more manageable.

Data Partitioning

The vast amount of data available nowadays makes the trend of using distributed data stores grows rapidly. Unlike traditional databases, distributed databases split data into chunks and spread them across nodes in the cluster. Designing a data partitioning strategy is necessary (or at least recommended) for the system owners to utilize distributed data stores effectively.

In developing geographic-oriented systems and applications, partitioning data by its geographical location is very sensible. The data associated with a similar neighborhood should also be in the same chunk and, therefore, stay in the same node in the database cluster. This way, when users retrieve the data, it will come from a minimal number of nodes and files which saves the computational cost significantly.

Geospatial indexes provide a convenient way to split your data geographically. As you can see from the above examples, the closer the coordinates are, the higher chance they will be in the same geohash(or any other index system) cell. Therefore, you can use geohash as a partition key for the distributed database of your choice.

Using geospatial index for data partitioning(Image by author).
Using geospatial index for data partitioning(Image by author).

Conclusion

Managing geolocation data is critical for developing location-based applications. A geospatial index provides a systematic and elegant approach to index geospatial data. Therefore, it’s a key component that any robust geographic software system must have.


Related Articles