System Design: Quadtrees & GeoHash

Efficient geodata management for optimized search in real-world applications

Vyacheslav Efimov
Towards Data Science

--

Introduction

Google Maps and Uber are only some examples of the most popular applications working with geographical data. Storing information about millions of places in the world obliges them to efficiently store and operate on geographical positions, including distance calculation and search of the nearest neighbours.

All modern geographical applications use 2D locations of objects represented by longitude and latitude. While it might seem naive to store the geodata in the form of coordinate pairs, there are some pitfalls in this approach.

In this article, we will discuss the underlying issues of the naive approach and talk about another modern format used to accelerate data manipulation in large systems.

Note. In this article, we will represent the world as a large flat 2D rectangle instead of a 3D ellipse. Longitude and latitude will be represented by X and Y coordinates respectively. This simplification will make the explanation process easier without omitting the main details.

Problem

Let us imagine a database storing 2D coordinates of all application objects. A user logs in to the application and wants to find the nearest restaurants.

Map representing the user (node u) and other objects located in the neighbourhood. The objective is to find all the nearest nodes located within the distance d from the user.

If coordinates are simply stored in the database, then the only way to answer this type of query is to linearly iterate through all of the possible objects and filter the closest ones. Obviously, this is not a scalable approach and search would be extremely slow in the real application.

Linear search includes calculating distances to all nodes and filtering the closest ones

SQL databases allow the creation of an index — a data structure built on a certain column of a table accelerating the search process by keys in that column.

Another approach includes creating an index on one of the coordinate columns. When a user performs a query, the database can in O(1) time retrieve the position of the row in the table corresponding to the current position of the user.

Thanks to the constructed index, the database can also rapidly find the rows with the nearest coordinate value. Then it is possible to take a set of such rows and then filter those whose total Euclidean distance from the user position is less than a certain search radius.

Building an index on a column containing Y coordinates of nodes. As a consequence, it becomes very rapid to find a set of nodes whose Y coordinates are the nearest to a given node. However, the search process does not take into consideration any information about X coordinates, which is why the search results must be then filtered.

While the described approach is better than the previous one, it requires time to filter rows with the closest distances. Additionally, there can be cases when initially selected rows with the closest coordinates are not actually the closest ones to the user position.

A single table cannot have two indexes simultaneously. That is why for solving this problem, both coordinates should be represented as a single combined value while preserving information about distances. This objective is exactly achieved by quadtrees which are discussed in the next section.

Quadtree

A quadtree is a tree data structure used for recursive partitioning of a 2D space into four quadrants. Depending on the tree structure, every parent node can have from 0 to 4 children.

Map representation in the quadtree format. The more levels are used, the higher the precision is.

As shown in the picture above, every square on a current level is divided by four equal subsquares in the next level. As a result, encoding a single square on level i requires 2 * i bits.

Quadtree visualisation

If a geographical map is divided in this way, then we can encode all of its subparts with a custom number of bits. The more levels are used in the quadtree, the better the precision is.

Properties

Quadtrees are particularly used in geo applications for several advantages:

  • Due to its structure, quadtrees allow rapid tree traversal.
  • The larger the common prefix of two strings used to encode a pair of points on the map, the closer they are. However, this does not work the other way around in the edge case: a pair of points can be very close to each other but have a small common prefix. Though edge cases occur, they are not that often: they only happen when two small quadrants are located on opposite sides of a border with another much larger quadrant.
Edge case example: smaller quadrants on different sides of the border have only 1 common character in their prefixes
  • If a quadrant is represented by a string s₁s₂…sᵢ, then all of the subquadrants it contains, are represented by strings x such as s₁s₂…sᵢ < x < s₁s₂…sⱼ, where sⱼ is the next character after sᵢ in the lexicographical order.
Lexicographical order in quadtrees helps to rapidly identify all the subregions contained inside a larger region

Advantages

The main advantage of quadtrees is that every position on a map is represented by a unique string identifier which can be stored in a database as a single column making it possible to construct an index on quadtree strings. Therefore, given any string representing a region on the map, it becomes very fast:

  • to go up to higher levels or to move to lower levels of the region;
  • to access all the subregions of the region;
  • to find up to all of the 8 adjacent regions on the same level (except for edge cases).

GeoHash

In most real geographical applications, the GeoHash format is used which is a slight modification over the quadtree format:

  • instead of squares, geographical regions are divided by rectangles;
  • regions are divided into more than four parts;
  • every object on the map is encoded by a string in the “base 32” format consisting of digits 0–9 and lowercase letters except “a”, “i”, “l” and “o”.

Despite these slight modifications, GeoHash preserves the important advantages that were described in the section above for quadtrees.

The table below shows the correspondance between every GeoHash level and rectangle sizes. For the large majority of cases, the levels 9 and 10 are already sufficient to give a very precise approximation on the map.

GeoHash correspondence between every encoding level and size of rectangles

Finding the nearest objects on the map

If we have on object on the map, we can find its nearest objects withing a certain distance d by using the following algorithm:

  1. Converting the object to the GeoHash string s.
In this example, we would like to find all the objects located within d = 500 m from the blue node

2. Finding the first smallest GeoHash level i whose size is greater than the required distance d.

Level 6 is the first level whose width and height are greater than the search radius d

3. Take the first i characters of the string s (to represent the rectangle containing the initial object on level k).

4. Find 8 adjacent regions around the string s[0 … i - 1].

5. Find all the objects in the initial and adjacent regions and filter those objects whose distance to the initial object is less than d.

For the search process, all the objects inside the rectangle 97sy3k and its 8 adjacent rectangles must be considered. All the candidate objects are then linearly filtered to find those that satisfy the distance condition.

Conclusion

Fast navigation is a crucial aspect of geoapplications that use data about millions users and places. The key method for achiveing it includes the creation of a single index identifier that can implicitly represent both latitude and longitude.

By inheriting the most important properties of quadtrees, GeoHash server as a great example of such a method that indeed achieves great performance in practice. The only weak side of it is the presence of edge cases when both objects are located on different sides of a large border separating them. Though they might negatively affect the search efficiency, edge cases do not appear that often in practice meaning that GeoHash is still a top choice for geoapplications.

In case if you are familiar with machine learning and would like to learn more about optimized ways to perform scalable similarity search on embeddings, I recommend you go through my other series of articles on it:

Similarity Search

7 stories

Resources

All images unless otherwise noted are by the author.

--

--

BSc in Software Engineering. Passionate machine learning engineer. Writer for Towards Data Science.