
A couple of years ago, I was working on a side project. I wanted to create a web app that recommends local gems, such as cafés, book stores, or hidden bars. The idea was to display all such points of interest within the user’s reach on a map. With hundreds of thousands of points in my dataset, I had to be clever about filtering the data points that are in a given range to the user. The naive approach is to calculate the distance between the user and each point of interest, and discard all points outside the specified range. Especially for big datasets like mine, this approach often leads to long processing times.
Surely, there had to be a better way as response time is important in interactive applications. This is when I came across the data structure R-tree. These trees are used for fast spatial access and search. Using an R-tree, I was able to quickly isolate points of interest close to the user’s location and display them on a map. This gave my web app a massive boost in response time – with just four additional lines of code!
In this article, I explain what R-trees are and how they work. This is illustrated with an example of street trees in New York City in the first two sections. The third section demonstrates how this data structure can be used in Python to speed up your geospatial data processing routines, too.
Learning R-trees by analyzing trees in New York City
Assume we were asked to analyze if there is a correlation in the neighborhoods of New York City and the health of its trees. The NYC Open Data Portal offers a street tree census dataset which includes species, diameter, perception of health and the geographic location of each tree.
To start with, we want to count the number of street trees in the Upper East Side. The pseudocode snippet below iterates through the dataset trees
and checks if a tree falls within the upper_east_side
boundary:
total_count, tree_count = 0, 0
for tree in trees:
total_count += 1
if upper_east_side.contains(tree):
tree_count += 1
print_results(num_tests=total_count, num_trees=tree_count)
>>> Total number of trees tested: 683,788
>>> Number of trees in Upper East Side: 8,807
We discovered that there are about 9k trees in the Upper East Side. However, we had to test a total of 684k trees to get there. The animation below visualizes that we test trees which are miles away from our target neighborhood and hence could be easily disregarded. But how can we exclude far away trees from expensive computations to achieve a significant performance gain?
One piece of information we virtually get for free is the bounding box of a polygon (it can be determined by the minimum and maximum values of its nodes). In addition, testing if a point falls within a rectangle is trivial and only requires four comparison operations (the point has to be greater than or equal to the bottom-left corner, and smaller than or equal to the top-right corner). For now, assume bounding_box
is a dataset which contains all trees in a tight rectangle around the Upper East Side (in the next section we learn how such a rectangle can be easily obtained). Taking this into consideration yields:
total_count, tree_count = 0, 0
for tree in bounding_box:
total_count += 1
if upper_east_side.contains(tree):
tree_count += 1
print_results(num_tests=total_count, num_trees=tree_count)
>>> Total number of trees tested: 10,768
>>> Number of trees in Upper East Side: 8,807
The right-hand side of the animation demonstrates that we now only test for potential candidates. Those are trees that are in immediate proximity to the polygon, i.e. points which fall within its bounding box. By disregarding far away trees we were able to reduce the number of tests from 684k to 11k – a factor of 60! In the next section we will see that R-trees make use of exactly this idea.


A data structure for spatial searches: the R-tree
R-trees are tree-based Data Structures for creating spatial indexes in an efficient manner. An R-tree is often used for fast spatial queries or to accelerate nearest neighbor searches [1]. A common use case might be to store spatial information of points of interest (e.g. restaurants, gas stations, streets, etc.) With the help of R-trees one can quickly retrieve all points of interest within a certain distance to a location. In return, these results can be displayed on a map or in a navigation system.
The basic idea of an R-tree is simple: leaf nodes of the tree hold spatial data, whereas a branching node corresponds to the minimum bounding box that contains all of its children. With this structure the R-tree partitions the space into rectangles which become more granular as the tree grows. This is illustrated in the example below.

An R-tree is queried for a rectangle, i.e. we want to retrieve all data that is contained in this search window. Remember that each non-leaf node corresponds to a bounding box that contains all of its children. To fulfill a search query we simply travel along the branches of the tree and follow the paths that intersect with the given rectangle until we reach the leaf nodes. These leaf nodes, and hence our data points, are contained in the search rectangle and fulfill the query. The animation below demonstrates that we can greatly reduce the number of search operations by disregarding entire branches which do not fit the search criteria.

R-trees in Python
The Python package Rtree
provides an implementation of the R-tree data structure and comes with a number of handy features, such as nearest neighbor searches, intersection searches or multi-dimensional indexes.
Rtree: Spatial indexing for Python – Rtree 0.9.4 documentation
We can conveniently install the package with Python’s package manager pip: pip install Rtree
.
Basics
Before we handle geometries like points or polygons, we cover the basic usage of the Rtree
package.
The index
module helps us to construct a spatial index. This index is built up automatically by inserting bounding boxes of our objects. The bounding boxes are defined by specifying their left, bottom, right and top coordinates. Note that we insert a bounding box together with an identifier (in the above example 0
and 1
). The ID will help us to identify the bounding box when performing queries:
The index is queried for a given rectangle, again specified by its left, bottom, right and top coordinates. The result of the intersection
method are the IDs of the objects that are contained within the search window (examples 1–3). The result is empty in the case that the search window is beyond the bounds of data we have in the index (example 4). Similarly, we use the nearest
method to find the k-nearest objects to a given search window:
Working with points, lines and polygons
In the previous section, we saw how an index is constructed by inserting bounding boxes of objects. We now want to continue by using points, lines and polygons for these objects. The package Shapely provides an easy way of working with these kind of geometries in Python:
Above, we first create a point, a line, and a polygon. Next, the bounding boxes of these objects are inserted into an index using IDs 0
, 1
, and 2
. We now query the index for different search windows:
The illustration below shows the geometries and search windows:

Searching all trees in the Upper East Side
We finally have everything needed to extract all trees within the Upper East Side! We will go through a code snippet below, however the full version can be found here.

First, we load all required geometries using the GeoPandas package:
Next, we create an R-tree index containing all trees in New York City:
Now, we generate a list of potential candidates, i.e. all trees that are within the bounding box of the Upper East Side:
Finally, we iterate through all potential candidates to extract the ones that are fully within the Upper East Side:
Conclusion
In this article, we learned how R-trees organize geographic information by partitioning the underlaying space into rectangles. This structure makes R-trees extremely fast for spatial lookups. In our New York City street tree example, utilizing an R-tree reduced the number of operations by a factor of 60. We also saw how to work with R-trees in Python. The speed-up in our example was achieved with just four lines of code: initializing the index (1 line), constructing the index (2 lines), and using the intersection
function to find nearby candidates (1 line).
So why are R-trees not used everywhere? While we gain time by reducing the number of search operations, we lose time by constructing the index. For the latter we literally have to iterate through the entire dataset. This makes R-trees not suitable for applications requiring only a small number of searches or applications where the index changes often (because of tree rebalancing).
R-trees have come a long way since their invention by Antonin Guttman in 1984. Nowadays, they are found in all sorts of applications, for example in computer graphics [2], video games [3], traffic control systems [4], and most prominently in databases for spatial data management [5]. And perhaps in your next Geospatial Data analysis, too!
References
[1] A. Guttman, R-Trees: A Dynamic Index Structure for Spatial Searching (1984), Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, p. 47–57
[2] D. Feldmann, Accelerated Ray Tracing using R-Trees (2015), Proceedings of the 10th International Conference on Computer Graphics Theory and Applications, p. 247–257
[3] A. Kinziabulatov, Optimizing R-tree inserts in Unity: a Bomberman-like example (2023), Medium
[4] Y. Manolopoulos, A. Nanopoulos, A. Papadopoulos and Y. Theodoridis, R-Trees: Theory and Applications (2006), Springer
[5] S. Bressan, J. Küng and R. Wagner, Database and Expert Systems Applications (2006), Springer
Datasets
New York City Department of Parks & Recreation, 2015 Street Tree Census – Tree Data (2016), NYC Open Data
New York City Department of City Planning, 2010 Neighborhood Tabulation Areas (NTAs) (2013), NYC Open Data