Trajectory Queries Using Space Partitioning

How can we quickly find overlapping trajectories?

João Paulo Figueira
Towards Data Science

--

Photo by Jens Lelie on Unsplash

While traveling through space, an object describes a trajectory. We can think about a trajectory as a function of time that outputs positions in space. Conceptually, trajectories are continuous functions, although we pragmatically use their discrete versions. A discrete trajectory is a time-ordered collection of points in space where we implicitly assume a linear interpolation between each point. This representation makes storing discrete trajectories in databases very easy and intuitive.

This article presents a method to efficiently query trajectories from relational databases using a space partitioning mechanism. Here, we go about this using square partitions and the Extended Vehicle Energy Dataset¹ (EVED) I used in a previous article.

The EVED contains a collection of over thirty-two thousand such trajectories in Ann Arbor, Michigan, collected during one full year from November 2017 to November 2018. The EVED paper authors enhanced the dataset with several extra features, namely the map-matched GPS coordinates. Map matching is a process that infers which roads were traveled by a vehicle using its noisy GPS data stream. This process requires access to a digital map of the area, and for this dataset, the paper authors used the OpenStreetMap public data and the Valhalla map-matching engine.

The map-matching process “snaps” each sampled GPS location into the most likely street or road. This process uses sophisticated probabilistic algorithms that infer the most likely map route from the sequence of sampled locations. You can see evidence for this in the source dataset, where each sample contains two locations, the noisy and the map-matched ones. In this article, we will only use the second set.

Figure 1 below illustrates how the map-matching process snaps the sampled GPS location to the most likely road network edge.

Figure 1 — The map-matching process snaps the noisy GPS measurement to the most likely road network edge. Here you can see a depiction of the process where the circles denoted with “n” represent the road network nodes, and the arrows named “e” are the directed edges. The sampled GPS location in green matches another one along the arc and gets recorded in the database. There is no information on the matched edges, though. (Image source: Author)

Note that the dataset only contains the matched points; it was not extended to include the matched edges. The absence of edge information makes the querying process harder but interesting.

Querying a Trajectory

What does it mean to query a trajectory? In the context of this article, querying a trajectory means using an input trajectory to query a database of trajectory data. We start with a query trajectory, possibly acquired from a map provider, and then use it to retrieve specific information from the database. In this case, we will retrieve the full trajectories or the intersecting trajectory segments.

Figure 2 — A trajectory is a collection of trajectory segments, represented above as arrows. The arrow vertices are the sampled GPS locations. The trajectory segments need not necessarily coincide with the existing road network. (Image source: Author)

A map-matched trajectory ensures that its nodes live in the road network edges but do not have to coincide with the road network nodes.

We can implement travel time estimations with the query results, for instance. If the queried trajectories contain energy consumption information, then we can use those to predict the consumption for a non-observed trajectory. You can think of plenty more use cases for trajectory queries.

Problem

The problem I want to solve is to query the dataset for the trajectory segments or full trajectories that match an arbitrary query trajectory over the covered region. A trajectory segment is a geodesic that connects two sequential locations on a given trajectory. A trajectory is a collection of consecutive trajectory segments, as shown in Figure 2. The trajectory query should retrieve the trajectory segments that best match the input and exclude the ones with the reverse direction.

Solution

My solution here uses quadkeys to segment the geographic space for faster querying and builds on the work developed in a previous article where I used a similar technique. Please refer to that material as an introduction to the present discussion.

The solution I devised requires expanding the information in the database with the trajectories discretized as quadkey lines. We later use these quadkey indices to match the quadkey-discretized trajectory query.

Let us see how the solution works.

Data Preparation

If you followed the process from the previous article, we created a supporting SQLite database containing a single table with all the sampled signals. We need to create three new support tables in the existing database to support faster trajectory indexing.

The first table models trajectories, which is quite easy to do. The original data already contains information on trajectories uniquely identified by vehicle and trip identifiers. The code in Figure 3 shows the simple table structure.

Figure 3 — The SQL code above creates the table to contain all recorded trajectories. (Image source: Author)

Figure 4 below shows how to populate the trajectories table from the data in the existing signals table.

Figure 4 — The code above shows how to populate the trajectories table from the signals. (Image source: Author)

We create a link for each pair of distinct locations in a trajectory (a trajectory segment). Each link contains information on the identifiers for the signal pair and the bearing we calculated in the previous article. Figure 5 below shows the table structure.

Figure 5— The link table contains information on the identifiers for the containing trajectory and the signal endpoints. The bearing is the same as the final signal’s, copied here for performance reasons, as it avoids an extra join. (Image source: Author)

Finally, we create the table containing individual quadkeys for the link endpoints' lines. Figure 6 below shows the SQL code to create that table.

Figure 6 — Each row in the table above contains a single integer quadkey code corresponding to a line pixel. When querying for trajectory links, we use this quadkey code as a match target. (Image source: Author)

Note that we will create an index on the quadkey code column to enable the trajectory queries. This indexed column will match the query trajectory to the trajectory links and possibly the trajectories themselves.

The final step of the data preparation process is also the lengthiest, as it entails enumerating all the trajectories, determining all the links, and finally drawing lines between their endpoints. Figure 7 below shows the code that populates these two tables.

Figure 7— The function above populates the links table and the corresponding “pixel” lines. Each “pixel” is a quadkey code indexed for later querying. (Image source: Author)

The function starts by loading all the database trajectories into a tuple list containing the trajectory, vehicle, and trip identifiers. Figure 8 below shows the simple code that loads all trajectories.

Figure 8 — The code above loads all trajectories from the database. (Image source: Author)

Next, we iterate the trajectory list and process each item, starting by loading the trajectory’s points (line 7 in Figure 7). Here we need to address the issue of contiguous repeating locations in the database. This situation probably occurs due to the different sampling periods of the generated signals between GPS location changes. Figure 9 below shows how to load the trajectory’s points under these circumstances.

Figure 9 — The code above loads all the points in a trajectory, making sure there are no repeated locations. (Image source: Author)

Now that we have all the unique trajectory points, we iterate each consecutive pair (Figure 7, line 10) and create the corresponding link. Remember that a link connects two consecutive vehicle locations in the database. The code in Figure 10 below shows how to insert a new link and retrieve the newly generated identifier.

Figure 10 — We reuse the previously calculated bearing value when inserting a new link. After inserting the latest data, the function above returns the newly generated identity value for posterior use. (Image source: Author)

After inserting the link, we can draw the quadkey line between the endpoints using the same approach as in the previous article. Using the analogy between a quadkey and a bitmap pixel, we can easily draw a line that connects any two points (Figure 7, lines 16 to 18). The generated line is a simple list of quadkey codes that we insert into the database. The code in Figure 11 below illustrates the process.

Figure 11 — The code above bulk inserts quadkey codes into the database. (Image source: Author)

Once done, we need to create some supporting indexes, and we are ready to start querying the database for trajectories. Please refer to the full source code in the GitHub repository file.

Approaches to Querying a Trajectory

Querying involves discretizing the query trajectory into a set of quadkey codes and then matching them to the previously discretized link quadkeys. We will approach this issue from two different perspectives. The first perspective involves using an arbitrary trajectory as the query, while the second uses an existing trajectory in the database. For instance, this last approach might drive a trajectory clustering algorithm.

Please follow this discussion using the published Jupyter notebook and the supporting library code.

Querying Using an Arbitrary Trajectory

We start with the first approach and define an arbitrary query using two random addresses. Figure 12 below shows the query trajectory we will be using.

Figure 12 — The picture above depicts the random query trajectory calculated from geocoded addresses. The darker blue dots represent the road network nodes. (Image source: Author using Folium and OpenStreetMap imagery)

We generate the above route using the simple code snippet from Figure 13. First, we instantiate an object that helps us query an arbitrary trajectory defined over an existing road network. Here, we use the definitions for Ann Arbor, Michigan. Next, we generate the query route using two random addresses from the same city. The generated route consists of a list containing the corresponding road network nodes. Note that these locations do not necessarily match the ones in the database.

Figure 13 — The code above loads Ann Arbor’s road network and generates a route from two arbitrary addresses. Note that the graph route object retains a copy of the function return for future use. (Image source: Author)

As I previously stated, querying an arbitrary trajectory involves discretizing it into quadkeys for a posterior database search. The function listed in Figure 14 shows how to go about this. The function generates a quadkey line along with the corresponding bearing for each pair of road network nodes. We will use this list to query the trajectory database by performing an exact quadkey match and an approximate bearing comparison.

Figure 14 — The function above converts the generated route to a list of quadkey and bearing pairs. (Image source: Author)

The next step is to compare the list of generated route quadkeys and bearings to the trajectory records in the database. The function listed in Figure 15 below does exactly that and returns the link and corresponding trajectory identifiers. The function also returns signal identifiers for the corresponding range for convenience and later use.

Figure 15 — For each query trajectory quadkey and bearing pairs, we issue a query to the links table, returning the matching links and the trajectory identifiers. (Image source: Author)

Using the overlapping link identifiers, we can complement the map in Figure 12 with the matching trajectory segments from the database (see Figure 16). Note that we only filtered these on the bearing but could also have extended these queries to the time of day or day of the week, such as in the previous article.

Figure 16 — The picture above shows in red all the trajectory links in the database that match the query trajectory. We can use these links’ properties to infer the query trajectory’s properties. (Image source: Author using Folium and OpenStreetMap imagery)

We can now extend the exercise and display the full trajectories that best match the query trajectory. To qualify the match quality, I used the Jaccard Similarity metric to measure how similar the quadkey sets are. Figure 17 below shows how easy it is to implement this metric using Python sets.

Figure 17 — The code above shows a simple implementation of the Jaccard Similarity. (Image source: Author)

To calculate this similarity, we must load all the matched trajectories’ quadkeys and compare them to the query using the Jaccard Similarity. The code below in Figure 18 depicts the process.

Figure 18 — To calculate the trajectory similarity to the query, we must first load the quadkeys from the database and compare both sets, as depicted above. The function returns a list of tuples containing the trajectory identifier and the Jaccard Similarity to the query trajectory. (Image source: Author)

Armed with this information, we can now filter the trajectories with the lowest similarities. In this case, we keep the top 5%.

Figure 19 — The code above uses the results from the previous function to select the top 5 percent of best-matching trajectories. (Image source: Author)

The resulting map in Figure 20 below shows the top 5% of matching trajectories.

Figure 20 — The map above shows the top 5% of best trajectory matches. (Image source: Author using Folium and OpenStreetMap imagery)

Querying Using a Database Trajectory

The second query approach involves using an existing database trajectory as the query. Using this approach, we do not have to resort to a road network and use the database as the only source of information. To make matters even better, we only need one workhorse function to retrieve the matching links. Figure 21 below shows the SQL query that allows us to retrieve links and trajectories that match the query trajectory. Note that this function returns the links corresponding to the query trajectory by design. As you will see, the consuming functions allow filtering of the query trajectory data from the response.

Figure 21 — The somewhat complex database query depicted above retrieves all links and respective trajectories that match the given query trajectory. (Image source: Author)

Once we get the result from the function above, we can determine the best matching trajectories. This process involves calculating the Jaccard Similarity between the trajectory query quadkey set and all the other matched trajectories’ quadkey sets. The function depicted in Figure 22 below does exactly that and extracts the top best-matching trajectories using a percentile ranking.

Figure 22 — The code above retrieves the top matching trajectories. The function allows for a top percent cutoff point as well as the exclusion of the query trajectory. Note how the process uses the Jaccard Similarity function to rank the selected results. (Image source: Author)

Figure 23 below displays a map with the query trajectory and top 5% best-matching trajectories.

Figure 23 — The map above shows the query trajectory in blue and the top 5% matching trajectories in red. Note that the resulting trajectories display in full. (Image source: Author using Folium and OpenStreetMap imagery)

We can alternatively display the matching links only, which we can query using the function in Figure 24 below.

Figure 24 — To get all the matching links for a query trajectory, we only need to extract the link identifiers from the function in Figure 21. (Image source: Author)

The map in Figure 25 below shows all the matching trajectory segments for the same query trajectory as in Figure 23.

Figure 25 — The map above shows the matching trajectory segments for the same query trajectory as in Figure 23. (Image source: Author using Folium and OpenStreetMap imagery)

Conclusion

Querying trajectories from databases is a relatively complex process that usually involves specialized indexing and supporting logic. This article discussed a method for querying trajectories from a database through space discretization using quadkeys. The square quadkey partitioning simplifies spatial querying by reducing the trajectories to sets of integers. This method involves discretizing trajectories to quadkey lines and the subsequent match using the trajectory segment bearings. There is no need for specialized geospatial indices, so simple database engines are sufficient.

Related Articles

References

Bing Maps Tile System — Bing Maps | Microsoft Learn

GitHub Repository

gsoh/VED: VED (Vehicle Energy Dataset): A Large-scale Dataset for Vehicle Energy Consumption Research. (IEEE Transactions on Intelligent Transportation Systems, 2020) (github.com)

Extended vehicle energy dataset (eVED): an enhanced large-scale dataset for deep learning on vehicle trip energy consumption

zhangsl2013/eVED: An extended version of the VED dataset, which is a large-scale dataset for vehicle energy analysis. (github.com)

Notes

  1. The dataset is licensed under the Apache 2.0 License (see the VED and EVED GitHub repositories). Note that this also applies to derivative work, such as the database built on this article.

João Paulo Figueira works as a Data Scientist at tb.lx by Daimler Trucks and Buses in Lisbon, Portugal.

--

--