Geographic Clustering with HDBSCAN

How to explore geographic data with HDBSCAN, H3, graph theory, and OSM.

Published in

Towards Data Science

11 min readJul 4, 2020

Your smartphone knows when you are at home or the office. At least, mine does, and can even tell me when to leave to get at one of my common destinations on time. We all accept that our smart devices collect information about our preferences and send them over to the cloud for processing. These come back as recommendations for shopping, food, mating, and when to leave the office and head home.

What is the magic behind inferring a usual location? How do the cloud computers go about finding the “places” where we live our lives? The answer involves a collection of timestamped geographical areas and clustering.

In this article, I will illustrate the process of geographic clustering using the HDBSCAN [1] algorithm and the Vehicle Energy Dataset. Please refer to my previous post named “The Medium-Sized Dataset,” where I present the dataset and the details on how to handle it with an SQLite database. The present article shares the same GitHub repository and builds upon it to provide more features to the geographic data analysis.

The clustering approach draws from another article named “Mapping the UK’s Traffic Accident Hotspots,” where I used DBSCAN to help create geofences around the most frequently reported traffic accident areas around the UK. Here I use an improved version of the famous density-based clustering algorithm to handle the complex clusters that naturally show up in urban environments and a very simple-minded but effective way of naming them using publicly available data.

The Vehicle Energy Dataset

In 2019 G. S. Oh, David J. Leblanc, and Huei Peng published a paper on the Vehicle Energy Dataset (VED) containing one year’s worth of vehicle trajectory and energy consumption data. These data were collected from November 2017 to November 2018 in Ann Arbor and refer to a sample of 383 vehicles of diverse types and power sources. Here, we will only use the geographic data present in the dataset and defer the analysis of energy dynamics to a future article.

The paper discusses an interesting approach to personal data de-identification and the study of use-cases related to fuel economy and energy efficiency. Most importantly, the survey collected over 374,000 miles of GPS data that we can use for this article’s geographic clustering purposes.

The study’s data collection process ensured driver anonymization through a relatively simple three-step approach. This process became quite relevant to this article because it produced, as a by-product, a critical piece of information: individualized vehicle trajectories. To anonymize the driver information, the study authors applied Random Fogging, Geo-fencing, and Major Intersection Bounding techniques. Random fogging removes observed locations near the start and the end of the trip, while geo-fencing clips observations outside a bounding box defined around the city’s boundaries. The authors also clipped trips around the first and last intersections. Besides driver anonymization, these procedures also have the benefit of producing individual trajectories that are readily usable.

I finished the previous article with an illustration on how to extract and display such trajectories. Here, I will use all the trajectories’ endpoints to identify places of interest for future analysis. Unfortunately, these will refer to the most used intersections in Ann Arbor, not the drivers’ final destinations themselves, as these are not present. But this is enough to illustrate the process of building the geographic clusters defined by all the trajectories’ endpoints.

Before we start, I invite you to clone the GitHub repository of this article and run the first four numbered notebooks. Each of the following sections details the next three Jupyter notebooks, leading you to the endpoint: cluster naming.

Clustering with HDBSCAN

Follow this section in the notebook 5-clustering-hdbscan.ipynb

Running the above mentioned four notebooks will get you ready to start the geographic clustering process. The trip endpoint information we seek lives in the move table of our SQLite database, and we use the code below to read both trip start and end locations with a single query.

Once the data is available, we can rearrange it into an array with two columns only, one for latitude and the other for longitude, like so:

We can now feed the location information to the HDBSCAN algorithm, the output of which is an array of cluster identifiers. The following function handles all the work.

Note the requirement for converting the geographic coordinates from degrees to meters. Now we can call this function on the endpoint coordinates and collect the cluster indices in the same order.

clusterer = fit_utm_clusterer(locations)

The resulting array has the same size along the first dimension as the location array, and the calculated cluster identifiers are located in the corresponding indices, making querying very easy. The following function displays a cluster in a map, given its id. Here, points define the area of interest, not a geofence.

You can see this function at work on notebook number five, coupled with an interactive widget to ease the cluster selection. Using this setup, you can quickly inspect the generated cluster quality and possibly adjust some of the HDBSCAN algorithm parameters.

Handling Outliers

One of the issues that will no doubt stand out is the poor quality of some clusters. Some of the clustered points show up very far away from the principal agglomeration, and these should be handled by marking them as noise. The next picture shows a blatant example of such a phenomenon.

Cluster number 2 displays a distinct set of outlying points to the northeast. The outlier score for each point reflects on its color, with blue points having a low score and red points a high score.

Fortunately, the HDBSCAN algorithm provides us with a means to handle such points, the outlier score. The outlier score ranges from zero to one, where zero means that the algorithm is pretty sure that the location belongs to the cluster, while a value close to one says the opposite. In the picture above, the calculated outlier score is converted into a color swath ranging from blue, representing zero, to red, meaning one. Looking at the map, it is clear the red points are noise and do not belong to the central cluster.

Still, we must come up with a threshold to distinguish between outliers or otherwise. My approach was quite visual, so I browsed the whole set of cluster points, looking for a reasonable threshold value. Points with outlier scores above that threshold will be converted into noise, while all the others retained. It turned out that red locations would always be in the oddest of places, and red means an outlier score above 0.8.

The selection of such a value seemed arbitrary, so I decided to provide it with some numerical support. I was trying to determine whether or not such a threshold would remove too many points. A brief look at the whole dataset outlier score distribution reassured me.

By selecting an outlier score threshold of 0.8, we retain over 97.85% of the cluster points.

Numerically, this calculation is quite simple and yields a relatively high value of 97.85%.

scores = clusterer.outlier_scores_
scores[scores < 0.8].shape[0] / scores.shape[0]

What if you want to be a bit more conservative in your clustering quality and retain a bit less, say 95% of the points? This calculation is quite easy to perform, as well.

pd.Series(scores).quantile(0.95)

It turns out that if you allow for just 95% of your cluster points, you can set the outlier threshold at 72.7%. Let’s keep the 0.8 value and filter out the outlying locations.

filtered_clusters = np.copy(clusterer.labels_)
filtered_clusters[scores >= 0.8] = -1

As you can see, filtering is a simple matter of marking all outliers as noise. We can now save them to the database. For each cluster point, we store the correspondent id (might be -1 for outliers), the geographic location, and the level 12 H3 index. As we will see below, this will cause us a bit of trouble that will be fun to solve.

Clusters already carry a lot of value because they allow us to realize where, geographically, trips start and end. Nevertheless, we can go a little further and devise a way to draw a geofence around the cluster points. This geofence would work as a discrimination function to test whether a location belongs to a cluster or not.

Cluster Geofencing

Follow this section in the notebook 6-cluster-geofencing.ipynb

To address the geofencing issue, I will resort to H3 again. As I showed in a previous article, we can use H3 to depict clusters graphically. For each H3 index, we can query the corresponding hexagon vertices in terms of their geographic coordinates. The following function converts an H3 hexagon index into an array of the corresponding vertices’ coordinates. Note how the first coordinate is duplicated at the end to produce a closed polygon.

By pasting together all the hexagons of a cluster, we can build a hive-like cluster representation.

By pasting all hexagons together, we create a hive-like representation of the cluster. Nevertheless, we want to keep the outline only.

The picture above helps understand how a hash-based cluster inclusion detection algorithm would work. Each of those hexagons corresponds to an H3 index of level 12. To test the inclusion of a random point in the cluster, we would have to convert its coordinates into an H3 index and compare it to the cluster’s list. This computation is high-speed and quite prone to be encoded with the help of a database index or an in-memory hash table.

We will keep this hash-based search feature but will improve on the representation by retaining just the overall outline of the polygon. We do so by merging all the polygons with the help of the Shapely package.

merged = cascaded_union(polygons)

We feed the list of hexagonal polygons to the cascaded union function. It returns either a single Polygon object, for the simple cases as above, or a MultiPolygon object when the placement of the hexagons produces scattered “islands.” Here is the result we want.

The same cluster as above but now with all the hexagons merged into a single shape.

But not all clusters will coalesce into a single shape. Due to the point dispersion and the fixed H3 level used for representation, we may see some disconnected areas, like the one below.

These islands sow up due to the point dispersion and the fixed H3 level used to represent the hexagons.

The solution to this problem seems to be quite simple: we have to place another hexagon somewhere and connect both components, which reminded me of network components. Maybe we can use some graph theory to solve this.

We can conceive a graph representing our cluster hive by considering the hexagons as the nodes and the neighboring relationship as the edges. If two hexagons are neighbors, there will be an edge connecting them. The function below does precisely this.

The function receives a list of hexagon indexes as the input parameter and scans all possible pairs for immediate vicinity. The resulting neighborhood graph is not of the directed type, so the function tests only half of all possible relations.

Once generated, we can then query the graph to know how many connected components it has. Visually, the result is as depicted in the following image.

The above cluster represented as a graph. Note the two obvious components, and how the nodes in the larger component connect. Each edge is a neighbor relationship. To link the lonely node to the larger component, we will have to add at least another node (hexagon).

The NetworkX package confirms our intuition with a simple query.

nx.number_connected_components(g)

This query returns the value two, and we can even query what the components are.

list(nx.connected_components(g))

The query yields a list with a set of nodes for each connected component. In this case, the result is as follows.

[{'8c274994cbb45ff'},
 {'8c274994c85a3ff',
  '8c274994c85a7ff',
  '8c274994c85b1ff',
  '8c274994c85b3ff',
  '8c274994c85b5ff',
  '8c274994c85b7ff',
  '8c274994c85bdff',
  '8c274994cbb49ff',
  '8c274994cbb4bff',
  '8c274994cbb51ff',
  '8c274994cbb59ff',
  '8c274994cbb5bff',
  '8c274994cbb5dff'}]

Now, we can iterate over the two components and try to find the shortest bridge between them. Fortunately, H3 comes to our rescue with a function that “draws lines” between two hexagons. The function below uses this service to find one of the shortest possible such lines between any two hexagons from two different connected components.

This function iterates over the first two connected components taken from the input list and enumerates all possible “bridges” between them. The selected bridge is either the shortest of the enumeration or the first with only three elements.

Once we have our bridge, we can build it between the two components, generate a new graph, and iterate. Find the latest list of connected components and, in case we have more than one, rerun the whole thing.

Here’s the whole process.

Note that we are only keeping the inner hexagons from the bridge, as the outer ones already exist in the connected components. The code above also inserts the new cluster points in the database, as it progresses.

We are now sure that all our clusters are connected and stored in the database. So what about naming them?

Cluster Naming

Follow this section in the notebook 7-cluster-names.ipynb

The final challenge in this article is to derive the cluster names without using a reverse-geocoding service. We will use the OSM Overpass API to query geographic data and use it to assign names. Each query requires a geographic bounding box, so we must first handle it by getting them from the database.

Querying Overpass is now possible, but we need to make additional provisions to avoid burdening the public API by caching all the retuned data. This strategy will allow us to prevent issuing repeated calls in case of error. Towards that end, we create a particular directory to store all the collected data as JSON-formatted files. Note that if you decide to go back and tune the cluster definitions, you will have to erase all the cache files as the cluster numbers will probably change.

This function retrieves the OSM data associated with a cluster while abstracting away the file cache.

We can now iterate over the cluster collection and calculate the associated names with the code below.

There are two notable mentions to the code above. First, we build each OSM network object using an auxiliary function that converts the JSON-encoded data. Second, the cluster name is a function if the set of points that define the cluster itself. This function calculates the K nearest locations to each of the input points and uses a simple voting mechanism to retrieve the top two most common street names. The final cluster name is a mere concatenation of these two as we are naming street intersections.

A named cluster in the intersection of Packard Road and Carpenter Road.

Conclusion

In this article, I have shown a few techniques to handle geographic clusters, starting with the clustering process itself, followed by the geofencing process, and finishing with a lightweight geocoder.

There is a lot more to explore with these data, and I intend to do so in future articles. One of the most exciting patterns to uncover will surely be the routes vehicles take when traveling between two clusters. Stay tuned.

References

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density-based clustering In Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

Ester, M., Kriegel, H.P., et al. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Database with Noise. KDD, 226–231

Vehicle Energy Dataset (VED), A Large-scale Dataset for Vehicle Energy Consumption Research

The Medium-Sized Dataset

Mapping the UK’s Traffic Accident Hotspots

Geospatial Indexing with Uber’s H3