Clustering algorithms are one of the most widely used solutions in the data science world, with the most popular ones being grouped into distance-based and density-based approaches. Although often overlooked, density based-clustering methods are interesting alternatives to the ubiquitous k-means and hierarchical approaches.
Some of the famous density-based clustering techniques include DBSCan (Density-based spatial clustering of applications with noise) or Mean-Shift, two algorithms that use data points’ center of mass to group observations together.
In this blog post, we’ll explore DBScan, a clustering algorithms that is particularly be useful when your data contains some of the following features:
- Clusters have an irregular shape. For example, a non spherical shape.
- Compared with other methods, DBScan doesn’t assume any prior about the underlying distribution of the data.
- Your dataset contains some relevant outliers that shouldn’t influence how the clusters’ centroids are mapped.
If these three sentences were confusing to you, don’t worry! In this post, we’re going to see a step-by-step implementation of the DBScan method, while discussing the topics above. Also,we’ll check the famous sklearn
Python implementation!
Also, if you would like to drop by others posts of my Unsupervised Learning series, you can check:
Let’s then dive deep and understand how DBScan works!
Fitting Distance-Based Clustering Solutions
In this step-by-step playbook, we’ll use a toy dataset with information about customers. In this example, we’ll use a two variable clustering to make it easier to grasp.
Let’s imagine that we run a shop and we have demographic information about our customers. We would like to do some campaigns based on their annual income and age and we only want to run 3 or 4 distinct marketing campaigns.
First, we start by plotting these two variables in a scatter plot:
For now, imagine that it is reasonable that someone makes 120$ per year (we’re just using this scale to have similar values between Age and Income, as that will make the explanation more intuitive).
If you look at the data, how many differents groups of customers do you see here? 2? 3? 4? This is an open question that we can’t really answer because there is no label or correct answer.
We can argue that we have three different groups, by the way that the customers are scattered throughout the plot:
- One scattered cluster on the left side of the plot.
- A group of 5 to 6 customers on the center.
- A third group of clusters on the right side of the plot.
But, this is just a guess. Maybe k-means can help us! Let’s fit a k-means algorithm with 3 clusters:
Although an interesting solution, it seems that our clustering is no picking up some patterns, being drawn away by outliers. What’s the result if we fit a solution with 4 centroids?
Again, the k-means solution is being driven away by outliers. Will Hierarchical Clustering help? Let’s see:
Using the Ward method, our result is similar to K-means clustering (with 3 centroids). Let’s change the linkage of the Hierarchical Clustering solution and see the results:
Uh-oh, even worse! Changing the linkage method was also a bad option.
The crux of this exercise is that certain datasets are not a fit for distance-based clustering. Normally, people rush to use these types of clustering models as they are very interpretable but these solutions work better when our clusters are spherical and well defined.
Demystifying DBScan
Now that we’ve seen how distance-based clusters may not be a fit for certain types of data, let’s explore why DBScan is different.
DBScan starts by selecting a random point in our dataset:
After picking the data point, we draw a circle around this data point with a certain radius.
The radius of the circle built around the data point is a hyperparameter of the DBScan algorithm.
For our example, we’ll draw a circle with a radius of 9 units:
How many datapoints points do the radius touch? In this case, our radius touches 9 data points (including the data point itself). This number is related to another hyperparameter of the DBScan algorithm: the _minsamples.
If the radius touches more or equal to what’s stated on _minsamples, we consider the chosen data point as a core point!
Min_samples is another hyperparameter of the DBScan algorithm
Being considered a core point gives significant power to the data point:
- The data point belongs to a cluster.
- The data point is able to call other data points from within the radius into the cluster.
After testing if the data point is a core point, we move to another point within the boundary:
We’ve just moved to the data point right above our first data point. The next question is, naturally, if this is a core point?
You know how to do this test! Let’s start by drawing a boundary around the new data point:
Note: This circle has the exact same radius as the one we’ve seen before.
So.. is this a core point? No! Because this data points’ radius only touches 8 data points, less than the _minsamples hyperparameter we’ve defined.
We call this a satellite data point. Although it belongs to the cluster (because it has been "summoned" by another data point), it will not have the right to call other data points within its boundary to the cluster. This is the main difference between core and satellite points.
So far, we’ve known two types of data points in the DBScan solution:
- Core points: Belong to a cluster and have the power to bring other points into it
- Satellite points: Belong to a cluster but don’t have the power to being other data points into it.
We consider that a cluster is defined when we touch satellite points that make a frontier.
In our solution, all the data points below are considered satellite points (using our hyperparameters) and everything within their boundary belongs to a cluster:
But there’s still a group of data points that we haven’t spoke of – What about data points that are not called by other data points? Those points are definitely core points. They also cannot be considered satellite data points as they aren’t touched by any core point.
Those are called outliers! This is useful feature of the dbscan algorithm that incorporates the identification of outliers in the solution.
For example, take this data point right here:
We can draw a radius around this data point as well:
Notice that this data point is completely isolated from the other data points. Additionally, it does not have enough data points in its boundary to be considered a core data point.
This makes it an Outlier. Data points can be outliers in DBScan for two main reasons:
- It doesn’t have enough data points in its boundary and is not a satellite point.
- Only boundaries of satellite points are able to reach them. Remember that satellite points are not able to bring data points into their own cluster. Only core points share that responsability.
Data points in the DBScan algorithm are categorized into Core, Satellite and Outlier
After cataloging all data points as either Core, Satellite or Outlier, the algorithm is complete!
What clusters will a DBScan solution find in our Age vs. Annual Income data? Let’s fit a dbscan solution with the following hyperparameters (do you still remember them?)
- Epsilon = 9
- Min_Samples = 5
The purple points are not a "cluster". They are considered outliers by the solution.
We have one upside and downside in this solution:
- Our 3 clusters have been created according to the density of points, which may be a good solution.
- Unfortunately, these hyperparameters are also too strict in this solution, producing too many outliers.
DbScan is sensitive to changes in the hyperparameters
One common trick is to do a DBSCan clustering that produces alot of outliers and thenrun a distance minimizing algorithm afterwards to allocate the leftover points (minimum euclidean distance to each cluster centroid, as an example).
Another alternative is to tweak our hyperparameters. Here, we can raise EPS or minimize the _minsamples, for example:
- EPS = 12
- Min Samples = 4
This result is a bit different and we are able to consider less points as outliers. Although there’s no correct solution, analyzing the clusters with different hyperparameters will give you different solutions as this algorithm is quite sensitive to the eps and _minsamples parameters.
Given that we have acquired new knowledge on DBScan, let’s use it on a practical use case next!
Practical Use Case – Taxi Stations in the US
The dataset we will be using for this practical case is based on the Taxi Trip Duration dataset available on Kaggle and made available on Google BigQuery DataSets. For simplicity, we will just use a sample of the dataset to be able to plot it in a map.
The dataset contains a lot of information but what will interest us is the _pickuplongitude and _pickuplatitude. These two columns contain the GPS coordinates where the passenger joined the ride.
Our goal is to let the taxi company know where they should build their taxi stations. Knowing that it would be interesting to have cars readily available for passengers, their goal is to build stations in the most appropriate locations, trying to maximize the number of potential rides they can serve from the stations.
To solve this problem, let’s start by loading the data using pandas
:
import pandas as pd
taxi_data = pd.read_csv('/content/data/taxi_data_sample.csv')
We have some data points that have their GPS coordinates wrongly stated. We need to remove those outliers:
taxi_data = taxi_data.loc[taxi_data.pickup_longitude > -100]
After removing these rows, let’s plot our pickup coordinates in a 2-D map. For the following code to work, you need to install and load cartopy
and contextily
– cartopy
may have some issues when you install them on local environments (for example, using conda
) – if you are having trouble, install the library in a Google Colab environment using:
!pip install cartopy
!apt-get -V -y -qq install python-cartopy python3-cartopy
!pip install contextily
After installing the dependencies, we can plot our data points on New York using:
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx
import cartopy.crs as ccrs
def loadNewYorkMap():
# Load a GeoDataFrame with New York City boundaries
ny_shapefile = gpd.read_file(gpd.datasets.get_path('nybb'))
nyc = ny_shapefile[ny_shapefile['BoroName'] == 'Manhattan']
# Add new york city boundaries
ax = plt.axes(projection=ccrs.PlateCarree())
ax.add_geometries(nyc['geometry'], crs=ccrs.PlateCarree(), edgecolor='black', facecolor='none')
return ax
ax = loadNewYorkMap()
plt.scatter(
taxi_data.pickup_longitude,
taxi_data.pickup_latitude,
transform=ccrs.PlateCarree(),
color='red',
marker='o'
)
# Add a basemap using web tiles from OpenStreetMap
ctx.add_basemap(ax, crs=ccrs.PlateCarree(), source=ctx.providers.OpenStreetMap.Mapnik)
ax.set_xticks([])
ax.set_yticks([])
# Show the plot
plt.show()
Before fitting our algorithm, we know that dbscan
is based on distances. What should we do? Standardize! Let’s standardize latitude
and longitude
to the same scale. Although the lack of standardization would not cause too many issues for this data, it’s still a good practice to do it, particularly if you want to apply this use case to other datasets:
scaled_taxi = StandardScaler().fit_transform(taxi_data[['pickup_latitude', 'pickup_longitude']])
Imagining we would like each of our taxi stations to serve at least 100 trips, how does this fit with our dbscan
parameters?
We can tweak the min_samples
parameter! Let’s fit our first clustering solution:
from sklearn.cluster import DBSCAN
taxi_data['cluster_dbscan'] = DBSCAN(eps=0.25, min_samples=100).fit_predict(scaled_taxi)
We’ve chosen eps=0.25
. Let’s see our solution on a map! … but which coordinates should we plot? One interesting solution is to calculate the average value of the coordinates for each cluster:
centroids_dbscan = taxi_data.groupby('cluster_dbscan')[['pickup_latitude', 'pickup_longitude']].mean()
Also, DBScan
represents outliers with the value -1
. Let’s exclude these values as they are not considered a "cluster":
# Remove outliers
centroids_dbscan = centroids_dbscan.loc[0:]
Very cool! We would setup 3 stations:
- One in Manhattan.
- One in the JFK Airport.
- One in La Guardia Airport.
Want to check which trips are "theoretically" served by which station?
Cool! Purple points are considered outliers of the dbscan
solution. It seems that we are being a bit strict on our definition of clusters in the DBScan
solution. Remember how to adjust this? Lowering eps
or the min_samples
arguments! Let’s try both:
taxi_data['cluster_dbscan_v2'] = DBSCAN(eps=0.1, min_samples=50).fit_predict(scaled_taxi)
Checking the centroids of our new solution and plotting it:
centroids_solution_2 = taxi_data.groupby('cluster_dbscan_v2')[['pickup_latitude', 'pickup_longitude']].mean()
# Remove outliers
centroids_solution_2 = centroids_solution_2.loc[0:]
Notice how we would setup more stations with these new parameters. By adjusting our eps
and min_samples
we were able to create smaller clusters in our data:
Keep adjusting the eps
and min_samples
parameters and see for yourself the impact on the clustering elements!
Conclusion
DBScan is a widely used algorithm that can be very cool to use in certain scenarios, particularly when your clusters are not spherical and contain irregular patterns.
DBScan has been used in applications as different as gene expression clustering, customer analytics or in some instances of image segmentation. Although k-means and hierarchical clustering are more famous, density based clustering methods are very cool options you can consider for your Unsupervised Learning projects.
In a nutshell, here’s what you should remember about DBScan:
- It is able to identify outliers.
- It divides points into core, satellite and outlier points.
- It uses a radius (eps) around the data point to create the density cluster.
- It uses a minimum number samples (min_samples argument) to catalog the datapoint according to its function (core, satellite, other).
If you enjoyed this post, make sure to read my other posts on the Unsupervised Learning series:
Feel free to check the Youtube video where I explain DBScan in my channel The Data Journey.
If you would like to drop by my Python courses, feel free to join my Complete Python Bootcamp for Beginners. My Python courses are suitable for beginners/mid-level developers and I would love to have you on my class!
[The dataset used in this post is under Public Domain License]