A Practical Guide to DBSCAN Method

A comprehensive guide to the popular clustering method DBSCAN

Amit Shreiber

Follow

Published in

Towards Data Science

13 min readApr 25, 2020

--

*Retrieved April 19, 2020, from* *gloria.com*

When I was working on my first data science task and I wanted to use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) for clustering, many times I searched for answers to questions such as:

How to choose the DBSCAN parameters?
How DBSCAN works?
How does DBSCAN choose which point will belong to which cluster?
Why use DBSCAN and not K-means?

Of course, it is not difficult to find answers to all these questions and many more. However, it might be hard to find one article that summarizes all these and many other basic questions. In addition, you may be surprised, but some important questions don’t have such accessible answers.

My goal was to write a guide that summarizes the DBSCAN method, answers all these questions and more. I find the DBSCAN algorithm less intuitive than other popular clustering methods like K-means and Hierarchical Clustering, so I’m going to use a lot of examples and I guarantee that by the end of the article you’ll understand the method.

Motivation for density-based clustering

Two popular types of clustering methods are: partitioning and hierarchical methods.

Partitioning method partitions the dataset to k (the main input of the methods) number of groups (clusters). The Partition iterative process allocates each point or object (from now I will refer to it as a point) in the dataset to the group it belongs to. After the points are allocated in its group, the mean of the group (its centroid) is calculated by taking an average for all the points in the group. The most well-known Partitioning method is K-means. The partition methods have some significant drawbacks: you should know beforehand into how many groups you want to split the database (the K value). Another important drawback is that K-means does not perform well on finding non-convex/non-spherical shapes of clusters (figure1). In addition, K-means is sensitive to noise data.

Figure 1- clustering shapes that K-means does not identify. Retrieved from introduction to Data Mining (Tan, Steinbach, Kumar,2004)

A Hierarchical method creates a Hierarchical visual representation for the data using a special tree (dendrogram). An agglomerative hierarchical clustering method starts when each point in its cluster and in each step it merges similar or close clusters to one cluster. In the final step all the points are in the same cluster. In contradiction to K-means, you don’t need to decide what should be the number of clusters, but it also has some serious drawbacks: it isn’t suitable for big datasets, has high computational complexity, you need to choose the metric for merging the clusters (linkage) that affects the clustering result. Furthermore, each metric has its drawbacks: sensitivity to noise, difficulty in dealing with severe differences in the density of clusters, and sensitivity to shape and size of clusters.

As we can see the main disadvantages of partitioning and hierarchical methods are: handling noise and getting bad results with finding clusters of nonspherical shape.

The DBSCAN clustering method is able to represent clusters of arbitrary shape and to handle noise.

Figure 2 -clusters of arbitrary shapes such as the “S” shape and oval Clusters. Retrieved from Data mining: concepts and techniques (Han, Peri, Kamner,2011).

DBSCAN intuition

Let’s think about a very big city with a lot of residents and tourists. Imagine that there is a small village just a short drive away. If we would take an alien to both places, although they are very close, he could easily tell that they are completely different places. Yes, the view, area, buildings, and many other aspects are completely different. But there is one aspect that is relevant for our case — the density of the places. The city is crowded, with a lot of locals and tourists, whereas the village is small with significantly fewer people.

Differing groups of points by their density is the main idea of the DBSCAN.

DBSCAN Concepts

The DBSCAN groups together points with a dense neighborhood into clusters.

A point will be considered as crowded if it has many other neighbors points near it. The DBSCAN finds these crowded points and places them and their neighbors in a cluster.

The DBSCAN has two main parameters -

ε (or eps or epsilon) — defines the size and borders of each neighborhood. The ε (must be bigger than 0) is a radius. The neighborhood of point x called the ε-neighborhood of x, is the circle/ball with radius ε around point x.

Some books and articles describe the ε-neighborhood of x as:

Formula retrieved from Data Mining and Analysis. Fundamental Concepts and Algorithms (Zaki, Meira, Meira, 2014)

In figure 3, we can see how different sizes of ε define the ε-neighborhood of x.

Figure 3 — we can see how the ε size affects the size of the ε-neighborhood of x.

Now, you are probably asking yourself why is it a density-based cluster? Would a very dense neighborhood and sparse one be considered as two different neighborhoods (figure 4) and later on be separated into two different clusters?

Figure 4 — A dense neighborhood and a sparse one.

Point x and its neighbors would be in one neighborhood and y and its few neighbors would be in another. In the end, however, point x and its neighbors would probably be in one cluster whereas point y and its neighbor would be considered as outliers or noise. This is because the ε-neighborhood of y isn’t dense enough. As a neighborhood contains more points, the denser it becomes.

How can we define if a neighborhood is dense enough? In order to do that, the DBSCAN uses the MinPts parameter.

MinPts — The density threshold. If a neighborhood includes at least MinPts points, it will be considered as a dense region. Alternatively, a point will be considered as dense if there are at least the value of MinPts points in its ε-neighborhood. These dense points are called core points.

Let’s check again figure 4, if the MinPts parameter is 3, point x will be a core point because the size of its ε-neighborhood is 9 and it’s bigger than 3. Point y won’t be a core point because its ε-neighborhood contains two points.

*The number of points in the ε -neighborhood of point x contains x itself.

A border point has ε-neighborhood that contains less than MinPts points (so it’s not a core point), but it belongs to the ε-neighborhood of another core point.

If a point isn’t a core point and isn’t a border point, it’s a noise point or an outlier.

In figure 5 we can see that point x is a core point, because it has more than 11 points in its ε-neighborhood. Point y isn’t a core point because it has less than 11 points in its ε-neighborhood, but because it belongs to the ε-neighborhood of point x, and point x is a core point, point y is a border point. We can easily see that point z isn’t a core point. It belongs to the ε-neighborhood of point y, and point y isn’t a core point, therefore point z is a noise point.

Figure 5 -we can see in this figure the three types of points: x is a core point, y is a border point and z is a noise point.

Now that we see the core points and border points will be in a cluster, how does the DBSCAN know which point goes to which cluster? In order to answer that, we need to define some definitions:

Directly density reachable — point y is directly density reachable from point x if:

Point y belongs to the ε-neighborhood of point x
Point x is a core point.

In figure 5, point y is directly density reachable from point x. Notice that point x isn’t directly density reachable from point y, because point y isn’t a core point.

Density reachable — point y is density reachable from point x, if there is a path of points between point x and point y, where each point in the path is directly reachable from the previous point. This means that all the points on the path are core points, with the possible exception of point y. We can see an example in figure 6.

According to Wikipedia:

A point q is reachable from core point p if there is a path P₁,P₂…, Pₙ with P₁ = p and Pₙ = q, that all points on the path are core points, with the possible exception of point q.

Figure 6 — point y is density reachable from point x. Point y is a border point and points: x, z, and p are core points.

Density-connected- a point x is density-connected to a point y, if there is a point o that both x and y are density-reachable from. This way the DBSCAN connects core objects and their neighbors in a dense region. We can see an example in figure 7.

Figure 7 — point x and point y are density-reachable from point o. Therefore, point x is density-connected to a point y. Figure retrieved from Knowledge Discovery in Databases(Seidl, 2016) and edited by me.

Density-Based Cluster: a cluster, C, is a not an empty group of points given that:

Point x is in C and y is density-reachable from x, in which case, y would be in C
All the points in C are density-connected to each other

In figure 6 Points x, y, z, p, and y will be in the same cluster.

The DBSCAN algorithm

Input:

D — a dataset with n points
MinPts — the neighborhood density threshold
ε- the neighborhood radius

Method:

1) We mark all the points in the data as unvisited.

2) We choose a random unvisited point to visit, and mark it as visited. Let’s refer to it as ‘p’.

3) We check if the ε-neighborhood of p has at least MinPts points.

If ‘p’ doesn’t have enough points in its ε-neighborhood (figure 8), we mark it as noise and proceed to step 4.

*Notice that, in a later iteration, in step 3.d, this point might be included in a cluster.

Figure 8 — point ‘p’ will be considered as a noise because its ε-neighborhood contains less than MinPts point.

If point ‘p’ has enough points in its ε-neighborhood (figure 9) we proceed to step 3.a

Figure 9 — ’p’ has more than MinPts in its ε-neighborhood

3.a) We will create a new cluster C And we add ‘p’ to the cluster.

3.b) We will give the ε -neighborhood of p a new name — N (figure 10). N will include the points: {a,b,c,d,e,f,g,h,i}

figure 10- We will give the ε -neighborhood of p a new name-N. *Figures 8–10 retrieved April 19, 2020, from* *THE IRISH TIMES*, edited by me

3.c) We’ll follow the next steps for each point in N. The first point is ‘a’. If ‘a’ is unvisited we will do steps 3.c.1 and 3.c.2. Else, we will proceed to step 3.d.

3.c.1) We will mark point ‘a’ as visited.

3.c.2) We will check if point ‘a’ has at least MinPts in its ε-neighborhood. If so we will add these points to N. for example if the ε -neighborhood of ‘a’ includes the points {j ,k,l } and MinPts is 3, the new N would include the points: {a,b,c,d,e,f,g,h,i,j,k,l}. But, if the ε -neighborhood of ‘a’ would include just point {j}, N would stay the same.

3.d) If point ‘a‘ ’doesn’t belong to any cluster, we will add it to cluster C, and now cluster C will include points ‘a’ and ‘p’.

3.e) We will move to the next point in N (point ‘b’) and go back to step 3.c. We will finish to repeat steps 3.c-3.e after we checked all the points in N. Remember that N might be updated at any time (step 3.c.2).

4. We’re done with ‘p’. We will go back to step 2, visit the next unvisited point and repeat these steps. We will finish when all the points in the dataset have been visited.

An illustration for the algorithm -

Figure 11. Retrieved April 19, 2020, from towards data science(Seif, 2018)

I recommend to check this DBSCAN toy example (Schubert, 2018)

How to choose MinPts and ε?

This is the million-dollar question. First, I will define some terms:

Function k-distance(p): distance from point p to its k-nearest neighbor (the user chooses the K value). For example, If the user chooses k value to be 4 (figure 12), the 4-nearest distance from point p, is the distance from p to its 4-nearest neighbor.

Figure 12- k-distance. Retrieved from Knowledge Discovery in Databases (Seidl, 2016)

K-distance plot: k-distances of all objects, sorted in decreasing order. Called also sorted k-dist graph.

Parameters Estimation

If ε has a small value, many points may be considered as outliers because they wouldn’t be core points or border points (the ε-neighborhoods will be very small). A large value for ε may cause a huge amount of points to be in the same cluster.

As said before, one of the main advantages of the DBSCAN is that it detects noise. According to a research made in 2017 by Schubert, Sander, et al, the desirable amount of noise will usually be between 1% and 30%. Another insight from that research is that if one of the clusters contains many (20%-50%) points of the dataset, it indicates that you should choose a smaller value for ε or to try another clustering method.

In general, ε should be chosen as small as possible.

According to the originators of the DBSCAN algorithm (Ester, Kriegel, Sander and Xu, 1996) we can use this heuristic to find ε and MinPts :

For a given k we build the sorted k-dist graph. The threshold point is the first point in the first “valley” of the sorted k-dist graph. The k-dist value of the threshold will be the ε value. The research indicates that the k-dist graphs for k > 4 don’t differ significantly from the 4-dist graph and they need considerably more computation. Therefore, they eliminate the parameter MinPts by setting it to 4 for all databases (for 2-dimensional data). The 4-dist value of the threshold point is used as the ε value for DBSCAN.

Figure 13 — K dist graph(for k=4) (Ester, Kriegel, Sander and Xu, 1996)

If you don’t want the MinPts value to be 4, you can decide the MinPts = k+1. A heuristic to choose k, is to set k to 2 ∗ dimensions -1 (Sander, Ester et al., 1998).

Another heuristic to choose MinPts value-

Where Pᵢ is the number of points in ε-neighborhood of point i, and n is the number of points in the dataset. For each different value of ε we will get the corresponding MinPts value (Sawant, 2014).

In the next example (figure 15) we:

Choose K =3.
Calculate the 3rd nearest neighbor distance of each point. For example, for point n this distance is a bit higher than 5.
Sort the distance (The bar plot in figure 15 is after we sorted all the points distances).
Choosing ε to be the value of the first knee, so ε ≈ 2.5.

Figure 14- choosing parameters using a k-dist graph. Retrieved from Knowledge Discovery in Databases by Erich Schubert, 2018. Edited by me

Computational Complexity

As we said before, the DBSCAN checks the ε-neighborhood of each point in the data. Checking it takes O(n²).

Thus, the overall complexity of DBSCAN is O(n²) in the worst case.

There are some cases that the complexity can reduce to O(nlogn).

DBSCAN PROS

Identifies randomly shaped clusters
doesn’t necessitate to know the number of clusters in the data previously (as opposed to K-means)
Handles noise

DBSCAN CONS

Datasets with varying densities are problematic
Input parameters (ε and MinPts) may be difficult to determine
computational complexity — when the dimensionality is high, it takes O(n²)

Summary

DBSCAN is a density-based clustering method that discovers clusters of nonspherical shape.

Its main parameters are ε and Minpts. ε is the radius of a neighborhood (a group of points that are close to each other). If a neighborhood will include at least MinPts it will be considered a dense region and will be part of a cluster.

There are many ways to choose the parameters. A popular heuristic is to choose a k value and to build a sorted k-dist graph. The threshold point is the first point in the first “valley” of the sorted k-dist graph. The k-dist value of the threshold will be the ε value. The MinPts value will be the value of k+1. You can choose 4 for the K and MinPts value as a default.

The DBSCAN main advantages are that you don’t need to know the number of clusters beforehand, it Identifies randomly shaped clusters. The main drawbacks are high computational complexity, the need to choose good values for ε and MinPts, and handling Datasets with varying densities.