The world’s leading publication for data science, AI, and ML professionals.

From Data to Clusters; When is Your Clustering Good Enough?

Sensible clusters and hidden gems can be found using clustering approaches but you need the right cluster evaluation method!

Hidden gems can be found using clustering approaches but you need the right clustering method and evaluation approach to make sensible clusters. Learn how to find them in four steps.

Photo by Shubham Dhage on Unsplash
Photo by Shubham Dhage on Unsplash

With unsupervised cluster analysis, we can group observations with similar patterns, and reveal (hidden) trends in the data. The use of cluster evaluation methods helps to determine the Clustering tendency, quality, and optimal number of clusters. In this blog, we will dive into cluster evaluation methods, learn how to interpret the methods and select the appropriate clustering method for your use case. We will start by delving into the fundamentals of clustering and evaluation methods that are used to assess the quality of clusters, including popular techniques like the Silhouette score, the Davies-Bouldin index, and the Derivative method. With the use of toy example data sets, we will investigate the strengths and limitations of each evaluation method, providing practical insights on how to interpret their results. For all analyses, the clusteval library is used.


Unsupervised Clustering.

With unsupervised clustering, we aim to determine "natural" or "data-driven" groups in the data without using apriori knowledge about labels or categories. The challenge of using different unsupervised clustering methods is that it will result in different partitioning of the samples and thus different groupings since each method implicitly impose a structure on the data. Thus the question arises; What is a "good" clustering? Figure 1A depicts a set set of samples in a 2-dimensional space. How many clusters do you see? I would state that there are two clusters without using any label information. Why? Because of the small distances between the dots, and the relatively larger "gap" between the cluttered dots.

Figure 1. Image by the author
Figure 1. Image by the author

With this in mind, we can convert our intuition of "clusters" into a mathematical statement such as; the variance of samples within a so-called-cluster should be small (within variance σW, red and blue), and at the same time, the variance between the clusters should be large (between variance, σB) as depicted in Figure 1B. The distance between samples (or the intrinsic relationships) can be measured with a distance metric (e.g., Euclidean distance), and stored in a so-called dissimilarity matrix. The distance between groups of samples can then be computed using a linkage type (for hierarchical clustering).

Distance metrics

The most well-known distance metric is the Euclidean distance. Although it is set as the default metric in many methods, it is not always the best choice. A schematic overview of various distance metrics is depicted in Figure 2.

Understand the mathematical properties of the distance metrics so that it fits to the data and aligns with the research question.

Figure 2: Schematic overview of the most popular distance metrics. Image by the author.
Figure 2: Schematic overview of the most popular distance metrics. Image by the author.

Linkage types.

The process of hierarchical clustering involves an approach of grouping samples into a larger cluster. In this process, the distances between two sub-clusters need to be computed for which the different types of linkages describe how the clusters are connected (Figure 3).

Figure 3: Linkage types. Image by the author.
Figure 3: Linkage types. Image by the author.

Briefly, the Single linkage between two clusters is the proximity between their two closest samples. It produces a long chain and is therefore ideal for clustering for outlier detection or snake-like-clusters. The complete linkage between two clusters is the proximity between their two most distant samples. Intuitively, the two most distant samples cannot be much more dissimilar than other quite dissimilar pairs. It forces clusters to be spherical and have often "compact" contours by their borders, but they are not necessarily compact inside. The average linkage between two clusters is the arithmetic mean of all the proximities between the objects of one, on one side, and the objects of the other, on the other side. The Centroid linkage is the proximity between the geometric centroids of the clusters. In other words, centroid linkage connects clusters based on the center of mass of the data points, while average linkage takes into account all the distances between the objects in the two clusters. Choose the metric and linkage type carefully because it directly affects the final clustering results.


From Data to clusters in 4 steps.

To determine data-driven clusters, that may behold new or unknown information, we carefully need to perform four constitutive steps to go from the input data set to sensible clusters. Libraries such as distfit and clusteval can help in this process.

Coming up with a sensible grouping of samples requires more than blindly running a clustering algorithm.

Step 1: Investigate the underlying distribution of the data.

Investigating the underlying distribution of the data is an important step because clustering algorithms rely on the statistical properties of the data to identify patterns and group similar data points together. By understanding the distribution of the data, such as its mean, variance, skewness, and kurtosis, we can make informed decisions about which clustering algorithm to use and how to set its parameters to achieve optimal results. Furthermore, investigating the data distribution can provide insights into the appropriate normalization or scaling techniques to be applied before clustering. While supervised approaches, like tree models, have the capability to handle mixed data sets, clustering algorithms on the other hand are designed to work with homogeneous data. This means that all variables should have similar types or units of measurement. Normalization or scaling is thus an important step as clustering algorithms group data points based on their similarity using a metric. More details about investigating the underlying data distribution can be read here [1]:

How to Find the Best Theoretical Distribution for Your Data

Step 2: Make an educated guess of the cluster density and the expected cluster sizes.

Setting expectations about cluster densities, shape, and the number of clusters will help to select the appropriate clustering algorithm and parameter settings to achieve the desired outcome. In addition, setting expectations can provide more confidence and validity of the results when interpreting and communicating the clustering results to stakeholders in a meaningful way. For example, if the aim was to identify rare anomalies in a dataset, and the clustering results produce a small cluster with very low density, it could potentially indicate the presence of such anomalies. However, it is not always possible to set expectations about the number of clusters or the density. Then we need to select clustering method(s) solely on the mathematical properties that match with the statistical properties of the data and the research aim.

Step 3: Select the Clustering Method.

The selection of a clustering method depends on steps 1 to 4 but we should also include factors such as scalability, robustness, and ease of use. For example, in a production setting, we may need different properties than for experimental use cases. There are several popular clustering methods, such as K-means, Hierarchical, and Density-based clustering algorithms, each with its own assumptions, advantages, and limitations (see a summary below). After the selection of the clustering method, we can start clustering the data and evaluate the performance.

K-means assumes that clusters are spherical, equally sized, and have similar densities. It requires specifying the number of clusters in advance (k). Note the detection for the optimal number of clusters is automatically detected in the clusteval library.

Hierarchical clustering builds a tree-like structure of clusters by recursively merging clusters based on distance or similarity metrics. It is agglomerative (bottom-up) and does not require specifying the number of clusters in advance.

Density-based clustering algorithms, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) group together data points that are densely packed and have lower densities in between clusters. They do not assume any specific shape or size of clusters and can identify clusters of arbitrary shapes and sizes. Density-based clustering is particularly useful for identifying clusters of varying densities and detecting outliers as noise points. However, they require tuning of hyperparameters, such as density threshold, and are sensitive to the choice of parameters.

Different clustering methods can result in different partitioning of the samples and thus different groupings since each method implicitly impose a structure on the data.

Step 4: Cluster Evaluation.

Cluster evaluation is to assess the clustering tendency, quality, and the optimal number of clusters. There are various cluster evaluation methods for which the most popular are incorporated in the clusteval library, i.e., Silhouette score, Davies-Bouldin index, and the derivative (or Elbow) method. The difficulty in using such techniques is that the clustering step and its evaluation are often intertwined for which the clusteval library internally handles this. In the next section, I will dive into the various methods and test their performances.

Cluster evaluation should be performed during the clustering process to assign scores to each clustering and enable meaningful comparisons between the results.

The Next Step: From Clusters to Insights.

The step after finding the optimal number of clusters is to determine the driving features behind the clusters. A great manner to do this is by using enrichment analysis with methods such as HNET [2], where we can use the hypergeometric test and Mann-withney U test to test for significant associations between the cluster labels and the features. More details can be found here:

Explore and understand your data with a network of significant associations.


The Clusteval Library.

A few words about the clusteval library that is used for all the analysis. The clusteval library is designed to tackle the challenges of evaluating clusters and creating meaningful plots. In a previous blog [3], I demonstrated its strength in image recognition for the clustering of images. The clusteval library contains the most popular evaluation methods, the Silhouette score, the Davies-Bouldin index, and the derivative (or Elbow) method to evaluate the clustering of Agglomerative, K-means, DBSCAN, and HDBSCAN. It can handle all data types, such as continuous, categorical, and discrete data sets. Just make sure that 1. The data is correctly normalized. 2. All variables have similar types or units of measurement. 3. you choose the appropriate distance metric and evaluation method.

Clusteval functionalities

  • It contains the most-wanted cluster evaluation pipelines for the Silhouette scores, Davies-Bouldin index, and the Derivative method to evaluate the clustering of Agglomerative, K-means, DBSCAN, and HDBSCAN.
  • It contains functionalities to create meaningful plots; the optimal number of clusters, a Silhouette plot with its coefficients, scatterplots, and dendrograms.
  • It contains functionality to determine the driving features behind the clusters.
  • It can handle all data types, such as continuous, categorical, and discrete data sets. Just make sure that all variables have similar types or units of measurement.
  • Open-source
  • Documentation page

The Choice of a Clustering Method Matters.

It may not surprise you that different cluster techniques can result in different groupings as they respond differently to varying densities and group sizes. The detection of the correct number of clusters can therefore be challenging, especially when it is high-dimensional where visual inspections are not possible. With clustering evaluation methods we can investigate and do some backtesting to determine the optimal number of clusters and describe the clustering tendency. To demonstrate how well the number of clusters can be determined, I created seven toy example data sets (inspired by scikit-learn) with varying densities, shapes, and sample sizes, i.e., snake clusters, clusters with different densities and sizes, and uniformly distributed samples (Figure 4). Each toy example contains 1000 samples and two features. As such, it can give an intuition when working in low dimensional space, such as after a feature extraction step e.g., PCA, t-SNE, or UMAP step.

Figure 4. Seven toy example data sets with different densities, shapes, and sizes. Samples are colored on their ground truth labels. Image by the author.
Figure 4. Seven toy example data sets with different densities, shapes, and sizes. Samples are colored on their ground truth labels. Image by the author.

For these seven toy examples, we will investigate the clustering using the evaluation methods that are available in clusteval; the Silhouette score, the Davies-Bouldin index, and the derivative (or Elbow) method. The clustering itself is performed using K-means, Hierarchical clustering (single, complete, and ward distance), and Density-based clustering algorithms (DBSCAN).


If you find this article helpful, use my referral link to continue learning without limits and sign up for a Medium membership. Plus, follow me to stay up-to-date with my latest content!


Cluster Evaluation using Silhouette scores.

The Silhouette score method is a measure of how similar a sample is to its own cluster (cohesion) compared to other clusters (separation). The scores range between [-1, 1], where a high value indicates that the sample or observation is well-matched to its own cluster and poorly matched to neighboring clusters. It is a sample-wise approach, which means that for each sample, a Silhouette score is computed, and if most samples have a high value, then the clustering configuration is considered appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. The Silhouette score method is independent of the distance metric which makes it an attractive and versatile method to use.

Let’s see how well it can group samples with similar densities and sizes for an easy data set, namely the blobs data set. In the underneath code section, we will initialize the DBSCAN method with the Silhouette evaluation method. Note that we could have used any other clustering method as demonstrated in the next section. For DBSCAN, A search across the epsilon parameter is performed and the number of clusters with the highest Silhouette score is retrieved. The DBSCAN with the Silhouette score correctly detects the 4 clusters, and the sample-wise Silhouette coefficients (Figure 5 middle) look stable (high scores with uniform shape for the clusters). This is expected as the samples in the clusters are far apart from other clusters and close to the center of their own cluster.

# Import library
from clusteval import clusteval

# Initialize
cl = clusteval(cluster='dbscan', evaluate='silhouette', max_clust=10)
# Import example dataset
X, y = cl.import_example(data='blobs', params={'random_state':1})

# find optimal number of clusters
results = cl.fit(X)

# Make plot
cl.plot()
# Show scatterplot with silhouette scores
cl.scatter()

# [clusteval] >INFO> Fit with method=[dbscan], metric=[euclidean], linkage=[ward]
# [clusteval] >INFO> Gridsearch across epsilon..
# [clusteval] >INFO> Evaluate using silhouette..
# [clusteval] >INFO: 100%|██████████| 245/245 [00:07<00:00, 33.88it/s]
# [clusteval] >INFO> Compute dendrogram threshold.
# [clusteval] >INFO> Optimal number clusters detected: [4].
# [clusteval] >INFO> Fin.
# [clusteval] >INFO> Estimated number of n_clusters: 4, average silhouette_score=0.812
Figure 5. Left: Optimizing the Epsilon parameter for the detection of the optimal number of clusters. Middle: Silhouette scores per sample. Right: Determined cluster labels. Image by the author.
Figure 5. Left: Optimizing the Epsilon parameter for the detection of the optimal number of clusters. Middle: Silhouette scores per sample. Right: Determined cluster labels. Image by the author.

This was just one data set and one clustering approach (DBSCAN) that we evaluated. Let’s iterate across the seven toy example data sets and five cluster methods and determine the optimal number of clusters using the Silhouette method. The results are shown in Figure 6 for which the top part shows the scatter plot colored on the optimal number of clusters, and the bottom part shows the Silhouette plots among a varying number of clusters.

The conclusion for Figure 6 is that if we choose the appropriate clustering method for the data set, then we can very well detect the correct number of clusters. This may feel like a trivial remark but choosing the appropriate clustering method can be challenging as you need to carefully perform the steps 1 to 4 as described in the previous section.

In case we can not visually inspect the results because of high dimensionality, we can still get an intuition of the sample distribution and the formation of clusters. If the groups are clearly separated, such as snake-like clusters, and blobs, we can expect a strong peak in the Silhouette scores for a specific number of clusters with the right clustering method. However, when the data becomes noisier, such as for the globular data set, multiple peaks can start to appear and the slope of the Silhouette score can gradually increase without clear peaks. This usually hints that the clusters are not strongly formed. With the parameters min_clust and/or max_clust in clusteval, we can bound the search space and determine relatively high peaks. This manual optimization step of inspecting and bounding is what we call backtesting. An interesting observation in Figure 6 is that the scores of the DBSCAN clustering method show an increase in turbulence when the data becomes noisy (see Figure 6, second column). The reasoning is that many outliers are detected and start forming smaller clusters, and later on, can become part again of larger clusters.

Figure 6. Detection of the optimal number of clusters using the Silhouette score method for five clustering methods and seven toy example data sets. Image by the author.
Figure 6. Detection of the optimal number of clusters using the Silhouette score method for five clustering methods and seven toy example data sets. Image by the author.

Cluster Evaluation using the Davies-Bouldin index.

The Davies-Bouldin index (DBindex) can be intuitively described as a measure of the ratio between within-cluster distances and between-cluster distances. The score is bounded between [0, 1] for which lower values hint towards tighter clusters and better separation between clusters. Lower scores are thus considered better. In contradiction to the Silhouette score method, the Davies-Bouldin index is not a sample-wise measure which makes it faster in usage. A disadvantage of DBindex is that it frequently overshoots as lower values are more common with a growing number of clusters. This can be clearly seen by the gradually decreasing slope of the scores without a distinct low score.

Let’s see how well we can cluster samples with similar densities and sizes to get some intuition of the performance of the Davies-Bouldin index. In the underneath code section, we will initialize the Agglomerative clustering method with the Davies-Bouldin index **** evaluation method. In case we limit the search space in _clusteva_l using max_clust=10, it correctly detects the 4 clusters (Figure 7 left panel). However, if we do not limit the search space there is a gradually decreasing slope of the scores which leads to the detection of the wrong number of clusters (Figure 7 right panel).

# Import
from clusteval import clusteval

# Initialize
cl = clusteval(cluster='agglomerative', evaluate='dbindex', max_clust=10)
# Import example dataset
X, y = cl.import_example(data='blobs', params={'random_state':1})

# find optimal number of clusters
results = cl.fit(X)

# Make plot
cl.plot(figsize=(12, 7))
# Show scatterplot with silhouette scores
cl.scatter()

# [clusteval] >INFO> Fit with method=[agglomerative], metric=[euclidean], linkage=[ward]
# [clusteval] >INFO> Evaluate using dbindex.
# [clusteval] >INFO: 100%|██████████| 18/18 [00:00<00:00, 120.32it/s]
# [clusteval] >INFO> Compute dendrogram threshold.
# [clusteval] >INFO> Optimal number clusters detected: [4].
# [clusteval] >INFO> Fin.
Figure 7. Davies-Bouldin index scores for varying numbers of clusters. Left panel: With a maximum limit of 10 clusters. Right panel: With a maximum limit of 20 clusters. Image by the author.
Figure 7. Davies-Bouldin index scores for varying numbers of clusters. Left panel: With a maximum limit of 10 clusters. Right panel: With a maximum limit of 20 clusters. Image by the author.

For the Davies-Bouldin index method, we will also iterate across the seven toy example data sets but I will only include the Agglomeration cluster methods for single, complete, and ward linkage. Note that DBSCAN is not supported in combination with the Davies-Bouldin index. In all examples, there is a limit (max_clust=10) in the search space to prevent overshooting (Figure 8). Even though we limited the search space, there is still a gradually decreasing slope that leads to an incorrect number of clusters in many example data sets. This is especially seen in the snake-like and anisotropic clusters (top three rows). A solution is to further lower the maximum number of clusters (max_clust). For the Blobs, Globular, and Density toy examples, the correct cluster tendency is detected as can be seen from the scatter plot but it overshoots frequently for the optimal number of clusters. Here again, if we could further limit the maximum number of clusters to 6 or 7, the results would improve.

Figure 8. Detection of the optimal number of clusters using the Davies-Bouldin index method for the seven toy example data sets. Image by the author.
Figure 8. Detection of the optimal number of clusters using the Davies-Bouldin index method for the seven toy example data sets. Image by the author.

Cluster Evaluation using the derivative method.

The derivative method compares the height of each cluster merge to the average and normalizes it by the standard deviation calculated over the depth of previous levels. Finally, the derivative method returns the cluster labels for the optimal cut-off based on the chosen hierarchical clustering method (red vertical line). Or in other words, a strong peak in the orange line can be seen when there is a sharp "elbow" in the blue line. Let’s see how well the Derivative method can cluster samples with similar densities and sizes. In the underneath code section, we will initialize the Agglomerative clustering method with the derivative evaluation method. In comparison to the Davies-Bouldin index, we do not need to limit the search space (Figure 9) and can easily detect the 4 clusters.

Figure 9. Derivative method scores for varying numbers of clusters. Image by the author.
Figure 9. Derivative method scores for varying numbers of clusters. Image by the author.
# Import library
from clusteval import clusteval

# Initialize
cl = clusteval(cluster='agglomerative', evaluate='derivative', max_clust=20)
# Import example dataset
X, y = cl.import_example(data='blobs', params={'random_state':1})

# find optimal number of clusters
results = cl.fit(X)

# Make plot
cl.plot(figsize=(12, 8))
# Show scatterplot with silhouette scores
cl.scatter()

Let’s iterate across the seven toy example data sets and determine the optimal number of clusters using the Derivative method (Figure 8). It can be seen that the Derivative approach frequently leads to an incorrect number of clusters. If we look closer at the Derivative line (orange), it shows a small peak across many data sets that point toward a weak elbow and thus likely noisy data. This hints that the estimated number of clusters may be unreliable.

Figure 9. Detection of the optimal number of clusters using the Derivative method for seven toy example data sets. Image by the author.
Figure 9. Detection of the optimal number of clusters using the Derivative method for seven toy example data sets. Image by the author.

Be aware that the cluster evaluation approaches can easily be fooled as scores can gradually improve by an increasing number of clusters.


Final words.

I demonstrated how different clustering methods respond to different data sets and that it takes multiple steps to determine the optimal number of clusters. Also, different clustering methods will yield different grouping since they implicitly impose a structure on the data, and thus the partitioning of the samples. The clusteval package can help in cluster investigation and backtesting for which the results can be explored by various plotting functionalities.

In general, K-means clustering works pretty well, especially for spherical, balanced, and globular clusters. However, it is not well suited for detecting clusters with different densities or snake-like clusters. DBSCAN on the other hand is a great allrounder that detects the correct number of clusters in most scenarios. Just be aware that the clustering tendency can be correct but the exact number of detected clusters may not. This is especially the case in globular clusters as many (groups of) samples can be flagged as outliers. A disadvantage of DBSCAN is that it can be computationally intensive, especially when optimizing the epsilon parameter. Instead, use agglomerative clustering with single linkage for snake-like clusters or for the detection of outliers. Alternatively, ward or complete linkage is suited for noisy, globular clusters, and K-means for spherical, balanced, and globular clusters.

For the cluster evaluation methods, the Silhouette score has many advantages. It is a sample-wise measure that can be used with any distance metric, making it an attractive and versatile method to use. The Davies-Bouldin index and the Derivative method on the other hand need a more iterative process of backtesting and investigation as it often overshoots for the correct number of clusters. The Davies-Bouldin index is described to be more robust when dealing with datasets with overlapping clusters and is more computationally efficient as it does not require computing the pairwise distances between all data points.

To summarize, the choice of cluster evaluation methods depends on the specific characteristics of your dataset, the goals of your analysis, and the requirements of your application. You can always perform a backtesting approach using multiple methods and compare their results to make an informed decision based on the specific context of your problem.

Be Safe. Stay Frosty.

Cheers E.


If you found this article helpful, use my referral link to continue learning without limits and sign up for a Medium membership. Plus, follow me to stay up-to-date with my latest content!


Software

Let’s connect!


References

  1. E. Taskesen, How to Find the Best Theoretical Distribution for Your Data, Febr. 2023, Towards Data Science, Medium.
  2. E. Taskesen, Explore and understand your data with a network of significant associations, Aug. 2021, Towards Data Science, Medium
  3. _E. Taskesen, A step-by-step guide for clustering images, Dec. 2021, Towards Data Science, Medium._

Related Articles