How To Tune HDBSCAN

A Quick Example of How to Tune Density Based Clustering from the Trenches

Charles Frenzel
Towards Data Science

--

Clustering is a very hard problem because there is never truly a ‘right’ answer when labels do not exist.

Photo by Tengyart via Unsplash

This is compounded by techniques with various assumptions in place. If a technique is run incorrectly, violating an assumption, this leads to incorrect (dead wrong) results.

In this blogpost, we will delve a bit into why clustering gets complicated, and then take a dive deep on how to properly tune density-based clusters in HDBSCAN making use of the Amazon DenseClus library too.

Background: Clustering is Complicated 😬

There is No Free Lunch for clustering algorithms and while one algorithm might fit a certain dataset well, there are no guarantees that it will work on a different dataset in the exact same manner. Likewise, clustering is “strongly dependent on contexts, aims and decisions of the researcher” which adds fire to the argument that there is no such thing as a “universally optimal method that will just produce natural clusters” as noted by Henning in What Are True Clusters? Henning 2015.

For example, commonly used techniques such as KMeans, assume that data is numerical and sphere-shaped. Those types of assumptions do not fair well when the data has high dimensionality and includes categorical values.

Cluster data that is in violation of assumptions causes a conundrum for the practitioner in two ways:

  1. How to formalize a specific featurization scheme?
  2. What clustering technique to choose?

Both of these must be formulated so that no assumptions are violated. In practice, this can lead to a process of elimination whereby the algorithm and featurization scheme that don’t violate an algorithm’s assumptions is the only choice standing.

Be Wary of Your Metric 📈

When no labels are available it’s common to pick a objective metric such as Silhouette Score to evaluate and then decide on the final clustering result. Silhouette Score measures cluster cohesiveness and separation with an index between -1 to 1. It does NOT take into account noise in the index calculation and makes use of distances. Distance is not applicable for a density-based technique. Not including a noise in the objective metric calculation violates an inherent assumption in density-based clustering.

This means that Silhouette Score and similar indexes like it are inappropriate for measuring density-based techniques!!! (my own emphasis added because I’ve seen multiple blogs on here doing it — this is dangerous.)

Density Based Clustering Validation to the Rescue 🌈

Density Based Clustering Validation or DBCV works for desnity-based clustering algorithms precisely because it takes noise into account and captures the shape property of clusters via densities and not distances (see the original paper)

As the paper explains, the final result of DBCV is a weighted sum of “Validity Index” values of clusters. This produces a score between -1 to 1, with the larger the value the better clustering solution.

Source: Density-Based Clustering Validation, Moulavi et al. 2014

An in depth discussion is out scope here but please see the original paper for more details.

Note that DBCV does have drawbacks. Like all other metrics and techniques DBCV is not immune from the problems of complication and measurement in clustering as noted earlier.

However, outside of having groundtruth labels it provides an objective criteria from which to judge how well-separated density-based technique clusters are.

A Real Example 🚀

Enough of that, let’s dive into a real example.

The notebook is available within the Amazon Denseclus library.

Note: as of November 2023, the notebook is updated, along with the library. For the blog version I am pinning the version it ran at.

In this example, you will use a synthetic churn dataset for an imaginary telecommunications company with the outcome Churn? flagged as as either True (churned) or False (did not churn). Features include customer details such as plan and usage information. The churn dataset is publicly available and mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.

pip install amazon-denseclus==0.0.8

The data includes both numeric and categorical features but will use Denseclus to transform it into lower-dimensional, dense space to form clusters on. For more on DenseClus see here. All of the needed transformations are taken care of under the hood. You just get to call fit.

# This runs in about a minute or two
from denseclus import DenseClus

import logging # to further silence deprecation warnings

logging.captureWarnings(True)
clf = DenseClus(
random_state=SEED,
umap_combine_method="intersection_union_mapper"
)

clf.fit(df)

Under the hood, among other steps, Denseclus uses HDBSCAN to cluster the data.

Let’s look at the how the data got split.

embedding = clf.mapper_.embedding_
labels = clf.score()
clustered = (labels >= 0)

cnts = pd.DataFrame(labels)[0].value_counts()
cnts = cnts.reset_index()
cnts.columns = ['cluster','count']
print(cnts.sort_values(['cluster']))
cluster count
4 -1 9
3 0 1234
0 1 1265
1 2 1253
2 3 1239

Upon examination there are exactly 4 almost evenly distributed clusters with -1 representing the noise found in the data.

In addition, to simply looking at their spread, another way to evaluate clusters it to visualize them.

_=sns.jointplot(
x=embedding[clustered, 0], y=embedding[clustered, 1], hue=labels[clustered], kind="kde"
)
png
Photo by Auhtor

As you can see we have 4 distinct islands formed within this slice of the data. Clusters have formed around these densities which is exactly the behavior we expect DenseClus to do.

You can further confirm the outcome by plotting the tree along which the densities were split.

This is a graphical view of the counts we saw with more information. For example, you can see that a two cluster solution is also possible as two densities represent the base split for the clusters.

_=clf.hdbscan_.condensed_tree_.plot(
select_clusters=True,
selection_palette=sns.color_palette("deep", np.unique(clusters).shape[0]),
)
png
Photo by Author

Lastly, let’s confirm that the majority of data points are covered by our clusters (hint: only 9 aren’t) and the DBCV score.

coverage = np.sum(clustered) / embedding.shape[0]

print(f"Coverage {coverage}")
print(f"DBCV score {clf.hdbscan_.relative_validity_}")
Coverage 0.9982
DBCV score 0.2811143727637039

The DBCV comes out to 0.28 on scale of -1 to 1.

That’s not great but it could be worse. Let’s optimize the score to find the best HDBSCAN hyperparameters to pass.

Hyperparameter Tuning 🦾

The two primary hyperparameters to look at to further improve results are min_samples and min_cluster_size, as noted in the HDBSCAN documentation.

You will run multiple combinations of these to find a result that generates high DBCV score.

In addition to looking at these hyperparameters you will also look at cluster selection methods with Expectation of Mass eom and splitting clusters along the tree with leaf (for details see hdbscan: Hierarchical density based clustering In, McInnes, J. Healy, S. Astels 2017).

As HDBSCAN’s documentation notes, whereas the eom method only extracts the most stable, condensed clusters from the tree, the leaf method selects clusters from the bottom of the leaf nodes as well.

This results in smaller, more homogeneous clusters that are more likely to be fine grained.

from sklearn.model_selection import RandomizedSearchCV
import hdbscan
from sklearn.metrics import make_scorer


logging.captureWarnings(True)
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(embedding)

# specify parameters and distributions to sample from
param_dist = {'min_samples': [10,30,50,60,100],
'min_cluster_size':[100,200,300,400,500,600],
'cluster_selection_method' : ['eom','leaf'],
'metric' : ['euclidean','manhattan']
}

#validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)


n_iter_search = 20
random_search = RandomizedSearchCV(hdb
,param_distributions=param_dist
,n_iter=n_iter_search
,scoring=validity_scorer
,random_state=SEED)

random_search.fit(embedding)


print(f"Best Parameters {random_search.best_params_}")
print(f"DBCV score :{random_search.best_estimator_.relative_validity_}")
Best Parameters {'min_samples': 100, 'min_cluster_size': 300, 'metric': 'manhattan', 'cluster_selection_method': 'eom'}
DBCV score :0.48886415007392386

The DBCV score has now risen from 0.28 to 0.488.

DenseClus defaults min_samples at 15 and min_cluster_size at 100. Random Search results have clusters larger and more restrictive, which results in a higher density and higher score :) City-block distance or Manhattan distance appears to aid the increase too.

In practice we would want a score over 0.45 to make sure that clusters are well-separated and this score shows that.

Let’s confirm this by looking at how clusters were split and visualizing the results again.

# evalute the clusters
labels = random_search.best_estimator_.labels_
clustered = (labels >= 0)

coverage = np.sum(clustered) / embedding.shape[0]
total_clusters = np.max(labels) + 1
cluster_sizes = np.bincount(labels[clustered]).tolist()

print(f"Percent of data retained: {coverage}")
print(f"Total Clusters found: {total_clusters}")
print(f"Cluster splits: {cluster_sizes}")


_=sns.jointplot(
x=embedding[clustered, 0], y=embedding[clustered, 1], hue=labels[clustered], kind="kde"
)
Percent of data retained: 1.0
Total Clusters found: 3
Cluster splits: [2501, 1236, 1263]
png
Photo By Author

Interestingly, enough no noise was found. Two clusters are the exact same, with one almost their combined size.

Visualizing the data on the same slice gives us a clue as to what happened here. The clusters numbered 3 and 2 from our previous run are now combined.

Shifting to a different dimensional slice can sometimes help explain things here and the below plot shows a better view.

_=sns.jointplot(
x=embedding[clustered, 1], y=embedding[clustered, 2], hue=labels[clustered], kind="kde"
)
png
Photo by Author

Wrap-up 🥂

I hoped you enjoyed a closer look at how to tune hyperparameters for HDBSCAN!!!

In this post you looked at why clustering and clustering metrics can get complicated, you then learned about DBCV as an objective metric, and you then applied it using Amazon Denseclus and HDBSCAN.

We’ve only scrapped the surface here. To dive deeper you could look at the following:

  • What other type of optimization frameworks can you use in place of Random Search?
  • What other type of hyperparameters are possible to use for tuning?
  • What other measures are possible here for further cluster validation?
  • Can any other underlying hyperparameters in Denseclus be tweaked to achieve a higher score?

Reference

“Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”, Rousseeuw 1987

“Density-Based Clustering Validation”, Moulavi et al. 2014

“hdbscan: Hierarchical density based clustering In”, McInnes, J. Healy, S. Astels 2017

--

--