The world’s leading publication for data science, AI, and ML professionals.

Precision Clustering Made Simple: kscorer’s Guide to Auto-Selecting Optimal K-means Clusters

kscorer streamlines the process of clustering and provides practical approach to data analysis through advanced scoring and parallelization

Made by DALL-E-2 according to the author's description
Made by DALL-E-2 according to the author’s description

Unsupervised machine learning, particularly Clustering, is a challenging task in data science. It is crucial to a wide range of practical business analytics projects. Clustering can operate on its own, but it is also a valuable component in complex data processing pipelines that enhance the efficiency of other algorithms. For instance, clustering plays a crucial role when developing a recommender system.

Well, Scikit-Learn notoriously offers various proven clustering algorithms. Nonetheless, most of them are parametric and require setting the number of clusters, which is one of the most significant challenges in clustering.

Commonly, an iterative method is used to decide on the optimal number of clusters when working with data. It means that you carry out clustering multiple times, each time with a different number of clusters, and evaluate the corresponding result. While this technique is useful, it does have limitations.

The yellowbrick package is a commonly employed tool that makes it easy to identify the optimal number of clusters. However, it also has some drawbacks. One significant drawback is the possibility of conflicting outcomes when evaluating multiple metrics and the challenge of identifying an elbow on the diagram.

In addition, the dataset size poses another problem, regardless of the package used. When working with large datasets, resource consumption difficulties may impede your ability to efficiently iterate through a wide range of clusters. If this is the case, consider exploring techniques such as MiniBatchKMeans, which can afford parallel clustering.

But advanced optimization of your clustering routine may require lesser-known techniques, described further. You will also get to know the kscorer package, which streamlines these techniques, offering a more robust and efficient approach to determining the optimal number of clusters.

Without a hitch, those techniques are:

  • Dimensionality Reduction. It may be beneficial to perform a Principal Component Analysis (PCA) on the data before applying the clustering algorithm. This will reduce data interference and lead to a more reliable clustering process.
  • Cosine Similarity. There is a straightforward way to use (approx.) cosine distances in K-means via applying Euclidean normalisation to the data. So that you do not need to pre-calculate distance matrix, such as while performing agglomerative clustering.
  • Many-Metrics-At-Hand. To find the optimal number of clusters, one should rely on a multi-metric assessment instead of depending on a single metric.
  • Data Sampling. To address resource consumption issues and improve clustering results one can get random samples from data to perform clustering operations and asses metrics. Averaging scores from multiple iterations can reduce the effect of randomness and produce more consistent results.

This workflow is illustrated below.

Image by an author
Image by an author

Luckily, there is no need to build this entire pipeline from scratch as an implementation is currently available in the kscorer package.

Now, let’s delve a bit deeper

I once heard a data scientist state at a conference talk: "Basically, you can do what you want, as long as you know what you are doing." © Alex2006

It is recommended to scale your data before clustering to ensure all features are on an equal footing and none dominate due to their magnitude. Standardisation (centred around the mean and scaled by the standard deviation) or Min-Max scaling (scaling values to a specified range) are common techniques used for scaling.

It’s worth noting that the significance of feature scaling, perfectly illustrated here, isn’t restricted to just KNeighbors models but also applies to various data science methodologies. Standardising features through z-score normalisation ensures that all features are on the same scale, preventing any feature from dominating the model adjustment due to their magnitudes. This scaling procedure can significantly impact the model’s performance, resulting in different model adjustments when compared to using unscaled data.

Furthermore, there is a fundamental link between K-means clustering and PCA, explored in Ding and He’s paper "K-means Clustering via Principal Component Analysis". Although initially serving distinct purposes, these techniques ultimately aim to efficiently represent data whilst minimising reconstruction errors. PCA aims to represent data vectors as a combination of a reduced number of eigenvectors. In contrast, K-means clustering aims to represent data vectors as a combination of cluster centroid vectors. Both approaches strive to minimise the mean-squared reconstruction error.

After applying PCA, we’ll scale again our data due to computational issues that might arise (some values may be close to zero while others will be quite large). This perfectly makes sense as we already have lost track of our initial features (after PCA), so there will be no interpretation of data.

Another interesting correlation that may not be commonly known is between Cosine Similarity and Euclidean Distance. Understanding the relationship between these measures is crucial when they are used interchangeably, albeit indirectly. This knowledge has practical application in transforming the traditional K-means Clustering Algorithm into the Spherical K-means Clustering Algorithm, where cosine similarity is a vital metric for clustering data. As mentioned previously, we can "establish" the connection between cosine similarity and Euclidean distance by applying Euclidean normalisation to data.

In the absence of ground truth cluster labels, the evaluation of clustering models must rely on intrinsic measures, and the kscorer package offers a comprehensive set of indicators to assess the quality of clustering. These indicators suggest valuable insight into the degree of separation among recognised clusters:

  • Silhouette Coefficient. It quantifies the separation of clusters by calculating the difference between the mean distance to the nearest cluster that a data point does not belong to and the mean intra-cluster distance for each data point. The outcome is standardised and expressed as the ratio between the two, with elevated values indicating superior cluster separation.
  • Calinski-Harabasz Index. It calculates the ratio of between-cluster scattering to within-cluster scattering. A higher score on the Calinski-Harabasz test indicates better clustering performance, indicating well-defined clusters.
  • Davies-Bouldin Index. It measures the ratio of between-cluster dispersion to within-cluster dispersion, with a lower value indicating superior clustering performance and more distinct clusters.
  • Dunn Index. It assesses cluster quality by comparing intercluster distance (the smallest distance between any two cluster centroids) to intracluster distance (the largest distance between any two points within a cluster). A higher Dunn Index indicates more well-defined clusters.

The Python calculation for the index utilized in the package is outlined as follows:

  • Bayesian Information Criterion (BIC). BIC serves as an additional and to some extent an independent metric. While K-means does not offer a direct probabilistic model, BIC can help estimate the data’s distribution after applying a K-means model. This approach provides a more comprehensive assessment of the cluster quality.

All metrics are standardised, guaranteeing that higher scores consistently indicate well-defined clusters. This thorough evaluation is crucial in identifying the optimal number of clusters in a dataset.

To overcome memory limitations and execute data preprocessing and scoring operations expediently for K-means clustering, the kscorer package utilises N random data samples. This approach ensures seamless execution and adapts to datasets of different sizes and structures. Similar to cross-validation techniques, it maintains robust results, even though each iteration focuses on a limited subset of the data.

Hands-on with kscorer

So, we have some data for clustering. Please note that we pretend not to know the exact number of clusters in this scenario.

Moving forward, we will split our dataset into train and test sets and fit a model to detect the optimal number of clusters. The model will automatically search for the optimal number of clusters between 3 and 15. This can be effortlessly achieved as follows:

After completing the fitting process, we can review the scaled scores for all the metrics applied. This will help us to determine the best number of clusters for our available data. When checking the plot, you will notice that some clusters are highlighted with corresponding scores. These labelled points correspond with the local maxima in the average scores across all metrics and thus represent the best options for selecting the optimal number of clusters.

Now, we can evaluate how well our new cluster labels match the true labels. Be sure that this option is usually not available in practical business scenarios 😉

In an unusual move in clustering, you could try to cluster data that hasn’t been seen before. But note that this isn’t a typical clustering task. A different and often more useful strategy would be to make a classifier using the cluster labels as targets. This will make it easier to assign cluster labels to new data.

And, finally, a fresh interactive perspective on our data.

So, that is how we’ve delved into K-means clustering using the kscorer package, which streamlines the process of finding the optimal number of clusters. Due to its complex metrics and parallel processing, it has proved to be a practical tool for data analysis.


Related Articles