The world’s leading publication for data science, AI, and ML professionals.

Cheat sheet for implementing 7 methods for selecting the optimal number of clusters in Python

Select the optimal number of clusters based on multiple clustering validation metrics like Gap Statistic, Silhouette Coefficient…

Photo by Mehrshad Rajabi on Unsplash
Photo by Mehrshad Rajabi on Unsplash

Segmentation provides a data driven angle for examining meaningful segments that executives can use to take targeted actions and improve business outcomes. Many executives run the risk of making decisions based on overgeneralizations because they utilize a one-size-fits-all approach to assess their business ecosystem. Segmentation, however, improves decision-making by providing multiple meaningful lenses to break apart data and take action.

One of the most perplexing issues we face while trying to segment customers or products is choosing the ideal number of segments. This is a key parameter for multiple clustering algorithms like K means, agglomerative clustering and GMM clustering. Unless our data has just 2 or 3 dimensions, it is not possible to visually understand the clusters present in the data. And in most practical applications, we will have more than 3 dimensions. This blog will help the readers understand and quickly implement the most popular techniques for selecting optimal number of clusters:

  1. Gap Statistic
  2. Elbow Method
  3. Silhouette Coefficient
  4. Calinski-Harabasz Index
  5. Davies-Bouldin Index
  6. Dendrogram
  7. Bayesian information criterion (BIC)

For this exercise, we will be working with clickstream data from an online store offering clothing for pregnant women. It has data from April 2008 to August 2008 and includes variables like product category, location of the photo on the webpage, country of origin of the IP address and product price in US dollars. Before selecting optimal number of clusters, we will need to prepare the data for segmentation.

I encourage you to check out the article below for an in-depth explanation of different steps for preparing data for segmentation before proceeding further:

One Hot Encoding, Standardization, PCA: Data preparation for segmentation in python

Gap Statistic

The gap statistic was developed by Stanford researchers Tibshirani, Walther and Hastie in their 2001 paper. The idea behind their approach was to find a way to compare cluster compactness with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Their estimate for the optimal number of clusters is the value for which cluster compactness on the original data falls the farthest below this reference curve. This information is contained in the following formula for the gap statistic:

Image by author
Image by author

where Wk is measure of the compactness of our Clustering based on the Within-Cluster-Sum of Squared Errors (WSS):

Image by author
Image by author

Within-Cluster-Sum of Squared Errors is calculated by the inertia_ attribute of KMeans function as follows:

  • The square of the distance of each point from the centre of the cluster (Squared Errors)
  • The WSS score is the sum of these Squared Errors for all the points

Calculating gap statistic in python for k means clustering involves the following steps:

  • Cluster the observed data on various number of clusters and compute compactness of our clustering
  • Generate reference data sets and cluster each of them with varying number of clusters. The reference datasets are created from a "continuous uniform" distribution using the random_sample function.
  • Calculate average of compactness of our clustering on reference datasets
  • Calculate gap statistics as difference in compactness between clustering on reference data and original data
# Gap Statistic for K means
def optimalK(data, nrefs=3, maxClusters=15):
    """
    Calculates KMeans optimal K using Gap Statistic 
    Params:
        data: ndarry of shape (n_samples, n_features)
        nrefs: number of sample reference datasets to create
        maxClusters: Maximum number of clusters to test for
    Returns: (gaps, optimalK)
    """
    gaps = np.zeros((len(range(1, maxClusters)),))
    resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
    for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
        refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
        for i in range(nrefs):

            # Create new random reference set
            randomReference = np.random.random_sample(size=data.shape)

            # Fit to it
            km = KMeans(k)
            km.fit(randomReference)

            refDisp = km.inertia_
            refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
        km = KMeans(k)
        km.fit(data)

        origDisp = km.inertia_
# Calculate gap statistic
        gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
        gaps[gap_index] = gap

        resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
score_g, df = optimalK(cluster_df, nrefs=5, maxClusters=30)
plt.plot(df['clusterCount'], df['gap'], linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('Gap Statistic');
plt.title('Gap Statistic vs. K');
Fig 1: Gap Statistics for various values of clusters (Image by author)
Fig 1: Gap Statistics for various values of clusters (Image by author)

As seen in Figure 1, the gap statistics is maximized with 29 clusters and hence, we can chose 29 clusters for our K means.

Elbow Method

It is the most popular method for determining the optimal number of clusters. __ The method is based on calculating the Within-Cluster-Sum of Squared Errors (WSS) for different number of clusters (k) and selecting the k for which change in WSS first starts to diminish.

The idea behind the elbow method is that the explained variation changes rapidly for a small number of clusters and then it slows down leading to an elbow formation in the curve. The elbow point is the number of clusters we can use for our clustering algorithm. Further details on this method can be found in this paper by Chunhui Yuan and Haitao Yang.

We will be using the YellowBrick library which can implement the elbow method with few lines of code. It is a wrapper around Scikit-Learn and has some cool Machine Learning visualizations!

# Elbow Method for K means
# Import ElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,30), timings= True)
visualizer.fit(cluster_df)        # Fit data to visualizer
visualizer.show()        # Finalize and render figure
Fig 2: Elbow Method Results (Image by author)
Fig 2: Elbow Method Results (Image by author)

The KElbowVisualizer function fits the KMeans model for a range of clusters values between 2 to 30. As shown in Figure 2, the elbow point is achieved with 8 clusters which is highlighted by the function itself. The function also informs us about how much time was needed to plot models for various numbers of clusters through the green line.

Silhouette Coefficient

The Silhouette Coefficient for a point i is defined as follows:

Image by author
Image by author

where b(i) is the smallest average distance of point i to all points in any other cluster and a(i) is the average distance of i from all other points in its cluster. For example, if we have only 3 clusters A,B and C and i belongs to cluster C, then b(i) is calculated by measuring the average distance of i from every point in cluster A, the average distance of i from every point in cluster B and taking the smallest resulting value. The Silhouette Coefficient for the dataset is the average of the Silhouette Coefficient of individual points.

The Silhouette Coefficient tells us if individual points are correctly assigned to their clusters. We can use the following thumb rules while using Silhouette Coefficient:

  1. S(i) close to 0 means that the point is between two clusters
  2. If it is closer to -1, then we would be better off assigning it to the other clusters
  3. If S(i) is close to 1, then the point belongs to the ‘correct’ cluster

For more details on this method, please refer to this 2018 paper by N. Kaoungku, K. Suksut , R. Chanklan and K. Kerdprasop. We will be using the KElbowVisualizer function to implement Silhouette Coefficient for K means clustering algorithm:

# Silhouette Score for K means
# Import ElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,30),metric='silhouette', timings= True)
visualizer.fit(cluster_df)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Fig 3: Silhouette Score Results (Image by author)
Fig 3: Silhouette Score Results (Image by author)

The optimal number of clusters based on Silhouette Score is 4.

Calinski-Harabasz Index

The Calinski-Harabasz Index is based on the idea that clusters that are (1) themselves very compact and (2) well-spaced from each other are good clusters. The index is calculated by dividing the variance of the sums of squares of the distances of individual objects to their cluster center by the sum of squares of the distance between the cluster centers. Higher the Calinski-Harabasz Index value, better the clustering model. The formula for Calinski-Harabasz Index is defined as:

Image by author
Image by author

where k is the number of clusters, n is the number of records in data, BCSM (between cluster scatter matrix) calculates separation between clusters and WCSM (within cluster scatter matrix) calculates compactness within clusters.

KElbowVisualizer function is able to calculate Calinski-Harabasz Index as well:

# Calinski Harabasz Score for K means
# Import ElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,30),metric='calinski_harabasz', timings= True)
visualizer.fit(cluster_df)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Fig 4: Calinski Harabasz Index (Image by author)
Fig 4: Calinski Harabasz Index (Image by author)

As seen in Figure 4, the Calinski Harabasz Index is maximized when number of clusters is 2 for the K means clustering algorithm. To dig deeper on this index, kindly refer to this paper by Xu Wang and Yusheng Xu.

Davies-Bouldin Index

The Davies-Bouldin (DB) Index is defined as:

Image by author
Image by author

where n is the count of clusters and σi is the average distance of all points in cluster i from the cluster centre ci.

Like silhouette coefficient and Calinski-Harabasz index, the DB index captures both the separation and compactness of the clusters.This is due to the fact that the measure’s ‘max’ statement repeatedly selects the values where the average point is farthest away from its center, and where the centers are closest together. But unlike silhouette coefficient and Calinski-Harabasz index, as DB index falls, the clustering improves.

# Davies Bouldin score for K means
from sklearn.metrics import davies_bouldin_score
def get_kmeans_score(data, center):
    '''
    returns the kmeans score regarding Davies Bouldin for points to centers
    INPUT:
        data - the dataset you want to fit kmeans to
        center - the number of centers you want (the k value)
    OUTPUT:
        score - the Davies Bouldin score for the kmeans model fit to the data
    '''
    #instantiate kmeans
    kmeans = KMeans(n_clusters=center)
# Then fit the model to your data using the fit method
    model = kmeans.fit_predict(cluster_df)

    # Calculate Davies Bouldin score
score = davies_bouldin_score(cluster_df, model)

    return score
scores = []
centers = list(range(2,30))
for center in centers:
    scores.append(get_kmeans_score(cluster_df, center))

plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('Davies Bouldin score');
plt.title('Davies Bouldin score vs. K');
Fig 5: Davies Bouldin score(Image by author)
Fig 5: Davies Bouldin score(Image by author)

As seen in Figure 5, the Davies Bouldin score is minimized with 4 clusters and can be considered for the k means algorithm. Further details on the DB score can be found here in a paper by Slobodan Petrovic.

Dendrogram

This technique is specific to the agglomerative hierarchical method of clustering. The agglomerative hierarchical method of clustering starts by considering each point as a separate cluster and starts joining points to clusters in a hierarchical fashion based on their distances. In a separate blog, we will focus on the details of this method. To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters.

If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters. We will plot the graph using the dendogram function from scipy library.

# Dendogram for Heirarchical Clustering
import scipy.cluster.hierarchy as shc
from matplotlib import pyplot
pyplot.figure(figsize=(10, 7))  
pyplot.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(cluster_df, method='ward'))
Fig 6: Dendrogram (Image by author)
Fig 6: Dendrogram (Image by author)

As shown in Figure 6, we can chose the optimal number of clusters based on hierarchical structure of the dendrogram. As highlighted by other cluster validation metrics, 4 clusters can be considered for the agglomerative hierarchical as well.

Bayesian information criterion

Bayesian information criterion (BIC) score is a method for scoring a model which is using the maximum likelihood estimation framework. The BIC statistic is calculated as follows:

BIC = (k*ln(n)) – (2ln(L))

where L is the maximized value of the likelihood function of the model, k is the number of parameters and n is the number of records

The lower the BIC score, better is the model. We can use the BIC score for the Gaussian Mixture Modelling approach for clustering. We will discuss details of this model in a separate blog but key thing to note here is that in the model we need to select number of clusters as well as type of covariance. We try out various combinations of the parameters and select the model with the lowest BIC score.

# BIC for GMM
from sklearn.mixture import GaussianMixture
n_components = range(1, 30)
covariance_type = ['spherical', 'tied', 'diag', 'full']
score=[]
for cov in covariance_type:
    for n_comp in n_components:
        gmm=GaussianMixture(n_components=n_comp,covariance_type=cov)
        gmm.fit(cluster_df)
        score.append((cov,n_comp,gmm.bic(cluster_df)))
score
BIC score (Image by author)
BIC score (Image by author)

Conclusion

In this post, we learnt about 7 different methods to select optimal number of clusters for various clustering algorithms.

Specifically, we learned:

  • How various metrics for selecting optimal number of clusters are calculated
  • The thumb rules to select optimal number of clusters using these metrics
  • How to implement the cluster validation methods in Python
  • How to interpret results of these methods

Finally, given the multiple metrics we have for selecting optimal number of clusters, we can take the average/median/mode across various metrics to be the optimal number of clusters.

Do you have any questions or suggestions about this blog? Please feel free to drop in a note.

Thank you for reading!

If you, like me, are passionate about AI, Data Science, or Economics, please feel free to add/follow me on LinkedIn, Github and Medium.

References

  1. Tibshirani R, Walther G and Hastie T, Estimating the number of clusters in a data set via the gap statistic, J. R. Statist. Soc. B (2001) 63, Part 2, pp. 411–423
  2. Yuan C and Yang H, Research on K-Value Selection Method of K-Means Clustering Algorithm, MDPI, 18 June 2019
  3. Kaoungku, N. & Suksut, K. & Chanklan, R. & Kerdprasop, K. & Kerdprasop, Nittaya. (2018). The silhouette width criterion for clustering and association mining to select image features. International Journal of Machine Learning and Computing. 8. 69–73. 10.18178/ijmlc.2018.8.1.665.
  4. Wang X & Xu Y, An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index, 2019 IOP Conf. Ser.: Mater. Sci. Eng. 569 052024
  5. Petrovic S, A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters

Related Articles