How Many Clusters?

Methods for choosing the right number of clusters

Satoru Hayasaka
Towards Data Science

--

Introduction

Clustering is an unsupervised machine learning method that can identify groups of similar data points, known as clusters, from the data itself. For some clustering algorithms, such as K-means, one needs to know how many clusters there are beforehand. If the number of clusters is incorrectly specified, the results are not very informative (see Figure 1).

Figure 1: Clustering with different number of clusters, k=4, 6, & 8. Simulated data with 6 clusters. Image by author.

Unfortunately in many instances we do not know how many clusters there are in our data. Indeed, figuring out how many clusters there are may be the reason why we want to perform clustering in the first place. Certainly, domain knowledge of the data set may help determine the number of clusters. However, this assumes that you know the target classes (or at least how many classes there are), and this is not true in unsupervised learning. We need a method that informs us about the number of clusters without relying on a target variable.

One possible solution in determining the correct number of clusters is a brute-force approach. We try applying a clustering algorithm with different numbers of clusters. Then, we find the magic number that optimizes the quality of the clustering results. In this article, we first introduce two popular metrics to assess cluster quality. We then cover three approaches to find the optimal number of clusters:

  • The elbow method
  • The optimization of the silhouette coefficient
  • The gap statistic

Quality of Clustering Outcome

Before getting into different methods to determine the optimal number of clusters, we shall see how we can quantitatively assess the quality of clustering outcomes. Imagine the following scenarios. The same data set is clustered into three clusters (see Figure 2). As you can see, the clusters are defined well on the left, whereas the clusters are identified poorly on the right.

Figure 2: Examples of well-defined clusters (left) and poorly-defined clusters (right) based on the same data set. The arrows indicate the distance between the data points and their cluster centers. Image by author.

Why is that? Remember that the goal of clustering is to group data points in clusters so that (1) points within a cluster are as similar as possible, (2) points belonging to different clusters are as distinct as possible. This means that, in ideal clustering, the within-cluster variation is small whereas the between-cluster variation is large. Consequently a good clustering quality metric should be able to summarize (1) and/or (2) quantitatively.

One such quality metric is inertia. This is calculated as the sum of squared distances between data points and the centers of the clusters they belong to. Inertia quantifies the within-cluster variation.

Another popular metric is the silhouette coefficient, which attempts to summarize both within-cluster and between-cluster variation. At each data point, we calculate the distance to the cluster center in which the data point belongs (referred to as a), as well as the distance to the second best cluster center (referred to as b). Here, the second best cluster refers to the closest cluster that is not the current data point’s cluster. Then based, on these two distances a and b, the silhouette s of that data point is calculated as s=(b-a)/max(a,b).

Under ideal clustering, the distance a is very small compared to the distance b, resulting in s being close to 1 (see Figure 3, left). If clustering is somewhat suboptimal, then the distances a and b may not differ dramatically (see Figure 3, center). In that case s is close to 0. If clustering is even worse, then the distance a may actually be larger than the distance b (see Figure 3, right). In such a scenario, s becomes negative, close to -1.

Figure 3: Scenarios where clustering is optimal (left), suboptimal (center), and even worse (right). The stars indicate cluster centers. Image by author.

Once s is calculated at all data points, the average of s determines a silhouette coefficient. A silhouette coefficient can be calculated for each cluster separately, or for all data points. A silhouette coefficient close to 1 indicates that a clustering algorithm is able to partition data into well-separated clusters.

Elbow Method

The inertia is a decreasing function of the number of clusters k. However, its rate of decrease is different above or below the optimal number of clusters K. For k<K, the inertia decreases rapidly, whereas the decrease is slow for k>K. Thus, by plotting the inertia over a range of k, one can determine where the curve bends, or elbows, at K. Figure 4 shows an inertia plot from our example in Figure 1. We can clearly see a bend, or the elbow, at k=6.

This method, however, is somewhat subjective, as different people may identify the elbow at different locations. In our example in Figure 4, some may argue that k=4 is the elbow. Moreover, the elbow may not be always apparent, as we shall see later.

Figure 4: The plot of the inertia for different k, for the data set presented in Figure 1. Image by author.

The use case of the elbow method can be seen in a natural language problem to determine the optimal number of topics in a social network using KNIME Analytics Platform (see the blog Topic Extraction: Optimizing the Number of Topics with the Elbow Method). Since there is no KNIME node to calculate the inertia, a Java Snippet node was used to calculate the inertia in this example. Here is a code snippet used to calculate the inertia.

// Initializing the sum of squares
out_sum_squares = 0.0;
/*
The first half of columns belong to features of the origin.
The second half of columns belong to features of the terminus.
The groups of columns have to be in the same order.
*/
int col_count = getColumnCount();
int no_dimensions = col_count / 2;
// Loop over the feature columns
for(int i=0; i < no_dimensions; i++){
/*
Checking if the feature i from the origin and
the feature i from the terminus (i.e., i+no_dimensions)
are not missing, and have similar column names
*/
if(!isMissing(i) && isType(i, tDouble)
&& !isMissing(i+no_dimensions) &&
isType(i+no_dimensions, tDouble) &&
getColumnName(i+no_dimensions).contains(getColumnName(i))){
// Calculating the squared distance and adding it to the sum
out_sum_squares += Math.pow(getCell(i, tDouble) -
getCell(i+no_dimensions, tDouble), 2);
}
}

Silhouette Method

The silhouette coefficient may provide a more objective means to determine the optimal number of clusters. This is done by simply calculating the silhouette coefficient over a range of k, and identifying the peak as the optimum K. A KNIME component Optimized K-Means (Silhouette Coefficient) does exactly that. It performs K-Means clustering over a range of k, finds the optimal K that produces the largest silhouette coefficient, and assigns data points to clusters based on the optimized K. Figure 5 shows an example of a silhouette coefficient plot from our example data presented in Figure 1. As it can be seen, the silhouette coefficient peaks at k=6, and thus it is determined as the optimum K.

Figure 5: The plot of the silhouette coefficient for different k, for the data set presented in Figure 1. Image by author.

Gap Statistic

To talk about gap statistics, let’s consider clustering of a random data set with no cluster organization whatsoever. Say a random data set is clustered into k clusters, and the inertia is calculated based on the resulting clustering (see Figure 6). Despite the lack of underlying cluster organization, the clustered random data produces steadily decreasing inertiae (plural of inertia) as k increases. This is because the more cluster centers there are, the smaller the distance becomes between data points to the cluster centers, producing decaying inertiae. In contrast, as we have already seen in Figure 4, the rate of decrease in inertia varies whether k is below or above the optimum number of clusters K in a data set with cluster organization. When the inertia for the observed and random data are plotted together, the difference becomes apparent (see Figure 7). The gap statistic is calculated by comparing the inertiae from a (hopefully) clustered data set and a corresponding random data set covering the same ranges in the data space (Tibshirani et al., (2001)).

Figure 6: Uniformly distributed random data clustered into k=4 (left), 6 (center), and 15 (right) clusters. Image by author.
Figure 7: How the inertia decreases for the original data (from Figure 1) vs. the random data over a range of k. Image by author.

In the actual calculation of a gap statistic, a number of random samples are generated, then clustered over a range of k, and the resulting inertia is recorded. This allows a number of inertiae for random cases. The original data set is also clustered over a range of k, resulting in a series of inertiae. The gap statistic, at k clusters, is calculated as

Where Wk(i) is the inertia from the i-th random sample (i=1,2,…,B) with k clusters, and Wk is the inertia from the original data with k clusters. We also calculate its standard deviation as

Then we find the optimal K as the smallest k that satisfies the condition

The calculation of a gap statistic involves a simulation. We call functions in R to calculate the gap statistic with some R scripting within a KNIME workflow. In particular, the clusGap() function is called to calculate the gap statistic at different k, and the maxSE() returns the optimal K satisfying the condition described above. Figure 8 shows the gap statistic plot of our example data set in Figure 1, based on B=100 iterations at each k. The red line represents the optimum K that satisfies the condition above.

Figure 8: A plot of gap statistics as well as their standard deviations, based on B=100 iterations. The optimal k=6, satisfying the condition, is indicated by the red line. Image by author.

It should be noted that the optimal K determined by the gap statistic method may not be consistent. For example, when the gap statistic method is applied to our toy data many times, the resulting optimal K may be different (see Figure 9).

Figure 9: Examples of gap statistics plots. The optimum k may not be consistent, depending on the simulation outcome. Image by author.

Example: MNIST Handwritten Digits Data

Now let us examine the three methods described above on a real data set with cluster organization. The MNIST data set consists of gray-scale images of handwritten digits from 0 to 9. In this example, we use n=1797 images with 8x8 pixels. Figure 10 shows some examples of the data set. The three methods described above are used to determine the optimal number of clusters. Since there are 10 different digits in this data set, it is reasonable to assume that there are 10 clusters, each corresponding to one of the digits. However, there may be multiple ways people write some of the digits. Thus, in reality the number of clusters isn’t necessarily 10. A 2D scatter plot of the data (projected to a 2D-space by tSNE, see Figure 11) shows that some clusters may be well-separated from the others, while some clusters may be touching or overlapping. The workflow for this example can be found on the KNIME Hub at https://kni.me/w/ACjm92qCUGbXsCX6.

Figure 10: Examples of handwritten digits data, down-sampled at 8x8 pixels. Image by author.
Figure 11: A scatter plot of the digits data projected to a 2D space with t-SNE (t-distributed stochastic neighbor embedding). Image by author.

The result from the elbow method is inconclusive, as there is no apparent elbow in the plot (Figure 12, left). There are some subtle bends in the plot (e.g., 9, 12, 20, 24 to name a few), and any of these may be chosen as the number of clusters.

Figure 12: The elbow plot (left) and the silhouette coefficient plot (right) generated from the digits data. Image by author.
Figure 13: The gap statistics plot generated from the digits data, based on B=100 iterations. The optimal k=12 is indicated by the red line. Image by author.

The silhouette coefficient has a peak at k=12 (Figure 12, right). According to the gap statistic method, k=12 is also determined as the optimal number of clusters (Figure 13). We can visually compare k-Means clusters with k=9 (optimal according to the elbow method) and k=12 (optimal according to the silhouette and gap statistic methods) (see Figure 14).

Figure 14: K-Means clusters found in the digits data with k=9 and k=12, projected to a 2D space with t-SNE. Image by author.

Conclusion

We showcased three different methods of selecting the optimum number of clusters, namely the elbow method, the silhouette coefficient and the gap statistic. While the interpretation of an elbow plot is rather subjective, both the silhouette coefficient and gap statistics methods can determine the number of clusters precisely. However, the gap statistic involves a simulation and it may not always produce the same solution.

Like many machine learning methods, the methods described here do not work well in all scenarios. Since these methods quantify the distance between cluster centers and data points, they are suitable for finding convex clusters, such as clusters found in K-Means clustering.

Reference

  • Robert Tibshirani, Guenther Walther, Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63: 411–423 (2001).

--

--