How Many Clusters?
Methods for choosing the right number of clusters
Introduction
Clustering is an unsupervised machine learning method that can identify groups of similar data points, known as clusters, from the data itself. For some clustering algorithms, such as K-means, one needs to know how many clusters there are beforehand. If the number of clusters is incorrectly specified, the results are not very informative (see Figure 1).
Unfortunately in many instances we do not know how many clusters there are in our data. Indeed, figuring out how many clusters there are may be the reason why we want to perform clustering in the first place. Certainly, domain knowledge of the data set may help determine the number of clusters. However, this assumes that you know the target classes (or at least how many classes there are), and this is not true in unsupervised learning. We need a method that informs us about the number of clusters without relying on a target variable.
One possible solution in determining the correct number of clusters is a brute-force approach. We try applying a clustering algorithm with different numbers of clusters. Then, we find the magic number that optimizes the quality of the clustering results. In this article, we first introduce two popular metrics to assess cluster quality. We then cover three approaches to find the optimal number of clusters:
- The elbow method
- The optimization of the silhouette coefficient
- The gap statistic
Quality of Clustering Outcome
Before getting into different methods to determine the optimal number of clusters, we shall see how we can quantitatively assess the quality of clustering outcomes. Imagine the following scenarios. The same data set is clustered into three clusters (see Figure 2). As you can see, the clusters are defined well on the left, whereas the clusters are identified poorly on the right.
Why is that? Remember that the goal of clustering is to group data points in clusters so that (1) points within a cluster are as similar as possible, (2) points belonging to different clusters are as distinct as possible. This means that, in ideal clustering, the within-cluster variation is small whereas the between-cluster variation is large. Consequently a good clustering quality metric should be able to summarize (1) and/or (2) quantitatively.
One such quality metric is inertia. This is calculated as the sum of squared distances between data points and the centers of the clusters they belong to. Inertia quantifies the within-cluster variation.
Another popular metric is the silhouette coefficient, which attempts to summarize both within-cluster and between-cluster variation. At each data point, we calculate the distance to the cluster center in which the data point belongs (referred to as a), as well as the distance to the second best cluster center (referred to as b). Here, the second best cluster refers to the closest cluster that is not the current data point’s cluster. Then based, on these two distances a and b, the silhouette s of that data point is calculated as s=(b-a)/max(a,b).
Under ideal clustering, the distance a is very small compared to the distance b, resulting in s being close to 1 (see Figure 3, left). If clustering is somewhat suboptimal, then the distances a and b may not differ dramatically (see Figure 3, center). In that case s is close to 0. If clustering is even worse, then the distance a may actually be larger than the distance b (see Figure 3, right). In such a scenario, s becomes negative, close to -1.
Once s is calculated at all data points, the average of s determines a silhouette coefficient. A silhouette coefficient can be calculated for each cluster separately, or for all data points. A silhouette coefficient close to 1 indicates that a clustering algorithm is able to partition data into well-separated clusters.
Elbow Method
The inertia is a decreasing function of the number of clusters k. However, its rate of decrease is different above or below the optimal number of clusters K. For k<K, the inertia decreases rapidly, whereas the decrease is slow for k>K. Thus, by plotting the inertia over a range of k, one can determine where the curve bends, or elbows, at K. Figure 4 shows an inertia plot from our example in Figure 1. We can clearly see a bend, or the elbow, at k=6.
This method, however, is somewhat subjective, as different people may identify the elbow at different locations. In our example in Figure 4, some may argue that k=4 is the elbow. Moreover, the elbow may not be always apparent, as we shall see later.
The use case of the elbow method can be seen in a natural language problem to determine the optimal number of topics in a social network using KNIME Analytics Platform (see the blog Topic Extraction: Optimizing the Number of Topics with the Elbow Method). Since there is no KNIME node to calculate the inertia, a Java Snippet node was used to calculate the inertia in this example. Here is a code snippet used to calculate the inertia.
// Initializing the sum of squares
out_sum_squares = 0.0;
/*
The first half of columns belong to features of the origin.
The second half of columns belong to features of the terminus.
The groups of columns have to be in the same order.
*/
int col_count = getColumnCount();
int no_dimensions = col_count / 2;// Loop over the feature columns
for(int i=0; i < no_dimensions; i++){
/*
Checking if the feature i from the origin and
the feature i from the terminus (i.e., i+no_dimensions)
are not missing, and have similar column names
*/
if(!isMissing(i) && isType(i, tDouble)
&& !isMissing(i+no_dimensions) &&
isType(i+no_dimensions, tDouble) &&
getColumnName(i+no_dimensions).contains(getColumnName(i))){
// Calculating the squared distance and adding it to the sum
out_sum_squares += Math.pow(getCell(i, tDouble) -
getCell(i+no_dimensions, tDouble), 2);
}
}
Silhouette Method
The silhouette coefficient may provide a more objective means to determine the optimal number of clusters. This is done by simply calculating the silhouette coefficient over a range of k, and identifying the peak as the optimum K. A KNIME component Optimized K-Means (Silhouette Coefficient) does exactly that. It performs K-Means clustering over a range of k, finds the optimal K that produces the largest silhouette coefficient, and assigns data points to clusters based on the optimized K. Figure 5 shows an example of a silhouette coefficient plot from our example data presented in Figure 1. As it can be seen, the silhouette coefficient peaks at k=6, and thus it is determined as the optimum K.
Gap Statistic
To talk about gap statistics, let’s consider clustering of a random data set with no cluster organization whatsoever. Say a random data set is clustered into k clusters, and the inertia is calculated based on the resulting clustering (see Figure 6). Despite the lack of underlying cluster organization, the clustered random data produces steadily decreasing inertiae (plural of inertia) as k increases. This is because the more cluster centers there are, the smaller the distance becomes between data points to the cluster centers, producing decaying inertiae. In contrast, as we have already seen in Figure 4, the rate of decrease in inertia varies whether k is below or above the optimum number of clusters K in a data set with cluster organization. When the inertia for the observed and random data are plotted together, the difference becomes apparent (see Figure 7). The gap statistic is calculated by comparing the inertiae from a (hopefully) clustered data set and a corresponding random data set covering the same ranges in the data space (Tibshirani et al., (2001)).
In the actual calculation of a gap statistic, a number of random samples are generated, then clustered over a range of k, and the resulting inertia is recorded. This allows a number of inertiae for random cases. The original data set is also clustered over a range of k, resulting in a series of inertiae. The gap statistic, at k clusters, is calculated as
Where Wk(i) is the inertia from the i-th random sample (i=1,2,…,B) with k clusters, and Wk is the inertia from the original data with k clusters. We also calculate its standard deviation as
Then we find the optimal K as the smallest k that satisfies the condition
The calculation of a gap statistic involves a simulation. We call functions in R to calculate the gap statistic with some R scripting within a KNIME workflow. In particular, the clusGap() function is called to calculate the gap statistic at different k, and the maxSE() returns the optimal K satisfying the condition described above. Figure 8 shows the gap statistic plot of our example data set in Figure 1, based on B=100 iterations at each k. The red line represents the optimum K that satisfies the condition above.
It should be noted that the optimal K determined by the gap statistic method may not be consistent. For example, when the gap statistic method is applied to our toy data many times, the resulting optimal K may be different (see Figure 9).
Example: MNIST Handwritten Digits Data
Now let us examine the three methods described above on a real data set with cluster organization. The MNIST data set consists of gray-scale images of handwritten digits from 0 to 9. In this example, we use n=1797 images with 8x8 pixels. Figure 10 shows some examples of the data set. The three methods described above are used to determine the optimal number of clusters. Since there are 10 different digits in this data set, it is reasonable to assume that there are 10 clusters, each corresponding to one of the digits. However, there may be multiple ways people write some of the digits. Thus, in reality the number of clusters isn’t necessarily 10. A 2D scatter plot of the data (projected to a 2D-space by tSNE, see Figure 11) shows that some clusters may be well-separated from the others, while some clusters may be touching or overlapping. The workflow for this example can be found on the KNIME Hub at https://kni.me/w/ACjm92qCUGbXsCX6.
The result from the elbow method is inconclusive, as there is no apparent elbow in the plot (Figure 12, left). There are some subtle bends in the plot (e.g., 9, 12, 20, 24 to name a few), and any of these may be chosen as the number of clusters.
The silhouette coefficient has a peak at k=12 (Figure 12, right). According to the gap statistic method, k=12 is also determined as the optimal number of clusters (Figure 13). We can visually compare k-Means clusters with k=9 (optimal according to the elbow method) and k=12 (optimal according to the silhouette and gap statistic methods) (see Figure 14).
Conclusion
We showcased three different methods of selecting the optimum number of clusters, namely the elbow method, the silhouette coefficient and the gap statistic. While the interpretation of an elbow plot is rather subjective, both the silhouette coefficient and gap statistics methods can determine the number of clusters precisely. However, the gap statistic involves a simulation and it may not always produce the same solution.
Like many machine learning methods, the methods described here do not work well in all scenarios. Since these methods quantify the distance between cluster centers and data points, they are suitable for finding convex clusters, such as clusters found in K-Means clustering.
Reference
- Robert Tibshirani, Guenther Walther, Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63: 411–423 (2001).