K-Means Clustering and the Gap-Statistics

Closing knowledge Gaps with Gap-Statistics

Published in

Towards Data Science

10 min readOct 22, 2020

Photo by Suserl just me on Freeimages and edited by me

There is a lot of code going on under the hood. That’s why I provide my Github repository at the end of this post and I show just a little code of the K-Means.

Introduction

Clustering is an important technique in Pattern Analysis to identify distinct groups in data. Due to data being mostly more than three-dimensional, we perform dimensionality reduction methods like PCA or Laplacian Eigenmaps before applying a clustering technique. The data is then available in 2D or 3D and this allows us to visualize the found clusters very nicely to humans.
Even though this is a basic workflow, it is not always the case.

Data is also often unlabeled. This means you have no clear definition of what you want to find within this data. That’s why clustering is a good data exploration technique as well without the necessity of dimensionality reduction beforehand.

Common clustering algorithms are K-Means and the Meanshift algorithm. In this post, I will focus on the K-Means algorithm, because this is the easiest and most straightforward clustering technique. We furthermore will assume that the data is either directly provided with two features (so 2D), or someone performed a 2D dimensionality reduction on the data and gave it then to us. Therefore, we directly dive deep into applying K-Means.

The K-Means algorithm requires one hyperparameter, namely the number of clusters K you want to find. But, if we wanna find clusters, how can we know how many clusters we are going to need?

Example: If we want to find personas for our marketing team in the data, we could assume that we want to find three or four clusters of people.

In this case, the number of K would be fixed. But what if not?

The choice of hyperparameters is called Model Selection. In the case of
K-Means, this is only the number of K, as I already said. In this blog post, I will focus on how we can statistically find out what the optimum value for K is, by performing Tibshirani’s Gap-Statistics on K-Means.

Gap-Statistics was introduced by Robert Tibshirani in the year 2000 at Stanford.

Goals

I want to answer these three questions with this post:

1) How can we find the optimum value for K in K-Means?
2) How can we measure the performance of the Gap-Statistics?
3) Under which conditions are the Gap-Statistics likely to fail?

K-Means — A very short introduction

K-Means performs three steps. But first you need to pre-define the number of K. Those cluster points are often called Centroids.

1) (Re-)assign each data point to its nearest centroid, by calculating the euclidian distance between all points to all centroids.
2) Calculate the mean for each centroid based on all respective data points and move the centroid in the middle of all his assigned data points.
3) Go to 1) until the convergence criterion is fulfilled. In my case, I calculate the within-cluster distance between all points to the re-assigned centroid mean. After a new iteration, if all centroids together moved less than 0.01, so basically nothing happens anymore, the convergence criterion is performed.

Code 1) K-Means from scratch written by me

It can be seen in code 1 above, that line 9 computes with a list comprehension for each data point the euclidian distance to each of the center points ( K = 3 shown in line 3). Numpy.argmin then chooses the closest distance for each point and assigns it to this centroid. Line 13 calculates the third step and sums up the total within-cluster distance. If it’s less than 0.01, the while loop breaks.

We can see how the assignment of the points changes for each while loop. It takes five loops in total until convergence.

Notice: The initialization for the centroids plays a HUGE role of how the K-Means will perform. Imagine two of the centroid start in the most left and lowest position. Every data point will be assigned to centroid number one and the clustering is useless.
Notice: K-Means is only locally optimal and there is no guarantee for a global minima

Further investigation

Figure 1 shows 3 centroids for 300 data points. The code for the data I did not show yet, it was just generated from sklearn with make_blobs.

Of course, this within-cluster distance can not sink any further with three globally optimal centroids. But what happens when we add more and more values for K? The within-cluster distance would shrink almost monotonically. Imagine if we split up one of the three clusters of Figure 1 in two with an additional centroid. The within-cluster distance would shrink, because all points are now closer to another centroid, even if this would be clearly a wrong cluster split.

When we choose the number of K equal to the number of centroids, the total within-cluster distance would be 0. From a minimization problem’s point of view this is ideal, but not for us.

When we chose the number of K equal to eleven, we can see that we will never reach less than 3 total within-cluster distance.

Due to a starting point of 1 in the plotting of Figure 2 and 3, the actual within-cluster distance of Figure 2 is zero and for Figure 3 it is two.

So how can we prevent that K is chosen to be equal the number of data points?

How can we enforce via the initialization that our clusters will have K non-empty components?

1) Initialize each cluster center at the location of a sample point
2) Make a counter measure against the local minima

1) How can we find the optimum value for K in K-Means?

The answer is of course the Gap-Statistics, but what is it?

It was invented by Robert Tibshirani 20 years ago. The basic minimization problem looks like the following:

Formula 1) The formal rule for estimating K* by HTF The Elements of Statistical Learning on page 519 | 14.3.11 Practical Issues

Looks weird? No problem! K* is the optimal value of K that we want to find. We need to define different data matrices W_data and W_uniform. Where W_data are the 300 data points from Figure 1 above and W_uniform is basically a simulated and averaged distribution of the within-cluster distances, as explained in the Further investigation part. W_uniform basically looks like Figure 3.

Formula 2) Cutting down the Gap Statistics Formula

We can see in the where above, that S’ (S prime) equals the previous S multiplied by a small value (1.024695). This means that it continuously grows bigger at each timestep.

The initial S at the first timestep equals the standard deviation of all data points which were uniformly drawn across, for example 20 runs. It could be more, but Tibshirani wrote that 20 simulation is a sufficient number.

What does this mean now?

Alright. As we already pointed out, we need a way, so that the optimal number for K is not being chosen as the number of data points. For this reason, the simulation average W_unifrom and standard deviation S’(K+1) is introduced.

By checking in the minimization problem in Figure for each time step in Formula 1, we take the number of K, which makes the biggest jump into a reduced total within-cluster distance by introducing a new centroid.

If you have a close look at Figure 3, you can see that the jumps of distance reduction are especially high for the first few centroids. This makes intuitive sense, because if you have a look back on Figure 1, it is clear that the more centroids you introduce, the smaller the changes in the within-cluster distances become.

So the basic idea of the Gap Statistics is to choose the number of K, where the biggest jump in within-cluster distance occurred, based on the overall behavior of uniformly drawn samples. It could be the case that only a very slight reduction in within-cluster distance occurred. For this reason, S’(K+1) acts as a threshold, to sort out too tiny changes and to remove the sampling noise from the data. Only if the change is so big that the threshold S’(K+1) plays no role anymore, the optimal value of K will be selected.

Let me visualize it for you:

Notice: The values for the curves for the within-cluster distances have been normalized by the maximum cluster distance.

I chose the data on the upper left image of Figure 4 on purpose in such a way, the lower two clusters are very nearby. It is almost impossible for K-means to distinguish that these are different clusters.

On the lower left image, we can see the Gap Statistics. The optimal value for K=3 is chosen, because we select the first peak point before the value shrinks again. The red line is calculated by subtracting the W_uniform (green) from the W_data (blue) from the lower right plot.

One step back: The lower right image shows the W_data (blue) and W_uniform (green) distribution. By looking at Formula 2 on the G(K), we see that we need to subtract the log of W_uniform by the log of W_data to.

The total within-cluster distance shrinks as expected for the W_data the same as for the simulated W_uniform. But at K=4, the Gap Statistics detects that the change of total distance for W_data does not behave like the simulated one. This means that it did not decrease as expected.

Even when the standard deviation S’(K+1) is subtracted from W_uniform, the optimal value for K will be selected as 3 by the Gap Statistics.

Notice: if we pretend that the gap at K=4 was not present, the next point which is lower to its predecessor is at K=6. But since I marked the standard deviation as the vertical fat red lines, we see, that after subtracting the standard deviation, the change in distance is too small such that K would have been selected to K=6. The next K that would be selected is K=7, because at K=8 occurs the next big gap.

2) How can we measure the performance of the Gap-Statistics?

As I already mentioned, the initialization of the centroids highly affects the optimization of the K-Means. We need to make some consideration with different data distributions. There are two basic cases we need to have in mind when calculating the 2nd step of the K-Means algorithm:

1) The overall cluster distances are large -> we take the average
2) The overall cluster distances are small -> we take the minimum

If we take those two differentiations into consideration before we calculate the Gap-Statistics, then the Gap-Statistics will be much more robust.

For my example, the overall cluster distances were small, so I calculated the average as can be seen in line 12 of the code 1, where I use Numpy.mean.

3) Under which conditions are the Gap-Statistics likely to fail?

There are three scenarios that could occur:

Underestimation of clusters
Overestimation of clusters
In general

Underestimation

If two or three clusters are very close together and the other clusters are far apart, it tends to underestimate. This is exactly what happened in the example I have shown to you previously. Two of the clusters were too close together, so the Gap-Statistics underestimated the K*.

Overestimation

If all clusters are close to together, it rather overestimates than underestimates. I can not really explain why, but this is what happened when I computed it a couple of times.

Figure 5) Gap Statistics overestimated | made by me

In general

Both underestimation and overestimation depend mostly on the randomly initialized centroids. When some of them get omitted due to random unluck, the break in the within-cluster distance forces the gap statistic to produce the optimum cluster earlier.

Conclusion

Even if Gap-Statistics is a good approach to find a suitable K, it is still not perfect. For example, we needed to introduce a new hyperparameter, namely the number of K for which the W_uniform is simulated on. We can’t be sure what the ideal value for this is. Furthermore, the random initialization of the centroids can lead to an over- or underestimation of K*.

But by knowing all of the aspects of Gap-Statistics, the best is to apply it and then run the Gap-Statistic plot for a couple of times. Taking the average of the Gap-Statistics can be an increased evaluation criterion.

Hopefully you liked this post and it gave you some insights into Gap-Statistics.

Code

You can find the entire code for this project on my Github page here. You can write me an Email if you have questions regarding the code. You can find my Email address on my website.

Package

I recently built a package that performs the Gap-Statistics on various clustering algorithms, but it is mainly designed for KMeans. Check it out here:

GitHub - Mavengence/GapStatistics: Computing the gap statistics from Tibshirani et. al. for various…

Computing the gap statistics from Tibshirani et. al. for various clustering algorithms - GitHub …

github.com

You can download it with pip: “pip install gapstatistics”

References

[1] Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009), Springer
[2] Trevor Hastie, Robert Tibshirani and Guenther Walther, Estimating the number of clusters in a data set via the gap statistics (2000)

This blog post is based from knowledge gained through the course Pattern Analysis from the Friedrich Alexander University Erlangen-Nürnberg. I used some parts of the lecture from Dr. Christian Riess to illustrate the examples for this post. So all rights go to Dr. Christian Riess.