
Clustering is very powerful due to the lack of labels. Getting labeled data is often expensive and time consuming. Clustering is often used for finding patterns in data. The found patterns are then often used in order to improve a certain product. One famous example is customer clustering. In customer clustering, groups of similar users can be found. If customers of one group buy certain products, the other customers of this group might also like them. Thus, targeted advertising can be applied to increase sales. Another famous example is clustering network activities into fraudulent and non-fraudulent actions.
There are a lot of different clustering algorithms. One very easy and still powerful clustering algorithm is the K-Means algorithm. The K-Means algorithm requires a user to first define the number of clusters and maybe also the initialization strategy of the clusters. But how to find these parameters? For the clustering of network activities two clusters could be used. One for the fraudulent activities and one for the non-fraudulent. But how many clusters to use for the customer clustering? In supervised learning, one could try different hyperparameters and numbers of clusters and could directly calculate some error metrics like the accuracy. The set of hyperparameters and number of clusters leading to the highest accuracy could then be used for the final model. But this is not possible for unsupervised learning, because of the lack of the ground truth values. So what could be done for unsupervised learning in order to evaluate and compare the different hyperparameters and number of clusters?
One possibility is to calculate the within cluster sum of squares, also called Inertia.
In this short article, the Inertia value is introduced. A K-Means clustering algorithm is then trained on a small data set using Scikit-Learn. The optimal number of clusters is found using the computed Inertia values and the elbow method applied on the Inertia curve. And last but not least, this article shows how to find optimal hyperparameters using the Inertia value. The code is programmed using Python and a Jupyter Notebook. The code can be found on my Github page.
Inertia
The Inertia or within cluster of sum of squares value gives an indication of how coherent the different clusters are. Equation 1 shows the formula for computing the Inertia value.

N is the number of samples within the data set, C is the center of a cluster. So the Inertia simply computes the squared distance of each sample in a cluster to its cluster center and sums them up. This process is done for each cluster and all samples within that data set. The smaller the Inertia value, the more coherent are the different clusters. When as many clusters are added as there are samples in the data set, then the Inertia value would be zero. So how to find the optimal number of clusters using the Inertia value?
For this, the so called Elbow-Method can be used. But let’s take a look into this method by using an example.
Elbow Method for Optimal Number of Clusters
Its always better to learn something by directly applying it on an example. For this purpose, a data set with two-dimensional features is created using Scikit-Learns _makeblobs function. Figure 1 shows the code for creating this data set, while Figure 2 shows the plot of this data set.

There are three clusters in this data set, so the optimal number of clusters for K-Means should be three. But let’s assume we don’t know that yet.
Let’s now train a K-Means model for a different amount of clusters and store the Inertia value for each trained model. Afterwards, the Inertia-Curve can be plotted in order to use the Elbow-Method for finding the optimal number of clusters. The code for these steps can be found in Figure 3. Figure 4 shows the resulting Inertia-Curve.

The red x-marker marks the elbow point. The elbow point gives the optimal number of clusters, which is three here. This makes totally sense, because the data set is created such that there are three different clusters. When adding more clusters, the Inertia value decreases, but also the information contained in a cluster decreases further. Having to many clusters leads to a performance decrease and also to a not optimal clustering result. Let’s assume you make a customer clustering and you have many small clusters. When a customer of a small cluster buys something, then you only can address few other potential buyers of that product. But when you would have a large coherent cluster, then you could directly address more potential buyers.
So for this example, the optimal number of clusters is three. Figure 5 shows the visualization of the three different clusters.

Using Inertia Value for Finding Optimal Hyperparameters
The Inertia value can also be used for finding better hyperparameters for the unsupervised K-Means algorithm. One potential hyperparameter is the initialization method. In Scikit-Learn, there are two different possibilities. One is called k-means++ and one is called random. But which initialization should be used? In order to answer this question, one could train one K-Means model for each initialization strategy and compare its Inertia values. The strategy leading to a smaller Inertia value can then be used as optimal strategy. Figure 6 shows the code for this evaluation and Figure 7 shows a data frame with the results.

As one can see, the random initialization achieves a slightly smaller Inertia value and could here be used as the optimal initialization strategy.
Conclusion
Clustering algorithms are very powerful in finding patterns in data. Clustering algorithms often only require a few hyperparameters, like the number of clusters or an initialization strategy of the clusters. Finding the optimal values is not as straightforward as in supervised learning, due to the lack of ground truth values. In order to still be able to find optimal values, the Inertia value and the Elbow-Method can be used.