K-Means Clustering

A quick & easy tutorial on an essential Machine Learning tool

Andrew Cole
Towards Data Science

--

By Andrew Cole

Data is, if nothing else, ambiguous. It contains infinite troves of information and unlocking that information requires equally ambiguous methods. But, as with anything, there must be a starting place. Classification is one of the most effective types of machine learning methods which seeks to classify data to a particular group, based on its already existing characteristics. However, we don’t always know what those characteristics are and therein lies the problem. Without even knowing what we are looking at in our datasets, we’re just stuck staring at it while it looks back at us.

Fortunately, there are ML classification methods which can help with this ambiguous problem. Machine Learning exists in two states:

  • Supervised Learning: Algorithms “learn” from a training dataset which already contains labeled data. The training dataset contains input and output data which is how the model validates the performance of itself. For example, we have a hospital patient who we know exhibits symptoms X, Y, and Z and they have Disease 1. The input data represents the three symptoms while the output represents that the patient does indeed have Disease 1. Because the input and output are known, we can utilize supervised machine learning.
  • Unsupervised Learning: Algorithms infer patterns within the data with unknown outcomes. You cannot really classify the data into an outcome because you don’t even know what the values for the outcome might be. For example, you have a business with 100,000 customers and their purchase history and you want to know demographic groupings of those customers to better adjust your business model. Unsupervised learning methods would be more appropriate here because they will effectively “group” those customers into similar groupings based only on the available data.

K-Means Clustering — The Model

We are going to dive in to a tutorial on how to create a basic unsupervised learning model, and how to evaluate its success. There are many applications of unsupervised learning, but we are going to focus on one in particular: clustering. Clustering will essentially scatter all the data points and group them based off of their characteristics. The goal is to group so that the intra-class similarity is high (similarities within the clusters), while keeping the inter-class similarity low (similarities between clusters).

The main goal of clustering is to group data points together without knowing what those data points actually are.

With K-Means clustering, we are determining that there are k-number of cluster centers. The general algorithm flows as followed:

1. The data will be grouped around a pre-determined number of center points throughout the dataset.

2. Each individual data observation will be assigned to a cluster that has the “closest” distance to the center point(Euclidean Distance).

3. The respective cluster center points will be recomputed now that they have their observations surrounding

4. The observations are now reassigned to one of the clusters according to some “rule” (see Init below)

5. The process is iterative, so if reallocation of observation points can be accomplished to achieve closer distances, it will occur. If not, the model is finished.

To start the modeling process, we import the necessary libraries and generate random data:

Here, we can clearly see that there are 5 to 7 clear blobs, or clusters, of data. The K-Means algorithm will now compute and attempt to find a center point of ‘k’ clusters. We will define ‘k’ = 7. The algorithm itself is somewhat straightforward:

As with any machine learning model, there is a plethora of parameter options to help tune your model to your data (argument options in parenthesis).

  • n_clusters: this is your “k” value. This parameter tells the algorithms how many clusters to group the data into, and therefore how many center points will need to be calculated. There is no definitive way to pre-determine the exact k number, so iteration of the model is necessary and the best- performing result metric (which we get into a bit later on), will tell you which number to use for your final model.
  • init: This is your “rule” method for initialization of the function (k-means++: default; selects initial cluster centers to pursue faster convergence; random: chooses k-random observations for choosing the initial centers; ndarray: this argument allows you to provide your own initial center points)
  • algorithm: specifies which algorithm is used during clustering (full: “Full Expectation Maximization”; at each iteration, an E-Step — points are assigned to nearest center point — and an M-Step — cluster means are updated based on the cluster elements — occur; elkan: more efficient; not available when using sparse data; auto: automatically picks full/elkan based on the data given)

The above code results in the graph below. The black dots at the very center of each of the 7 clusters represents the calculated center point of each cluster.

K-Means Clustering: Evaluation Metrics

As mentioned before, clustering is an iterative process. There is no great way to pre-determine how many k-clusters we should use in our model, so we must run multiple models and then compare resulting metrics. There are numerous metrics possible within the sk-learn library, but we will focus on two:

  • Variance Ratio (Calinski-Harabasz Score): This is a ratio of variance of the point within a cluster. A higher variance ratio score = better performing model.
  • Silhouette Score: higher score = better model

a = average distance between one data sample and all other points within the same cluster

b = average distance between one data sample and all the other points in the next nearest cluster

It is important to note that neither metric is necessarily “better” than the other. It is important, however, that once you elect an evaluation metric you stick to using that same metric across all models.

Silhouette Score: k = 7

Calinski-Harabasz Score (Variance Ratio): k = 7

Now that we have the model performance metrics for k=7 clusters, we will iterate through the model again, this time using k = 6. Once the new model is fit and predicted, we will compare metrics. Whichever is higher, will be the better performing model. Code is as follows:

When you compare the two resulting model plots, we cannot really differentiate between the two centers other than the fact that the upper right cluster of light purple samples (x = -5, y = 0.0) Therefore, we must compare evaluation metrics to determine which score is better. Let’s use Silhouette Score as our comparison (again, once you choose a metric you must use it for all model iterations).

Conclusion

We now have two metrics for comparison.

  • k = 7; silhouette score = 0.70
  • k = 6; silhouette score = 0.68

The silhouette score for k=7 is higher than k=6, therefore k=7 is the best performing model.

--

--