
K-means clustering (referred to as just k-means in this article) is a popular unsupervised Machine Learning algorithm (unsupervised means that no target variable, a.k.a. Y variable, is required to train the algorithm). When we are presented with data, especially data with lots of features, it’s helpful to bucket them. By sorting similar observations together into a bucket (a.k.a. cluster), we can then compare and contrast the buckets. Understanding the features that drive cross-bucket differentiation gives us critical clues towards what features to focus on when creating an analysis or building a model (we may even want to use our clusters directly as a model feature).
So that’s clustering in a nutshell. Now let’s see how k-means separates our observations into meaningful clusters.
Getting Started
If you would like to see the code in its entirety, you can grab it from GitHub here.
I already downloaded it for a previous post so we’re going to use the Titanic dataset again today:
import numpy as np
import pandas as pd
import numpy.matlib
import matplotlib.pyplot as plt
import seaborn as sns
titanic = pd.read_csv('train.csv')
Since our main focus is to build k-means and explore how it works, we will just work with 2 columns from the dataset: fare (price paid for the ticket) and age of the passenger. We will also drop nulls. I sort the data by fare and then age because I will eventually pick the top k observations to be the cluster centers (centroids). Sorting ensures that I will pick k observations that are all very similar to each other as my initial centroids. This way, the starting centroids will be suboptimal and we can more clearly see how the algorithm is able to converge to much better centroids (and clusters).
cluster_data = titanic[['Fare','Age']].copy(deep=True)
cluster_data.dropna(axis=0, inplace=True)
cluster_data.sort_values(by=['Fare','Age'], inplace=True)
cluster_array = np.array(cluster_data)
Our data that we will cluster is stored in the array, cluster_array. It looks like the following (first column is fare, second is age):
# Fare Age
[[ 0. 19. ]
[ 0. 25. ]
[ 0. 36. ]
[ 0. 38. ]
[ 0. 39. ]
[ 0. 40. ]
[ 0. 49. ]
[ 4.0125 20. ]
[ 5. 33. ]
[ 6.2375 61. ]]

The K-Means Algorithm
Before we code it up, let’s first understand conceptually how the k-means algorithm works. K-means imagines each cluster as a solar system. The star that everything (all the observations) in the cluster revolves around is known as the cluster’s centroid.

So given a set of centroids and their coordinates (where the X coordinate is fare and the Y coordinate is age), we can easily figure out what cluster each observation belongs to by calculating which centroid it is closest to (in terms of Euclidean distance).
But how do we decide where the centroids are? That’s where the k-means algorithm comes in. First, we pick a value for k, the number of centroids (this is a hyperparameter that we must tune). Let’s say we pick k to be 4. Then we can just pick 4 points at random and assign them to be our starting centroids. And using our randomly chosen starting centroids, we can create 4 clusters. Sounds kind of silly right? What’s the point to picking random centroids and creating random clusters?
Here’s the trick: the means of our clusters become our new centroids (for each cluster, we calculate the mean fare and mean age and that’s the coordinate of our new centroid). And as long as our starting randomly picked centroids were even slightly different from each other, the new centroids (the cluster means) will be more optimal than our initial clusters; where optimality is defined as maximizing similarity within cluster and difference across clusters.
Once we have our new centroids, we can reassign each observations’ cluster based on which of the new centroids it is closest to. Since the centroids became a bit more optimal, our clusters should improve too (in terms of homogeneity within cluster and variance across clusters). And now we can again calculate new centroids from the means of the coordinates of the reassigned clusters. These centroids will have again improved upon their predecessors, and we can keep rinsing and repeating this process until the algorithm has converged. Convergence is defined as when we are no longer able to decrease the sum of squared deviations from the centroid (a.k.a. cluster mean) for all clusters. The sum of squared deviations from the mean is a measure of how alike the members of a cluster are to each other – the lower the value, the more similar and better.
An example might help you better understand why this algorithm works. Imagine a room full of people with different heights (image below). We want to sort them into two clusters based on the only observable parameter, height. So we choose two people at random (the ones in gray) and tell everyone else to stand next to whoever they are closest in height to (coin flip to break ties). Once they do so, the mean of the group (or person with heigh equal to the mean) becomes the new centroid (in green). Notice how this serves to push the centroids apart – the green centroids are further apart than the gray ones. This pushing apart of all the centroids (in reality we would have more than 1 feature and 2 centroids) is what tends to happen as we repeatedly take the mean and re-cluster (assuming we didn’t take points at maximum and opposite extremes from each other as our initial centroids).

Coding Up K-Means – Helper Functions
Before we start coding it up, let’s refresh ourselves on the steps:
- Randomly assign centroids to start things up.
- Based on those centroids (and an observation’s distance from it), assign each observation to a cluster.
- Calculate the mean coordinates of each cluster; these are our new centroids.
- Reassign clusters based on new centroids.
- Keep repeating steps 3 and 4 until convergence.
To make life easier, let’s define a few helper functions. First let’s write one to calculate Euclidean distance between 2 points(a.k.a. the straight line distance between 2 points):
def calc_distance(X1, X2):
return (sum((X1 - X2)**2))**0.5
Next, we need a function that given a set of centroids, can tell us which cluster each observation belongs to. The following function uses nested for loops (not efficient I know) to calculate the distance between every observation and every centroid (using our calc_distance function). Then it assigns an observation to a cluster based on whichever centroid it is closest to. The output is the list of each observation’s cluster label.
# Assign cluster clusters based on closest centroid
def assign_clusters(centroids, cluster_array):
clusters = []
for i in range(cluster_array.shape[0]):
distances = []
for centroid in centroids:
distances.append(calc_distance(centroid,
cluster_array[i]))
cluster = [z for z, val in enumerate(distances) if val==min(distances)]
clusters.append(cluster[0])
return clusters
Now we need a function for the updating step where we assign new centroids. The following function concatenates the data (fare and age of each observation), cluster_array, and the current cluster that it belongs to, clusters, together into a dataframe, cluster_df. We can then filter cluster_df by cluster to get just the observations that belong to a particular cluster and calculate the mean of those observations. These calculated means are our new centroids.
# Calculate new centroids based on each cluster's mean
def calc_centroids(clusters, cluster_array):
new_centroids = []
cluster_df = pd.concat([pd.DataFrame(cluster_array),
pd.DataFrame(clusters,
columns=['cluster'])],
axis=1)
for c in set(cluster_df['cluster']):
current_cluster = cluster_df[cluster_df['cluster']
==c][cluster_df.columns[:-1]]
cluster_mean = current_cluster.mean(axis=0)
new_centroids.append(cluster_mean)
return new_centroids
The returned value of calc_centroids is an array where the first column is the mean fare of each cluster and the second column is the mean age of each cluster:
# Example Output of calc_centroids
# Mean Mean
# Fare Age
0 1
0 34.835711 18.394295
1 13.119311 32.090217
2 285.381483 31.166667
3 96.662211 36.114023
The last helper function we need is more for reporting purposes. We want to know what the variance within the cluster is, or in other words how similar or dissimilar the observations within a cluster are to each other. So let’s build a function to calculate the sum of squared deviations from the centroid for each cluster. This function filters cluster_df by cluster, calculates the mean, and then subtracts the cluster mean from each observation within the cluster. The function, repmat, takes a given array and replicates it – in our case we want to copy the mean for as many times as we have observations so that we can directly subtract the two arrays.
# Calculate variance within each cluster
def calc_centroid_variance(clusters, cluster_array):
sum_squares = []
cluster_df = pd.concat([pd.DataFrame(cluster_array),
pd.DataFrame(clusters,
columns=['cluster'])],
axis=1)
for c in set(cluster_df['cluster']):
current_cluster = cluster_df[cluster_df['cluster']
==c][cluster_df.columns[:-1]]
cluster_mean = current_cluster.mean(axis=0)
mean_repmat = np.matlib.repmat(cluster_mean,
current_cluster.shape[0],1)
sum_squares.append(np.sum(np.sum((current_cluster - mean_repmat)**2)))
return sum_squares
Cool, we are ready to run k-means now.
Running K-Means
Let’s go with 4 clusters still (k=4). Like I stated previously, we will purposefully choose bad starting centroids so that we can see the improvements made by the algorithm (I use observations 2, 3, 4, and 5 because they produce really bad starting clusters) – in reality, we don’t want to do this as it slows things down. Then we use our helper function, assign_clusters, to assign each observation to a cluster based on the centroid that it’s closest to.
Following this, we run a for loop 20 times (20 is enough for convergence in this case) where we repeatedly calculate new centroids (using calc_centroids) and new clusters (using assign_clusters) so that we can obtain optimal clusters. Recall that by repeating this process of calculating cluster means (a.k.a. new centroids) and assigning new clusters based on these new centroids is how the algorithm converges to the final clusters.
k = 4
cluster_vars = []
centroids = [cluster_array[i+2] for i in range(k)]
clusters = assign_clusters(centroids, cluster_array)
initial_clusters = clusters
print(0, round(np.mean(calc_centroid_variance(clusters, cluster_array))))
for i in range(20):
centroids = calc_centroids(clusters, cluster_array)
clusters = assign_clusters(centroids, cluster_array)
cluster_var = np.mean(calc_centroid_variance(clusters,
cluster_array))
cluster_vars.append(cluster_var)
print(i+1, round(cluster_var))
For each iteration in my loop, I stored the average of the clusters’ sums of squared deviations from their centroids in the list cluster_vars. This average is an approximate measure of the level of variance within each cluster (for all clusters). Because we want the cluster members to be as alike as possible, we would like this average to be as low as possible. Let’s check out how this average evolves as we iterate. It decreases drastically at the start and then levels off. By the sixth iteration of our loop, it’s more or less converged.

Let’s take a look to see if our clusters got better. Recall that we started with arbitrarily chosen centroids. Here are the clusters based on those initial centroids (each color is a cluster). Seems like we got lucky with some age based separation, but that’s about it.

Now here are the converged clusters. These clusters seem to be more meaningfully differentiated:
- There’s the older low fare folks (green).
- The younger low fare folks (dark red).
- The ones that could afford premium tickets (blue).
- And the ones that could afford super expensive tickets (gold, ironically).

Another way we can check how we did is by seeing whether the survival probabilities of our clusters are different. Remember that k-means clustering is unsupervised; so we are not explicitly training the model to predict whether a passenger survived. Rather we are hoping that by producing meaningfully differentiated clusters, we can find groups that also happen to behave differently in terms of the things we care about (in this case survival). And it seems like our converged clusters do a better job of predicting survival relative to our initial ones.

Conclusion
Obviously there is a lot more we can do including adding more features (versus just 2), tuning k (the number of clusters), and trying to better understand the identity and key characteristics of each cluster. But that’s something for another day. I hope that by reading this post, you gained some insight into how clustering works, and how k-means uses a wonderfully simple algorithm to produce some pretty cool results. Cheers and stay safe everyone!