The world’s leading publication for data science, AI, and ML professionals.

K-Means Clustering - An Introduction

An overview of a popular unsupervised machine learning method

Photo by Karolina Grabowska and obtained from Pexels.com.
Photo by Karolina Grabowska and obtained from Pexels.com.

When we are working with unlabelled datasets during the Exploratory Data Analysis phase of a project, we may be interested in separating our data into groups based on similarities. This allows us to easily identify any patterns that may exist within the data that may not be immediately obvious to the human eye. This is achieved through the unsupervised learning process of clustering.

One of the most popular clustering methods to achieve the grouping of data based on similarities is K-means clustering. It is a very commonly used unsupervised machine learning algorithm that is relatively easy to understand and easy to implement within Python.

Within this article, we will be covering the basics of the K-means clustering algorithm.

What is Clustering?

Clustering is an unsupervised Machine Learning process that aims to separate an unlabelled dataset into a number of groups based on the similarity of nearby points.

Data points that have similar characteristics are placed in the same cluster, and those that have different characteristics are placed in another cluster.

Before and after clustering of data. Image by the author.
Before and after clustering of data. Image by the author.

There are many different algorithms available for clustering data including:

  • k-means
  • DBSCAN
  • Gaussian Mixture Model
  • Ward
  • Agglomerative Clustering
  • BIRCH

An example of the different methods available within the popular Scikit-Learn Python library can be viewed here.

What is K-means Clustering?

K-means clustering is a popular unsupervised algorithm that groups data into ‘k’ number of clusters, where k is defined by the user. The algorithm attempts to minimise the sum of all of the squared distances within each cluster and it also minimises the distance between the data points and a cluster centre point called a centroid.

The centroid is initialised at k random points in the data space and all points around it are assigned to the relevant cluster based on the distance to the centroid. The centroid is then adjusted to the central point of the cluster and the points surrounding it are reassigned. This continues until either there is no change in the centroids or the points remain in the same cluster or until a maximum number of iterations is reached.

K-means is a hard clustering method which means that a data point either belongs to a cluster or it doesn’t.

Applications of K Means Clustering

The applications of k-means clustering are numerous.

  • Image Segmentation
  • Social Network Analysis
  • Customer Segmentation
  • Classification of Documents
  • Anomaly Detection

Within the geoscience and petrophysics domains:

  • Outlier detection with well-log measurements
  • Classification of facies from well logs and/or core analysis data

K-Means Clustering Algorithm – How it Works

Overview of How K-Means Clustering Works

The following workflow illustrates the overall process of how the k-means algorithm works. Each step is detailed in the subsequent sections.

Overview of the k-means clustering algorithm. Image by the author.
Overview of the k-means clustering algorithm. Image by the author.

K-Means Clustering Step by Step

Let us have a closer look at each of the steps.

Step 1. Gather our data & Determine the Value for ‘k’

The first step is to gather our data together and determine how many clusters we want to split our data into. For this dataset, we are going to split it into 3 clusters.

Step 1, identify the number of clusters to group the data into. Image by the author.
Step 1, identify the number of clusters to group the data into. Image by the author.

There are a few ways to determine the optimum number of clusters which are covered after this section.

Step 2. Select k random points within the data

Next we select ‘k’ random points from the data. In this case k = 3, so we will select 3 random points. These are the cluster centroids.

Step 2: Select k random points from the dataset. Image by the author.
Step 2: Select k random points from the dataset. Image by the author.

Step 3. Assign Points to Closest Seed Point

The distance between each point in the dataset and the centroids is computed using Euclidean Distance.

Once this has been calculated for every point we then assign each point to the closest centroid. This forms our initial clustering.

Step 3: Calculate the Euclidean distance between the points and the centroid. Then assign the points to the closest seed point (centroid). Image by the author.
Step 3: Calculate the Euclidean distance between the points and the centroid. Then assign the points to the closest seed point (centroid). Image by the author.

Step 4. Identify New Centre Points

We then calculate the average (mean) point of each cluster. These become the new centroids.

Step 4: Identify the new centroids by finding the average point within each cluster. Image by the author.
Step 4: Identify the new centroids by finding the average point within each cluster. Image by the author.

Step 5. Assign Points to Closest Centroid

Step 5: Assign the points to the new cluster centroids. Image by the author.
Step 5: Assign the points to the new cluster centroids. Image by the author.

We then repeat the process of assigning the points to the nearest cluster using the Euclidean Distance.

Euclidean Distance Example. Image by the author.
Euclidean Distance Example. Image by the author.

Step 6. Identify the New Cluster Centres

The centroids are then recomputed.

Step 6: Recompute the average point of each cluster and adjust the centroids. Image by the author.
Step 6: Recompute the average point of each cluster and adjust the centroids. Image by the author.

Step 7. Repeat Steps 4–6

The process of assigning the points to the nearest centroids and recomputing the mean point

Step 6: Reassign the points to the nearest centroid. Image by the author.
Step 6: Reassign the points to the nearest centroid. Image by the author.
Step 7: Identify new centroids and repeat this process until specified conditions have been met. Image by the author.
Step 7: Identify new centroids and repeat this process until specified conditions have been met. Image by the author.

When Does Clustering Stop?

We repeat the process of clustering until we reach certain conditions:

  • The maximum number of iterations has been reached
  • Model convergence where there is no to very little change in the centroid positions or the points being clustered

Identifying the Optimum Number of Clusters

Rather than guessing at the number for ‘k’, we can use a number of techniques to come up with a value for us. The two discussed here are simple manual ways to identify the optimum number for ‘k’.

Elbow Plot

The elbow method is the most commonly used way to determine the optimum number of clusters due to it being simple and easy to visualise.

Essentially we are running the k-means algorithm multiple times with different numbers of k and calculating the Within-Cluster-Sum of Squared Errors (WSS). This property is also known as Inertia.

Once we have the results we plot the inertia against each cluster and identify the point in the graph where the data starts to "flatten out". In the example below we could pick a value between 5 and 10 as our optimal number for k.

Bear in mind doing this for a large number of clusters can increase computational time.

Elbow plot for selecting the optimum number of clusters. Image by the author.
Elbow plot for selecting the optimum number of clusters. Image by the author.

Silhouette Method

The silhouette method provides a measure of how similar a data point is within its own cluster (cohesion) compared to other clusters (separation).

It provides a value between +1 and -1, with a value closer to +1 being more desirable and indicating that the data point is within the correct cluster and is far away from other clusters.

If we have multiple negative values, then we possibly have too many or too few clusters.

Silhouette Analysis for selecting the optimum number of clusters for k-means clustering. Image by the author.
Silhouette Analysis for selecting the optimum number of clusters for k-means clustering. Image by the author.

More details on both of these methods can be found here:

K-Mean | K Means Clustering | Methods To Find The Best Value Of K

Improving the K-Means Result

If we repeat the K-means clustering process again and depending on the parameters, we are likely to get different results. This is due to a difference in the points that are selected as the initial centroids. Additionally, once the points have been initialised it is difficult for them to move large distances or near clusters that are already relatively stable.

One way to improve the results is by repeating the k-means process and attempting to find the lowest sum of variance within all of the clusters.

There are also a number of initialisation techniques available for k-means clustering, including:

  • Selecting random points (the one used in the example)
  • k-means ++
  • Naive Sharding
  • Furthest Point Heuristic
  • Sorting Heuristic
  • Projection Based

Advantages and Disadvantages of K-Means Clustering

Advantages:

  • A fast, effective and efficient algorithm
  • Easy to understand
  • Easy to implement within Python
  • Can be scaled to large datasets
  • Guarantees convergence

Disadvantages:

  • Need to specify the number for k before running the algorithm
  • Dependent on how centroids are initialised. Once centroids have been initialised they cannot be moved large distances or if other clusters are relatively stable
  • Sensitive to outliers and noise – Outliers can influence the initialisation of the centroids
  • Speed may become an issue with large datasets
  • Can have issues scaling as the number of dimensions increases
  • Clusters are assumed to be spherical with each cluster having a similar number of data points

Python Implementation

I previously put together an article which goes through the steps required along with code samples. This example uses well-log data measurements to group the data into different clusters, which can be interpreted as different lithologies.

You can find it at the link below.

How to use Unsupervised Learning to Cluster Well Log Data using Python

Summary

In summary, the K-means algorithm is a very popular unsupervised machine learning technique that is easy to understand and implement. It is an effective solution for grouping data points together based on similarities and should be considered as an option during the exploration phase of your data analysis.


Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here! Alternatively, you can sign up to my newsletter to get additional content straight into your inbox for free.

Secondly, you can get the full Medium experience and support me and thousands of other writers by signing up for a membership. It only costs you $5 a month, and you have full access to all of the amazing Medium articles, as well as have the chance to make money with your writing. If you sign up using my link, you will support me directly with a portion of your fee, and it won’t cost you more. If you do so, thank you so much for your support!


Related Articles