When we are working with unlabelled datasets during the Exploratory Data Analysis phase of a project, we may be interested in separating our data into groups based on similarities. This allows us to easily identify any patterns that may exist within the data that may not be immediately obvious to the human eye. This is achieved through the unsupervised learning process of clustering.
One of the most popular clustering methods to achieve the grouping of data based on similarities is K-means clustering. It is a very commonly used unsupervised machine learning algorithm that is relatively easy to understand and easy to implement within Python.
Within this article, we will be covering the basics of the K-means clustering algorithm.
What is Clustering?
Clustering is an unsupervised Machine Learning process that aims to separate an unlabelled dataset into a number of groups based on the similarity of nearby points.
Data points that have similar characteristics are placed in the same cluster, and those that have different characteristics are placed in another cluster.

There are many different algorithms available for clustering data including:
- k-means
- DBSCAN
- Gaussian Mixture Model
- Ward
- Agglomerative Clustering
- BIRCH
An example of the different methods available within the popular Scikit-Learn Python library can be viewed here.
What is K-means Clustering?
K-means clustering is a popular unsupervised algorithm that groups data into ‘k’ number of clusters, where k is defined by the user. The algorithm attempts to minimise the sum of all of the squared distances within each cluster and it also minimises the distance between the data points and a cluster centre point called a centroid.
The centroid is initialised at k random points in the data space and all points around it are assigned to the relevant cluster based on the distance to the centroid. The centroid is then adjusted to the central point of the cluster and the points surrounding it are reassigned. This continues until either there is no change in the centroids or the points remain in the same cluster or until a maximum number of iterations is reached.
K-means is a hard clustering method which means that a data point either belongs to a cluster or it doesn’t.
Applications of K Means Clustering
The applications of k-means clustering are numerous.
- Image Segmentation
- Social Network Analysis
- Customer Segmentation
- Classification of Documents
- Anomaly Detection
Within the geoscience and petrophysics domains:
- Outlier detection with well-log measurements
- Classification of facies from well logs and/or core analysis data
K-Means Clustering Algorithm – How it Works
Overview of How K-Means Clustering Works
The following workflow illustrates the overall process of how the k-means algorithm works. Each step is detailed in the subsequent sections.

K-Means Clustering Step by Step
Let us have a closer look at each of the steps.
Step 1. Gather our data & Determine the Value for ‘k’
The first step is to gather our data together and determine how many clusters we want to split our data into. For this dataset, we are going to split it into 3 clusters.

There are a few ways to determine the optimum number of clusters which are covered after this section.
Step 2. Select k random points within the data
Next we select ‘k’ random points from the data. In this case k = 3, so we will select 3 random points. These are the cluster centroids.

Step 3. Assign Points to Closest Seed Point
The distance between each point in the dataset and the centroids is computed using Euclidean Distance.
Once this has been calculated for every point we then assign each point to the closest centroid. This forms our initial clustering.

Step 4. Identify New Centre Points
We then calculate the average (mean) point of each cluster. These become the new centroids.

Step 5. Assign Points to Closest Centroid

We then repeat the process of assigning the points to the nearest cluster using the Euclidean Distance.

Step 6. Identify the New Cluster Centres
The centroids are then recomputed.

Step 7. Repeat Steps 4–6
The process of assigning the points to the nearest centroids and recomputing the mean point


When Does Clustering Stop?
We repeat the process of clustering until we reach certain conditions:
- The maximum number of iterations has been reached
- Model convergence where there is no to very little change in the centroid positions or the points being clustered
Identifying the Optimum Number of Clusters
Rather than guessing at the number for ‘k’, we can use a number of techniques to come up with a value for us. The two discussed here are simple manual ways to identify the optimum number for ‘k’.
Elbow Plot
The elbow method is the most commonly used way to determine the optimum number of clusters due to it being simple and easy to visualise.
Essentially we are running the k-means algorithm multiple times with different numbers of k and calculating the Within-Cluster-Sum of Squared Errors (WSS). This property is also known as Inertia.
Once we have the results we plot the inertia against each cluster and identify the point in the graph where the data starts to "flatten out". In the example below we could pick a value between 5 and 10 as our optimal number for k.
Bear in mind doing this for a large number of clusters can increase computational time.

Silhouette Method
The silhouette method provides a measure of how similar a data point is within its own cluster (cohesion) compared to other clusters (separation).
It provides a value between +1 and -1, with a value closer to +1 being more desirable and indicating that the data point is within the correct cluster and is far away from other clusters.
If we have multiple negative values, then we possibly have too many or too few clusters.

More details on both of these methods can be found here:
K-Mean | K Means Clustering | Methods To Find The Best Value Of K
Improving the K-Means Result
If we repeat the K-means clustering process again and depending on the parameters, we are likely to get different results. This is due to a difference in the points that are selected as the initial centroids. Additionally, once the points have been initialised it is difficult for them to move large distances or near clusters that are already relatively stable.
One way to improve the results is by repeating the k-means process and attempting to find the lowest sum of variance within all of the clusters.
There are also a number of initialisation techniques available for k-means clustering, including:
- Selecting random points (the one used in the example)
- k-means ++
- Naive Sharding
- Furthest Point Heuristic
- Sorting Heuristic
- Projection Based
Advantages and Disadvantages of K-Means Clustering
Advantages:
- A fast, effective and efficient algorithm
- Easy to understand
- Easy to implement within Python
- Can be scaled to large datasets
- Guarantees convergence
Disadvantages:
- Need to specify the number for k before running the algorithm
- Dependent on how centroids are initialised. Once centroids have been initialised they cannot be moved large distances or if other clusters are relatively stable
- Sensitive to outliers and noise – Outliers can influence the initialisation of the centroids
- Speed may become an issue with large datasets
- Can have issues scaling as the number of dimensions increases
- Clusters are assumed to be spherical with each cluster having a similar number of data points
Python Implementation
I previously put together an article which goes through the steps required along with code samples. This example uses well-log data measurements to group the data into different clusters, which can be interpreted as different lithologies.
You can find it at the link below.
How to use Unsupervised Learning to Cluster Well Log Data using Python
Summary
In summary, the K-means algorithm is a very popular unsupervised machine learning technique that is easy to understand and implement. It is an effective solution for grouping data points together based on similarities and should be considered as an option during the exploration phase of your data analysis.
Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here! Alternatively, you can sign up to my newsletter to get additional content straight into your inbox for free.
Secondly, you can get the full Medium experience and support me and thousands of other writers by signing up for a membership. It only costs you $5 a month, and you have full access to all of the amazing Medium articles, as well as have the chance to make money with your writing. If you sign up using my link, you will support me directly with a portion of your fee, and it won’t cost you more. If you do so, thank you so much for your support!