The world’s leading publication for data science, AI, and ML professionals.

K-Means Explained

Explaining and Implementing kMeans Algorithm in Python

Image by Kelly Sikkema from Unsplash
Image by Kelly Sikkema from Unsplash

This article will outline a conceptual understanding of the k-Means algorithm and its associated python implementation using the sklearn library. K-means is a clustering algorithm with many use cases in real world situations. This algorithm generates K clusters associated with a dataset, it can be done for various scenarios in different industries including pattern detection, medical diagnostic, stock analysis, community detection, market segmentation, image segmentation etc. It is often used to gain intuition about the dataset you’re working with, by grouping similar data points close to another (a cluster). Data points in the same group would be close and similar to one another while data points in other groups would be dissimilar.

Table of Contents

  • What is k-Means Clustering
  • Algorithm
  • Choosing K
  • Elbow Method
  • Advantages
  • Disadvantages
  • Implementation
  • Summary
  • Resources

What is k-Means Clustering

This is an unsupervised learning algorithm, essentially meaning that the algorithm learns patterns from untagged data. This implies that you can train a model to create clusters on any given dataset without having to initially label the data.

The intuition behind the algorithm is to divide data points into different pre- defined (K) clusters, where a datapoint in each cluster would only belong to that cluster. The cluster would consist of data which share similarities with one another, implying that data points in different clusters would be dissimilar to one another. Many other clustering Algorithms like EM algorithm for gaussian mixtures, k-medians, etc share the same fundamental insight as k-means does. That data points inside of a cluster should be close to the centre of that cluster [2].

The image on the left is a cluster of data, the image on the right is the same data with a centroid visualized (Image provided by Author)
The image on the left is a cluster of data, the image on the right is the same data with a centroid visualized (Image provided by Author)

Algorithm

  1. Choose your value of K
  2. Randomly select K data points to represent the cluster centroids
  3. Assign all other data points to its nearest cluster centroids
  4. Reposition the cluster centroid until it is the average of the points in the cluster
  5. Repeat steps 3 & 4 until there are no changes in each cluster

Choosing K

Since the algorithm requires the user to specify the number of clusters K to look for, and it doesn’t learn it from the data, it is one of the most difficult aspects of using this algorithm. It’s difficult to say if any given value of K is incorrect. Often this value is determined through having extensive domain knowledge, and experience to know an ideal value of K. If this isn’t the case for your current needs then the elbow method is commonly used in the industry to identify the ideal value of K.

Elbow Method

The elbow method uses the sum of squared distance (SSE) to choose an ideal value of k based on the distance between the data points and their assigned clusters. We would choose a value of k where the SSE begins to flatten out and we see an inflection point. When visualized this graph would look somewhat like an elbow, hence the name of the method.

Figure 2 : Visual representation of the elbow method based on the data from Figure 1. Elbow point is at 4 (Image provided by author)
Figure 2 : Visual representation of the elbow method based on the data from Figure 1. Elbow point is at 4 (Image provided by author)

The graph above shows that k = 4 is probably a good choice for the number of clusters. There are situations when the graph does not look like an elbow, this makes things very difficult to choose the value of k. The code to generate the graph above and the modelling component of this algorithm is provided below in the Implementation section of this article.

Advantages

  • Scales to large data
  • Always converges
  • Often (not always) faster than other common clustering methods like hierarchical clustering

Disadvantages

  • Clustering imbalanced data
  • Manual choice of K
  • Dependent on initial assumptions
  • Scaling with high dimensions

Implementation

Summary

In summation, k-means is an unsupervised learning algorithm used to divide input data into different predefined clusters. Each cluster would hold the data points most similar to itself, and points in different clusters would be dissimilar to one another. The trickiest part of this algorithm is to choose an ideal value of K, this can be done through intuition or using the elbow method. The elbow method uses SSE to show an suggested value of K be based on the distance between data points and their assigned clusters.

Resources


Here are some other articles written by me which you might like :

Markov Chain Explained

Support Vector Machine (SVM) Explained

Random Forest Explained


Related Articles