K-Means Clustering — Introduction to Machine Learning Algorithms

Simplest clustering algorithm — Code & Explanation

Rohith Gandhi
Towards Data Science

--

Introduction — Unsupervised Learning

In machine learning, we are not always provided an objective to optimize, we are not always provided a target label to classify the input data points into. The kinds of problems where we are not provided with an objective or label to classify is termed as an unsupervised learning problem in the domain of AI. In an unsupervised learning problem, we try to model the latent structured information present within the data. Clustering is a type of unsupervised learning problem where we try to group similar data based on their underlying structure into cohorts/clusters. K-means algorithm is a famous clustering algorithm that is ubiquitously used. K represents the number of clusters we are going to classify our data points into.

K-Means Pseudocode

## K-Means Clustering 1. Choose the number of clusters(K) and obtain the data points 
2. Place the centroids c_1, c_2, ..... c_k randomly
3. Repeat steps 4 and 5 until convergence or until the end of a fixed number of iterations
4. for each data point x_i:
- find the nearest centroid(c_1, c_2 .. c_k)
- assign the point to that cluster
5. for each cluster j = 1..k
- new centroid = mean of all points assigned to that cluster
6. End

The simulations below would provide a better understanding of the K-means algorithm.

K — Means Algorithm

How to Choose K??

There would be some instances where we would not know the number of clusters. Then, how can we choose the value for K??? There is a way called the elbow method. In this method, you choose a different number of clusters and start plotting the within-cluster distance to the centroid. The graph looks as below.

Elbow method

From the above graph we can infer that at k=4, the graph reaches an optimum minimum value. Even though the within-cluster distance decreases after 4, we would be doing more computations. Which is just analogous to the law of diminishing returns. Therefore, we choose a value of 4 as the optimum number of clusters. The reason it is named the elbow method is that the optimum number of clusters would represent an elbow joint!

Applications of K-Means Clustering Algorithm

  • Behavioural Segmentation
  • Anomaly Detection
  • Social Network Analysis
  • Market Segmentation

There are just a few examples where clustering algorithm like K-means is applied.

Let’s Write Some Code

We will be using the Iris dataset to build our algorithm. Even though the Iris dataset has labels, we will be dropping them and use only the feature points to cluster the data. We know that there are 3 clusters(‘Iris-virginica’, ‘Iris-setosa’, ‘Iris-versicolor’). Therefore, k=3 in our case.

We load the dataset and drop the target values. We convert the feature points into a numpy array and split it into training and testing data.

We implement the pseudocode shown above and we can find that our algorithm converges after 6 iterations. We can now input a test data point and find the centroid it is closest to and assign that point to the respective cluster.

The Scikit-learn library once again saves us from writing so many lines of code by providing an abstract level object which we can just use to implement the algorithm.

Conclusion

K-means is an introductory algorithm to clustering techniques and it is the simplest of them. As you would’ve noticed, there is no objective/loss function. Hence, no partial derivates is required and that complicated math is eliminated. K-means is an easy to implement and handy algorithm.

Reference

--

--