The world’s leading publication for data science, AI, and ML professionals.

Clustering using k-Means with implementation

Objects around us come from natural groups

Image Source: https://unsplash.com/@nauleyco
Image Source: https://unsplash.com/@nauleyco

Clustering is a technique to find natural groups in the data.

Figure 1: Groups of Animals ( Source: Author)
Figure 1: Groups of Animals ( Source: Author)

If we show the above picture to a kid, he can identify that there are four types of animals. He may not know the names of all of them, but he can still identify that there are four different types and he can do this independently, without the help of an adult. As we don’t need an adult to supervise, clustering is an unsupervised technique.

What are the major motivations of clustering?

The three motivations can be listed as follows.

Underlying Structure: We can understand that there are different groups, within the groups, there is less variation and between the group, the variation is lesser.

Natural Classification: We know that there are four different types or groups of animals.

Summarization: If we now, think of some strategy for maybe their shelter or food, I don’t need to think for all the animals, because a sheep’s food requirement will be different from a Panda, Giraffe or Rhyno but not so much different from another sheep. This can be very important when any company is trying to come up with a customer group specific strategy.

k-means is arguably the most popular algorithm, which divides the objects into k groups. This has numerous applications as we want to find structure in data. We want to group students, customers, patients, products, movies, tweets, images, and whatnot.

Let’s now see how it works.

How k-means works

Input: Data will be given and the value of ‘k’ will be given

Data: Input to K- Means (Image: Author)
Data: Input to K- Means (Image: Author)

Step 1: Initialize random ‘k’ points from the data as the cluster centers, let’s assume the value of k is 2 and the 1st and the 4th observation is chosen as the centers.

Randomly Selected K (2) Points (Source: Author)
Randomly Selected K (2) Points (Source: Author)

Step 2: For all the points, find the distance from the k cluster centers. Euclidean Distance can be used. The following table is used as an illustration. C1 and C2 columns indicate the distance from the centers respectively.

Distance Calculation from Centroids (Source: Author)
Distance Calculation from Centroids (Source: Author)

If for the 2nd observation, we want to determine the distance from centroid 1, the calculation will look like the following

Euclidian Distance Example (Source: Author)
Euclidian Distance Example (Source: Author)
  • Notice that the distance of the first point from center 1 is 0 and the distance of the fourth point from the second cluster is 0. The reason is simple, these points were selected as the centroids
  • Also for the third observation, the distance from both the centers are the same.

Step 3: Every point is assigned to the cluster center it is nearest to and then the cluster center is again computed using arithmetic means

Points Assigned to nearest Center (Source: Author)
Points Assigned to nearest Center (Source: Author)

For the first two points, they are assigned to the cluster 1, the third point is randomly assigned and the last four points are assigned to cluster 2.

Cluster Centers ( Source: Author)
Cluster Centers ( Source: Author)

This Step 2 and Step 3 is repeated until convergence

  • No change in cluster center from iteration to iteration
  • Very Small change in the cluster assignment

Essentially, we are trying to find k best representative points. These are selected by taking mean. Hence, we are trying to find the k best representative points iteratively. Hence the name k-means.

What are some of the issues in k-means and how to get around them?

Some of the issues of k-means are

  • As the first step is random initialization, this has a bearing on the quality of the clusters ( Some of it can be solved by k-means ++, which selects the random points incrementally, making sure that the next point is selected only when it is far away from the selected points)
  • Picks circles of circular shape ( The distance measure can be played with to a certain extent)
  • Are affected by outliers ( Outliers can be removed and then clustering can be applied)
  • The value of k needs to be given by the user ( An Approach to find the same is discussed)

How to find the optimal value of k

WCSS Within Cluster Sum of Square ( Source: Author)
WCSS Within Cluster Sum of Square ( Source: Author)

A good cluster is one where the points are close to the center. How can we find the closeness? We can find the euclidian distance of all the points in that cluster from the cluster center. Now we need to extend this to all clusters. This idea is represented in the equation above and is called the within-cluster sum of the square. Ideally, we would want this to be small. The extreme case, we can have one cluster for each point resulting in WCSS as 0. It also turns out as we increase k, WCSS continues to decrease. Technically, this is called as monotonically decreasing.

We want to rather stop, where there is a steepest fall or forming kind of an elbow when we are plotting WCSS in the y-axis and the number of clusters in the x-axis.

How to apply k-means

Standardization: All distance-based algorithms need the data to be standardized so that no one attribute gets undue importance just because the range of the values is more. The code snippet is given below.

# Importing the Standard Scaler
from sklearn.preprocessing import StandardScaler 
#Intializing the Stnadard Scaler
scaler = StandardScaler().fit(df) 
#Standardizing the data
df_scaled = scaler.transform(df)

Clustering: The code snippet is given below

#Imporing the Library
from sklearn.cluster import KMeans
# Intialization
kmeans = KMeans(n_clusters = 3, init = 'random', max_iter = 300, n_init = 10, random_state = 0) 
#Applying Clustering
y_kmeans = kmeans.fit_predict(df_scaled)

Some important Parameters:

  • n_clusters: Number of clusters or k
  • init: Random or kmeans++ ( We have already discussed how kmeans++ gives better initialization)
  • n_init: The algorithm is run with these many different random choices of initial k points. The best is returned.
  • max iteration: How many iterations the algorithm will run for
  • tol: This stands for tolerance and is also used for the termination criterion. This is an interesting concept. If it is a small change in the centers from one iteration to another then we are towards convergence. The centers are matrices and we can find closeness between two matrices by checking the element-wise euclidian distance. This is called the Frobenius norm.

Determining the value of ‘k’

The code snippet is given below.

# Applying k-means for diffrent value of k and storing the WCSS
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'random', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x_scaled)
    wcss.append(kmeans.inertia_)

For plotting against the number of clusters

plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')      #within cluster sum of squares
plt.show()
Figure 2: Number of Clusters via Elbow Plot (Source: Author)
Figure 2: Number of Clusters via Elbow Plot (Source: Author)
  • You can check out our full implementation here.
  • You can also look at a video tutorial which talks more about clustering here

Conclusion:

k-means is here to stay, because of its simplicity, ease of implementation. It easily appears in the top 10 algorithms of ML in any survey. There are shortcomings and also way around to make k-means robust.

Reference:

[1] https://towardsdatascience.com/k-means-clustering-13430ff3461d


Related Articles