Contents
In this post, we’ll be going through:
- Unsupervised Learning
- Working example of K-Means
- Cost Function for K-Means
- Initialization methods for clusters
- Elbow Method (hit-and-trial method)
- Kaggle‘s Credit Card Dataset to map user spending activity
So far in the series of posts on Machine Learning, we have had a look at the most popular supervised algorithms up to this point. In the previous post, we discussed Decision Trees and Random Forest in great detail. This post and the next few posts will focus on Unsupervised Learning Algorithms, the intuition and mathematics behind them, with a solved Kaggle dataset at the end.
Unsupervised Learning
Learning tasks done without supervision is Unsupervised Learning. Unlike supervised machine learning algorithms, there are no labels present in the training data for unsupervised learning which supervise the machine learning model’s performance. But, like supervised learning algorithms, unsupervised learning is used for both, discrete and continuous data values. Formally, unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses.

There are two types of Unsupervised Learning: discriminative models and generative models. Discriminative models are only capable of telling you, if you give it X then the consequence is Y whereas the generative model can tell you the total probability that you’re going to see X and Y at the same time.
So the difference is as follows: the discriminative model assigns labels to inputs, and has no predictive capability. If you gave it a different X that it has never seen before it can’t tell what the Y is going to be because it simply hasn’t learnt it yet. With a generative model, once you set it up and find the baseline you can give it any input and ask for a prediction. Thus, it has a predictive ability.
Two common use cases of unsupervised learning are:
(i) Cluster Analysis a.k.a. Exploratory Analysis
(ii) Principal Component Analysis
Cluster analysis or clustering is the task of grouping data points in such a way that data points in a cluster are alike and are different from data points in the other clusters. Clustering has various applications such as in market segmentation, pattern recognition, word vectors, detecting cyber-attacks and many such areas.
Principal Component Analysis is used to reduce the dimensionality (the number of features) of a dataset. In addition to reducing the dimensionality, Principal Component Analysis also converts data into a numeric form which is difficult to interpret by humans. Through Principal Component Analysis, the training time of the model can be reduced to a great extent with just a very slight decline in accuracy.
Some common examples of unsupervised learning are K-means, Principal Component Analysis (PCA), Autoencoders, etc. In this post, we’ll have a detailed look at K-means algorithm.

K-Means
Let’s first get the hang of the term ‘K-Means’. In statistics, ‘mean’ refers to the average of a given set of data points. This means that the algorithm deals with ‘K’ averages. But, what does ‘K’ averages signify? We usually have just a single average representing the entire data. Having ‘K’ averages signifies that we have performed the ‘mean’ operation on ‘K’ data segments, each of which can be treated as a separate independent unit. So, having ‘K’ averages for the given data is possible only when it is divided into ‘K’ parts/groups/clusters. The term K-means in itself is sufficient to describe that it’s a clustering algorithm and hence we should avoid using the term ‘clustering’ with K-means. Later we’ll see that this is exactly how the mathematics behind K-means works.
Formally, K-Means is an unsupervised learning algorithm that takes an unlabelled dataset with ‘m’ records and groups it into ‘K’ subsets/clusters, where each cluster has records having similar attributes and K < m. The algorithm however is not intelligent enough to determine the number of clusters in the data automatically and hence requires a predefined number of clusters (K) to divide the data into ‘K’ coherent groups. That being said, K-means is still one of the most widely used clustering algorithms due to its simplicity and fast computation time.
K-means algorithm works in 2 steps.
(i) Cluster assignment
(ii) Move centroid
Both of these steps are repeated until the algorithm converges.
K-Means Working Example
To better understand the 2 steps of K-means, let’s look at how K-means works through an example and the optimization objective (cost function) involved. In order to visualize things, we’ll assume that the data we’re using just has 2 features i.e. 2-dimensional data. Let us divide the data into 2 clusters, so K = 2.

Given the scatter plot for raw data in fig. (a), first of all 2 random points (cluster centroids) are selected on the graph (fig. b). Once these points have been selected, the proximity of all the data points to the chosen cluster points is calculated. There are many ways to calculate proximity and one of the easiest ways is to use Euclidean distance. Once all the Euclidean distances are calculated, the data points are assigned to those cluster centroids which are nearest to them. In fig. c, data points are assigned to red/blue cluster centroids based on their proximity to the cluster centroid. This is the cluster assignment step where each data point is assigned to a cluster. But these cluster assignments are not optimal since the initial values of cluster centroids were randomly chosen. To overcome this, the centre value of all the data points belonging to a cluster is calculated and the cluster centroid is moved to this new position as represented in fig. d. This is the move centroid step. So far, we have performed one iteration of the cluster assignment and the move centroid step and haven’t got a good division of data points into clusters. We need to perform multiple iterations of these 2 steps in order for the K-means algorithm to converge. Fig. e performs the ‘cluster assignment step’ for the new cluster centroids obtained in fig. d and then fig. f performs the ‘move centroid step’ to give new and optimal cluster centroids. At this point, after 2 iterations of these steps, we seem to have reached an optimal cluster grouping for the given data points. In reality, it takes a large number of iterations to reach an optimal cluster grouping.
The entire process of K-means clustering described above can be summarized through the following pseudo-code.
Initialize cluster centroids µ(1), µ(2), µ(3), ......, µ(k) ∈ Rn randomly, where Rn represents a feature vector of n-dimensions.
Repeat until convergence:
{
for i = 1 to m
c(i) = cluster centroid closest to training example x(i)
for k = 1 to K
µ(k) = average (mean) of points assigned to cluster k
}
Cost Function
Now that we know how the steps involved in K-means lead to clustering, let us try to obtain the optimization objective (cost function) involved through the intuition behind these steps. In one line, K-means algorithm determines the cluster centroid for each data point, then shifts the cluster centroid to a central position of that cluster and both of these steps are repeated until the algorithm converges. Hence, the cost function should minimize the overall distance of every cluster centroid with its corresponding data points; subject to the conditions that occur in ‘cluster assignment step’ and ‘move centroid step’.
Before formally defining the cost function for K-means, let’s have a look at the notation used to define the cost function.
· c(i) = cluster number (between 1 and K) to which training example x(i) is currently assigned
· µ(k) = representation of a cluster centroid (µ(k) ∈ Rn)
· µ(c(i)) = representation of cluster centroid of the cluster to which training example x(i) has been assigned.
The parameters for the cost function are c(i) and µ(k) (also µ(c(i))), since by varying these parameters, we can get different clusters and the target is to get the most optimal clusters which can be obtained by minimizing the cost function. The minimization objective for K-means is:

and the values of c(i) and µ(k) for which the cost function yields minimum value are the optimal values of these parameters. During the ‘cluster assignment step’, the cost function is minimized wrt the parameters c(i) keeping µ(k) fixed (i.e. the partial derivative of the cost function is taken wrt the c(i) variables) and during the ‘move centroid’ step, the cost function is minimized wrt the parameters µ(k) keeping c(i) fixed (i.e. the partial derivative of the cost function is taken wrt the µ(k) variables).
The gradient descent algorithm then updates the cost function after computing the aggregate of these 2 partial derivatives and enables us to reach the global minima/a good local minima.
Initializing Clusters
In the pseudo-code for k-means, we saw how the cluster assignment step and the move centroid step work in tandem to generate k clusters, but we still haven’t discussed how to initialize the cluster centroids randomly, which was the first line of the pseudo-code. Just like we randomly initialized the parameters ‘W’ and ‘b’ in linear and logistic regression, here, there are different ways of initializing the parameters c(i) and µ(k) randomly. Let’s discuss them one by one.
(i) Forgy Initialization: Forgy initialization is one good and fast way of initializing parameters. For K clusters, Forgy method randomly chooses K observations from the training dataset and uses them as the initial means (cluster centroids). Forgy initialization is quite an intuitive technique for initializing cluster centroids since the cluster centroids will lie somewhere near to the training data points, however in practice it tends to spread the initial means out i.e. the algorithm takes more steps to converge.
(ii) Random Partition: In this method, each data point is randomly assigned to one of the K clusters, data points with same group are combined together to form groups and their mean value is taken to determine the initial cluster centroids. Random Partition method is generally preferred for fuzzy k-means but for the standard k-means, forgy initialization works well.
(iii) K-means++: The intuition behind this approach is that spreading out the K initial cluster centroids is a good thing. The first cluster centroid is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster centroid is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center. This method outperforms both, Forgy Initialization and Random Partition methods and is a preferred choice. More about K-means++ can be found here.
The Problem of Local Optima
Although the above mentioned random initializations work well for K-means, they can result in K-means converging to a bad local optimum (some local optima are close to global optima and some are not) of the cost function, resulting in poor cluster formation. In fact, in all the algorithms that have random initialization steps, they are at this risk of getting stuck on local optima, even linear and logistic regression algorithms. Look at the image below to see an example of clusters formed by K-means when the algorithm gets stuck in a bad local optimum.

These bad local optima are detrimental for the model’s performance. We need to avoid such a scenario. The solution is simple though. For values of K between 2–10, we can overcome this problem by running 10 to 1000 iterations of K-means, each time with different initial random initializations and pick that one model for which the set of parameters (c(i) and µ(k)) obtained leads to the smallest value for the cost function. However, this method does not work for large values of K since the more clusters we want, greater is the chance of the algorithm converging to a local optima. Hence, in such cases, we don’t need to apply K-means a large number of times since each time there will be high chances of converging to a local minima. A single iteration of K-means gives satisfactory results for large values of K.
Choosing K
Since we have no way to evaluate the K-means model’s performance due to unlabelled data, the choice of the variable K becomes challenging. The most sensible thing to do in such cases is to use a hit-and-trial method. In the hit-and-trial method used for K-means, we choose a range of values of K (let’s say 1 to 10) and compute the overall cost by running K-means with every possible value of K in the range selected. The value of K up to which the cost function decreases steeply and after which the cost function decreases quite slow can be considered an optimum value of K. This method is called the Elbow Method but I simply like calling it the hit-and-trial method. However, it is not a compulsion to choose this elbow point as the optimum value of K. The value of cost function decreases with increasing values of K and reaches 0 when K = m (total number of training data points) which happens when each data point is a part of its own unique cluster. Although we don’t want K = m, but we can go beyond the elbow point depending on the computation power, the amount of data and the complexity of data we have at hand.

Additional Notes
So far we’ve discussed about K-means in great detail and I’d like to draw your attention to the point when we just started discussing K-means. I mentioned that one of the limitations of K-means is that it doesn’t capture all the clusters automatically and needs an explicit value for the number of clusters. This however, can also be put to an advantage. Consider a dataset where we need to find clusters for a task such as market segmentation but the data points are uniformly scattered. An algorithm that captures clusters automatically has a higher chance of failing in such a scenario. But through the explicit number of clusters supplied to K-means, we can expect the algorithm to come up with good groupings which even humans can’t find. This is the beauty of K-means.
Another important thing to note about K-means is that K-means results in linear separation of clusters. This may come as a surprise but there is nothing to worry about. The clusters (the squiggly shapes drawn by hand around a set of data points to represent clusters) are still very much the same as we’ve portrayed them so far. The image below will make things clearer.

With this, we have come to an end on our discussion of K-means. Now, let’s use an unlabelled Credit Card Dataset from Kaggle and use it to determine the usage pattern of credit cards for different credit card holders.
Problem Statement
We’ll be using Credit Card Dataset on Kaggle to determine map spending activity. The dataset can be found here. As is the norm, we need to pre-process the data in order to feed it to the K-means algorithm. First of all, let’s read the data into a pandas dataframe.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

df = pd.read_csv('/kaggle/input/ccdata/CC GENERAL.csv')
print(df.shape)
df.head()

df.describe()

There are 18 features in the dataset which are difficult to fit inside a single image. But, from the output of df.describe(), we can see that for all the columns/features, the standard deviation is quite high and the min and max values are too far apart with the distribution being skewed towards lower values as can be seen from the 75% mark, mean and max. This means that the given dataset consists of outliers and these outliers need to be dealt with. Simply ignoring the outliers will result in quite a lot of data loss.
Before we deal with outliers, let’s check for missing values in the data and impute them.
df.isna().sum()

Only 2 columns have null values. The missing values are small fraction of the entire dataset (1/8950) and (313/8950) and hence can be easily imputed. We’ll impute CREDIT_LIMIT with mean value and since MINIMUM_PAYMENTS is a continuous variable skewed towards the lower side, we can impute it with either the mean or median. It shouldn’t make much of a difference since this the fraction of missing values is quite small. We’ll go with imputing with mean values.
df['MINIMUM_PAYMENTS'].fillna(df['MINIMUM_PAYMENTS'].mean(skipna=True), inplace=True)
df['CREDIT_LIMIT'].fillna(df['CREDIT_LIMIT'].mean(skipna=True), inplace=True)
Now that we’ve taken care of missing values, let’s focus our attention back towards the outliers. Let’s convert the entire dataset’s values to categorical values. As we are interested in finding similarities through clusters, it is a good idea to group values in a particular range and assigning them a category. Later we’ll normalize these category values as well to make sure that no large value in any column dominates/skews the clustering result.
columns = ['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT','PAYMENTS', 'MINIMUM_PAYMENTS'] # All features with outlandish values
for c in columns:
Range = c+'_RANGE'
df[Range]=0
df.loc[((df[c]>0)&(df[c]<=500)),Range]=1
df.loc[((df[c]>500)&(df[c]<=1000)),Range]=2
df.loc[((df[c]>1000)&(df[c]<=3000)),Range]=3
df.loc[((df[c]>3000)&(df[c]<=5000)),Range]=4
df.loc[((df[c]>5000)&(df[c]<=10000)),Range]=5
df.loc[((df[c]>10000)),Range]=6
columns=['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'PRC_FULL_PAYMENT']
for c in columns:
Range=c+'_RANGE'
df[Range]=0
df.loc[((df[c]>0)&(df[c]<=0.1)),Range]=1
df.loc[((df[c]>0.1)&(df[c]<=0.2)),Range]=2
df.loc[((df[c]>0.2)&(df[c]<=0.3)),Range]=3
df.loc[((df[c]>0.3)&(df[c]<=0.4)),Range]=4
df.loc[((df[c]>0.4)&(df[c]<=0.5)),Range]=5
df.loc[((df[c]>0.5)&(df[c]<=0.6)),Range]=6
df.loc[((df[c]>0.6)&(df[c]<=0.7)),Range]=7
df.loc[((df[c]>0.7)&(df[c]<=0.8)),Range]=8
df.loc[((df[c]>0.8)&(df[c]<=0.9)),Range]=9
df.loc[((df[c]>0.9)&(df[c]<=1.0)),Range]=10
columns=['PURCHASES_TRX', 'CASH_ADVANCE_TRX']
for c in columns:
Range=c+'_RANGE'
df[Range]=0
df.loc[((df[c]>0)&(df[c]<=5)),Range]=1
df.loc[((df[c]>5)&(df[c]<=10)),Range]=2
df.loc[((df[c]>10)&(df[c]<=15)),Range]=3
df.loc[((df[c]>15)&(df[c]<=20)),Range]=4
df.loc[((df[c]>20)&(df[c]<=30)),Range]=5
df.loc[((df[c]>30)&(df[c]<=50)),Range]=6
df.loc[((df[c]>50)&(df[c]<=100)),Range]=7
df.loc[((df[c]>100)),Range]=8
Since we have modified all the exisitng feature names, we can delete the existing feature names.
df.drop(['CUST_ID', 'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT' ], axis=1, inplace=True)
X= np.asarray(df)
df.head()

We can plot the frequency distribution for all the features. You can check them out in the notebook here.The above graphs show that the frequencies for lower values are high since most values in the data are small. This is evident from the minimum, 1st quartile, median 3rd quartile and maximum values from the data distribution we obtained by df.describe(). This process however took a certain number of trials but didn’t consume much time. Now we normalize all the values to adjust them in the range of 0–1.
scaler = StandardScaler()
X = scaler.fit_transform(X)
Now that we have converted data from continuous to discrete values and brought it down to a particular range, we’ve made sure that we give equal importance to all the features. Now, we can go ahead and apply K-means. Let’s use Elbow Method to choose an optimal value of K.
clusters = 25
cost = []
for i in range(1,clusters):
kmeans = KMeans(i)
kmeans.fit(X)
cost.append(kmeans.inertia_)
plt.plot(cost, 'ro-')

We seem to reach an inflection point when K = 6. After this value of K, the cost decreases very slowly.
kmeans = KMeans(6)
kmeans.fit(X)
labels = kmeans.labels_
The output of this step is a ‘cluster’ variable, which contains the cluster number for each record/row of the dataset. Let us add this variable at the end of the dataframe.
clusters = pd.concat([df, pd.DataFrame({'cluster':labels})], axis=1)
In order to visualize the clusters created and see if they’re well-defined, we need to reduce the dimensionality of the data since it’s difficult to visualize n-dimensional data in 2 dimensional space. However, while reducing the dimensionality of the data, we want to make sure that we capture as many features of the original dataset as possible. For this, we use Principal Component Analysis (PCA), which helps us to achieve the objective mentioned above. Follow the next post for an in-depth understanding of PCA.
pca = PCA(2)
principalComponents = pca.fit_transform(X)
x, y = principalComponents[:, 0], principalComponents[:, 1]
print(principalComponents.shape)
colors = {0: 'red', 1: 'blue', 2: 'green', 3: 'yellow', 4: 'orange', 5:'purple'}

final_df = pd.DataFrame({'x': x, 'y':y, 'label':labels})
groups = final_df.groupby(labels)
Finally, we plot all the clusters as various subplots inside a single plot.
fig, ax = plt.subplots(figsize=(15, 10))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=5, color=colors[name], mec='none')
ax.set_aspect('auto')
ax.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
ax.tick_params(axis= 'y',which='both',left='off',top='off',labelleft='off')
ax.set_title("Customers Segmentation based on their Credit Card usage bhaviour.")
plt.show()

The results are decent. We have 6 separate clusters which are well distinguished. We see that the clusters in green can be a part of either yellow and purple clusters if we chose the number of clusters to be 5. Additionally, we can also determine what each cluster means by plotting frequency distributions for each feature in a cluster and then comparing and contrasting these plots. The entire code for this post can be found here.
That’s it for this post. We had a detailed look at K-means algorithm in this post. In the next post, we’ll have an in-depth look at Principal Component Analysis, which is one of the most popular and extensively used unsupervised learning algorithms.