The world’s leading publication for data science, AI, and ML professionals.

Concise Guide To Unsupervised Learning With Clustering!

Detailed understanding of the concepts of unsupervised learning with the help of clustering algorithms.

Photo by William Iven on Unsplash
Photo by William Iven on Unsplash

Machine learning tasks usually have some data sets where we have some parameters, and for those resulting parameters, we have their respective outputs. From these datasets, our machine learning model built can predict the results for similar data. This process is what happens in supervised learning.

An example of supervised learning is for determining if the patient appears to have a tumor. We have a large dataset with a set of parameters of their patients matched with their respective results. We can assume that this is a simple classification task with ‘1’ for tumor and ‘0’ for None.

Photo by Tran Mau Tri Tam on Unsplash
Photo by Tran Mau Tri Tam on Unsplash

However, let’s say we have a dataset of dogs and cats. There are no pre-trained results for us to determine which one of them is a cat or a dog. Such kind of problems that have unlabeled datasets can be solved with the help of unsupervised learning. In technical terms, we can define unsupervised learning as a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. Clustering and association are two of the most important types of unsupervised learning algorithms. Today, we will be focusing only on Clustering.


Clustering:

Source: Wikipedia
Source: Wikipedia

Using certain data patterns, the Machine Learning algorithm is able to find similarities and group these data into groups. In other words, Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

In clustering, we don’t have any predictions or labeled data. We are given a set of input data points, and using these we need to find the most similar matches and group them into clusters. The clustering algorithms have a wide range of applications that we will discuss in future sections.

Let us analyze the various clustering algorithms that are available. We will discuss the three most prevalent and popular algorithm techniques among the many existing approaches available to us. We will also understand the performance metrics used for Unsupervised Learning and finally discuss their applications in the real world.


Clustering Algorithms:

There are lots of clustering algorithms but today we will be focusing mainly of the three most popular and important types of clustering algorithms. These clustering algorithms are –

  1. Centroid-based clustering (K-means Clustering)
  2. Connectivity-based clustering (hierarchical clustering)
  3. Density-based clustering(DBSCAN)

We will analyze each of these algorithms in detail and understand how they exactly work. We will also look at the advantages and limitations of these algorithms. So, without further ado, let us get started!

1. K-means Clustering:

Source: Wikipedia
Source: Wikipedia

The K-means clustering algorithm is one of the most popular methods for performing clustering analysis. In K-means clustering, the hyperparameter used for evaluation is ‘K.’

‘K’ = No. of clusters.

The number of clusters will determine the type of clustering classification that will be performed. In the above figure, we can assume that the value of K chosen is 3. The value of K will determine the number of centers that will be considered for the process of segregation and grouping.

The "correct" value of the hyperparameter K can be determined with methods like grid search or random search. Hence, we can say that the main objective of K-means is to find the best centroids and group the clusters accordingly in a suitable manner for the particular dataset.

Steps to follow for K-means clustering –

1. Initialization: Randomly picking the n points from the dataset and initialize them. Choose the K value as well.
2. Assignment: For each selected point find the nearest centroid values. 
3. Update: Re-compute the centroid values and update them accordingly. 
4. Repetition: Repeat the step 2 and step 3 until you reach convergence. 
5. Termination: Upon reaching convergence terminate the program. 

The above image is a representation for the convergence procedure of K-means.

Advantages:

  1. Relatively simple to implement and execute.
  2. Effective and efficient with larger datasets.
  3. Guarantees convergence at some point.

Limitations:

  1. Choosing the best hyperparameter ‘k’ can be difficult.
  2. Issues in computing higher dimensional data.
  3. It is more prone to error in the presence of outliers or noise.

2. Hierarchical Clustering:

Source: Wikipedia
Source: Wikipedia

Connectivity-based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away.

The concept of hierarchical clustering is basically grouping of similar things from either bigger chunks to smaller chunks or vice versa. This point can be understood better when we look at the two types of hierarchical clustering methods which are as follows:

  • Agglomerative Clustering – This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive Clustering – This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

The agglomerative clustering is usually preferred over the divisive clustering. So, we will look further into the analysis of agglomerative cluster rather than divisive.

Source: Wikipedia
Source: Wikipedia

The above dendrogram is a representation of how hierarchical clustering and specifically how agglomerative clustering works.

Let us understand the steps involved in the process of agglomerative clustering –

1. Compute the proximity matrix which is basically a matrix containing the closest distance of each of the similarities i.e., the inter/intra cluster distances. 
2. Consider each point to be a cluster.
3. Repeat the following step for every point. 
4. Merge the two closest points. 
5. Update the proximity matrix. 
6. Continue until only a single cluster remains. 

Advantages:

  1. Easier to judge the number of clusters looking at the dendrograms.
  2. Overall easy to implement.

Limitations:

  1. Very sensitive to outliers.
  2. Not suitable for larger datasets.
  3. It has a high time complexity which not be ideal for some applications.

3. DBSCAN:

Source: Wikipedia
Source: Wikipedia

DBSCAN stands for Density-based spatial clustering of applications with noise and is gaining a rising popularity. In DBSCAN, we create clusters that for the areas of higher density than the remainder of the data set. Objects in sparse areas that are required to separate clusters are usually considered to be noise and border points.

DBSCAN utilizes two important hyperparameters in minimum points (or MinPts) and epsilon ε. Before we look at the steps on how to solve these problems, let us analyze how these hyperparameters exactly work.

  • MinPts: The larger the data set, the larger the value of minPts should be chosen. minPts must be chosen at least 3.
  • Epsilon ‘ϵ’: The value for ϵ can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor. Good values of ϵ are where the plot shows a strong bend like an elbow shape.
Source: Wikipedia
Source: Wikipedia

For the implementation of the DBSCAN algorithm, the stepwise procedure we need to follow is as described below:

1. Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
2. Label each of the selected points as a core point or border point or noise point. This initial labeling is an important step for the overall functionality.
3. Remove all the noise points from your data because sparse regions do not belong to any clusters. 
4. for each core point 'p' not assigned to a cluster create a loop as follows - 
   a. Create a new cluster with the point 'p'. 
   b. Add all points that are density connected to into this newly created cluster. 
5. Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise. Repeat the procedure until convergence is reached. 

Advantages:

  1. Density based clustering methods are highly resistant to noise and outliers.
  2. It usually works for any shape unlike the two previously mentioned algorithms which have trouble dealing with non-globular (non-convex) shapes.
  3. They do not have specific requirement for number of clusters to be set unlike K-means.

Limitations:

  1. DBSCAN cannot cluster data sets well with large differences in densities.
  2. It is not entirely deterministic and it is prone to errors on a high-dimensionality dataset.
  3. Since it has two variable hyperparameters, it is susceptible to changes in them.

Performance Metrics:

The performance metrics used in supervised learning like AUC (Area under ROC curve) or ROC (Receiver operating characteristic) curve does not work for unsupervised learning. Hence to evaluate the performance metric of unsupervised learning, we need to compute some parameters like intra-cluster and inter-cluster. Take a look at the below reference diagram.

Image By Author
Image By Author

Intra-cluster: The distance between two similar data points belonging to the same cluster.

Inter-cluster: The distance between two dissimilar data points belonging to different clusters.

The main objective of any good clustering algorithm is to reduce the intra-cluster distance and maximize the inter-cluster distance. One of the main performance metrics that is used for clustering is the Dunn Index parameter.

Dunn index: The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:

The first line in the ratio is trying to minimize the inter-cluster distance, and the second part is trying to increase the intra-cluster distance. Since the internal criterion seeks clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.

If anyone is confused by this statement, please consider the Dunn index ‘D’ as a parameter that measures the worst cases of both parameters, and hence when ‘D’ is high, it is considered as desirable clustering. Other performance metrics choices are available such as Davies–Bouldin index, but the Dunn index is usually more preferred in most cases.


Applications of Clustering:

  1. Data Mining: It is the extraction of useful data elements from an existing dataset. The clustering algorithms can be used in the selection of groups of data that are useful for the task, and the rest of the data can be neglected.
  2. Pattern Recognition: We can also make of clustering algorithms to find out evident patterns between objects. Pattern recognition is the automated recognition of patterns and regularities in data.
  3. Image Analysis: Similarity and similar types of images can be grouped into clusters together to get the desired result. An example of this is the segregation of cats and dogs, which was mentioned previously in the earlier sections.
  4. Information Retrieval: Information retrieval is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Clustering can be used for similar natural language processing tasks for obtaining selected repetitive patterns.
  5. Biomedical application and Bioinformatics: Clustering is extremely beneficial in the field of analysis of medical data and scans to determine patterns and matches. An example of this is the IMRT segmentation, where Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
  6. Anomaly Detection: Clustering is one of the best ways to deduce patterns and detect the presence of outliers by grouping similar groups into clusters while neglecting the outliers, i.e., the unnecessary noise information present in the dataset.
  7. Robotics: The field of robotics utilizes clustering in an interdisciplinary manner with all the disciplines mentioned above. The robots need to find patterns on their own without labeled data, and clustering can be of great help. They are also used for robotic situational awareness to track objects and detect outliers in sensor data.
  8. E-commerce: The last but certainly not the least of all the applications we will be discussing is E-commerce. Clustering is extensively used in marketing and e-commerce to determine the specifications of customers in their business. These distinguished patterns allow these companies to determine what products to sell to their particular customers.

There are a ton more applications to clustering, and I would highly recommend all of you to check out the various applications that you can utilize in the field of clustering.


Conclusion:

Image By Author
Image By Author

Clustering is an extremely important concept of machine learning that can be used for a variety of tasks. It can also be very useful in the data mining process and the initial stages of exploratory data analysis. The above diagram is an image of a graph constructed using TSNE after the application of clustering on the data.

With the help of the three aforementioned algorithms, you should be able to solve most of the clustering tasks quite easily. The techniques and applications mentioned give us a brief insight into why clustering algorithms are helpful in categorizing the unlabeled datasets. I hope this article was helpful in explaining these concepts.

Check out these other recent articles that you might be interested in reading!

Natural Language Processing Made Simpler with 4 Basic Regular Expression Operators!

5 Best Python Project Ideas With Full Code Snippets And Useful Links!

Artificial Intelligence Is The Key To Crack The Mysteries Of The Universe, Here’s Why!

OpenCV: Complete Beginners Guide To Master the Basics Of Computer Vision With Code!

Lost In A Dense Forest: Intuition On Sparsity In Machine Learning With Simple Code!

Thank you all for sticking on till the end. I hope you enjoyed reading the article. Wish you all a wonderful day!


Related Articles