A Dive Into Unsupervised Learning

How do the Unsupervised Learning Algorithms work?

Published in

Towards Data Science

13 min readNov 22, 2020

Unsupervised algorithms are regarded as self-learning algorithms that possess the capacity to explore and locate the previously unknown patterns in a dataset. They are one of the most used machine learning algorithms as they do not need a labeled dataset to operate. The unsupervised algorithms are widely used to detect anomalies and defects in the dataset, as the anomalies won’t match the normal pattern of the rest of the data distribution.

Types of Unsupervised Learning

There are two types of unsupervised learning algorithms mostly:

Association Algorithm
Clustering Algorithm

Clustering Algorithms are a type of unsupervised learning algorithm where we try to find the features in our dataset based on the similarity of which we can group or cluster the data together.

Clustering algorithms mainly consists of three types of algorithms:

1. Partitioning Methods

The partitioning methods usually operate using a central tendency to represent a structure. They use a distance-based approach to create the clusters.

The most popular Partitioning method algorithms are:

K-means Algorithm
K-medoid Algorithm
KNN Algorithm

2. Hierarchical Methods

The hierarchical algorithms are so-called because they create tree-like structures to create clusters. These algorithms also use a distance-based approach for cluster creation.

The most popular algorithms are:

Agglomerative Hierarchical clustering
Divisive Hierarchical clustering

3. Density-Based Methods

They create arbitrary shaped clusters. The methods identify dense regions and low-density regions and create clusters accordingly

The most well-known algorithm is

DBSCAN Algorithm

4. Unsupervised Algorithm based Dimensionality Reduction Algorithms

Several unsupervised algorithm principle-based algorithms are used for dimensionality reduction. Most used and well-known algorithms are:

PCA
t-SNE

Association Algorithms are the type that works on dependencies. They basically create mappings among data items. Say, in a supermarket, a customer buying pattern shows customers who bought milk also bought bread most of the time. So, we create a dependency between milk and bread, that acts as an association between the items. Such algorithms are used by supermarket and marketing E-shops to predict items a customer may want to buy based on the products he/she already bought or added to a cart.

The most important algorithm used in association problems is Apriori Algorithm.

Let’s take a dive into the algorithms.

K-means Clustering

The algorithm initially creates K clusters randomly using N data points and finds the mean of all the point values in a cluster for each cluster. So, for each cluster we find a central point or centroid calculating the mean of the values of the cluster. Then the algorithm calculates the sum of squared error (SSE) for each cluster. SSE is used to measure the quality of clusters. If a cluster has large distances between the points and the center, then the SSE will be high and if we check the interpretation it allows only points in the close vicinity to create clusters.

The algorithm works on the principle that the points lying close to a center of a cluster should be in that cluster. So, if a point x is closer to the center of cluster A than cluster B, then x will belong to cluster A. Thus a point enters a cluster and as even a single point moves from one cluster to another, the centroid changes and so does the SSE. We keep doing this until the SSE decreases and the centroid does not change anymore. After a certain number of shifts, the optimal clusters are found and the shifting stops as the centroids don’t change any more.

The initial number of clusters ‘K’ is a user parameter.

Let’s look at the visualization.

The first image shows the distribution of points on which the clustering is to be performed. In the second, we just start clustering randomly and we locate their centroids. Now, we calculate the SSE for each cluster, i.e, the sum of squared distances of all the points in the cluster from the centroid of the cluster. The main aim is to keep decreasing the SSE as we move forward. Coming to the last image or fourth image we have reached the optimal clusters and no the centroids do not shift anymore. Here we will find the SSE to be minimum.

We find the total SSE for a stage in the clustering process by adding up the sum of all individual clusters. So, if a point of cluster A is closer to the centroid of cluster B than that of A, grouping the point in cluster B will decrease the overall SSE.

We have seen that for this type of clustering technique we need a user-defined parameter ‘K’ which defines the number of clusters that need to be created. Now, this is a very important parameter. To, find this parameter a number of methods are used. The most important and used method is the elbow method.

For smaller datasets, k=(N/2)^(1/2) or the square root of half of the number of points in the distribution.

Elbow method

The elbow method works on the Sum of the squared error principle. We try random numbers of clusters and try to build a graph of the correspondingly obtained SSEs. So, say for our case, we consider the number of clusters as 1,2,3,4,5,6,7, and 8. For each number of clusters we obtained the corresponding SSEs, SSE1, SSE2, SSE3, and so on. Now, if we plot the SSEs we obtain a plot as follows:

So, we find an Elbow at 4, so, we should use 4 as the optimal number of clusters. On moving further also, the SSE decreases, but it creates a huge number of clusters which is not desired. This is a trade-off between the decreasing SSE and the number of created clusters.

Disadvantages

One of the main disadvantages of the algorithm is that it is superbly susceptible to noise and outliers. As we have seen the algorithm selects the centroid which is just calculated as the mean of all the points, and not a real point of the distribution, the outliers present in any cluster will cause the centroid to distort and it will also cause the SSE to blow up.

To solve this issue with the K-means algorithm, the K-medoid algorithm was developed.

K-medoid Algorithm

Instead of considering the mean of the data points in the cluster, the K-medoids technique considers ‘k’ representative data points from the existing pointed in the data sets as the center of the cluster. The representatives are actual data points in K-medoids, unlike K-means algorithms. Next, the process is very similar to the K-means algorithm. In this case, we calculate the SSE from the considered representative point and not the centroid as we did in the case of K-means.

An algorithm named Partitioning Around Medoids (PAM) is used to implement the K-medoid algorithm. Let’s see how the algorithm works.

Initially, K points are chosen randomly as the representative points, one for each cluster.
We assign the remaining points to the cluster to whose representative point it is nearest to.
We randomly select a non-representative point say ‘On’ from a cluster whose current representative point is ‘Oo’.
So, we calculate both the ‘SSEold’ and ‘SSEnew’ corresponding to the ‘Oo’ and ‘On’ points.
If we find SSEnew<SSEold we swap the representative points and redefine the clusters
We repeat 2 to 5 until there is no change in SSE.

The algorithm works on real points rather than imaginary mean so, they can easily eliminate outliers and noise. The disadvantage of the algorithm is that it has higher time complexity. The time complexity of K-medoid algorithm O(k(n-k)²) for one iteration where k is the number of representative points and n is the total number of points in the distribution.

So, for the total algorithm to complete, if we take t iterations the time complexity will be O(tk(n-k)²)

K-NN Algorithm

K-NN or K-Nearest Neighbours do not have a model-based working. It is a simple algorithm that works entirely on similarity metrics that is mostly distance-based measure. It considers all the points available and finds the distance, so the K-NN algorithm has a huge time complexity and only used for small datasets.

We have seen the partitioning methods, now let’s look at the hierarchical methods.

Hierarchical methods

There are two types of hierarchical methods:

Agglomerative Hierarchical clustering: In this type of hierarchical clustering, each point initially starts as a cluster, and slowly the nearest or similar most clusters merge to create one cluster.
Divisive Hierarchical Clustering: The type of hierarchical clustering is just the opposite of Agglomerative clustering. In this type, all the points start as one large cluster and slowly the clusters get divided into smaller clusters based on how large the distance or less similarity is between the two clusters. We keep on dividing the clusters until all the points become individual clusters.

Now, as we have seen in both cases distance-based similarity metrics play a major role. One thing to note is, though we are using the term “Distance-based Similarity” it does not mean only distance, it implies measures derived from a distance as basic units. We will look at them later.

For agglomerative clustering, we keep on merging the clusters which are nearest or have a high similarity score to one cluster. So, if we define a cut-off or threshold score for the merging we will get multiple clusters instead of a single one. For instance, if we say the threshold similarity metrics score is 0.5, it means the algorithm will stop merging the clusters if no two clusters are found with a similarity score less than 0.5, and the number of clusters present at that step will give the final number of clusters that need to be created to the clusters.

Similarly, for divisive clustering, we divide the clusters based on the least similarity scores. So, if we define a score of 0.5, it will stop dividing or splitting if the similarity score between two clusters is less than or equal to 0.5. We will be left with a number of clusters and it won’t reduce to every point of the distribution.

Let’s see a visualization for agglomerative hierarchical clustering.

The above diagram shows how clustering algorithms work. Divisive algorithms works the same way just in the reverse order. Now, the question arises of how the similarity metrics maintained and how the cutoff is applied. One of the most used methods for the purpose is the dendrogram method.

Dendrograms

Dendrograms are used to represent the similarity measure or the distance between the cluster. It is visualized like a tree. It also displays the order in which the clusters are merged in the case of agglomerative clustering and split in the case of divisive clustering.

Let’s build a dendrogram for our example shown above:

The order of merging is evident from the graph. The connection length between clusters is the amount of similarity between the cluster. So, it is basically the measure of distance. The more dissimilar the two clusters are, the longer is the length of the cluster. Now, as shown in the diagram, if we use a threshold at 0.25 the clusters (1,2,3,4,5) and (6,7,8) won’t merge and we will obtain two clusters conclusively. The operations are similar in the case of divisive clustering also.

Now, the question arises, how we obtain the similarity measure for the dendrogram based on the distances?

There are several methods to define inter-cluster distance or similarity. The methods used are:

MIN or Single link: Distance between the nearest two points of different clusters are used as the metrics. The disadvantage of using this unit is it is susceptible to noise and outliers.
MAX or complete link: Distance between the farthest points of different clusters are used. This method is less susceptible to noise but tends to break the large clusters
Group Average: Defines cluster proximity to be the average pairwise distance of all pairs of points from different clusters.
Distance between Centroids: Each cluster is represented by the centroids. The distance between the centroids serves as the similarity metrics
Ward’s Method: The method works on the principle of Sum of squared errors. Here also, the cluster is represented by the centroid of the cluster. First, the SSE of each cluster is calculated before merging, and then again the SSE of each cluster is calculated after merging. The distance or score is measured on the basis of the two values. It tries to minimize SSE. The algorithm is also not affected by the outlier but they tend to create globular clusters

The algorithm takes O(n³) time complexity which can be reduced to O(n²log n) on use of heaps and takes O(n²) space complexity.

We have talked all about Hierarchical methods. Next, let’s talk about Density-Based Scanning methods.

Density-based methods- DBSCAN

Density-based scanning methods are the most used unsupervised clustering algorithm. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. The algorithm is most used because it works the best with noisy data with outliers and doesn't prefer spherical, globular, or elliptical clusters. It can cluster in any shape.

To understand DBSCAN we must know some terms:

Epsilon: It is the considered radius around a given data point

Minimum points: It gives the minimum number of points that have to be present inside the Epsilon radius circle around a data point.

Core point: If a data point has a number of points equal or more than the “minimum points” inside the radius of epsilon around the circle, a point is called the core point.

For example, if Epsilon is 5 and Minpts is 4, and inside 5 unit radius of point A, there are at least 4 points including A, A is called a Core point

Border point: If a data point does not have the minimum points required to make a core point but has at least one core point inside the epsilon radius around the points, it is called a Border point.

Noise Point: If a data point has no core points inside the epsilon radius around the point, the point is called the Noise point, as the point is isolated. Thus DBSCAN rules out outliers.

The core and boundary points are considered in the cluster.

To implement the algorithm we create a matrix having distances of all the points from each other. The distance measure to be used is mostly euclidian distance. The matrix formed is symmetric as the distance from point i to point j is equivalent to the distance of point j from point i. The time complexity of the algorithm is O(n²).

We have almost discussed all the used clustering algorithms. Now, let’s look at the unsupervised algorithm based on dimensionality reduction algorithms.

Dimensionality Reduction Algorithms

PCA

PCA or Principle Component Analysis is very commonly used for viewing the data distribution and analyzing the main components of a distribution. We are not going to discuss it here because it is not a fully unsupervised algorithm. You can check out the concepts behind PCA here and its application here.

t-SNE

t-SNE stands for T-distributed Stochastic Neighbour Embedding. It is a method to represent higher dimensional data in lower dimensions, mostly for the purpose of data visualization.

t-SNE works on the Kullback-Lieber Divergence theorem:

t-SNE first randomly projects the points from a higher dimensional plane say X to a lower-dimensional plane say Y.

The probability of the point j being similar to point i is equal to the euclidian distance between the two points on the plane.

So, for plane X, it is given by:

For plane Y, it is given by:

The target of the algorithm is to make P and Q as similar as possible.

So, if P==Q, log(1)=0 so, no change,

but if P !=Q, there is a penalty.

t-SNE is preferred over PCA in some cases because if the distance between the points of the distribution is less, t-SNE performs better than PCA.

t-SNE creates a T-distributed curve using all the points of the distribution. So, for each point, we create a T-distribution, which helps us to preserve the distances of the other points from a given point. The points near a considered point lie close to the center of the curve and the ones far away lie at the long tail ends. Thus, from the curve, we exactly know how far from a point the other points in the distribution are located. We create the curve from the original high dimensional data and use them to arrange the points correspondingly in the lower dimension.

Now, a question may arise, why T-distributed curve?

The original SNE algorithm was designed using the normal distribution curve, which had a crowding problem, i.e, the points lying close to the considered point crowded at the top of the bell curve. To solve the problem, a T-distribution curve was introduced.

These sums up all the mainly used unsupervised learning algorithms and approaches.

Conclusion

Unsupervised algorithms play a major role in machine learning problems as it is impossible to generate a huge amount of labeled data in real world scenarios. The algorithms we talked about in this article serve to solve several machine learning problems like fraudulent transactions.

I hope this helps!!!.