The world’s leading publication for data science, AI, and ML professionals.

Hierarchical clustering explained

Hands-on Tutorials

Illustration of analysis and procedures used in hierarchical clustering in a simplified manner

Photo by Alina Grubnyak, Unsplash
Photo by Alina Grubnyak, Unsplash

In our previous article on Gaussian Mixture Modelling(GMM), we explored a method of clustering the data points based on the location of the sample in its feature vector space. In GMM, based on the distribution of data points in the system, we were able to assign the likelihood of every sample belonging to each cluster in a probabilistic manner.

But what if instead of just focussing on the concentration of distribution of the data points in the entire system, we wanted to quantitatively estimate the relation between every sample in the system and study how close each data point is related to each other in the system. To achieve this objective, in this article, we will explore another method of clustering that belongs to a completely different family of cluster analysis known as hierarchical clustering.

Dendrogram

The sole concept of hierarchical clustering lies in just the construction and analysis of a dendrogram. A dendrogram is a tree-like structure that explains the relationship between all the data points in the system.

Dendrogram with data points on the x-axis and cluster distance on the y-axis (Image by Author)
Dendrogram with data points on the x-axis and cluster distance on the y-axis (Image by Author)

However, like a regular family tree, a dendrogram need not branch out at regular intervals from top to bottom as the vertical direction (y-axis) in it represents the distance between clusters in some metric. As you keep going down in a path, you keep breaking the clusters into smaller and smaller units until your granularity level reaches the data sample. In the vice versa situation, when you traverse in up direction, at each level, you are subsuming smaller clusters into larger ones till the point you reach the entire system. As a result, hierarchical clustering is also known as clustering of clustering.

Effect of granularity and cluster size while traversing in the dendrogram (Image by Author)
Effect of granularity and cluster size while traversing in the dendrogram (Image by Author)

Number of clusters

In hierarchical clustering, while constructing the dendrogram, we do not keep any assumption on the number of clusters. Once the dendrogram has been constructed, we slice this structure horizontally. All the resulting child branches formed below the horizontal cut represent an individual cluster at the highest level in your system and it defines the associated cluster membership for each data sample. Note we are saying it as the highest level because even after you have created the clusters, you are still aware of what would be the relationship within the subsequent subclusters that can still be formed and you always have an option to increase/decrease the granularity level of clustering.

Dendrograms however do not do proper justice to understand how the clusters will look like after you place the horizontal cut. You have to individually mark the data points in a feature vector space with the resulting cluster indexes to visually see the effect of clustering.

A sliced dendrogram with its resulting clusters marked in its corresponding feature vector space on the right side (Images by Author)
A sliced dendrogram with its resulting clusters marked in its corresponding feature vector space on the right side (Images by Author)

The next question to ponder is where should you place the horizontal cut. The location of slicing can either be decided visually or even with the opinion that you desire to have a minimum distance of ‘y’ (the location of cut in the y-axis) between your clusters.

Dendrogram slicing possibilities at cluster distance of y1, y2, y3 (Image by Author)
Dendrogram slicing possibilities at cluster distance of y1, y2, y3 (Image by Author)

Also, it is not a constraint that you have to cut dendrogram throughout at a constant distance. Based on the application and domain knowledge of the problem you are trying to solve, the dendrogram can be cut inconsistently. For example, below in the outlier detection application, to separate a couple of outliers lying adjacently, the horizontal cut is varying at different places.

Dendrogram cut at variable lengths for outlier detection application with feature vector space projection on the right side (Images by Author)
Dendrogram cut at variable lengths for outlier detection application with feature vector space projection on the right side (Images by Author)

Interpretation of dendrogram

Each level of dendrogram has a subtle meaning to the relationship between its data members. In a regular relationship chart, one may interpret that at the top lies grandparents or the first generation, the next level corresponds to parents or second generation and the final level belongs to children or third generation. Likewise, in every branching procedure of dendrogram, all the data points having the membership at each level belongs to a certain class.

However, to infer this class entity, one has to go through a few individual samples of each level within the formulated cluster and find out what feature is common in the resulting cluster. Also, these inferred classes need not be similar at the sister branches. For example, in level 2, the cats have been clustered on big ears and calm behavior but dogs have been clustered on a similar attribute of size.

Inferred classes of data in dendrogram by the user (Image by Author)
Inferred classes of data in dendrogram by the user (Image by Author)

Construction of dendrogram

Now that we have understood what a dendrogram is, let us learn how to construct it. There are two ways of constructing it. One way to construct it is by bottoms-up where you start from the bottom and keep merging the individual data points and subclusters and go all the way to the top. This is known as agglomerative clustering.

The other alternative is the opposite procedure of top-down in which you start by considering the entire system as one cluster and then keep sub clustering it until you reach individual data samples. This process is known as divisive clustering. Each of these methods has separate Algorithms to achieve its objectives.

a) Agglomerative clustering

One of the simplest and easily understood algorithms used to perform agglomerative Clustering is single linkage. In this algorithm, we start with considering each data point as a subcluster. We define a metric to measure the distance between all pairs of subclusters at each step and keep merging the nearest two subclusters in each step. We repeat this procedure till there is only one cluster in the system.

b) Divisive clustering

One of the algorithms used to perform divisive clustering is recursive k-means. As the name suggests, you recursively perform the procedure of k-means on each intermediate cluster till you encounter all the data samples in the system or the minimum number of data samples you desire to have in a cluster. In each step of this algorithm, you have to be mindful of how many clusters would you like to create next.

Formation of agglomerative/divisive cluster with data samples represented by black dots (Image by Author)
Formation of agglomerative/divisive cluster with data samples represented by black dots (Image by Author)

In both the clustering approaches, when we are drawing the dendrogram, we have to be mindful of the distance between the two subsuming clusters, and the variation in the scale of the distance is maintained in the Y-axis of the dendrogram diagram.

Variation in height of branch in dendrogram based on cluster distance (Image by Author)
Variation in height of branch in dendrogram based on cluster distance (Image by Author)

Lastly, let us look into the advantages and disadvantages of hierarchical clustering.

Advantages

  • With hierarchical clustering, you can create more complex shaped clusters that weren’t possible with GMM and you need not make any assumptions of how the resulting shape of your cluster should look like.
Complex structured shapes formed with hierarchical clustering (Image by Author)
Complex structured shapes formed with hierarchical clustering (Image by Author)
  • In one go, you can cluster the dataset first at various levels of granularity, and then decide upon the number of clusters you desire in your dataset.
  • If you are using some type of Minkowski distance, say Euclidean distance to measure the distance between your clusters, then mathematically, your clustering procedure can become very easy to understand.

Disadvantages

  • When you begin analyzing and taking decisions on dendrograms, you will realize that hierarchical clustering is heavily driven by heuristics. This leads to a lot of manual intervention in the process and consequently, application/domain-specific knowledge is required to analyze whether the result makes sense or not.
  • Though hierarchical clustering may be mathematically simple to understand, it is a mathematically very heavy algorithm. In any hierarchical clustering algorithm, you have to keep calculating the distances between data samples/subclusters and it increases the number of computations required.
  • The last disadvantage that we will list is one of the biggest detrimental factors on why hierarchical clustering is usually shunned away by ML engineers. In all our visualizations, we have shown dendrograms having very few data samples. If the number of data samples increases (and quite possibly every time), then visually analyzing dendrogram and making decisions becomes impossible.

In this article, we have explored the concepts of hierarchical clustering. Do share your views in the comments section.


Related Articles