Hierarchical Clustering and Dendrograms in R for Data Science

A guide to understanding clustering techniques, its applications, pros & cons and creating Dendrograms in R.

Maria Gulzar
Towards Data Science

--

In the early stages of performing data analysis, an important aspect is to get a high level understanding of the multi-dimensional data and find some sort of pattern between the different variables- this is where clustering comes in. A simple way to define hierarchical clustering is:

`partitioning a huge dataset into smaller groups based on similar characteristics that would help make sense of the data in an informative way.`

Image via @jeremythomasphoto on unsplash.com

Hierarchical Clustering can be classified into 2 types:

· Divisive (Top-down) : A clustering technique in which N nodes belong to a single cluster initially and are then broken down into smaller clusters based on a distance metric until the desired number of clusters is achieved down the hierarchical structure.

· Agglomerative (Bottom-up): A set of N observations in which the closest two nodes are grouped together in a separate cluster to be left with N-1 points, followed by the same pattern recursively until we get one single cluster forming a final dendrogram that encases all clusters solutions in a single tree.

This blogpost will focus upon Agglomerative Hierarchical Clustering, its applications and a practical example in R. By now, two questions should arise in your mind. 1) When we say we group the two closest nodes together, how do we define close? And 2) What will be the merging approach to group them?

To compute distance, several approaches can be used (Euclidean distance being the most common):

· Euclidean distance: a continuous straight line similarity i.e Pythogaros’ Theorem

· Continuous Correlation Similarity

· Binary Manhattan Distance: Absolute distance calculated between two vectors (used where distances cannot be define by a straight line i.e city maps)

Let’s start with a small dataset and understand how Dendrograms are formed in RStudio:

Step 1: Generating random data

I have used normal distribution to compute both x and y coordinates for our dataset and also numbered the datapoints for our understanding.

Set.seed(12)
x <- rnorm(10, sd = 1)
y <- rnorm(10, sd = 1)
plot(x, y, col = "red", pch = 19, cex = 2)
text(x + 0.07, y + 0.06, labels = as.character(1:10))
(Image by author) Data plot

Step 2: Readying our plot to create a dendrogram

First, we store our x and y datasets as x- and y-coordinates of a dataframe. Next, we scale the coordinates to normalize our features with mean of 0 and variance of 1 (standardization). Lastly, we use the dist() function to calculate the distance between the rows of the dataframe.

dF <- data.frame(x = x, y = y)
dF <- scale(dF)
distxy <- dist(dF)
(Image by author) For e.g the distance from point 3 to point 2 is 2.94 while the distance from point 6 to point 4 is 0.603

Step 3: Call hclust()

This forms a hierarchical cluster of the data points based on a distance metric (in this case ‘Euclidean’) on the set of objects in the dataset (in this case 10)

cluster <- hclust(distxy)
Image by author

Step 4: Create a dendrogram

An alternate way to try this is using plot(as.dendrogram(cluster)) which yields the same result.

plot(cluster, ylab = "Height", xlab="Distance", xlim=c(1,10), ylim=c(1,10))
Image by author

Step 5: Obtaining your desired number of clusters

Depending on the problem at hand, the number of clusters you want out of your dendrogram varies according to where you draw the line. Here, since the line cuts the height at 1, we get 4 clusters.

abline(h=1.0, col= "blue")
Image by author

Applications:

From the classification of animal/plant species to determining the similarities in the variants of a virus to categorizing customer segmentation for marketing campaigns, Dendrograms has many uses. For example in customer segmentation, group together people with similar traits and their likelihood to purchase. Once you have the groups, you can run trials on each group with a different marketing copy that will help you better target your future campaigns.

The good and the bad:

Dendrograms are1) an easy way to cluster data through an agglomerative approach and 2) helps understand the data quicker. There is 3) no need to have a pre-defined set of clusters and we can 4) see all the possible linkages in the dataset.

However, the biggest issue with dendrogram is 1) scalability. Having a large dataset with a greater number of observations (i.e. 100+ or 1000+ etc.) will not yield conclusive results at all. It is 2) computationally expensive as a poor agglomerative cluster has a time complexity of O(n³).

--

--

Passionate about building products & data science, heart-breaking books and amateur writing. Ex-Editor-in-Chief, Scribes (medium.com/scribes)