A common question in cluster analysis is: "Now what?". You have plenty of methods to bring some order into your data chaos, without the need of having a pre-defined hypothesis in mind. You could use the classical kMeans, some density based methods like dbscan or Hierarchical Clustering, which gives you an impression of the hierarchy of clusters in a dataset. However, you are mostly not only interested in which of your samples cluster together, but also, why they do so. Only when being able to tell which sample features brought them together and which drove them apart, you get value out of your analysis. You move beyond mere description to interpretation and insights.
Yet you want to make sure that your analysis tells the story all by itself. Do not steal the show. Just provide the right figures, and the insights will be obvious. Recently, I stumbled upon a new package in R, which makes appealing hierarchical clusters a child’s play. It also has a great documentation here, so I merely provide an appetizer. You can go and explore the nitty-gritty details all by yourself.
So first we get some data (the mtcars dataset) and run a hierarchical cluster analysis. This is how the standard plot looks like:
Now we pep it up by loading Tal Galili’s awesome dendextend package and transform the hclust object to a dendrogram. You can color the leaf labels and dendrogram lines (here by engine type), set the size of the labels by some vector (in this case by the number of cylinders) and play around with nodes.
But most importantly, you can add colored bars, which represent your feature levels of the given sample. I have cut continuous variables by their quartiles to assign discrete colors, so you quickly get an impression of feature levels of a given cluster. Thus we can quickly see that the group in the middle with Merc450 series shares a common weight class (wt), rear axle ratio (drat), same number of gears and cylinders, same type of transmission and engine (am, vs), yet the cars differ in displacement and horsepower (disp, hp). Interestingly, the rear axle ratio & mpg combination of the Honda Civic is so far off the Mercedes450 group, that it clusters with Toyota and Fiat, though its other parameters are comparable to Merc450. I wonder where Tesla would have been grouped to, if the dataset was renewed someday.
Here is the code to get the figures:
library(dendextend)
library(colorspace) #to translate factors to colors
library(plotrix) #to translate hex to color
#run cluster analysis:
hccomplete <- hclust(dist(mtcars), method ="complete")
#standard plot
plot(hccomplete)
#transform the hclust object to a dendrogram
dend <- hccomplete %>% as.dendrogram
#get colors for factors. Continuous variables are cut by their quartiles, to which the colors are assigned
the_bars <- cbind(unname(sapply(rainbow_hcl(length(levels(as.factor(mtcars$cyl))))[c(as.factor(mtcars$cyl))], color.id)),
unname(sapply(rainbow_hcl(length(levels(as.factor(mtcars$vs))))[c(as.factor(mtcars$vs))], color.id)),
unname(sapply(rainbow_hcl(length(levels(as.factor(mtcars$am))))[c(as.factor(mtcars$am))], color.id)),
unname(sapply(rainbow_hcl(length(levels(as.factor(mtcars$gear))))[c(as.factor(mtcars$gear))], color.id)),
unname(sapply(rainbow_hcl(length(levels(as.factor(mtcars$carb))))[c(as.factor(mtcars$carb))], color.id)),
sapply(cut(mtcars$mpg,breaks=c(0,unname(quantile(mtcars$mpg))), labels=heat_hcl(5)), color.id),
sapply(cut(mtcars$disp,breaks=c(0,unname(quantile(mtcars$disp))), labels=heat_hcl(5)), color.id),
sapply(cut(mtcars$hp,breaks=c(0,unname(quantile(mtcars$hp))), labels=heat_hcl(5)), color.id),
sapply(cut(mtcars$drat,breaks=c(0,unname(quantile(mtcars$drat))), labels=heat_hcl(5)), color.id),
sapply(cut(mtcars$wt,breaks=c(0,unname(quantile(mtcars$wt))), labels=heat_hcl(5)), color.id),
sapply(cut(mtcars$qsec,breaks=c(0,unname(quantile(mtcars$qsec))), labels=heat_hcl(5)), color.id))
#plot awesomely
par(mar = c(17,3,2,2))
dend %>% set("branches_col", ifelse(mtcars$vs ==0, "red", "blue")) %>% # change color of branches
set("labels_col", ifelse(mtcars$vs ==0, "red", "blue"))%>% # change color of labels
set("labels_cex", mtcars$cyl/10+0.2 ) %>% # Change size by number of cylinder
set("nodes_pch", 19) %>% set("nodes_cex", 0.8) %>% set("nodes_col", 3) %>% #add nodes
plot(main = "Hierarchical cluster dendrogramm colored by engine (vs) nFont size by cylinder") # plot
colored_bars(colors = the_bars, dend = dend, rowLabels = c("cyl","vs","am","gear","carb","mpg","disp","hp","drat","wt","qsec"), sort_by_labels_order = FALSE) #sort_by_labels_order = FALSE is very important, otherways you get the colors sorted by their appearance in the dataset rather than in the dendrogram