The world’s leading publication for data science, AI, and ML professionals.

Making publication-quality figures in Python (Part IV): Violin plot and dendrogram

Drawing violin plot and dendrogram from the scratch, a step-by-step guide

python-visualization-tutorial

Photo by Myriam Jessier on Unsplash
Photo by Myriam Jessier on Unsplash

This is the fourth tutorial of my python visualization series,

  1. The tutorial I: Fig and Ax object
  2. Tutorial II: Line plot, legend, color
  3. Tutorial III: box plot, bar plot, scatter plot, histogram, heatmap, colormap
  4. Tutorial IV: violin plot, Dendrogram
  5. Tutorial V: Plots in Seaborn (cluster heatmap, pair plot, dist plot, etc)

Let’s have a short intro this time, this article just resumes from where Tutorial III left off. The reason I want to write a separate article for the Violin plot and dendrogram is that they are a little bit involved compared to the previously-covered plot types. So take a deep breath and let’s dive into these two plots.


Violin plot

The Violin plot essentially takes the exact same input as the box plot, namely, a sequence of 1D arrays. The only difference is that the violin plot can show an additional layer of information of your 1D array, which is the density, or distribution. This piece of information was smeared in a box plot (a box body has to be rectangular) but in a violin plot (how the name is developed), density information is able to be revealed because we now have a curvy "box", or a violin.

Same preparation as the last tutorial,

# load packages
import matplotlib as mpl
import matplotlib.pyplot as plt

# set global parameters
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
mpl.rcParams['font.family'] = 'Arial'
# prepare some data for drawing figures
import numpy as np

np.random.seed(42)
data1 = np.random.randn(100)
data2 = np.random.randn(100)
data3 = np.random.randn(100)

I add two lines of codes for convenience:

dataset = [data1,data2,data3]. # the sequence of arrays
positions = [1,5,7].  # where to put those violin, the x-coordinates

Now let’s start to draw violin plot:

fig,ax = plt.subplots()
vp = ax.violinplot(dataset=dataset,positions=[1,5,7])
Basic violin plot
Basic violin plot

It looks OK, it has the desirable density information that we are looking for. For an exploratory purpose, it is absolutely fine to keep with that. But I may need to make it a bit prettier:

  1. Aesthetic adjustment: face color, line color, etc
  2. Add quantiles, whiskers to the violin plot

For point (1), it is what I covered in Tutorial III, figure out the underpinning Python objects for each graph element, and look up the function to change their individual aesthetic setting via reading the documentation. This is important! Check out the box plot section for more detail.

fig,ax = plt.subplots()
vp = ax.violinplot(dataset=dataset,positions=[1,5,7])

for body in vp['bodies']:
    body.set_facecolor('red')
    body.set_edgecolor('black')
    body.set_alpha(1)
vp['cmaxes'].set_color('black')
vp['cmins'].set_color('black')
vp['cbars'].set_color('black')

Let’s see the effect:

Modify aesthetic elements
Modify aesthetic elements

For point (2), here’s where things become a bit complicated, but no worry, let’s go over it together. Basically, we need two arrays, one to store quantile information, let’s call it tmp array because it is a temporary one. Another is called whisker array, it is to store the upper and lower bound, or whisker.

tmp : [(q1,median,q3), (q1,median, q3), (q1,median,q3)]

whisker : [(upper whisker, lower whisker), (upper whisker, lower whisker), (upper whisker, lower whisker)]

You see, each tuple corresponds to a violin box. Now we need to compute these stuff, right?

tmp is easy, np.percentile() function do all the stuff for us, we just use a list comprehension to iteratively get the final tmp array.

tmp = [np.percentile(data,[25,50,75]) for data in dataset]

Then for whisker, let’s define a function, the logic is simple because we want to compute upper whisker and lower whisker, we need to first compute iqr , then upper whisker is just 1.5 IQR + q3, lower whisker is just q1-1.5 IQR.

def get_whisker(tmp,dataset):
    whisker = []
    for quantile,data in zip(tmp,dataset):
        data = np.array(data)
        q1 = quantile[0]
        median = quantile[1]
        q3 = quantile[2]
        iqr = q3 - q1
        upper = q3 + 1.5 * iqr
        upper = np.clip(upper,q3,data.max())
        lower = q1 - 1.5 * iqr
        lower = np.clip(lower,data.min(),q1)
        whisker.append((upper,lower))
    return whisker

Remember, here, we use np.clip() function to clip upper whisker if it exceeds the maximum value of the data (1D array), otherwise it will be misleading since there shouldn’t be an upper whisker if there’s no data points up here, right?

Now we have the whisker array as well.

whisker = get_whisker(tmp,dataset)

Then we add those elements to the Violin Plot:

fig,ax = plt.subplots()
vp = ax.violinplot(dataset=dataset,positions=[1,5,7])

for body in vp['bodies']:
    body.set_facecolor('red')
    body.set_edgecolor('black')
    body.set_alpha(1)
vp['cmaxes'].set_color('black')
vp['cmins'].set_color('black')
vp['cbars'].set_color('black')
ax.scatter(positions,
[quantile[1] for quantile in tmp],
marker='o',color='white',s=30,zorder=3)

We use a scatter plot to place three median points to each violin’s main body, please check out the scatter plot section and zorder section in previous posts.

Add median points to violin plot
Add median points to violin plot

Then we add 25% quantile and 75% quantile range:

ax.vlines(positions,
[quantile[0] for quantile in tmp],
[quantile[2] for quantile in tmp],
color='black',linestyle='-',lw=5)
Add 25 and 75 quantile ranges
Add 25 and 75 quantile ranges

Finally, we need to add whisker bound:

ax.vlines(positions,
[bound[0] for bound in whisker],
[bound[1] for bound in whisker],
color='black',linestyle='-',lw=2)
Add whisker bound
Add whisker bound

Due to our clip function and the black line background, this step doesn’t show any differences compared to the previous one, in order to demonstrate its effect, I exaggerate a bit, but this is just for showing the effect, you don’t need to do that, just stick with the above one.

ax.vlines(positions,
[bound[0] for bound in whisker],
[bound[1] for bound in whisker],
color='black',linestyle='-',lw=4)

I changed the line width to 4, let’s see:

Exaggerate the whisker bound range(we clip if it exceeds min or max value)
Exaggerate the whisker bound range(we clip if it exceeds min or max value)

To summarize, two tricks here, we use zorder to always keep the median point in the very front. Then we use different linewidth to emphasize the 25%-75% quantile ranges.

All the codes are available at:https://github.com/frankligy/python_visualization_tutorial

Dendrogram

To be honest, you may not encounter this in your area. But as a bioinformatics student, I dealt with clustering problems a lot, and dendrogram is very often encountered in my project and day-to-day coding.

When will we use dendrogram?

Using the famous iris dataset as an example, let’s just imagining that we have 6 flowers, each of them has different characteristics including but not limited to sepal length , sepal width , petal length etc. It can be more formally represented by a m x n matrix, where m stands for m observations n stands for n features for each observation.

Let’s prepare dummy data for that:

sample1 = np.random.randn(100)
sample2 = np.random.randn(100)
sample3 = np.random.randn(100)
sample4 = np.random.randn(100)
sample5 = np.random.randn(100)
sample6 = np.random.randn(100)

mat = np.row_stack([sample1,sample2,sample3,sample4,sample5,sample6])

mat is the m x n matrix.

We are interested to find the pairwise similarity between these 6 flowers, which two are closer to each other and which two are of most distance. We need to do hierarchical clustering and group flowers together, in the end, we will have something like this:

dendrogram
dendrogram

You can see, this is a dendrogram, it tells you flower(2) and flower(3) are very similar, and the underlying relationship is clearly shown in the above plot. So next I will show you how to come to this dendrogram step by step.

First, we need to do hierarchical clustering, click the link to know more about that. But simply put, in each step, we group the most similar two to form a new meta-sample or cluster, then this meta-sample will be treated as a single sample as well in the next round, we still intend to find the most similar two groups in the second round. It is a "bottom-up" approach to cluster different observations.

from scipy.spatial.distance import pdist,squareform
dense_distance = pdist(mat,'euclidean')
square_distance = squareform(dense_distance)

We use pdist() function in scipy to compute pairwise distance, it will return a dense_distancelist, this is very difficult to understand at first glance regarding what each item in the list represents, so we first transform it into a square_distance matrix to build up some intuition.

Square_distance matrix
Square_distance matrix

Now let’s compare to dense_distance list:

print(dense_distance)
[13.76499829 14.86214799 14.67696826 14.49691937 13.35096827 15.18247523
 14.51538309 13.39787901 13.77284523 14.35555456 14.82429126 14.66308396
 13.73263062 13.82613097 13.45116341]

So, dense_distance basically stores the pairwise distance in this order: [(0,1),(0,2),(0,3),(0,4),(0,5),(1,2),(1,3),(1,4),(1,5),(2,3),(2,4),(2,5),(3,4),(3,5),(4,5)], here (0,1) means pairwise distance between 0th observation and 1st observation, recall, python is 0-based index. If you look at the upper triangle of square_distance matrix, it is just read from left-to-right.

Now we have the dense_distance list, we can do hierarchical clustering:

from scipy.cluster.hierarchy import linkage
linkage_matrix = linkage(dense_distance,method='ward',metric='euclidean')

To understand more about method and metric parameters, you can either check out the documentation or some nice blogs. But what we will get is another matrix called linkage_matrix . Let’s try to understand that:

linkage_matrix
linkage_matrix

Please follow my narrative below, so you can understand what it means,

In the 0th iteration (first row), 0th observation (first column) and 5th observation (second column) is most similar based on method and metric we defined. The distance between them is 13.35097 (third column), so we combine them together to form a new cluster/observation 6th (or no.7 observation, python is a 0-based index), because we already have 6 observations. There are 2 observations /nodes (fourth column)in the newly formed cluster 6th (or no.7 observation) .

Then in the 1st iteration (second Row), the same process goes on again.

With that understood, we just need to know, this linkage_matrix is the input to draw dendrogram , so let’s plot it.

from scipy.cluster.hierarchy import linkage,dendrogram
linkage_matrix = linkage(dense_distance,method='ward',metric='euclidean')
fig,ax = plt.subplots()
dendrogram(linkage_matrix,ax=ax)
Now we get the dendrogram
Now we get the dendrogram

I personally found the documentation for making dendrogram not that clear, so I intend to use plain English to describe this process in more detail to convey a better intuition about that.


That’s it, hoping that could be a bit of help.

If you like these tutorials, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please ask me questions about which kind of figure you’d like to learn and how to draw them in a succinct fashion, I will respond!

All the codes are available at https://github.com/frankligy/python_visualization_tutorial

Continuing reading

Tutorial V: Plots in Seaborn (cluster heatmap, pair plot, dist plot, etc)


Related Articles