python-visualization-tutorial

This is my last tutorial for my python visualization series:
- The tutorial I: Fig and Ax object
- Tutorial II: Line plot, legend, color
- Tutorial III: box plot, bar plot, scatter plot, histogram, heatmap, colormap
- Tutorial IV: violin plot, dendrogram
- Tutorial V: Plots in Seaborn (cluster heatmap, pair plot, dist plot, etc)
You don’t need to read all previous posts, and this one would be a bit separated from my last four articles. I am going to show you a head-to-head comparison between the Matplotlib library and the Seaborn library in python.
As I alluded to in my tutorial I, I purposely use the most low-level visualization library – matplotlib to demonstrate how to gain full control and understand each element in every plot type. If you’ve had a chance to read my previous articles, I hope it indeed helps you toward this goal (If not, don’t hesitate to let me know and I can post additional tutorials for that). In contrast to matplotlib, Seaborn is a high-level interface that would make our life much easier by alleviating the pain and freeing us from writing a lot of boilerplate codes. It is a great package and here in this post, I will provide useful guidance about when you should use Seaborn, and more importantly, I will use concrete examples to walk you through the process of using Seaborn.
When should I use Seaborn?
I made a table encompassing all Seaborn supported plot types:


All the plot types I labeled as "hard to plot in matplotlib", for instance, violin plot we just covered in Tutorial IV: violin plot and dendrogram, using Seaborn would be a wise choice to shorten the time for making the plots. I outline some guidance as below:
- For "the easy to plot in matplotlib" type, you can use either tool and it won’t be many differences in terms of time and complexities. (If you don’t know how to do that in matplotlib, check my previous posts)
- For the "hard to plot in matplotlib" type, I recommend using Seaborn in your practice but I also suggest at least understand how to draw these plots from the scratch.
- For all figure types, Seaborn would be a better choice if multiple categories are involved, for example, you need to draw a side-by-side box plot or violin plot.
Next, I will show you how to use Seaborn under those "hard to plot in matplotlib" scenarios. One thing which is the cornerstone for every other plot we will cover soon is that Seaborn is designed for pandas DataFrame. So as a safe bet and good practice, please always represent your data as a pandas DataFrame.
Long-formatted and wide-formatted data frames
In order to understand what long-formatted and wide-formatted data frames are, I prepared two datasets that we will use in the following examples.
The first is penguin
dataset, which is a long-formatted data frame:

It will become clearer why it is called a long-formatted data frame after I show you what a wide-formatted data frame is:
I created a synthetic dataset called synthesis
, which is a wide-formatted data frame:

A wide formatted data frame synthesis
only contains three columns with all numeric values, each column corresponds to a category, and the names are shown as the column names var1
, var2
, var3
. Let’s transform it into a long-formatted data frame using the following code:
synthesis.stack().reset_index(-1)

As you can see, now we stacked this data frame and somehow squeezed the width of it to force it to become a more lengthy format, each category now is sequentially outlined in column 1, and the correspondence to its associated value doesn’t change. In a nutshell, long-formatted
data frame and wide-formatted
data frames are just two different views of the data frame that contains data from different categories.
These two concepts will help us to understand the logic and different behaviors of Seaborn packages.
Distribution plot
We use the examples to demonstrate the different distribution plot, see the codes below:
fig,ax = plt.subplots()
sns.histplot(data=penguin, # the long-formatted data frame
kde=True, # draw kernel density estimation line as well
stat='frequency', # the y-axis is the frequency
x='bill_length_mm', # the column of dataframe we want to visualize
hue='species', # the categorical column
multiple='layer', # how to lay out different categories's data
kde_kws={'bw_adjust':5}, # adjust the smoothness of kde curve
line_kws={'linewidth':7}, #adjust the aesthetic element of kde curve
palette='Set2', # choose color map
ax=ax). # the ax object we want to draw on
The code should be self-explainable unless two points, the stat
and bw_adjust
, but let’s first see the effect:

stat
accept four keywords:
count
: the count of each bin
frequency
: the count of each bin divided by the width of each bin
density
: sum(density of each bin* width of each bin) = 1
probability
: sum(probability of each bin, or height) = 1
Then the bw_adjust
argument adjust the extent of smoothness of the kde curve, here we set it as a large value so it should be quite smooth. You can change it to 0.2 and see how it change.
Let’s see another example of wide-format.
fig,ax = plt.subplots()
sns.histplot(data=synthesis, # wide format data
kde=True,
stat='density',
common_norm=False, # sum of all category is equal tot 1 or each category
hue_order=['var2','var1','var3'], # the order to map category to colormap
multiple='layer',
ax=ax)
First, see the effect:

Here, we set the stat
to density
, which means the sum of all bins * bin width will be 1. However, since now we have multiple categories, do you mean all the sum of area under three curves would be 1 or each category will compute their own density? This confusion is carried out by common_norm
argument, True
means the former situation and False
means the latter.
The differences of wide-format are that, here we don’t have a separate column serving as hue
, instead, every single column will be automatically recognized as a category. In that, you even don’t need to specify x
the argument but just need to input the whole wide-format data frame.
What if you only want to draw kde curve, easy!
sns.kdeplot(data=penguin,
x='bill_length_mm',
hue='species',
clip=(35,100)). # the range within which curve will be displayed

Here, clip
argument is to control the range within which the KDE curve would be displayed, in another word, KDE curve that is out of this range would not be shown here. It is useful in some cases, for instance, you know the x
variable you plot has non-negative property, so you want to merely focus on the positive branch and clip
argument will be a handy tool to achieve that.
Finally, if you want to add the rugs to the distribution plot, you just use rugplot
function:
sns.rugplot(data=penguin,x='bill_length_mm',hue='species')

You see, as long as you understand the long and wide format of the data frame, all the Seaborn plots become very easy to understand.
Categorical plot
A strip plot is that all data points associated with each category will be displayed side-by-side, please check the example here:
Swarm plot is one step further, it guarantees that those dots will not overlap with each other, see here:
Let’s draw a swarm plot using penguin
dataset:
sns.swarmplot(data=penguin,x='species',y='bill_length_mm',hue='sex',dodge=True)

dodge
means different hue
will be separated out instead of being clumped together. Still, as long as you understand the philosophy of how Seaborn draws works, all the plot types are inter-connected with each other, right?
Let’s try another one, which is the violin plot, do you remember how hard we draw a violin plot from the scratch in Tutorial IV: Violin plot and Dendrogram? Now let’s see how easy It would be in Seaborn.
sns.violinplot(data=penguin,
x='species',
y='bill_length_mm',
hue='sex',
split=True, # will split different hues on different side of categories
bw=0.2, # the smoothness of violin body
inner='quartile', # show the quartile lines
scale_hue=True, # scale arcross each hue instead of all violins
scale='count'). # scale the width of each violin
Let’s see the effect:

Explanations for the arguments:
scale
: it accepts three value:
area
: default, each violin will have the same area
count
: the width of the violin will be scaled by the observation in each category
width
: each violin will have the same width
For scale_hue
, it means whether the above scaling is performed across all violins in the plot or just violins in each hue
group.
inner
accept several values, if quartile
, it will display all the quartile lines, if box
, it will display the quartile boxs like what we did in matplotlib violin plot. if point
or stick
, it basically displays all the points.
Finally, let’s look at one more example together, the point plot
. Point plot is useful to visualize the point estimate and confidence interval changes across diverse categories, and lines joining each point estimate will make the influence of the explainable variable x
to the responsive variable y
become more clear.
sns.pointplot(data=penguin,x='species',y='bill_length_mm',hue='sex')

I will leave boxenplot
for your own exploration, but it is just an enhanced version of boxplot in the sense that it will not only display 25, 50, and 75 quartile but also shows some additional quartiles, hence it can provide much richer information for complicated data array.
Regression plot
Here I would like to draw your attention, unlike other plots, a regression plot is very easy to draw but I would recommend you understand what is actually going on under the hood. In another word, how these regression lines were derived from the data, how each argument would affect the resultant plot?
sns.regplot(data=penguin,x='bill_length_mm',y='bill_depth_mm')

Simply put, you can picture this process as we compute a mean value at each x
location, then we either estimate the confidence interval by bootstrapping or output the standard deviation (sd) as the translucent bands shown on the graph. But there are some additional arguments like logistic
, lowess
, logx
, robust
, etc, there would be an underlying stats model function to carry out each regression derivations. Just keep that in mind and make sure you can describe how the regression line is derived, that’s it.
Heatmap
Well, you will soon find out how convenient Heatmap is in Seaborn, let’s see:
sns.heatmap(data=synthesis.iloc[0:5,:],
annot=True, # superimpose the text
linewidths=0.5, # add grid line
square=True,
yticklabels=False)

The favorite parameters I like is annot
and linewidth
, if you still remember how we made heatmap in Tutorial III: heatmap, it is a very painful process, now it is just one click, super cool!!
Another handy argument is mask
, it allows you to arbitrarily mask off any squares that you don’t need or don’t want to see. Let’s first define a mask, remember, the mask matrix is of the same dimension as the data matrix, and the position you want to mask would receive a True or 1 value.
mask = np.array([[0,0,0],
[0,0,0],
[0,1,0],
[0,0,0],
[0,0,0]])
Then let’s plot:
sns.heatmap(data=synthesis.iloc[0:5,:],annot=True,linewidths=0.5,square=True,yticklabels=False,mask=mask)

Three plots that you definitely want to try out in Seaborn
- cluster heatmap
- pair plot
- joint plot
If you argue Seaborn just bring you modest convenience in the above plots, in these three examples, it may take you several hours to draw them without the convenient interface of Seaborn.
Let’s start with the cluster heatmap:
sns.clustermap(data=synthesis)

What’s really going on under the hood is way much complicated than this one-line code, you have to do hierarchical clustering, then you basically need to draw a heatmap, annotate it, then draw dendrogram (you just covered in Tutorial IV: dendrogram), imagine how complicated it is.
This is arguably my favorite function in Seaborn, thanks a lot for the developer!
Then there’s another cool function I will introduce, which is to add the arbitrary number of row color bars and column color bars, I will demonstrate adding two-layer of row color bars.
row_cb = pd.DataFrame(data=np.random.choice(['r','g','b','m'],(100,2)),
index=np.arange(100),
columns=['hey','ha'])
I basically create a data frame, each row of this data frame corresponds to one row in the main heatmap, and the two columns of this data frame contain the colors you want to assign to each sample. Then we can add the row color bar to the main heatmap.
sns.clustermap(data=synthesis,row_colors=row_cb)

Very nicely done!!!
Let’s move on to the next exciting plot function, pairplot
.
sns.pairplot(data=penguin.iloc[:,[2,3,4,5]],dropna=True)
Let’s only use four columns of penguin
dataset that belongs to the numeric value, and we drop out observations having NAN in the dataset.

You can change the diagonal plot kind and non-diagonal plot type in the function, but I will leave it for you to explore. Just check out its documentation.
Last but not least, we have jointplot
.
sns.jointplot(data=penguin,
x='bill_length_mm',
y='bill_depth_mm',
kind='scatter')

You can add regression line and KDE curve onto that as well,
sns.jointplot(data=penguin,
x='bill_length_mm',
y='bill_depth_mm',
kind='reg')
Just changing the kind
to reg
, you are done!

Final remark
This brings us to the end of this tutorial, we start with tedious and boring low-level coding using matplotlib, and we also experienced what excitement Seaborn can bring to us. But I bet without the suffering at the beginning, we wouldn’t be able to feel the joy at the end. Going back to my initial wish, I was trying to share how I make publication-quality figures in Python. The key to making publication-quality figures is (a) concise, (b) in a decent outlook. Also, you should be able to tweak any elements on the graph as you want. I like a metaphor that said drawing publication-quality figures is like drawing on an entirely empty scratch paper, that’s why I started my tutorial by showing you how to understand your canvas and then go all the way down to each specific plot type. I hope it is useful in some way and please don’t hesitate to ask me questions.
If you like these tutorials, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please ask me questions about which kind of figure you’d like to learn and how to draw them in a succinct fashion, I will respond!
All the codes are available at https://github.com/frankligy/python_visualization_tutorial