The world’s leading publication for data science, AI, and ML professionals.

All you need to know about Seaborn

When should I use Seaborn versus matplotlib, and how to use it?

python-visualization-tutorial

Photo by Donald Giannatti on Unsplash
Photo by Donald Giannatti on Unsplash

This is my last tutorial for my python visualization series:

  1. The tutorial I: Fig and Ax object
  2. Tutorial II: Line plot, legend, color
  3. Tutorial III: box plot, bar plot, scatter plot, histogram, heatmap, colormap
  4. Tutorial IV: violin plot, dendrogram
  5. Tutorial V: Plots in Seaborn (cluster heatmap, pair plot, dist plot, etc)

You don’t need to read all previous posts, and this one would be a bit separated from my last four articles. I am going to show you a head-to-head comparison between the Matplotlib library and the Seaborn library in python.

As I alluded to in my tutorial I, I purposely use the most low-level visualization library – matplotlib to demonstrate how to gain full control and understand each element in every plot type. If you’ve had a chance to read my previous articles, I hope it indeed helps you toward this goal (If not, don’t hesitate to let me know and I can post additional tutorials for that). In contrast to matplotlib, Seaborn is a high-level interface that would make our life much easier by alleviating the pain and freeing us from writing a lot of boilerplate codes. It is a great package and here in this post, I will provide useful guidance about when you should use Seaborn, and more importantly, I will use concrete examples to walk you through the process of using Seaborn.


When should I use Seaborn?

I made a table encompassing all Seaborn supported plot types:

All Seaborn-supported plot types
All Seaborn-supported plot types

All the plot types I labeled as "hard to plot in matplotlib", for instance, violin plot we just covered in Tutorial IV: violin plot and dendrogram, using Seaborn would be a wise choice to shorten the time for making the plots. I outline some guidance as below:

  1. For "the easy to plot in matplotlib" type, you can use either tool and it won’t be many differences in terms of time and complexities. (If you don’t know how to do that in matplotlib, check my previous posts)
  2. For the "hard to plot in matplotlib" type, I recommend using Seaborn in your practice but I also suggest at least understand how to draw these plots from the scratch.
  3. For all figure types, Seaborn would be a better choice if multiple categories are involved, for example, you need to draw a side-by-side box plot or violin plot.

Next, I will show you how to use Seaborn under those "hard to plot in matplotlib" scenarios. One thing which is the cornerstone for every other plot we will cover soon is that Seaborn is designed for pandas DataFrame. So as a safe bet and good practice, please always represent your data as a pandas DataFrame.

Long-formatted and wide-formatted data frames

In order to understand what long-formatted and wide-formatted data frames are, I prepared two datasets that we will use in the following examples.

The first is penguin dataset, which is a long-formatted data frame:

Penguin, long-formatted data frame
Penguin, long-formatted data frame

It will become clearer why it is called a long-formatted data frame after I show you what a wide-formatted data frame is:

I created a synthetic dataset called synthesis , which is a wide-formatted data frame:

Synthesis: a wide-formatted data Frame
Synthesis: a wide-formatted data Frame

A wide formatted data frame synthesis only contains three columns with all numeric values, each column corresponds to a category, and the names are shown as the column names var1 , var2 , var3 . Let’s transform it into a long-formatted data frame using the following code:

synthesis.stack().reset_index(-1)
Synthesis: transforming to long-formatted
Synthesis: transforming to long-formatted

As you can see, now we stacked this data frame and somehow squeezed the width of it to force it to become a more lengthy format, each category now is sequentially outlined in column 1, and the correspondence to its associated value doesn’t change. In a nutshell, long-formatted data frame and wide-formatted data frames are just two different views of the data frame that contains data from different categories.

These two concepts will help us to understand the logic and different behaviors of Seaborn packages.

Distribution plot

We use the examples to demonstrate the different distribution plot, see the codes below:

fig,ax = plt.subplots()
sns.histplot(data=penguin,  # the long-formatted data frame
kde=True,     # draw kernel density estimation line as well
stat='frequency',  # the y-axis is the frequency
x='bill_length_mm',  # the column of dataframe we want to visualize
hue='species',    # the categorical column
multiple='layer',   # how to lay out different categories's data
kde_kws={'bw_adjust':5},   # adjust the smoothness of kde curve
line_kws={'linewidth':7}, #adjust the aesthetic element of kde curve
palette='Set2',    # choose color map
ax=ax).   # the ax object we want to draw on

The code should be self-explainable unless two points, the stat and bw_adjust , but let’s first see the effect:

Long-format distribution plot
Long-format distribution plot

stat accept four keywords:

count : the count of each bin

frequency : the count of each bin divided by the width of each bin

density : sum(density of each bin* width of each bin) = 1

probability : sum(probability of each bin, or height) = 1

Then the bw_adjust argument adjust the extent of smoothness of the kde curve, here we set it as a large value so it should be quite smooth. You can change it to 0.2 and see how it change.

Let’s see another example of wide-format.

fig,ax = plt.subplots()
sns.histplot(data=synthesis,   # wide format data
kde=True, 
stat='density',
common_norm=False,   # sum of all category is equal tot 1 or each category
hue_order=['var2','var1','var3'],  # the order to map category to colormap
multiple='layer',
ax=ax)

First, see the effect:

wide format (synthesis dataset) : distribution plot
wide format (synthesis dataset) : distribution plot

Here, we set the stat to density , which means the sum of all bins * bin width will be 1. However, since now we have multiple categories, do you mean all the sum of area under three curves would be 1 or each category will compute their own density? This confusion is carried out by common_norm argument, True means the former situation and False means the latter.

The differences of wide-format are that, here we don’t have a separate column serving as hue , instead, every single column will be automatically recognized as a category. In that, you even don’t need to specify x the argument but just need to input the whole wide-format data frame.

What if you only want to draw kde curve, easy!

sns.kdeplot(data=penguin,
x='bill_length_mm',
hue='species',
clip=(35,100)).  # the range within which curve will be displayed
KDE plot
KDE plot

Here, clip argument is to control the range within which the KDE curve would be displayed, in another word, KDE curve that is out of this range would not be shown here. It is useful in some cases, for instance, you know the x variable you plot has non-negative property, so you want to merely focus on the positive branch and clip argument will be a handy tool to achieve that.

Finally, if you want to add the rugs to the distribution plot, you just use rugplot function:

sns.rugplot(data=penguin,x='bill_length_mm',hue='species')
KDE curve plus rugs at x-axis
KDE curve plus rugs at x-axis

You see, as long as you understand the long and wide format of the data frame, all the Seaborn plots become very easy to understand.

Categorical plot

A strip plot is that all data points associated with each category will be displayed side-by-side, please check the example here:

Swarm plot is one step further, it guarantees that those dots will not overlap with each other, see here:

Let’s draw a swarm plot using penguin dataset:

sns.swarmplot(data=penguin,x='species',y='bill_length_mm',hue='sex',dodge=True)
swarm plot
swarm plot

dodge means different hue will be separated out instead of being clumped together. Still, as long as you understand the philosophy of how Seaborn draws works, all the plot types are inter-connected with each other, right?

Let’s try another one, which is the violin plot, do you remember how hard we draw a violin plot from the scratch in Tutorial IV: Violin plot and Dendrogram? Now let’s see how easy It would be in Seaborn.

sns.violinplot(data=penguin,
x='species',
y='bill_length_mm',
hue='sex',
split=True,   # will split different hues on different side of categories
bw=0.2,    # the smoothness of violin body
inner='quartile',   # show the quartile lines 
scale_hue=True,    # scale arcross each hue instead of all violins
scale='count'). # scale the width of each violin

Let’s see the effect:

violin plot
violin plot

Explanations for the arguments:

scale : it accepts three value:

area : default, each violin will have the same area

count : the width of the violin will be scaled by the observation in each category

width : each violin will have the same width

For scale_hue , it means whether the above scaling is performed across all violins in the plot or just violins in each hue group.

inner accept several values, if quartile , it will display all the quartile lines, if box , it will display the quartile boxs like what we did in matplotlib violin plot. if point or stick , it basically displays all the points.

Finally, let’s look at one more example together, the point plot . Point plot is useful to visualize the point estimate and confidence interval changes across diverse categories, and lines joining each point estimate will make the influence of the explainable variable x to the responsive variable ybecome more clear.

sns.pointplot(data=penguin,x='species',y='bill_length_mm',hue='sex')
point plot
point plot

I will leave boxenplot for your own exploration, but it is just an enhanced version of boxplot in the sense that it will not only display 25, 50, and 75 quartile but also shows some additional quartiles, hence it can provide much richer information for complicated data array.

Regression plot

Here I would like to draw your attention, unlike other plots, a regression plot is very easy to draw but I would recommend you understand what is actually going on under the hood. In another word, how these regression lines were derived from the data, how each argument would affect the resultant plot?

sns.regplot(data=penguin,x='bill_length_mm',y='bill_depth_mm')
Regression plot
Regression plot

Simply put, you can picture this process as we compute a mean value at each x location, then we either estimate the confidence interval by bootstrapping or output the standard deviation (sd) as the translucent bands shown on the graph. But there are some additional arguments like logistic , lowess , logx , robust , etc, there would be an underlying stats model function to carry out each regression derivations. Just keep that in mind and make sure you can describe how the regression line is derived, that’s it.

Heatmap

Well, you will soon find out how convenient Heatmap is in Seaborn, let’s see:

sns.heatmap(data=synthesis.iloc[0:5,:],
annot=True,   # superimpose the text
linewidths=0.5,   # add grid line
square=True,
yticklabels=False)
Heatmap
Heatmap

The favorite parameters I like is annot and linewidth , if you still remember how we made heatmap in Tutorial III: heatmap, it is a very painful process, now it is just one click, super cool!!

Another handy argument is mask , it allows you to arbitrarily mask off any squares that you don’t need or don’t want to see. Let’s first define a mask, remember, the mask matrix is of the same dimension as the data matrix, and the position you want to mask would receive a True or 1 value.

mask = np.array([[0,0,0],
         [0,0,0],
         [0,1,0],
         [0,0,0],
         [0,0,0]])

Then let’s plot:

sns.heatmap(data=synthesis.iloc[0:5,:],annot=True,linewidths=0.5,square=True,yticklabels=False,mask=mask)
Heatmap with mask
Heatmap with mask

Three plots that you definitely want to try out in Seaborn

  1. cluster heatmap
  2. pair plot
  3. joint plot

If you argue Seaborn just bring you modest convenience in the above plots, in these three examples, it may take you several hours to draw them without the convenient interface of Seaborn.

Let’s start with the cluster heatmap:

sns.clustermap(data=synthesis)
cluster heatmap
cluster heatmap

What’s really going on under the hood is way much complicated than this one-line code, you have to do hierarchical clustering, then you basically need to draw a heatmap, annotate it, then draw dendrogram (you just covered in Tutorial IV: dendrogram), imagine how complicated it is.

This is arguably my favorite function in Seaborn, thanks a lot for the developer!

Then there’s another cool function I will introduce, which is to add the arbitrary number of row color bars and column color bars, I will demonstrate adding two-layer of row color bars.

row_cb = pd.DataFrame(data=np.random.choice(['r','g','b','m'],(100,2)),
index=np.arange(100),
columns=['hey','ha'])

I basically create a data frame, each row of this data frame corresponds to one row in the main heatmap, and the two columns of this data frame contain the colors you want to assign to each sample. Then we can add the row color bar to the main heatmap.

sns.clustermap(data=synthesis,row_colors=row_cb)
cluster heatmap with row color bars
cluster heatmap with row color bars

Very nicely done!!!

Let’s move on to the next exciting plot function, pairplot .

sns.pairplot(data=penguin.iloc[:,[2,3,4,5]],dropna=True)

Let’s only use four columns of penguin dataset that belongs to the numeric value, and we drop out observations having NAN in the dataset.

pairplot
pairplot

You can change the diagonal plot kind and non-diagonal plot type in the function, but I will leave it for you to explore. Just check out its documentation.

Last but not least, we have jointplot .

sns.jointplot(data=penguin,
x='bill_length_mm',
y='bill_depth_mm',
kind='scatter')
Joint plot (default kind)
Joint plot (default kind)

You can add regression line and KDE curve onto that as well,

sns.jointplot(data=penguin,
x='bill_length_mm',
y='bill_depth_mm',
kind='reg')

Just changing the kind to reg , you are done!

Joint plot (with regression and KDE curve)
Joint plot (with regression and KDE curve)

Final remark

This brings us to the end of this tutorial, we start with tedious and boring low-level coding using matplotlib, and we also experienced what excitement Seaborn can bring to us. But I bet without the suffering at the beginning, we wouldn’t be able to feel the joy at the end. Going back to my initial wish, I was trying to share how I make publication-quality figures in Python. The key to making publication-quality figures is (a) concise, (b) in a decent outlook. Also, you should be able to tweak any elements on the graph as you want. I like a metaphor that said drawing publication-quality figures is like drawing on an entirely empty scratch paper, that’s why I started my tutorial by showing you how to understand your canvas and then go all the way down to each specific plot type. I hope it is useful in some way and please don’t hesitate to ask me questions.

If you like these tutorials, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please ask me questions about which kind of figure you’d like to learn and how to draw them in a succinct fashion, I will respond!

All the codes are available at https://github.com/frankligy/python_visualization_tutorial


Related Articles