Data Visualisation | Seaborn

Heatmaps are a great tool for creating beautiful figures and can provide us with insights on trends and allow us to easily identify potential outliers within a dataset.
Within this tutorial, we are going to look at one of the uses for a heatmap – the correlation matrix heatmap. A correlation matrix allows us to identify how well, or not so well, features within a dataset correlate with each other as well as whether that correlation is positive or negative.
There are a number of data visualisation libraries available within Python, but one of the most popular and easy to use is the Seaborn library. With just a single function call and a dataset we can create a heatmap with ease.
Dataset
The dataset we are using for this tutorial is a subset of a training dataset used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). It is released under a NOLD 2.0 licence from the Norwegian Government, details of which can be found here: Norwegian Licence for Open Government Data (NLOD) 2.0.
The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.
The objective of the competition was to predict lithology from existing labelled data using well log measurements. The full dataset consists of 118 wells from the Norwegian Sea.
Video Tutorial on Seaborn Heatmaps
I have also released a video version of this tutorial on my YouTube channel which goes into more detail with another data example.
Importing the Libraries and Data
For this tutorial, we are going to import three libraries. Pandas for loading our data and seaborn for visualising our data.
import pandas as pd
import Seaborn as sns
Once the libraries have been imported we can begin loading our data. This is done by using pd.read_csv()
and passing in the relative location of the file and its name.
We then take a subset of the data. This allows us to look at the features that we are really interested in. In this case: Caliper (CALI), Bulk Density (RHOB), Gamma Ray (GR), Neutron Porosity (NPHI), Photoelectric Factor (PEF) and Acoustic Compressional Slowness (DTC).
well_data = pd.read_csv('data/Xeek_Well_15-9-15.csv')
well_data = well_data[['CALI', 'RHOB', 'GR', 'NPHI', 'PEF', 'DTC']]
Once the data has been loaded in we can view the first five rows of the dataset by calling upon well_data.head()

Creating the Correlation Heatmap
Now that the data has been successfully loaded in, we can begin creating our first heatmap.
We first need to create a correlation matrix. This is very easy to do by calling upon the .corr()
method on our dataframe.
The correlation matrix provides us with an indication of how well (or not so well) each feature is correlated with each other. The returned value will be between -1 and +1, with higher correlations tending toward these endpoints, and poorer correlations tending towards 0.
We can then call upon the seaborn heatmap using sns.heatmap()
and passing in the correlation matric ( corr
).
corr = well_data.corr()
sns.heatmap(corr)
What we get back is our first heatmap.

However, it is not very practical or visually appealing. The colours selected by default make it hard to understand what is highly correlated and what is not. Even though we do have a colour bar.
Change Seaborn Heatmap Color
We can easily change the colours for our heatmap by providing a palette for the cmap
argument. You can find a full range of palettes here.
As we are interested in both the high and low values centred around 0, I would highly recommend using a divergent colour scheme.
sns.heatmap(corr, cmap='RdBu')
When we run this we get back the following heatmap. Values tending towards dark red are negatively correlated, and those tending towards dark blue are positively correlated. The lighter the color, the closer the value is to 0.

If we take a look at the colour bar on the right-hand side of the plot, we can see it starts at 1 at the top and goes down to around -0.8 at the bottom.
We can control this range so that it is equal by using the vmin
and vmax
arguments and setting them to -1 and +1 respectively.
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1)
The returned heatmap now has colour values that are balanced between positive and negative values.

Adding Numbers to a Seaborn Heatmap
If we look up a particular cell’s colour on the colour bar, we may not get an accurate reading. However, if we add the numbers to the heatmap we can instantly see the values and still retain the variation in colour.
To add numbers to our heatmap we simply add-in annot=True.
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True)
What we get back is a nicely annotated heatmap.

One thing to note when we do this is the colour of the text will change automatically based on the cell colour. So there is no need to specify any font colours.
Customizing the Annotation Font Properties on a Seaborn Heatmap
If we want to control the font size and font weight of our annotations, we can call upon the annot_kws
and pass in a dictionary. In this example, I am changing the fontsize
to 11 and setting the fontweight
to bold.
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True,
annot_kws={'fontsize':11, 'fontweight':'bold'})
This makes our numbers stand out better.

Making Heatmap Cells True Squares
This last example is a minor one, but if we want each of the cells to be square instead of rectangular, we can pass in the square
argument and set it to True
.
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True,
annot_kws={'fontsize':11, 'fontweight':'bold'},
square=True)
When we execute this code, we get back a nicely proportional heatmap.

Summary
Heatmaps are a great way to visually summarise tabular data. With a simple glance at the colours, we can easily identify trends and outliers within our dataset. Creating heatmaps within Python is very easy, especially if we use the Seaborn library.
Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here! Alternatively, you can sign up to my newsletter to get additional content straight into your inbox for free.
Secondly, you can get the full Medium experience and support me and thousands of other writers by signing up for a membership. It only costs you $5 a month, and you have full access to all of the amazing Medium articles, as well as have the chance to make money with your writing. If you sign up using my link, you will support me directly with a portion of your fee, and it won’t cost you more. If you do so, thank you so much for your support!
References
Bormann, Peter, Aursand, Peder, Dilib, Fahad, Manral, Surrender, & Dischington, Peter. (2020). FORCE 2020 Well well log and lithofacies dataset for machine learning competition [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4351156