The world’s leading publication for data science, AI, and ML professionals.

Seaborn Heatmap for Visualising Data Correlations

Visualise how features correlate with each other with a simple heatmap

Data Visualisation | Seaborn

Photo by DDP on Unsplash
Photo by DDP on Unsplash

Heatmaps are a great tool for creating beautiful figures and can provide us with insights on trends and allow us to easily identify potential outliers within a dataset.

Within this tutorial, we are going to look at one of the uses for a heatmap – the correlation matrix heatmap. A correlation matrix allows us to identify how well, or not so well, features within a dataset correlate with each other as well as whether that correlation is positive or negative.

There are a number of data visualisation libraries available within Python, but one of the most popular and easy to use is the Seaborn library. With just a single function call and a dataset we can create a heatmap with ease.

Dataset

The dataset we are using for this tutorial is a subset of a training dataset used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). It is released under a NOLD 2.0 licence from the Norwegian Government, details of which can be found here: Norwegian Licence for Open Government Data (NLOD) 2.0.

The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.

The objective of the competition was to predict lithology from existing labelled data using well log measurements. The full dataset consists of 118 wells from the Norwegian Sea.

Video Tutorial on Seaborn Heatmaps

I have also released a video version of this tutorial on my YouTube channel which goes into more detail with another data example.

Importing the Libraries and Data

For this tutorial, we are going to import three libraries. Pandas for loading our data and seaborn for visualising our data.

import pandas as pd
import Seaborn as sns

Once the libraries have been imported we can begin loading our data. This is done by using pd.read_csv() and passing in the relative location of the file and its name.

We then take a subset of the data. This allows us to look at the features that we are really interested in. In this case: Caliper (CALI), Bulk Density (RHOB), Gamma Ray (GR), Neutron Porosity (NPHI), Photoelectric Factor (PEF) and Acoustic Compressional Slowness (DTC).

well_data = pd.read_csv('data/Xeek_Well_15-9-15.csv')
well_data = well_data[['CALI', 'RHOB', 'GR', 'NPHI', 'PEF', 'DTC']]

Once the data has been loaded in we can view the first five rows of the dataset by calling upon well_data.head()

Creating the Correlation Heatmap

Now that the data has been successfully loaded in, we can begin creating our first heatmap.

We first need to create a correlation matrix. This is very easy to do by calling upon the .corr() method on our dataframe.

The correlation matrix provides us with an indication of how well (or not so well) each feature is correlated with each other. The returned value will be between -1 and +1, with higher correlations tending toward these endpoints, and poorer correlations tending towards 0.

We can then call upon the seaborn heatmap using sns.heatmap() and passing in the correlation matric ( corr).

corr = well_data.corr()
sns.heatmap(corr)

What we get back is our first heatmap.

Initial heatmap generated using Seaborn and a correlation matrix. Image by the author.
Initial heatmap generated using Seaborn and a correlation matrix. Image by the author.

However, it is not very practical or visually appealing. The colours selected by default make it hard to understand what is highly correlated and what is not. Even though we do have a colour bar.

Change Seaborn Heatmap Color

We can easily change the colours for our heatmap by providing a palette for the cmap argument. You can find a full range of palettes here.

As we are interested in both the high and low values centred around 0, I would highly recommend using a divergent colour scheme.

sns.heatmap(corr, cmap='RdBu')

When we run this we get back the following heatmap. Values tending towards dark red are negatively correlated, and those tending towards dark blue are positively correlated. The lighter the color, the closer the value is to 0.

Seaborn heatmap for a correlation matrix after specifying a custom colourmap. Image by the author.
Seaborn heatmap for a correlation matrix after specifying a custom colourmap. Image by the author.

If we take a look at the colour bar on the right-hand side of the plot, we can see it starts at 1 at the top and goes down to around -0.8 at the bottom.

We can control this range so that it is equal by using the vmin and vmax arguments and setting them to -1 and +1 respectively.

sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1)

The returned heatmap now has colour values that are balanced between positive and negative values.

Seaborn heatmap for a correlation matrix after specifying the vmin and vmax values for the colormap. Image by the author.
Seaborn heatmap for a correlation matrix after specifying the vmin and vmax values for the colormap. Image by the author.

Adding Numbers to a Seaborn Heatmap

If we look up a particular cell’s colour on the colour bar, we may not get an accurate reading. However, if we add the numbers to the heatmap we can instantly see the values and still retain the variation in colour.

To add numbers to our heatmap we simply add-in annot=True.

sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True)

What we get back is a nicely annotated heatmap.

Seaborn heatmap after adding numbers to each cell. Image by the author.
Seaborn heatmap after adding numbers to each cell. Image by the author.

One thing to note when we do this is the colour of the text will change automatically based on the cell colour. So there is no need to specify any font colours.

Customizing the Annotation Font Properties on a Seaborn Heatmap

If we want to control the font size and font weight of our annotations, we can call upon the annot_kws and pass in a dictionary. In this example, I am changing the fontsize to 11 and setting the fontweight to bold.

sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, 
            annot_kws={'fontsize':11, 'fontweight':'bold'})

This makes our numbers stand out better.

Seaborn heatmap after changing the annotation properties. Image by the author.
Seaborn heatmap after changing the annotation properties. Image by the author.

Making Heatmap Cells True Squares

This last example is a minor one, but if we want each of the cells to be square instead of rectangular, we can pass in the square argument and set it to True.

sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, 
            annot_kws={'fontsize':11, 'fontweight':'bold'},
           square=True)

When we execute this code, we get back a nicely proportional heatmap.

Summary

Heatmaps are a great way to visually summarise tabular data. With a simple glance at the colours, we can easily identify trends and outliers within our dataset. Creating heatmaps within Python is very easy, especially if we use the Seaborn library.


Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here! Alternatively, you can sign up to my newsletter to get additional content straight into your inbox for free.

Secondly, you can get the full Medium experience and support me and thousands of other writers by signing up for a membership. It only costs you $5 a month, and you have full access to all of the amazing Medium articles, as well as have the chance to make money with your writing. If you sign up using my link, you will support me directly with a portion of your fee, and it won’t cost you more. If you do so, thank you so much for your support!

References

Bormann, Peter, Aursand, Peder, Dilib, Fahad, Manral, Surrender, & Dischington, Peter. (2020). FORCE 2020 Well well log and lithofacies dataset for machine learning competition [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4351156


Related Articles