Correlation Matrix, Demystified

Correlation matrix: what is, how is it built and what is it used for

Published in

Towards Data Science

11 min readJul 6, 2022

It seems like a scatter plot, doesn’t it? More on this later. Photo by rovenimages.com:

In the last articles of this mini-series on statistical indices (which was initially designed based on my experience as a teacher at Datamasters.it) we already studied variance, standard deviation, covariance e correlation. In this article, we’ll focus on a data structure outlined in the last article that, when I started studying Machine Learning literally blew up my brain, and not because it’s a hard concept to grasp, but because it made clear to me the power of Data Science and Machine Learning.

Where did we leave off? Correlation

The data structure I’m talking about is the mighty correlation matrix. Like many other Data Science concepts, it is an algebra concept easy to understand and even easier to use. Let’s make a quick recap on correlation: it’s an index that shows the linear relationship between two random variables X and Y. It is always a number between -1 and 1, where:

-1 means that the 2 variables have an inverse linear relationship: when X increases, Y decreases
0 means no linear correlation between X and Y
1 means that the 2 variables have a linear relationship: when X increases, Y increases too.

Beware! Correlation does not imply causation. When correlation between X and Y is close to 1, we cannot say that a change in X implies a subsequent change in Y. For example, consider two variables: “Number of ice creams sold daily in the span of one year” and “Number of sunburns in the span of one year”. These two variables will likely have a high correlation, but a change in one of the two variables will not reflect on the other. High correlation, low causation. Now: back to correlation matrix.

Correlation matrix

Correlation matrix is a squared (the number of rows equals the numbers of columns), symmetric (the matrix is equal to its transpose), with all the principal diagonal elements equal to 1 and semidefinite positive (all its eigenvalues are non negative) matrix. While the first 3 properties are simple to understand and to visualize, it’s worth spending a couple of words on the last condition, because not all square, symmetric with principal diagonal equal to 1 are semidefinite positive, and thus not all matrices that satisfy the first 3 requisites are correlation matrices. For example, the following matrix:

m = [
    [1, 0.6, 0.9],
    [0.6, 1, 0.9],
    [0.9, 0.9, 1]
]

has one negative eigenvalue. You could find it with pen and paper, but why bother when we could make someone else do the math? We can use Python and numpy to get all the eigenvalues of m:

m = [
    [1, 0.6, 0.9],
    [0.6, 1, 0.9],
    [0.9, 0.9, 1]
]eigenvalues = np.linalg.eig(m)
print(eigenvalues[0])Out: [ 2.60766968  0.4        -0.00766968]

the np.linalg.eig function takes a matrix as input (which in all programming languages can be represented as a list of lists, an array of arrays, or a vector of vectors) and returns a tuple with two elements:

The first one is the list of the eigenvalues of the matrix
The second one is the list containing the normalized eigenvectors of the matrix

The eigenvalues are the element with index [0] of the returned tuple. Some techniques exist to make a non-semidefinite positive matrix a semidefinite positive one, but we’ll not get into this topic here. You can check this URL if you want to study more about this topic.

Building a correlation matrix

Let’s now try to understand how a correlation matrix is made, supposing it already has all the properties written earlier.

Let’s start from a dataset, also known as “set of random variables”, or if you prefer a set of rows and columns that represent single observations in which each row has a certain number of columns or features.

When I started reading this book to study ML, the first complete example of predictive models (a simple linear regression, chapter 2) trained itself on a dataset made with the California districts’ houses data. You can download it from here. When I first read what a linear regression is and when I studied the exploratory analysis part (where correlation and correlation matrices came in) my Doors of Perception quickly opened, as someone said. Yes, with no mescaline. We, computer scientists, need so little to trip. By the way: each row of the dataset represents a different California district; plus, each row has the following features (feature is a cool name to call a “random variable”, or even better: variable you can compute some statistical indices on):

longitude
latitude
median house age
total number of rooms
total number of bedrooms
population number
households
median income
median house value
ocean proximity

This book is a real must for whoever wants to study Machine Learning, even though it is not for total beginners, and it’s better if you have a basic data science background. All the code is available here, bookmark it.

We can say that our dataset has a n x 10 dimension, where n is the number of rows, i.e. the number of the California districts.

Let’s build the correlation matrix for this dataset. The variables we’re going to compute correlations on are the 10 features of the dataset. Oh, well, in this dataset there’s one feature for which correlation just doesn’t make sense: we’re talking about the ocean_proximity feature, a categorical variable. “Categorical” means that the domain of the variable is a discrete set of values, not a continuous set of numbers. In particular, for these features the only admitted values are:

{“1H OCEAN”, “INLAND”, “NEAR OCEAN”, “NEAR BAY”, “ISLAND” }

So computing the correlation (an index that computes the linear relationship between two continuous random variables) with this variable doesn’t make sense. We can just exclude it from the correlation matrix. Let’s start from scratch: our dataset is made with 10 features but we’re leaving out of the matrix one of them, so our correlation matrix will be an initially empty 9x9 matrix:

An empty 9x9 matrix. Image by the author.

Let’s now fill our matrix with the actual correlations. Let me remind you that each element of a matrix has one row index and one column index that describe its position in the matrix. We start counting the rows and the columns from 0: this means that (for example) the lowest leftmost value has position 8, 0(row 8, column 0). The rightmost element of the fourth row has position 3, 8(row 3, column 8). The symmetry of the matrix is telling us one more interesting thing: the element with position i, j equals the element with position j, i(the element in position 3, 8equals the element in position 8, 3): to satisfy this property we must build the matrix such that a variable that is located at a certain, is located to the same column, too. For example, let’s start with the longitude feature and say that we want to use it at row 0. The symmetry condition imposes that we must use the longitude feature for column 0. Then let’s do the same with latitude: row 1, column 1. housing_median_age? Row 3, column 3, and so on, until we use all the dataset features and we get this empty matrix:

Labels for the correlation matrix. Image by the author.

Let’s try to read this matrix: the element with position 0, 5(row 0, column 5) represents the correlation between longitude and population; for the symmetry property it equals the element with position 5, 0, which represents the correlation between population and longitude. The correlation between two variables X and Y equals the correlation between Y and X. Same story for the element with position 6, 7, the element holding the correlation between “households” and “median_income” and equal to the element with index 7, 6, the correlation between median_income and households.

Now consider an element from the principal diagonal of the matrix, for example, the one with position 4, 4: it would represent the correlation of `total_bedrooms` with itself. By definition, the correlation of a variable with itself is always 1. Of course, all the principal diagonal elements have this property: all the principal diagonal elements of a correlation matrix equal 1.

Correlation matrix in Python, pandas, and seaborn

Now: to fill a correlation matrix with the actual values we should compute the correlation for each couple of variables. Boring. The proof is left as an exercise for the reader. We could use `pandas` instead:

import pandas as pdhousing = pd.read_csv('datasets/housing.csv')rounded_corr_matrix = housing.corr().round(2)print(rounded_corr_matrix[‘median_income’])

After the named (it’s the as pd part) import instruction, let’s read the CSV file we downloaded earlier with the pandas method read_csv, which takes the path of the file as input and let’s store the results of the reading in a variable called housing. The data type returned by read_csv is a DataFrame, the most important data type defined in pandas, which represents a set of data (did someone say “dataset”?). We can use many methods and functions on a DataFrame, and among them, we have the corr() method; as the name implies, we can use it to get a correlation matrix from a dataset! We round the correlation values to the second decimal place using the method round(2) just because we want to work with a more readable matrix. In the next instruction, we print the correlation values between median_income and all the other features in form of pandas Series. It’s a data structure that resembles a regular array (i.e. we can access its values using a numerical index), but with superpowers. Plus, we can access one particular value specifying a second index. For example:

rounded_corr_matrix['median_income']['housing_median_age']

will hold the correlation between median_income and housing_median_age. Handy, right? We can also print all the correlation values for the median_income feature ordered by descending order, with the instruction

rounded_corr_matrix["median_income"].sort_values(ascending=False)

The output would be:

median_income         1.00
median_house_value    0.69
total_rooms           0.20
households            0.01
population            0.00
total_bedrooms       -0.01
longitude            -0.02
latitude             -0.08
housing_median_age   -0.12
Name: median_income, dtype: float64

So, to get the entire dataset’s correlation matrix the corr() method will do the work. If we want to improve the way we can visualize a correlation matrix we can use seaborn’s heatmap function.

import seaborn as snsheatmap = sns.heatmap(rounded_corr_matrix, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)

A heatmap is a data visualization tool in which a particular phenomenon is mapped to color scales. In our case, darker colors are used to map lower values (with black mapping the correlation value -1) while higher values are mapped to lighter colors (with white mapping the correlation value +1). Seaborn has a heatmap method that takes as the first parameter the two-dimensional data structure we’re going to create the heatmap from: the correlation matrix, in our case. We pass another parameter to the heatmap function whose name is annot: it’s useful to write in the heatmap cells the actual correlation values, to get a more precise idea of what’s going on.

Seaborn heatmap for the California housing dataset. Image by the author.

The usefulness of a heatmap, as we can see, relies on the immediacy of the interpretation of the visualized data. For example, after a quick glance it is evident there’s a high correlation between `total_bedrooms` and total_rooms (0.93, very close to 1), total_roomns and population, total_bedrooms and households. It makes sense, doesn’t it? In contrast, we’ll have a low correlation value for latitude and longitude (hang on for a moment and try to visualize the shape of California state…). We cannot say really anything for values around 0 (e.g. median_income and population).

Thanks to pandas we can take a subset of our dataset features and print the related correlation matrices. To take a subset of our correlation matrix features all we have to do is create a list with the feature names and use it with the brackets notation on the original matrix:

features = ["median_house_value", "median_income", "total_rooms",
                  "housing_median_age"]subset = rounded_corr_matrix[features].loc[features]
heatmap = sns.heatmap(subset, annot=True)

We notice that if we try to simply access rounded_corr_matrix[features] we’ll get a 9x4 matrix containing the correlation of the 4 selected features with all the other dataset features. We use the loc pandas attribute, which allows us to access a feature subset of the 9x4 data structure using their names rather than their numerical indices. These names are of course the features names. We get a 4x4 structure on which we can use our heatmap. Here’s the result:

Heatmap for a subset of the dataset. Image by the author.

Scatter Matrix — Basics

In the end, we use the pandas function scatter_matrix, which provides us with a much more intuitive visualization of the correlation matrix. As its name implies, this matrix is not made with numbers, but with scatter plots (2D plots in which each axis is a dataset feature).

It’s useful to visualize linear relationships between the features couples (the same purpose as a classic correlation matrix, but from a visual point of view).

from pandas.plotting import scatter_matrixfeatures = ["total_rooms", "population", "households", "median_house_value"]
scatter_matrix(housing[features], figsize=(12, 8))

The output is:

Scatter matrix of a California housing dataset subset. Image by the author.

Notice one curious thing: we have histograms on the principal diagonal. In theory, we should find in these positions the correlations between the variables and themselves, but if we drew them we’d get just lines with equation y=x (we’d have the same values on both the x-axis and the y-axis, a straight line). Rather than visualizing a 45 degrees line, `scatter_matrix` shows to us the histograms of these variables, just to have a quick idea about the distributions of the features. Looking at the other plots, for certain variables couples (es. population/total_rooms, or households/population) there’s a clear positive correlation, in some cases very close to 1. In contrast, all the variables present a correlation value with `median_house_value` (the most interesting feature, should we design a machine learning predictive model) near 0, and the plots are very “sparse”.

Uses of correlation matrices

Now that we know how to build a correlation matrix and after the exploration of other forms of data visualization techniques in Python, we can ask ourselves what are the actual uses of this data structure. Usually, a correlation matrix is used in machine learning to do some exploratory and preliminary analysis, to make speculations about what kind of predictive models could be effective to solve a given task. For example, should our model be a regression model (i.e. should our model predict a continuous value) capable of predicting the house prices we could use a correlation matrix on the most interesting features. In such a scenario, the most relevant feature -no doubt- would be median_house_value, so a classic approach would be drawing a heatmap or a scatter matrix of the correlation between this feature and the features with higher correlation:

features = ["median_house_value", "total_rooms", "median_income"]scatter_matrix(housing[features], figsize=(12, 8))

We would find quite a clear correlation between median_income and median_house_value (the higher the median income, the higher the median house value… as always, it makes sense). Then we could try to build, train and optimize a simple linear regression model. We wouldn’t get a very precise model, but that’s still a starting point, isn’t it?

Bonus track — California, here we come!

Earlier in the article, we asked what could a very low correlation value between latitude and longitude mean. For the sake of science, let’s draw a scatter plot of these two variables:

California, here we come! Image by the author

Hey, doesn’t it look like actual California? Yes, of course! The low correlation value between latitude and longitude is due to the geographical California shape which resembles a line with a negative angular coefficient. Isn’t that funny?

Here’s the code to generate this scatter plot with pandas:

housing[['longitude', 'latitude']].plot.scatter(x="longitude", y="latitude")

Happy studying and coding!