The world’s leading publication for data science, AI, and ML professionals.

Spearman coefficient: Tool for a generalized correlation analysis

Linear relationships are not all a correlation analysis can reveal. We discuss rank-based correlation that is more generalized and…

Beyond linear relationship

It’s a common Data Science interview question.

Say, y = _x_² + noise, you plot y vs. x and try to determine the correlation coefficient. What will it be?

Most interviewers won’t give you credit if you say anything other than zero. The word correlation, in most circumstances, invokes the concept of linearity or linear relationship between variables.

Here is a visual illustration of what I meant.

Image source: Created by the author
Image source: Created by the author

This notion of linearity is coming from the definition of a particular correlation coefficient that is most widely used in statistical analysis or in data science – Pearson’s coefficient.

It is named after the famous English statistician Karl Pearson and is given by,

where cov(X,Y) is the covariance of two random variables X and Y and the denominator consists of the product of their standard deviations.

Karl Pearson, Image source: Wikimedia, Free media to use
Karl Pearson, Image source: Wikimedia, Free media to use

Pearson’s coefficient suffices for a first-order analysis but suffers from outliers and cannot handle nonlinearity by definition. Therefore, there is a need to look beyond the linear correlation analysis.

For such generalized correlation analysis, there are a few more tools. Among them, Spearman’s coefficient is the most straightforward to understand and calculate. In this article, we will discuss that.

Spearman’s coefficient

Spearman’s coefficient is a rank correlation (a measure of statistical dependence between the rankings of two random variables). It is named after Charles Spearman, an English psychologist, famous for factor analysis among other contributions.

So, how is this different from the regular correlation coefficient we are familiar with? Spearman coefficient is different in the sense that it is not just about the raw numerical values, but about rankings.

What’s a rank?

So, what is a rank? Here is the Wikipedia definition,

"A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second."

Take two sets of numbers – 1,2,3 and 10,24,102. The exact numerical relationship between these sets may look odd, but they are very similar in a "one-to-one" sense if we consider their ranking only. The second number in each set is greater than the first one and it is lesser than the third number. Two sets are identical from that point of view. The relative order matters and not the actual numerical values.

The Pearson (linear) correlation coefficient of these two sets can be quite different than 1. But the Spearman’s coefficient will be a perfect 1.

Following a similar formula like the Pearson coefficient, it is given by,

The only difference is that all the raw numerical values are replaced by their rank values (i.e., order values 1st, 2nd, 3rd, 50th percentile, etc.).

An example with Python

Using the Scipy library, it is quite simple to evaluate the Spearman coefficient of two sets of 1-D or 2-D arrays.

Let’s show the stark difference between the two types of correlation with a canonical example,

import numpy as np, matplotlib.pyplot as plt
import scipy.stats as stat
x = np.linspace(0,10,20)
y = np.log(1+1/(x+0.1))
pearson = stat.pearsonr(x,y)
spearman = stat.spearmanr(x,y)

If we check their values, they will be -0.65 and -1.0 respectively. Here is the visual representation,

Image source: Created by the author
Image source: Created by the author

Why is Spearman’s coefficient a perfect -1.0? Because the rank order is exactly identical between the two arrays. As long as the underlying function follows a monotonic (increasing or decreasing) order, this will be true. Obviously, the generating function is a highly nonlinear one (inverse of a logarithm) but that does not impact Spearman’s coefficient.

Spearman coefficient is more robust to outliers

If we introduce some random outliers to the same arrays as above, we will see that Spearman’s coefficient is more robust to those outliers as the rank order is disturbed much less than the actual numeric values. Here is a sample result with outliers.

Image source: Created by the author
Image source: Created by the author

The Pearson’s coefficient is disturbed by the outliers (from -0.65 to -0.79), but the Spearman’s coefficient hardly moved (from -1.0 to -0.99).

Summary

In this article, we discussed Spearman’s coefficient which is equivalent to the more widely used Pearson’s coefficient but for ranked orders of inter-related arrays. It is a special case of more generalized correlation analysis that looks at multiple levels of properties associated with arrays of numbers or sets of values that two random variables can assume.

In many cases, going beyond the linear Pearson coefficient can be used such as in the presence of outliers or when rank ordering is more interesting to analyze than raw numerical values. Concepts like Spearman’s coefficient will come in handy for data scientists in such scenarios.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Tirthajyoti Sarkar


Related Articles