A correlation measure based on Theil-Sen regression

Matthias Plaue
Towards Data Science
6 min readFeb 25, 2020

--

Source: https://www.tylervigen.com/spurious-correlations

Association and correlation measures are important tools in descriptive statistics and exploratory data analysis. They provide statistical evidence for a functional relationship between variables.

One popular such measure is the Pearson correlation coefficient r(⋅ , ⋅). For samples of metric data x and y it is given as follows:

where s(⋅ , ⋅) is the sample covariance and s(⋅) the sample standard deviation.

If we view the sequences of n observations x and y as vectors in Euclidean space ℝⁿ, the correlation coefficient can be interpreted geometrically. First, we normalize the samples by rescaling x and y as follows:

where μ(⋅) is the arithmetic mean (subtracted from each observation). Here we assume that we use the unbiased sample covariance, i.e., the one with factor 1/(n - 1). We then find that the normalized samples live on the unit sphere, and the correlation coefficient is their dot product:

This high-dimensional geometric interpretation is useful to understand some important properties of the Pearson correlation. For example, we immediately understand that the measure allows us to compare the direction into which the normalized samples are pointing: r(x, y) = 1 means that they point in the same direction, and r(x, y) = -1 means that they point in opposite directions. If r(x, y) vanishes, the vectors z(x) and z(y) are orthogonal. We could even compute an “angle of association” φ between the normalized samples via the relation r(x, y) = cos(φ).

Correlation and regression

Another useful interpretation of the Pearson correlation coefficient comes from simple linear regression based on the method of least squares. (Other interpretations are presented in Rodgers and Nicewander 1988.) In the last paragraph, we viewed the two samples as two vectors of dimension n. We now want to regard them as n data points of dimension two. If we use least squares to determine a straight line of best fit through these data points where we treat y as response variable and x as explanatory variable, that line will have a slope of m(y, x) = s(x, y) / (s(x))². Conversely, if we treat x as response variable and y as explanatory variable, the regression line will have slope m(x, y) = s(x, y) / (s(y))². First, we observe that no matter which variable is treated as dependent or independent, the slope of the regression line does not change sign. This, and a little algebra, allows us to conclude:

To put it in words: the Pearson coefficient is the (signed) geometric mean of the slopes of straight lines of best fit that you can draw through the data points via the method of least squares. This identity shows very explicitly that the Pearson coefficient measures the extent to which x and y are linearly related. It also shows that the Pearson coefficient is just as vulnerable to outliers as the least squares method: it is generally not considered to be a robust statistic. A synthetic test data set that is designed to showcase such vulnerabilities is the Anscombe quartet. Even though the four data sets present very differently as scatter plots, the regression lines based on least squares are identical:

Source: Wikipedia

An alternative to least squares for simple linear regression is Theil-Sen estimation. This more robust method determines the slope of the regression line via the median of the slopes of all lines that can be drawn through the data points:

Consequently, using the above relationship between the Pearson correlation coefficient and least squares regression, we can now formulate a variant that relates to Theil-Sen estimation:

if the median slopes have the same sign, and zero otherwise. The condition to set the measure to zero on sign change of the slopes might seem artificial, however:

  • The median slope generically does not change sign when exchanging x and y,
  • and when it does, the Kendall rank correlation between x and y vanishes.

This can be seen as follows. Assume that K of the individual slopes are negative, and N - K are positive. Exchanging x and y implies taking the reciprocal of the slopes. Ordering the slopes to compute their median reveals the following relationship:

We can see that a sign change of the median can only occur if there are as many negative slopes as there are positive ones. However, this implies that the Kendall rank correlation τ vanishes, as can be seen by the following rewriting of that measure:

We can evaluate this Theil-Sen-based correlation coefficient on the Anscombe quartet, and compare it with Pearson’s coefficient. The following table shows that comparison, plus two other popular measures based on rank statistics, Spearman’s ρ and the already mentioned Kendall’s τ:

Maybe somewhat surprisingly, for the Anscombe 4 data set, the Theil-Sen, Spearman, and Kendall correlation measures prove to be highly unstable: any noise of even the smallest amplitude added to the data points produces arbitrary values of correlation. For the rank-based Spearman and Kendall measures this can be easily explained: If the majority of x values are identical or almost identical, small changes in the data can produce arbitrary ranks.

On the other hand, the Theil-Sen estimator produces arbitrarily large values for the slope of the regression line m(y, x) and values arbitrarily close to zero for m(x, y), leading to an ill-defined product when subjected to noise.

In general, if at least one of the samples to be compared shows only small dispersion, correlation measures will become unreliable. For robust correlation measures, such dispersion should be measured by robust means. The median absolute deviation of the x values within the Anscombe 4 data set vanishes, so it comes to no surprise that the Theil-Sen estimated correlation and the rank correlation measures are ill-defined.

Summary

The Theil-Sen estimator for robust simple linear regression can be used to define a correlation measure in analogy to the relation of Pearson’s correlation coefficient with least squares regression. Evaluated on the Anscombe quartet, it shows characteristics similar to rank correlation measures such as Spearman’s ρ or Kendall’s τ.

References

Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Rodgers, J. L.; Nicewander, W. A. (1988). “Thirteen Ways to Look at the Correlation Coefficient”. American Statistician. 42 (1): 59–66.

Sen, P. K. (1968). “Estimates of the regression coefficient based on Kendall’s tau”. Journal of the American Statistical Association. 63 (324): 1379–1389.

Theil, H. (1950). “A rank-invariant method of linear and polynomial regression analysis”. I. Nederl. Akad. Wetensch., Proc. 53: 386–392.

--

--

Trained mathematical physicist, working data scientist. Author of text books on applied math and data science.