The world’s leading publication for data science, AI, and ML professionals.

Cointegration vs Spurious Correlation: Understand the Difference for Accurate Analysis

Why correlation does not equal causation

Cointegration vs Correlation

Photo by Wance Paleri on Unsplash
Photo by Wance Paleri on Unsplash

Background

In time series analysis, it is valuable to understand if one series influences another. For example, it is useful for commodity traders to know if an increase in commodity A leads to an increase in commodity B. Originally, this relationship was measured using linear regression, however, in the 1980s Clive Granger and Paul Newbold showed this approach yields incorrect results, particularly for non-stationary time series. As a result, they conceived the concept of cointegration, which won Granger a Nobel prize. In this post, I want to discuss the need and application of cointegration and why it is an important concept Data Scientists should understand.

Spurious Correlation

Overview

Before we discuss cointegration, let’s discuss the need for it. Historically, statisticians and economists used linear regression to determine the relationship between different time series. However, Granger and Newbold showed that this approach is incorrect and leads to something called spurious correlation.

A spurious correlation is where two time series may look correlated but truly they lack a causal relationship. It is the classic ‘correlation does not mean causation‘ statement. It is dangerous as even statistical tests may well say that there is a casual relationship.

Example

An example of a spurious relationship is shown in the plots below:

Plot generated by author in Python.
Plot generated by author in Python.

Here we have two time series A(t) and B(t) plotted as a function of time (left) and plotted against each other (right). Notice from the plot on the right, that there is some correlation between the series as shown by the regression line. However, by looking at the left plot, we see this correlation is spurious because B(t) consistently increases while A(t) fluctuates erratically. Furthermore, the average distance between the two time series is also increasing. Therefore, they may be correlated, but there is no evidence of a causal relationship.

See here for some further examples of spurious correlation. My favourite is the Video Game Sales vs. Nuclear Energy Production!

Causes

There are several causes for why spurious correlation occurs:

  • Pure luck, chance, or coincidence.
  • The sample time series data does not adequately represent the population time series.
  • Both time-series A and B, are driven by a third un-observed time-series C. So, C causes both A and B, therefore it looks like A causes B or vice versa.

What is Cointegration?

Overview

Cointegration is a technique that allows us to distinguish if two time series have a long-running relationship or if it is just spurious correlation. Rather than measuring if the series’ move together, it focuses on determining if the difference between their means stays consistent.

Theory

Two time series are considered to be cointegrated if there exists a linear combination between them that leads to a lower integration than the integrations’ of the two individual series. By integration, we are referring to the degree to which the series is stationary, not calculus.

For example, if two series have an I(1) order of integration (not stationary), then the two time series are cointegrated if a certain linear combination exists that makes the resultant series I(0) (stationary).

See here for a more thorough explanation of the order of integration.

So, if we have our two time series A(t) and B(t), they are considered to be cointegrated if there exists a β scaling coefficient that produces a stationary process:

Equation generated by in LaTeX.
Equation generated by in LaTeX.

If this is true, then there exists a chance that A(t) and B(t) truly have a long-term causal relationship.

If you want to learn more about stationarity, checkout my previous post on it here:

Time-Series Stationarity Simply Explained

Example

Plotted below is an example of two cointegerated series:

Plot generated by author in Python.
Plot generated by author in Python.

Notice how the mean distance between the series stays consistent. In fact if we multiply B(t) by 2, β=2, the resultant output is:

Plot generated by author in Python.
Plot generated by author in Python.

The two series completely overlay eachother! Therefore, we can say that they are cointegrated.

This is a perfect toy example, in reality no two series will perfectly overlap each other.

Cointegration Test: Engle-Granger Two-Step Method

Overview & Theory

The most common test for cointegration is the Engle-Granger test. This measures whether the residuals from a linear sum of the two series are stationary.

For example, going back to the above equation, let’s say the linear combination A(t) and B(t) leads to a stationary series, u(t):

Equation generated by in LaTeX.
Equation generated by in LaTeX.

The coefficient β can be computed through a linear regression fit of A(t) vs B(t). This is the standard OLS process:

Equation generated by in LaTeX.
Equation generated by in LaTeX.

Where c is the intercept term.

We can verify u(t) is indeed stationary by carrying out a statistical test. The most common test for stationarity (unit root test) is the Augmented Dickey-Fuller (ADF) test.

The hypotheses are:

Equation generated by in LaTeX.
Equation generated by in LaTeX.

Example

Let’s make this theory more concrete using a simple toy example:

Plot generated by author in Python.
Plot generated by author in Python.

First, we will regress A(t) against B(t) to find β using OLS:

The output of this is:

LinregressResult(slope=1.000392464678179, intercept=0.31083202511773855, rvalue=0.9629500869656515, pvalue=4.162670194519794e-11, stderr=0.06795014678046259, intercept_stderr=0.9311636662243622)So our value is β=2.

Now using β=1.0004, we can calculate the residuals from the two series and run those residuals through an ADF test to determine if they are stationary:

The output:

ADF Statistic:  -1.9502125507110546
P-Value:  0.30882039870947364
Critical Values:
 1%: -4.14
 5%: -3.15
 10%: -2.71

As our ADF statistic is greater than the 10% confidence interval, the two series are then not cointegrated.

If you want to learn about confidence intervals, checkout my previous post on them here:

Confidence Intervals Simply Explained

Other Tests

The issue with the Engle-Granger test is that it only measures cointegration between two time series. However, tests such as the Johansen test are used to determine cointegration between several time series’.

See here for more information on the Johansen test.

Summary & Further Thoughts

Cointegration is a crucial tool in time series analysis as it allows Data Scientists to distinguish between real casual long-term relationships of series’ to spurious correlation. This is a useful concept, particularly for those Data Scientists working in finance and trading firms to really understand.

The full code used in this blog is available at my GitHub here:

Medium-Articles/Time Series/Time Series Tools/cointegration.py at main · egorhowell/Medium-Articles

References & Further Reading

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles