Cointegration vs Correlation

Background
In time series analysis, it is valuable to understand if one series influences another. For example, it is useful for commodity traders to know if an increase in commodity A leads to an increase in commodity B. Originally, this relationship was measured using linear regression, however, in the 1980s Clive Granger and Paul Newbold showed this approach yields incorrect results, particularly for non-stationary time series. As a result, they conceived the concept of cointegration, which won Granger a Nobel prize. In this post, I want to discuss the need and application of cointegration and why it is an important concept Data Scientists should understand.
Spurious Correlation
Overview
Before we discuss cointegration, let’s discuss the need for it. Historically, statisticians and economists used linear regression to determine the relationship between different time series. However, Granger and Newbold showed that this approach is incorrect and leads to something called spurious correlation.
A spurious correlation is where two time series may look correlated but truly they lack a causal relationship. It is the classic ‘correlation does not mean causation‘ statement. It is dangerous as even statistical tests may well say that there is a casual relationship.
Example
An example of a spurious relationship is shown in the plots below:

Here we have two time series A(t) and B(t) plotted as a function of time (left) and plotted against each other (right). Notice from the plot on the right, that there is some correlation between the series as shown by the regression line. However, by looking at the left plot, we see this correlation is spurious because B(t) consistently increases while A(t) fluctuates erratically. Furthermore, the average distance between the two time series is also increasing. Therefore, they may be correlated, but there is no evidence of a causal relationship.
See here for some further examples of spurious correlation. My favourite is the Video Game Sales vs. Nuclear Energy Production!
Causes
There are several causes for why spurious correlation occurs:
- Pure luck, chance, or coincidence.
- The sample time series data does not adequately represent the population time series.
- Both time-series A and B, are driven by a third un-observed time-series C. So, C causes both A and B, therefore it looks like A causes B or vice versa.
What is Cointegration?
Overview
Cointegration is a technique that allows us to distinguish if two time series have a long-running relationship or if it is just spurious correlation. Rather than measuring if the series’ move together, it focuses on determining if the difference between their means stays consistent.
Theory
Two time series are considered to be cointegrated if there exists a linear combination between them that leads to a lower integration than the integrations’ of the two individual series. By integration, we are referring to the degree to which the series is stationary, not calculus.
For example, if two series have an I(1) order of integration (not stationary), then the two time series are cointegrated if a certain linear combination exists that makes the resultant series I(0) (stationary).
See here for a more thorough explanation of the order of integration.
So, if we have our two time series A(t) and B(t), they are considered to be cointegrated if there exists a β scaling coefficient that produces a stationary process:

If this is true, then there exists a chance that A(t) and B(t) truly have a long-term causal relationship.
If you want to learn more about stationarity, checkout my previous post on it here:
Example
Plotted below is an example of two cointegerated series:

Notice how the mean distance between the series stays consistent. In fact if we multiply B(t) by 2, β=2, the resultant output is:

The two series completely overlay eachother! Therefore, we can say that they are cointegrated.
This is a perfect toy example, in reality no two series will perfectly overlap each other.
Cointegration Test: Engle-Granger Two-Step Method
Overview & Theory
The most common test for cointegration is the Engle-Granger test. This measures whether the residuals from a linear sum of the two series are stationary.
For example, going back to the above equation, let’s say the linear combination A(t) and B(t) leads to a stationary series, u(t):

The coefficient β can be computed through a linear regression fit of A(t) vs B(t). This is the standard OLS process:

Where c is the intercept term.
We can verify u(t) is indeed stationary by carrying out a statistical test. The most common test for stationarity (unit root test) is the Augmented Dickey-Fuller (ADF) test.
The hypotheses are:

Example
Let’s make this theory more concrete using a simple toy example:

First, we will regress A(t) against B(t) to find β using OLS:
The output of this is:
LinregressResult(slope=1.000392464678179, intercept=0.31083202511773855, rvalue=0.9629500869656515, pvalue=4.162670194519794e-11, stderr=0.06795014678046259, intercept_stderr=0.9311636662243622)So our value is β=2.
Now using β=1.0004, we can calculate the residuals from the two series and run those residuals through an ADF test to determine if they are stationary:
The output:
ADF Statistic: -1.9502125507110546
P-Value: 0.30882039870947364
Critical Values:
1%: -4.14
5%: -3.15
10%: -2.71
As our ADF statistic is greater than the 10% confidence interval, the two series are then not cointegrated.
If you want to learn about confidence intervals, checkout my previous post on them here:
Other Tests
The issue with the Engle-Granger test is that it only measures cointegration between two time series. However, tests such as the Johansen test are used to determine cointegration between several time series’.
See here for more information on the Johansen test.
Summary & Further Thoughts
Cointegration is a crucial tool in time series analysis as it allows Data Scientists to distinguish between real casual long-term relationships of series’ to spurious correlation. This is a useful concept, particularly for those Data Scientists working in finance and trading firms to really understand.
The full code used in this blog is available at my GitHub here:
Medium-Articles/Time Series/Time Series Tools/cointegration.py at main · egorhowell/Medium-Articles
References & Further Reading
- More comprehensive mathematical view on cointegration: https://www.uh.edu/~bsorense/coint.pdf
- Robert F. Engle and C. W. J. Granger **** original paper on cointegration: https://www.jstor.org/stable/1913236?origin=crossref
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.