The world’s leading publication for data science, AI, and ML professionals.

Autocorrelation For Time Series Analysis

Describing what autocorrelation is, its formula, and why it is useful in time series analysis.

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Introduction

In time series analysis we often make inferences about the past to produce forecasts about the future. In order for this process to be successful, we must diagnose our time series thoroughly to find all its ‘nooks and crannies.’

One such diagnosis method is autocorrelation. This helps us detect certain features in our series to enable us to choose the most optimal forecasting model for our data.

In this short post I want to go over: what is autocorrelation, why it is useful and finish with how to apply it to a simple dataset in Python.

What Is Autocorrelation?

Autocorrelation is just the correlation of the data with itself. So, instead of measuring the correlation between two random variables, we are measuring the correlation between a random variable against itself. Hence, why it is called _auto_-correlation.

Correlation is how strongly two variables are related to each other. If the value is 1, the variables are perfectly positively correlated, -1 they are perfectly negatively correlated and 0 there is no correlation.

For time-series, the autocorrelation is the correlation of that time series at two different points in time (also known as lags). In other words, we are measuring the time series against some lagged version of itself.

Mathematically, autocorrelation is calculated as :

Equation by author from LaTeX.
Equation by author from LaTeX.

Where N is the length of the time series y and k is the specifie lag of the time series. So, when calculating _r_1 we are computing the correlation between y_t and y_{t-1}._

The autocorrelation between y_t and y_t would be 1 as they are identical.

Why Is It Useful?

As stated above, we use autocorrelation to measure the correlation of a time series with a lagged version of itself. This computation allows us to gain some interesting insight into the characteristics of our series:

  • Seasonality: Lets say we find the correlation at certain lag multiples is in general higher than others. This means we have some seasonal component in our data. For example, if we have daily data and we find that every multiple of 7 lag term is higher than others, we probably have some weekly seasonality.
  • Trend: If the correlation for recent lags is higher and slowly decreases as the lags increase, then there is some trend in our data. Therefore, we would need to carry out some differencing to render the time series stationary.

To learn more about seasonality, trend and stationarity, check out my previous articles on those topics:

Seasonality of Time Series

Time Series Decomposition

Time-Series Stationarity Simply Explained

Let’s now go through an example in Python to make this theory more concrete!

Python Example

For this walkthrough we will use the classic airline passenger volumes dataset:

Data sourced from Kaggle with a CC0 licence.

Plot generated by author in Python.
Plot generated by author in Python.

There is a clear upwards trend and yearly seasonality (data points indexed by month).

We can use the plot_acf function from the statsmodels package to plot the autocorrelation of our time series at various lags, this type plot is known as a correlogram:

Plot generated by author in Python.
Plot generated by author in Python.

We observe the following:

  • There is a clear cyclical pattern in the lags every multiple of 12. As our data is indexed by month, we therefore have a yearly seasonality in our data.
  • The strength of correlation is generally and slowly decreasing as the lags increase. This points to a trend in our data and it needs to be differenced to make it stationary when modelling.

The blue region signifies which lags are statistically significant. Therefore, when building a forecast model for this data, the next month forecast should probably only consider ~15 of the previous values due to their statistical significance.

The lag at value 0 has a perfect correlation of 1 because we are correlating the time series with an exact copy of itself.

Summary and Other Thoughts

In this post we have described what autocorrelation is and how we can use it to detect seasonality and trends in our time series. However, it does have other uses to. For example, we can use an autocorrelation plot for the residuals from a forecasting model to determine if the residuals are indeed independent. If the autocorrelation for the residuals are not mostly zero, then the fitted model has not accounted for all information and probably can be improved.

The full code script used in this article can be found at my GitHub here:

Medium-Articles/autocorrelation.py at main · egorhowell/Medium-Articles

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!

References and Further Reading


Related Articles