The world’s leading publication for data science, AI, and ML professionals.

Understanding Autocorrelation

And its impact on your data (with examples)

Photo by Jonathan Daniels on Unsplash
Photo by Jonathan Daniels on Unsplash

I’ve been dealing with autocorrelated data a lot lately. In finance, certain time series such as housing prices or private equity returns are notoriously autocorrelated. Properly accounting for this autocorrelation is critical to building a robust model.

First, what is autocorrelation? Autocorrelation is when past observations have an impact on current ones. For example, if you could use Apple’s returns last week to reliably predict its returns this week, then you can say that Apple’s stock price returns are autocorrelated. In math terms:

y = B0 + B1*y_lag1 + B2*y_lag2 + ... + Bn*y_lagn + error
If any of B1 to Bn is significantly nonzero, then we can say that the time series represented by y is autocorrelated.

Why Do We Care About Autocorrelation?

Two reasons come to my mind:

  1. Many things are autocorrelated. So when we attempt to study the world by building simulations (e.g. Monte Carlo simulations of the Economy), we need to take that autocorrelation into account. Otherwise our model would produce faulty results (see reason 2 for why).
  2. Autocorrelation causes volatility to be understated, especially when compounded. *The compounded product, that is (1+r1)(1+r2)… of an autocorrelated random variable can have a much wider distribution of outcomes than that of a random variable with no autocorrelation.**

An Example With Inflation Data

Let’s use some actual data to study this. I downloaded some CPI data from FRED (the data repository of the Federal Reserve Bank of St. Louis). It looks like this:

CPI (Source: Federal Reserve Bank of St. Louis, graphic created by author)
CPI (Source: Federal Reserve Bank of St. Louis, graphic created by author)

Whenever analyzing time series data like CPI, you should start by taking the rate of change (to make it more mean-variance stationary). Let’s do that:

Here's what my raw CPI data looks like (stored in df):
values
date              
1950-01-01   23.51
1950-02-01   23.61
1950-03-01   23.64
1950-04-01   23.65
1950-05-01   23.77
1950-06-01   23.88
1950-07-01   24.07
1950-08-01   24.20
1950-09-01   24.34
1950-10-01   24.50
# My CPI data is stored in a dataframe called df
# The following line calculates the monthly rate of change
df_chg = (df/df.shift(1)-1).dropna()

The result looks like this (there’s some obvious seasonality in there that we will ignore today):

Monthly change in CPI (Source: Federal Reserve Bank of St. Louis, graphic created by author)
Monthly change in CPI (Source: Federal Reserve Bank of St. Louis, graphic created by author)

We can do a check for autocorrelation by looking at the correlation of the monthly change in CPI against its lagged values. We can use the shift method to create the lags.

df_chg.rename({'values': 'unlagged'}, axis=1, inplace=True)
lags = 10
for i in range(lags):
    if i > 0:
        df_chg['lag_'+str(i)] = df_chg['unlagged'].shift(i)

Plotting the correlations between the unlagged values of CPI change against its various lags, we see that there’s significant autocorrelation:

Lots of autocorrelation (Graphic created by author)
Lots of autocorrelation (Graphic created by author)

Ignore Autocorrelation At Your Own Risk

Let’s do a quick simulation to see what happens if we ignore autocorrelation. The monthly change in inflation has a standard deviation of 0.32%. If it were a normally distributed random variable, we can annualize the 0.32% by multiplying it by the square root of 12 (because we are going from monthly to annual) – which gives us an annualized standard deviation of 1.11%. That’s a pretty low standard deviation and would imply that extreme events such as hyperinflation are virtually impossible.

The following code simulates a year’s worth of inflation 10,000 times so we can look at the difference in outcomes between including autocorrelation and ignoring it.

# c is the regression constant, a.k.a. the mean value
# we will set c to zero to simplify things
c = 0
# List to store autocorrelated simulation results
auto_correl_list = []
# List to store normally distributed simulation results
normal_list = []
# Run 10,000 scenarios
for sims in range(10000):
    # In each scenario, generate 12 months of "inflation"
    shocks = np.random.normal(0, target_vol, (12,))
    y_list = []
    # This loop takes the 12 shocks and adds autocorrelation
    for i, e in enumerate(shocks):
        if i == 0:
            y = c + betas[0]*0 + betas[1]*0 + e
            y_list.append(y)
        elif i == 1:
            y = c + betas[0]*y_list[i-1] + betas[1]*0 + e
            y_list.append(y)
        else:
            y = c + betas[0]*y_list[i-1] + betas[1]*y_list[i-2] + e
            y_list.append(y)
    # Calculate the compounded products
    auto_correl_list.append(((pd.Series(y_list)+1).cumprod()-1).iloc[-1])
    normal_list.append(((pd.Series(shocks)+1).cumprod()-1).iloc[-1])

Let’s take a look at the distribution of outcomes. Look at how much wider the autocorrelated version (in blue) is than the normal (in orange). The simulated standard deviation of the normal (the standard deviation of the orange histogram) is basically what we calculated earlier – 1.11%. The standard deviation of the autocorrelated version is 7.67%, almost seven times higher. Notice also that the means for both are the same (both zero) – autocorrelation impacts the variance but not the mean. This has implications for regression, which I will cover in a future article.

Finally, let’s talk a bit about why this occurs. When something is autocorrelated (and the correlation coefficients are positive), it’s much more susceptible to feedback loops. Trends tend to snowball – for example, in cases where the last few observations were high, the next observation tends to be high as well because the next is heavily impacted by its predecessors.

We can see this snowball effect by looking at a few of the individual paths from our CPI simulation (this time simulated out to 60 months instead of just 12). In the autocorrelated version, once things get out of hand, it tends to stay that way, either going to the stratosphere or to -100%.

Feedback loops can produce extreme results (Graphic created by author)
Feedback loops can produce extreme results (Graphic created by author)

The normal version of our CPI simulation fails to capture this snowball effect and therefore understates the range of possible outcomes:

Assuming no autocorrelation understates volatility (Graphic created by author)
Assuming no autocorrelation understates volatility (Graphic created by author)

Which is closer to reality? My autocorrelated simulation definitely overstates the likely range of outcomes for inflation – it’s a simple simulation that fails to account for some of inflation’s other properties such as mean reversion. But without autocorrelation, you would look at your simulation results and assume that extreme phenomena like hyperinflation or persistent deflation are statistical impossibilities. That would be a grave mistake.


If you liked this article and my writing in general, please consider supporting me by signing up for Medium via my referral link here. Thanks!


Related Articles