Time Series Forecasting for Beginners

Key concepts, folk wisdom, and common mistakes

Published in

Towards Data Science

17 min readJul 28, 2020

It is unfortunate that intro to data science is often split into regression / classification / clustering. There should be a fourth major category: time series.

Time series are difficult to model. They are a completely different beast from the typical flat, tabular data. When we don’t talk about time series as a distinct problem, newcomers to data science will inadvertently apply standard regression algorithms without understanding how the model completely fails.

This article is primarily aimed at newbies, but experienced modelers might pick up a few useful tidbits. It is part of a series of articles addressing common basic mistakes, such as misunderstanding the t-test and drawing unwarranted conclusions from pairs plots.

The focus will be on ARIMA, which performs exceptionally well for how simple it is to tune, making it perfect for beginners. Although many other articles provide an introduction to time series, this article adds value by discussing 1) folk wisdom and 2) common mistakes to avoid, which are strangely absent from many guides.

We’ll work with the classic AirPassengers data to keep things simple. The goal of this article is to communicate ideas; complex data will obscure the concepts.

Autocorrelation

Autocorrelation is the most important concept in time series. It is precisely what makes modeling them so difficult.

Many models assume independence, that the errors are uncorrelated. For instance, your clicking on an ad should not affect whether or not your neighbor clicks on that same ad.

In time series, the current value depends on past values. If the temperature today is 80 F, tomorrow it is more likely for the temperature to be around 80 F rather than 40 F.

If you swap the first and tenth observations in tabular data, the data has not changed one bit. If you swap the first and tenth observations in a time series, you get a different time series. Order matters. Not accounting for autocorrelation is almost as silly as this timeless classic.

We often check autocorrelation using the autocorrelation function (ACF) plot and the partial ACF plot. The x axis has fractional lags because our time series contains monthly data, and lag 1 here means the same month in the previous year. The ACF at lag k is simply the correlation between the value at time t and time t-k. The ACF at lags 1 and 2, for instance:

cor(
  passengers, 
  shift(passengers, n = 1L), 
  use = 'complete.obs'
)
cor(
  passengers, 
  shift(passengers, n = 2L), 
  use = 'complete.obs'
)

I’m using data.table’s shift() function. PACF is the correlation that is left over after controlling for all previous lags. For instance, if you want to get the PACF at lag 3, you can fit a regression using the first and second lags, then compute correlation on the residuals:

adjustment <- lm(
  passengers~shift(passengers, n = 1L)+
    shift(passengers, n = 2L)
)
cor(
  residuals(adjustment), 
  shift(passengers, n = 3L)[-(1:2)], 
  use = 'complete.obs'
)

From a frequentist standpoint, one major goal of time series modeling is to remove autocorrelation from the residuals so that the residuals are reasonably iid white noise.

You cannot choose a time series model only by looking at RMSE or whatever your favorite metric is. This is the biggest sin an analyst can commit. Always plot the time series of residuals and check the ACF/PACF of residuals. We’ll get back to this later.

Some people mistakenly believe that you can choose the best time series model based on the ACF and PACF alone. This is true only for pure AR or pure MA processes, but those are rare. These are mainly diagnostics plots.

Stationarity

Stationarity is the second key concept, generally taken to mean that the time series has constant mean and variance. Recall the AirPassengers time series:

This obviously does not have a constant mean. One common trick is to take first differences, subtracting the observation at time t-1 from the observation at time t:

This series has constant mean but not constant variance. If we believe that the time series follows an exponential growth, we can take log(Y) and difference that:

The series is now much more well-behaved. Make this procedure as part of your EDA when working with classic time series forecasting methods. Log transformation makes sense when the series has exponential (multiplicative) growth.

Stationarity is required for many time series models, especially the part about constant mean. Think about it this way. As long as the mean is constant, we can predict somewhere close to the mean and we’ll be roughly correct; otherwise, we have no idea where the series will end up in the future.

Seasonal Trend Decomposition

The seasonal trend decomposition is useful for visualizations but generally can’t be used to forecast. The method is very easy to understand. For weekly-seasonal data, we can:

Compute the 7-day moving average (from t-3 to t+3). This is your trend.
Subtract the trend from your data. Then take the average for each day of the week (Monday, Tuesday, …). This is your seasonal component.
After subtracting both the trend and seasonality from your raw observations, what’s left is the residuals.

Look very carefully at the first step. The trend is a moving average, and you actually need future observations to estimate the trend.

I’ve seen some people perform the decomposition and then train a model to forecast the trend because it’s “easier”. Uh, no. First of all, that’s “cheating” because it artificially reduces RMSE. Second of all, the trend is a moving average that has absorbed future values — a massive data leak.

ARIMA

ARIMA(p, d, q) has three components:

AR(p) Auto-Regressive. This uses lagged values. For instance, AR(2) uses the observations from time t-1 and t-2 to predict the value at time t. This can be viewed as the persistent, structural component of the time series.
I(d) Integrated. This is the number of times you need to difference the series for it to become stationary. The AR and MA components use the differenced series, if done.
MA(q) Moving Average. This uses lagged values of residuals. For instance, MA(2) uses the residuals from time t-1 and t-2 to predict the value at time t. This can be viewed as the local trend of the time series.

We often write (2, 1, 2) to mean an ARIMA model with components AR(2), I(1), and MA(2). As a rule of thumb, anything over (2, 1, 2) is likely to be overfitting. (Super important!) If you’re fitting something like a (1, 1, 6) model, you might be better off adding a seasonal component (next section).

ARIMA is easy to tune as we only need to try 3 x 2 x 3 = 18 possible models and pick the best one. I(2) might happen for some scientific processes but most business data only needs I(1).

The parameters are not intuitive, but the models have certain meanings. Of note is (1, 1, 2), the damped trend model. You can think of this model as extrapolating the observed trend but with some regularization (and we know regularization works well in practice). It has been argued that (1, 1, 2) should be the default setting for ARIMA because it works well in so many scenarios. Damped trend is especially great when you have very little data, e.g. for a newly launched product.

ARIMA is really hard to beat for short-term forecasts but fails for longer-term forecasts because a lot of time series need an MA component. Suppose you have MA(1). You can use the observed residual at time t to forecast t+1, but you don’t have the residual at t+1 to forecast t+2. If you need to make long-term forecasts, look into something like Facebook’s Prophet.

I don’t like the auto.arima() function in R. Earlier, I mentioned that time series models should not be evaluated on a single metric alone. The model with the lowest AIC or BIC can have ill-behaved residuals; we want to pick the model with the best metric out of the ones with well-behaved residuals.

Also, auto.arima() by default searches outside (2, 1, 2) so it can pick needlessly complex models — and because of the larger search space, it uses heuristics so you can fall into a local optimum.

We previously saw that the logged values combined with first-differencing yields a stationary distribution, so let’s apply the log transform and search for the parameters, excluding the last 14 (this should be 12 but I’m not rerunning the entire thing) months of data for our test set:

library(forecast)log_passengers <- ts(log(passengers), frequency = 12)
train <- ts(
  log_passengers[1:(length(log_passengers) - 14)], 
  frequency = 12
)placeholder <- list()
i <- 1
for(p in 0:2){
  for(d in 0:1){
    for(q in 0:2){
      model <- Arima(train, order = c(p, d, q))
      placeholder[[i]] <- c(p, d, q, AIC(model))
      i <- i + 1
    }
  }
}
performance <- do.call('rbind', placeholder)
colnames(performance) <- c('p', 'd', 'q', 'AIC')

The lowest AIC comes from (2, 1, 1). Here we check the ACF and PACF of residuals:

These are not satisfactory, particularly the spike at lag 1. The next best AIC is from (1, 1, 2):

These are also not satisfactory because of the spike at lag 1. Looking at the original time series, we obviously need a seasonal component. We have monthly data; the spike corresponds to the annual (12-month) seasonality.

SARIMA

SARIMA(p, d, q)×(P, D, Q)m works similarly to ARIMA but with a Seasonal component. You can think of it as combining two models: one with lag steps of 1 (t-1, t-2, …) and one with lag steps of m (t-m, t-2m, …). But to gain a more precise understanding, you should learn the backshift notation because the d and D components have some interplay.

Just like with ARIMA, we can test all possible parameter values, keeping them within (2, 1, 2). The m depends on the granularity of your time series. For hourly data, try m = 24 for 24 hours in a day; for daily data, try m = 7 for 7 days in a week; and for monthly data, try m = 12. The biggest limitation of SARIMA is that it only allows for one seasonality term. When you have daily data, it’s reasonable to believe there’s both a weekly and annual seasonality, but sadly we can only incorporate one into the model. One workaround is to incorporate Fourier terms, which can be a pain in the SARIMA framework if the time series is not roughly stationary.

Anyway, onto finding the appropriate setting. Let’s venture into the pyramid of doom:

placeholder <- list()
i <- 1
for(p in 0:2){
  for(d in 0:1){
    for(q in 0:2){
      for(P in 0:2){
        for(D in 0:1){
          for(Q in 0:2){
            model <- tryCatch({
              Arima(train,
                    order = c(p, d, q),
                    seasonal = c(P, D, Q)
                    )
            }, error = function(err){
              Arima(train,
                    order = c(p, d, q),
                    seasonal = c(P, D, Q),
                    method = 'CSS'
                    )
            })
            placeholder[[i]] <- c(p, d, q, P, D, Q, AIC(model))
            i <- i + 1
          }
        }
      }
    }
  }
}
performance <- do.call('rbind', placeholder)
colnames(performance) <- c('p', 'd', 'q', 'P', 'D', 'Q', 'AIC')

This runs much slower than before because we have to consider 18 x 18 = 324 parameter combinations. We have to use a tryCatch here because sometimes the default optimization algorithm doesn’t work. We end up with (2, 1, 2) × (1, 0, 1)12 as the best model.

These diagnostic plots look fine. But we always still need to check the residuals. Here’s a comparison to show the improvement from adding seasonality:

The ACF and PACF plots look fine, and the residuals reasonably look like white noise. There is one more thing we can’t forget to check: reasonableness of forecasts:

Sometimes the other diagnostics look fine but the forecast looks really off. This looks fine.

Other people might choose different models. That’s fine as long as all the diagnostics look fine. Remember that all models are wrong, but some are useful. Rather than obsessing over the “true” model, make sure that the model is in the ballpark of being useful.

De-biasing Forecasts

This section is a very common mistake. We took the log of the time series, so to create the forecast we only need to take the exponent, right? Not necessarily. The choice boils down to whether we want the median (minimizing MAE) or the mean (minimizing MSE). Here, forecasting the median can also be interpreted as minimizing the % error (kinda). Sometimes the business context wants to minimize MSE and analysts forget to make the model forecast the mean.

Under the model assumptions, the log of the time series has normally distributed residuals. In other words, the residuals follow the lognormal distribution. The mean of the distribution is exp(μ + 0.5 σ²), not exp(μ), which is the median. The σ² term can be estimated by taking the variance of the residuals.

Instead of manually taking the log of the time series, you can keep the original time series and transform it within Arima():

Arima(..., lambda = 0, biasadj = TRUE)

The forecasts automatically compute exp(μ + 0.5 σ²).

Here’s our SARIMA model with bias adjustment:

model <- Arima(
  subset(passengers, start = 1, end = 132),
  order = c(2, 1, 2),
  seasonal = c(1, 0, 1),
  lambda = 0,
  biasadj = TRUE
)
plot(forecast(model, 12, biasadj = TRUE), 
     ylim = c(300, 700), xlim = c(1959, 1961))
par(new = TRUE)
plot(passengers, add = TRUE, 
     ylim = c(300, 700), xlim = c(1959, 1961))

That forecast on the held-out test set looks really good, if you ask me.

Time Series Regression

Sometimes called (S)ARIMAX, where X stands for eXternal regressor. The procedure is simple:

Fit y~x1+x2+…
Fit (S)ARIMA on the residuals

That’s it.

The first step is actually super tricky. Time series unfortunately, by their very nature, have a common confounder: time. You can get models with good performance on the training data, but the model is nonsensical due to spurious correlations. Independent random walks are typically strongly correlated!

rw1 <- cumsum(rnorm(100, 0, 1))
rw2 <- cumsum(rnorm(100, 0, 1))
plot(
  rw1, 
  type = 'l',
  main = paste0(
    'Correlation = ', round(cor(rw1, rw2), 2), 
    ', p approx. ', round(cor.test(rw1, rw2)$p.value, 2)),
  ylim = c(min(c(rw1, rw2)), max(c(rw1, rw2))))
lines(rw2)

Some econometrics models require trend-stationarity, i.e. each time series becomes stationary after removing the (linear) trend. This one is relatively straightforward to check, but I except most series are not trend-stationary.

We generalize this concept to cointegration. Stated simply, if all of the time series (both Y and the X’s) are I(1), then the regression in step 1 must result in I(0) residuals. Alternatively, if all the time series are I(2), then the regression in step 1 must result in I(1) or I(0) residuals. No, you cannot cheat and say “This I(1) series is still stationary after differencing one more time.” That’s still I(1).

If the time series are not cointegrated, then you cannot do time series regression. Think about what cointegration conceptually means. Pretend Y is a dog and X is their owner.

If Y and X are cointegrated, the dog is on a leash. The dog can walk anywhere within the radius of the leash, but it is attached to its owner. If we know where the owner is, we have a good idea of where the dog is.

If Y and X are not cointegrated, the dog is not on a leash. It can run away and explore the entire city. Knowing the location of the owner does not tell us the general vicinity of the dog. The bigger the difference in time, the greater the uncertainty. Maybe the dog just ran away and is still within the block in the next five minutes; but after an hour, we have no idea where it is.

In business settings, most time series are I(1). Check that the regression residuals are stationary. If not, then you need a model that’s more advanced than ARIMA (Facebook Prophet is the likely candidate).

Note on generic machine learning

Random forest and other tree algorithms can work if the time series is stationary. They use external regressors to capture seasonality without applying (S)ARIMA on the residuals.

However, what happens when the time series has a trend? Understanding how machine learning algorithms work under the hood is absolutely crucial.

Consider a tree with one split, and both X and Y have upwards trend. The split says “if X > x, predict Y = yhigh; else predict Y = ylow”. It forecasts a flat line. Even if your X provides a trend, a forecast just sees X > x and forecasts a flat line at Y = yhigh. Trees cannot learn trends.

One way to ameliorate this is to difference the original time series so it becomes stationary, if possible. Then you can forecast by taking the most recent observed point and adding the cumulative sum of forecasts. However, because the algorithm has no understanding of serial correlation, expect to see long streaks of under/over-prediction.

Note on seasonality

If your time series looks “roughly stationary” — flat with a subtle cycle — then you can add Fourier terms for seasonality. For instance, we don’t expect conversion rate to have a large constant trend, but it might have a strong daily and weekly seasonality. You can train a logistic regression on this.

If you have hourly data, try adding:

daily seasonality using the predictors sin(2πkt/24) and cos(2πkt/24)
weekly seasonality using the predictors sin(2πkt/(24*7)) and cos(2πkt/(24*7))

where t is the time, and k = 1, 2, 3, … You can select k using regularization.

Practical Constraints

Sometimes people make the cute mistake of forgetting that, at the end of the day, the model’s purpose is forecasting. If the business requires you to forecast 7 days into the future, you better make sure that whatever regressors you use can be obtained 7 days beforehand. Make sure to lag your external regressors, if using any.

Model Validation

Cross-validation is amazing to get a sense of out-of-sample error. Sure, CV gets the spotlight for its role in hyperparameter tuning, but for many of us (cough Bayesians cough) we use CV to assess how well the model generalizes, not to choose hyperparemeters.

I have seen people use regular CV on time series. CV is completely invalid because autocorrelation is a thing.

There are two main ways to do cross-validation on time series, each with different assumptions.

The first is leave-future-out cross-validation (LFOCV). For instance, we can train the model using t = 1:36 and forecast 12 steps into the future and record the error. Then train using t = 1:48 and forecast 12 steps into the future. Then t = 1:60. And so on.

Ideally, we forecast 1 step into the future and refit the model with t = 1:36, 1:37, 1:38, and so on. However, this is practically LOOCV and computationally very expensive. Using increments of 12 is done out of convenience, like choosing k = 5 or 10.

The second is to use sliding windows. Here we fix the length of the training data and slide, so t = 1:36, 13:48, 25:60, and so on. This post has a great visualization of the two methods.

Which one is better? It depends on how “sticky” the behavior of the time series is. If the behavior seems consistent, then LFOCV works better because you’ll use all your data for training the final model. If the behavior is more erratic and will vary over time, then sliding window works better because each time you make a prediction, the model will be trained on a recent subset of your data.

(Ah, I forgot to mention earlier. Unlike tabular data, more data does not necessarily mean better predictions. If your time series keeps changing behavior, using only the most recent subset is more sensible. There are dynamic AR models but those are beyond the scope of this article.)

Remember, our goal is to evaluate how well the model performs once it roams in the wild, so we want to make the condition as close as possible to when the model is deployed.

Evaluation metrics

We have not discussed evaluation metrics yet. Personally, I really dislike MAPE and MAE. Depending on your time series, optimizing either of these two introduces systematic bias — forecasts are too low.

In the case of MAPE, the error from underpredicting is at most 100%, while the error from overpredicting is unbounded.

In the case of MAE, if you have exponential growth, the median can be considerably lower than the mean.

If the business cares about percentage errors, try log(predicted / actual) instead. In other words, the residuals from forecasting the logged time series. The average of this, when exponentiated, yields a “fairer” percentage error.

If the business cares about squared errors, then you should compute RMSE on the bias-adjusted forecasts.

Going Beyond

If you’re new to time series and want to learn more, I suggest going through Forecasting: Principles and Practice, a free online textbook by Rob Hyndman, the author of R’s forecast package. It focuses more on the understanding and application rather than the math, which is perfect for a newbie.

One major disadvantage to models like ARIMA is the lack of interpretability. Coefficients on lagged values can hardly be explained. In addition, while ARIMA makes great short-term forecasts, it fails at forecasting longer periods. Facebook’s Prophet package tackles these two issues by providing a seasonal trend decomposition-like approach. They linearly extrapolate the trend for forecasting.

If you’re looking for advanced time series methods, I suggest learning about Bayesian structural time series (bsts) models. They are state space models — think of them as Hidden Markov Models that are much more flexible and can handle continuous latent states. This class of models is very promising for several reasons:

It explicitly accounts for measurement error. Frequentist methods like ARIMA assume perfect measurement, so when making forecasts, the models propagate the measurement error. If you like ARIMA that much, it can be formulated as a state space model.
It can be applied to causal inference. I believe Google’s CausalImpact is the most useful package for quasi-experiments. It’s great for measuring the impact of marketing campaigns.
It allows for dynamic regression. Many models fit a single set coefficients to the training data and assume that these coefficients are static. When what we’re modeling is constantly changing, we might want to let the coefficients change over time.
It incorporates regularization in a clever way. It uses the spike and slab prior, a point mass at 0 plus a Gaussian. It’s similar to elasticnet but allows posterior draws of 0.