The world’s leading publication for data science, AI, and ML professionals.

How To: Forecast Time Series Using Lags

Lag columns can significantly boost your model's performance

How To: Forecast Time Series Using Lags

. Here’s how you can use them to your advantage

Image by author
Image by author

The nature of a time series model is such that past values often affect future values. When there’s any kind of seasonality in your data (in other words, your data follows an hourly, daily, weekly, monthly or yearly cycle) this relationship is even stronger.

Capturing this relationship can be done with features like hour, day of week, month, etc, but you can also add lags, which can quickly take your model to the next level.

What is a lag?

A lag value is simply this: A value that at one time point or another, preceded your current value.

Let’s say you have a time series dataset that has the following values: [5,10,15,20,25].

25, being your most recent value, is the value at time t.

20 is the value at t-1. 15 is the value at t-2, and so on, until the beginning of the dataset.

This makes intuitive sense, since the word "lag" insinuates that something is "lagging behind" something else.

When we train a model using lag features, we can train it to recognize patterns with regard to how preceding values affect current and future values.

Why lags? (+ How to implement)

To showcase how lags can benefit your model, I’ll be walking you through an example using an hourly energy consumption dataset (CC0 1.0 license).

Here is a sample of about 4 weeks of this dataset, so you get a feel for what it looks like:

Screenshot by author
Screenshot by author

As you can see, there is some weekly seasonality as the weekends (showcased by red circles) tend to have lower usage across this 4 week slice. There’s also a clear daily seasonality as the peaks of the days tend to be between 17:00 and 19:00 (5PM to 7PM).

When you zoom out, you can also see that usage patterns differ between months of the year (particularly summer vs winter months).

Screenshot by author
Screenshot by author

If I were to train a regular time series model, I’d focus on the following features:

  • Hour of day
  • Day of week
  • Month of year

I’m going to show you an example using the NIXTLA mlforecast library, since it not only makes time series forecasting very simple, but it’s also able to easily add lag features to your time series models.

First, I trained a regular model using only the features I listed. To start, I loaded the dataset in and prepared it for the NIXTLA library:

import pandas as pd
from mlforecast import MLForecast
from sklearn.ensemble import RandomForestRegressor

# Load in data and basic data cleaning
df = pd.read_csv('AEP_hourly.csv')
df['Datetime']=pd.to_datetime(df['Datetime'])
# Sort by date
df.set_index('Datetime',inplace=True)
df.sort_index(inplace=True)
df.reset_index(inplace=True)
# This dataset is huge with over 100,000 rows
# Get only the last 10,000 rows of hourly data (a little over a year of data)
df = df.tail(10000)

# NIXTLA requires that your date/timestamp column be named "ds"
# and your target variable be named y
df.rename(columns={'Datetime':'ds','AEP_MW':'y'},inplace=True)
# NIXTLA requires a "unique_id" column in case you are training
# more than one model using different datasets, but if you're only
# training with 1 dataset, I just create a dummy constant variable
# column with a value of 1
df['unique_id']=1

I then prepared my data for modeling by splitting into train and test sets:

# Split into train/test sets. For this problem, 
# I don't want a huge test set, since when using lags, you'll have to
# predict using predictions after the first hour forecast.
# (More on this later)
# So my test set will be only 48 hours long (48 rows)
train_size = df.shape[0] - 48

df_train = df.iloc[:train_size]
df_test = df.iloc[train_size:]

Next, I trained a Random Forest model using NIXTLA:

# NIXTLA allows you to train multiple models, so it requires
# a list as an input. For this exercise, I only trained 1 model.
models = [
    RandomForestRegressor(random_state=0)
]

# Instantiate an MLForecast object and pass in:
# - models: list of models for training
# - freq: timestamp frequency (in this case it is hourly data "H")
# - lags: list of lag features (blank for now)
# - date_features: list of time series date features like hour, month, day
fcst = MLForecast(
    models=models,
    freq='H',
    lags=[],
    date_features=['hour','month','dayofweek']
)

# Fit to train set
fcst.fit(df_train)

Lastly, I predicted on the test set (used the forecast object to forecast the next 48 hours and compared it to the test set values) as well as ran a cross-validation with 3 windows, each predicting 24 hour chunks:

from sklearn.metrics import mean_squared_error
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import rmse

# Predict 
predictions = fcst.predict(48)

# Compare to test set. This returns a result of 689.9 RMSE
print(mean_squared_error(df_test.y.values,predictions.RandomForestRegressor.values,squared=False))

# Run cross validation with train df
cv_df = fcst.cross_validation(
    df=df_train,
    h=24,
    n_windows=3,
)

# Get CV RMSE metric
cv_rmse = evaluate(
    cv_df.drop(columns='cutoff'),
    metrics=[rmse], 
    agg_fn='mean'
)

# Prints out 1264.1
print(f"RMSE using cross-validation: {cv_rmse['RandomForestRegressor'].item():.1f}")

So with this time series model – using hour, day of week, and month – had an average cross-validation RMSE of 1264.1 and a test set RMSE of 689.9.

Let’s compare this to a lag-based model.

# Pass in lags as a list argument - I'm tracking lags for 24 hours
# since the goal of our model is to forecast 24 hours at a time
fcst_lags = MLForecast(
    models=models,
    freq='H', 
    lags=range(1,25), 
    date_features=[] 
)

fcst_lags.fit(df_train)

# Predict 24 hours twice for the test set (48 hours)
predictions_lags = fcst_lags.predict(48)

# RMSE test score w/ lags: 421.86
print(mean_squared_error(df_test.y.values,predictions_lags.RandomForestRegressor.values,squared=False))

# Cross validation:
cv_df_lags = fcst_lags.cross_validation(
    df=df_train,
    h=24,
    n_windows=3,
)

cv_rmse_lags = evaluate(
    cv_df_lags.drop(columns='cutoff'),
    metrics=[rmse], 
    agg_fn='mean'
)

# RMSE for CV w/ lags: 1038.7
print(f"RMSE using cross-validation: {cv_rmse_lags['RandomForestRegressor'].item():.1f}")

Let’s compare the metrics side by side:

  • Model without lags + time series features (hour, day of week, month) test RMSE: 689.9
  • Model without lags + time series features CV RMSE: 1264.1
  • Model with lags test RMSE: 421.86
  • Model with lags CV RMSE: 1038.7

Note: The CV RMSE is a lot higher than the test RMSE in both cases. This is because the CV is evaluating different data than the test set.

The CV is evaluating on the following 3 days: July 29–31, predicting 24 hours at a time and taking the average. The holdout test set is evaluating August 1–2, predicting 48 hours at a time.

To investigate this a bit further, I plotted the cross-validation predictions against the actuals, and got the RMSE for each split window (there were 3 – one per day).

It appears the model, in both cases, had a significant under-prediction for July 30th (RMSE of 1445) and a slight overprediction for July 31st (RMSE 888). This brought up the CV average.

Lag prediction model cross validation predicted vs actuals. Image by author
Lag prediction model cross validation predicted vs actuals. Image by author

So it’s possible that for some reason (potentially due to other variables we didn’t consider here, such as weather) the CV holdout sets didn’t do as well in both cases.

It’s always important to investigate what’s going on when your metrics look a bit off.

In a real ML project case, I would do a much deeper dive into why these days in particular were harder to predict, but I won’t for the purpose of this article — just noting that it’s always a good idea to do so.

If I take the averages:

Model without lags: 997

Model with lags: 730.28

Regardless of averages though, the model with lags outperforms the model without lags in both CV and holdout test set.

A word of caution

Lags can provide your model with useful information and improve its performance. They’re also fairly easy to implement, especially with a well built time series library such as NIXTLA (Note: I am not sponsored by NIXTLA).

But over-relying on lags can become a problem, especially if your goal is to forecast longer time horizons.

Basically, the deal with lags is that at some point the model no longer has actual values to use as features, so it has to rely on predictions. This introduces some error to the model. And as you make more and more predictions, the error compounds.

For example, let’s say you are only using 1 lag column and your dataset is as follows: [1,2,3,4,5]. You trained your model on this dataset and now it’s time to make the forecast for the next 5 rows.

On the first pass, we are on the next timestep past 5, let’s call it time t. The lag feature at t-1 is 5. The model predicts the next value will be 6. To predict the next value, at t+1, we will need to use our prediction: 6. This introduces uncertainty since 6 was a prediction, not a real lag value.

If the model predicts 8 at t+1, the next prediction, at t+2 must take in 8 (a prediction with an extra bit of uncertainty since it was produced by using 6 as a feature) as the next lag feature.

And so on, with each prediction increasing in uncertainty.

So you can see how this could lead to worse performance over longer horizons.

Conclusion

Lags can be great for forecasting shorter time horizons, like 1 — 48 rows. After that you need to watch your error carefully. Use prediction intervals to measure uncertainty.

It’s also important to prioritize other numerical and categorical features besides just lags. Marking whether it is a holiday, a weekend vs weekday, or the season can also improve your model, as well as including external variables like temperature.

However, everything depends on your specific dataset, so always experiment with multiple features and combinations.

Find the full source code and dataset here.

Thanks for reading

  • Connect with me on LinkedIn
  • I’m now offering 1:1 Data Science tutoring, career coaching/mentoring, writing advice, resume reviews & more on Topmate!

Get an email whenever Haden Pelletier publishes.


Related Articles