The world’s leading publication for data science, AI, and ML professionals.

Mini Guide to Supervised Learning for Time Series Forecasting

Model training, feature engineering and error calculation techniques for Time Series Forecasting via Supervised Learning

Photo by Stephen Dawson on Unsplash
Photo by Stephen Dawson on Unsplash

One of the first techniques I learnt for Time Series Forecasting was ARIMA. But as I started building forecasting models, I came across research papers and blogs about using supervised learning models to forecast time series data. These models provide benefit over ARIMA especially if the forecast needs to be at multiple granularities. In this blog post, I am going to capture the learnings I have had as I have built these models and some do(s)/don’t(s). I will be covering a few learnings around cross validation and model training, Feature Engineering, target variable engineering and error calculation.

  1. Data Splitting and Testing

To convert your forecasting problem into a Supervised Learning based regression problem, you will need to restructure your data such that it has a target variable aka y. A __ simple restructure of data could look like a feature set of t-5, t-4, t-3, t-2 and t-1 timestamps while the target variable is t.

To explain this, I use the well known time series dataset of shampoo sales from Kaggle which contains number of sales per month of shampoo over a 3 year period. I applied some preprocessing in order to generate a timestamp for the dataset and created a function to generate the feature set and target variable from a single time series.

import pandas as pd
def supervised_sets(data, lookback):
    '''
    params:
    data - df with series 
    lookback - number of time periods to look back 
    return:
    X - feature set
    y - target variable
    '''

    X = pd.DataFrame()
    y = pd.DataFrame()

    for i in range(0, len(data) - lookback):

        train = data.iloc[i : lookback + i]
        test = data.iloc[lookback + i]
        X = pd.concat([X, pd.DataFrame(train[['Sales']].values.T)])
        y = pd.concat([y, pd.Series(test[['Sales']].values[0])])

    for k in range(lookback):
        X = X.rename(columns = {k : 't' + str(k)})

    return X.reset_index(drop = True), y.values
data = pd.read_csv('shampoo_sales.csv')
data['Timestamp'] = pd.to_datetime(data['Month'].apply(lambda x : '200' + x))
X, y = supervised_sets(data, 5)

Once you have split your data and it is ready for a model to consume, an essential question is how to fit and test the data. In a supervised learning problem which does not have a temporal structure, we can simply do a k-fold split to train and test for the entire dataset. But if you are feeding a temporal series you would want the test dataset to have timestamps that occur post training. If we do k-fold splits, then we risk having a train dataset leaking novel forward looking behavior while we test it on observations prior to when the behavior started.

I’ll try to clear things up with a diagram:

Image by Author
Image by Author

For split number 2 and 3, the training data will have data from the future. Additionally, mean error for split number 3 will be inaccurate since any change in forward looking behavior will lead to prediction of inaccurate values for the past.

To resolve this issue, we can make use of Nested Cross Validation via Day Forward Chaining. I will use a function from sklearn to split the data.

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 10)
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]

This functions splits that data in such a way that we iteratively add time periods to the training dataset while using the subsequent time periods as test data. I am adding a couple of lines of results from the code above to better explain what the function is doing:

TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10] TEST: [11 12]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12] TEST: [13 14]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] TEST: [15 16]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] TEST: [17 18]

2. Feature Engineering for temporal features

Time Series Forecasting is based on the assumption that historical time stamps follow a pattern that can be used to predict future timestamps. Therefore, feature engineering of temporal features is a critical part of building a model that performs well.

First step of analyzing time series data is plotting a line graph of the series over time.

Image by Author
Image by Author

When I plotted the shampoo sales, I observed multiplicative seasonality where the change between time periods was increasing with time. We can also observe a weak seasonality of two months where after every month we observe an increase/decrease in shampoo sales.

A good place to start with feature engineering is to plot the Auto Correlation Function (ACF) and Partial Auto Correlation Function (PACF). This will give you an idea of which lags are important and the seasonality in the dataset. Seasonal decomposition is also a great way to understand the underlying temporal pattern in the data(though this requires some parameter tuning).

We will difference the series to make it stationary before plotting the correlation functions so that the statistical properties of the series are independent of time.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plot_acf(data['Sales'].diff().dropna().values)
plt.show()
plot_pacf(data['Sales'].diff().dropna().values)
plt.show()
Image by Author
Image by Author

Here the ACF suggests that lag 1 is significant i.e. the previous timestamp is significant. Additionally, the alternating signs of lags confirm the hypothesis of increase/decrease after each time period. However, this trend is not consistent across all lags. Including t-1 as a feature in the model would be helpful.

Image by Author
Image by Author

We see that the PACF shows the first lag is to be significant followed by a few lags between 11 and 15. These lags in isolation affect the current timestamp. It would be good to have these timestamps as separate features where lag 1 is likely dictating the level and 11–15 are dictating the weak seasonality. We can add year and month features as indicator variables in the model along with values of these lags.

The above discussions are a few ideas on how to kick start your feature generation process using data analysis. Of course, trial and error yields the best results 🙂

3. Accounting for cyclicality in time based features

We may want to put features(date, day, month, hour, year) of the timestamp at which an observation was recorded on in the feature set. For example, we might want to put in the month and year the observation was recorded (if we know level of the series is increasing each month) or we might want to put in the hour the observation was recorded on(if we know there is strong seasonality on an hourly level).

You do not want to use these values directly as integers because that would mean an observation recorded in December (month = 12) has greater distance to an observation recorded in January (month = 1) as compared to to an observation recorded in September (month = 9). We know that this is not the case. We want the model to know that January is closer to December than September and Hour 23 is closer to Hour 0 than Hour 4. We can do this by taking the sin and cos of the temporal features to capture the cyclicality.

data['sin_month'] = np.sin(data['Timestamp'].dt.month)
data['cos_month'] = np.cos(data['Timestamp'].dt.month)

4. Engineering your target variable

When I built my first supervised learning model for time series forecasting, I was using a lag of order 1 as one of my features and I was able to get extremely low error metrics. It was on further inspection I realized that the model was just performing a random walk since my next time step was heavily dependent on the previous one I was able to get great metrics.

To avoid this, it would be useful to define your target variable to be the diff of order 1. Therefore instead of the ML model predicting the value at t + n + 1 timestamp, it would predict the value of { t + n + 1 timestamp } – {t+ n timestamp}. In this manner, the model will not only be relying on the previous timestamp to predict the next but rather will use historical lags and exogenous regressors. Read Vegard’s article for a great in-depth explanation of this topic : https://towardsdatascience.com/how-not-to-use-machine-learning-for-time-series-forecasting-avoiding-the-pitfalls-19f9d7adf424

5. Error Monitoring

Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are good metrics in order to track the accuracy of supervised learning based forecasting models but a metric that I discovered last year and use often is Mean Absolute Scaled Error (MASE). This metric takes into account the error we would get if we used a random walk approach where last timestamp’s value would be the forecast for the next timestamp. It compares the error from the model to the error from the naive forecast. This would also help you numerically evaluate how much better your model really is as compared to the naive forecast.

def MASE(y_train, y_test, pred): 

    naive_error = np.sum(np.abs(np.diff(y_train)))/(len(y_train)-1)        
    model_error = np.mean(np.abs(y_test - pred))
return model_error/naive_error

You can also customize the MASE metric according to the seasonality of your dataset. For example, if you are working with a dataset which showcases weekly seasonality, you can perform a diff of order 7 instead of order 1 to calculate the naive error. You can set your own baseline model depending on your forecast frequency and dataset seasonality.

If MASE > 1 then the model is performing worse than the baseline model. The closer the MASE is to 0, the better the forecasting model.


If you made it this far, thanks for reading! I tried to capture some of my learnings from my journey in using supervised models for Time Series Forecasting. I would love to hear your learnings as well in the comments. I believe that we need a lot more blogs/articles sharing learnings in this area of forecasting. Let me know your thoughts 🙂


Related Articles