The world’s leading publication for data science, AI, and ML professionals.

Time Series From Scratch – Train/Test Splits and Evaluation Metrics

Part 7 of Time Series from Scratch Series – Everything you need to know before modeling time series data. Learn how to split and evaluate…

Time Series from Scratch

Photo by Viktor Talashuk on Unsplash
Photo by Viktor Talashuk on Unsplash

Today you’ll learn the last theoretical bit needed for time series forecasting – train/test splits and evaluation metrics. These work differently than in regular Machine Learning. Keep reading to see how.

The article is structured as follows:

  • Train/test splits in time series
  • Evaluation metrics in time series
  • Conclusion

Train/test splits in time series

In machine learning, train/test split splits the data randomly, as there’s no dependence from one observation to the other. That’s not the case with time series data. Here, you’ll want to use values at the rear of the dataset for testing and everything else for training.

For example, if you had 144 records at monthly intervals (12 years), a good approach would be to keep the first 120 records (10 years) for training and the last 24 records (2 years) for testing.

Let’s see this in action. To start, you’ll import the libraries and the dataset – Airline Passengers:

Here’s how the dataset looks like:

Image 1 - Airline passengers dataset (image by author)
Image 1 – Airline passengers dataset (image by author)

Let’s say you want to use the last two years for testing and everything else for training. You can use Python’s slicing notation to split the dataset:

Here are both sets, visualized:

Image 2 - Airline passengers train and test datasets (image by author)
Image 2 – Airline passengers train and test datasets (image by author)

And that’s all there is to train/test splits. Keep in mind – you should never split time series data randomly.


Evaluation metrics

When evaluating time series models, you can either opt for relative model performance metrics or general regression metrics.

Relative model performance metrics

You’ll often hear two acronyms thrown around when choosing a time series model – AIC and BIC.

AIC, or Akaike Information Criterion, shows you how good a model is relative to the other models. AIC penalizes complex models in favor of simple ones. For example, if two models have the same performance in forecasting but the second one has 10 more parameters, AIC will favor the first model.

The AIC value of a model is calculated with the following formula:

Image 3 - AIC formula (image by author)
Image 3 – AIC formula (image by author)

Where k is the number of parameters in the model, L-hat is the maximum value of the likelihood function for the model, and ln is the natural logarithm.

BIC, or Bayesian Information Criterion, is similar to AIC. It is an estimate of a function of the posterior probability of a model being true under a ceratin Bayesian setup (source). Once again, the lower the value, the better the model.

The BIC value of a model is calculated with the following formula:

Image 4 - BIC formula (image by author)
Image 4 – BIC formula (image by author)

Where k is the number of parameters in the model, L-hat is the maximum value of the likelihood function for the model, n is the number of data points (sample size), and ln is the natural logarithm.

We won’t go over Python implementation of AIC and BIC today since that would require training multiple forecasting models. You’ll see how these work in future articles.

Remember that both AIC and BIC are relative metrics, so you can’t directly compare models for different datasets. Instead, choose the model with the lowest score.

General regression metrics

You can use any regression evaluation metric, such as MAE, MSE, or RMSE to evaluate time series forecasts. We’ll go over two of these today:

  • RMSE – Root Mean Squared Error
  • MAPE – Mean Absolute Percentage Error

RMSE tells you how many units your model is wrong on average. In our airline passengers example, the RMSE will tell you how many passengers you can expect the model to miss in every forecast.

MAPE tells you how wrong your forecasts are percentage-wise. I like it because, in a way, it is equivalent to accuracy metric in classification problems. For example, the MAPE value of 0.02 means your forecasts are 98% accurate.

The Scikit-learn package doesn’t have an official implementation for RMSE, so we’ll have to calculate it manually by taking a square root from the MSE value.

Let’s implement both in code – we’ll declare rmse as a lambda function, make arbitrary actual and forecasted data, and calculate the error metrics:

Here are the results:

Image 5 - RMSE and MAPE values (image by author)
Image 5 – RMSE and MAPE values (image by author)

In a nutshell – on an average month, the predictions are off by 10 passenger units (thousands), or around 2.5%. That’s all you should know.


Conclusion

And there you have it – the last theoretical bit you’ll need before modeling time series data. You now know how to separate the data for training and testing and evaluate both models and raw forecasts.

In the following article, you’ll build your first predictive model – with simple and exponential moving averages. These are often baselines to more refined models, but you’ll find they work extraordinarily well on some problems.

Thanks for reading.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Dario Radečić


Time Series From Scratch Series

  1. Seeing the Big Picture
  2. Introduction to Time Series with Pandas
  3. White Noise and Random Walk
  4. Decomposing Time Series Data
  5. Autocorrelation and Partial Autocorrelation
  6. Stationarity Tests and Automation

Stay Connected


Related Articles