You’ve Already Made The Choice. You’re Here To Understand Why You’ve Made It. – The Oracle, The Matrix
Timeseries is one of the very few data disciplines where things are getting difficult to model, almost every day.
For example, the abundance of data is a great news for many other domains. We can train better model and expect better outcome. But in the context of timeseries, the abundance of data often take a different form. It usually means that the interval at which the data are arriving is getting shorter. When that happens, instead of better explaining the underlying dynamics of the situation, often newer and previously unknown quirks start surfacing in the timeseries. For example, fractional seasonalities or abruptly changing trends.
For many other disciplines, modelling and prediction walks hand in hand. For timeseries, they are often archenemies. Take Efficient Market Hypothesis (EMH) for example. It is a theoretical underpinning in finance, however, it was (it still is) a strong blow for financial timeseries prediction. Because it says that the stock-prices are mere random walks due to the market efficiency. Modelling random walks are super easy. Predicting, however, is a futile attempt (whether EMH holds or not is a different discussion though, lots of smart people refuse to believe EMH).
In this article, I will be discussing about Data Leakage which is prevalent in a wide range of ML tasks. Data leakage can creep in your ML lifecycle and stay silent up until you push the model to the production and all hell break loose. But what is data leakage?
Wikipedia offers an apt definition for Data leakage. I took the liberty of modifying it a little in the following way: data leakage is the use of information during model training and validation which would not be available during prediction stage. When it comes to time series, this lack of availability can come in many form, especially when forecasting horizon is large. It can be due to the following reasons:
- The feature is simply not available in the future. During cross-validation this unavailability might be left unnoticed because we cross-validate on datasets which themselves are composed of past observations. Hence, up until production, we always have access to those features. The major impact of these kinds of data leakage are – they deem the hours spent on feature creation pointless
- For the sake of argument, let’s say that the features can be made available during the forecasting horizon with forecasts/anticipations of the features themselves. In that case, there are two caveats: (a) Reliable forecasts are very expensive (b) Forecasts come with lots of uncertainties/errors, so brace yourself to see a sharp decline in the model’s performance in production.
A common example of data leakage might be using the stock movement data to predict the volume of some production. During cross-validation stage, the stock movement data might show excellent SHAPELY value, making it an important modelling feature. But when it comes to real time prediction, its utility become very little, if not absolutely useless.
In what follows next, I will be trying to standardize some practices which can be used to avoid this trap. Some of the practices detailed here are very commonly accepted among ML practitioners and I will mention those for the sake of completeness.
Split first, normalize later.
This practice is well known among the ML practitioners. This is also somewhat relevant for timeseries analysis, especially when techniques involving neural network (for example: AR-Net, NeuralProphet, RNN-LSTM ) are applied. This is because neural networks are hard, if not impossible, to train with datasets which are not normalized.
This practice comprises of the following steps:
- Split the dataset into training, validation and test set
- Work out the normalization factors using the training set only. Normalize the training set
- Using those normalization factors found in the previous stage, normalize validation and test set
Had we worked out the normalizing factors using the entire dataset, we would have introduced data leakage during model training in the form of normalizing factors. This is simply because the normalizing factor now contains statistics from the validation and the test set.
There are some important caveats which are noteworthy. If the process being modelled is not stationary, then carrying the normalization factors far into the future (validation + test) can have its own detrimental effects on model performance. It is, however, a different discussion.
Dive straight into the MVP, cross-validate later!
Sounds a little counter-intuitive but let me explain.
MVP stands for Minimum Viable Product. This is basically a stripped down version of the ML system we are designing (for example: a recommendation engine or a prediction engine) that is capable of delivering end-to-end solution in real time. It does not necessarily need to be deployed.
An usual ML experiment lifecycle has the following steps:
- Prepare the datasets (splitting + normalization + feature creation)
- Rapidly build a series of models
- Cross-validation followed by hyperparameter tuning
- Model selection / ensembling
- Create an MVP
Keep that in mind, every performance metric you are monitoring are heavily inflated, asymmetrically across all the models under consideration if there are data leakage in the model. So during model selection, we might end up throwing away the most robust model based on a reportedly wrong performance metric. Or while ensembling, we might end up assigning inaccurate weight to each model, thus producing poor prediction in the production.
To avoid all of these pitfalls, let me reshuffle and redesign the experiment lifecycle a little:
- Prepare the datasets (splitting + normalization + feature creation)
- Train a simple model. Why not complex, sophisticated models? It will be clear in the next step.
- Create an MVP. There is no way of measuring the performance of the model since the MVP is predicting in real time (for example, next 7 days). Model complexity is hence, irrelevant.
- Assuming the features which are introducing the data leakage are identified, proceed to further iterations (more complex models, model selection/ensembling)
This redesigned lifecycle won’t only help you to identify the features which might introduce data leakage, but also give you an idea about what types of features can actually be included.
Do not decouple model training and prediction/inference
For many ML tasks, model training and prediction/inference can be completely decoupled. Take the so-called cat classifier as an example. One can completely decouple the process of classifier training and classification. It can be safely assumed that the availability of the features (which is an array of pixel values) required to make the prediction is not a function of time.
This is, however, not at all the case when it comes to timeseries. In the timeseries analysis modelling and prediction are intertwined through a temporal relationship. A feature can only contribute to the inference if it is a direct/indirect function of time. Hence, every decision taken during model training needs to be holistically backed up by this idea. This is rather a combination of thought-process and domain expertise than a practice. Naturally, it takes time to perfect it.
I, for one, found the above practices tremendously helpful throughout my timeseries analysis journey. I hope there are some takeaways for you as well. Have a nice one. Cheers!