Simulate and Modelling Feature and Label Shifts

All the trained models are prone to become old and useless. It’s a well-known truth that, after some time, all the models are not able to be accurate. This is normal and due to a temporal shift that may occur in the data flow. Especially, most of the application which involve modeling the human activity must be monitored and continually updated. For example, the changes of some needs or market trends may influence the purchasing power of customers. If we are not able to take into account the changing of customer’s habits, our predictions reveal to be not trustable over time.
Given a supervised trained model, we may encounter two opposite situations that affect future performances. A shift in feature distributions or a shift in target distribution. A shift in feature distributions it’s harmful because our model makes predictions on data that it doesn’t see before. On the contrary, a shift in label distribution is bad because our model is trained to approximate a different ground truth.
In this post, we make an experiment to test the ability of some models to survive data shifts over time. We consider a linear model, a Decision Tree, and a linear tree. We want to see how different regime changes can affect the accuracy of predictions on future test data. This permits us to elaborate more about the ability of the three algorithms to handle unavoidable data shifts.
Linear trees, as introduced in my previous post, are a special case of model trees. They build a tree on the passed data evaluating the best splits fitting linear models. The fitted model results in a tree-based structure with linear models in the leaves. In other words, it computes multiple linear regressions partitioning the dataset according to simple decision rules. Everyone can easily implement linear trees using the linear-tree package. It can be used as a scikit-learn BaseEstimator to wrap every linear estimator from sklearn.linear_model
and build an optimal tree structure.
EXPERIMENT SETUP
For our experiments, we use some artificial data generated in a classification context. We partition our dataset into some temporal blocks that we use to train (including hyperparameter searching) and evaluate the performance of our models.

As introduced in the section above, we use three different models: a linear model, a decision tree, and a linear tree. We approximate the passage of time changing the test starting point. We consider 4 testing blocks where we compute the evaluation of each model type at different dates. This enables us to track how the models behave over time.
We propose the experiment in three different versions: no shifts, feature shifts, and label shifts.
NO SHIFTS
In a ‘no shifts’ context, we suppose to see in the future the same data distribution. This is the more perfect and unrealistic scenario. It’s perfect because we train and test our model on similar data, resulting in stable results. It’s unrealistic because we have to be lucky to encounter a situation like this. We live in a dynamic world and all are prone to change rapidly, corrupting our model performances.
However, we start from this situation to fit the models of our interest. We perform training with hyperparameter tuning on the selected temporal blocks. The evaluations on the test sets are summarized in the plot below.

As expected, the accuracies of the models are constant over time. The decision tree and the linear tree reach the same scores.
FEATURE SHIFTS
In a ‘feature shifts’ context, we may encounter a shift in some feature distributions over time. We simulate a double shift in two different time periods. This is possible simply by adding some constant values to change where the distributions are centered.

Predict unseen values may result in wrong predictions. It’s easy to image with tree-based algorithms, where data may fall in the most unexplored split sections. Obviously, in this situation, we have to expect a general decrease in performance comparing to a ‘no shifts’ situation. Our models are not trained to deal with distribution shifts, so we register accuracy decreases of at least 10 points. What we observe is that the linear tree survives better concerning the decision tree.

LABEL SHIFTS
In a ‘label shifts’ context, we may encounter a shift in target distribution over time. We simulate this kind of scenario maintaining the same feature generation process but changing the label balances in two different time periods.

In this scenario, we are using an algorithm, trained on a given label distribution, to predict data with a different label balance. It’s normal to reach sub-optimal results (especially for tree-based algorithms) because the model doesn’t expect so many occurrences of a particular class. We register performances constant over time but with worsenings comparing to the ‘no shifts’ case. As before, the linear tree survives better concerning the decision tree.

In all our experiments with the presence of shifts, we register a good survival behavior of linear trees. They suffer less a decrease in performances as opposed to standard decision trees and simple Linear Models. Their ability to generalize better is the result of the union of linear approximations and the splits of the data in a tree structure. This made linear trees more accurate when data fall in unexplored splitting regions, which are particularly stressed in the case of temporal data shifts.
SUMMARY
In this post, we simulated some cases of temporal data shifts. Linear trees revealed to be good models to test and validate when we expect temporal shifts. Despite they are not widely adopted, they are very simple yet powerful predictive algorithms and represent valuable alternatives to classical decision trees.
Keep in touch: Linkedin