The world’s leading publication for data science, AI, and ML professionals.

Model Tree: handle Data Shifts mixing Linear Model and Decision Tree

Simulate and Modelling Feature and Label Shifts

Photo by Conor Luddy on Unsplash
Photo by Conor Luddy on Unsplash

All the trained models are prone to become old and useless. It’s a well-known truth that, after some time, all the models are not able to be accurate. This is normal and due to a temporal shift that may occur in the data flow. Especially, most of the application which involve modeling the human activity must be monitored and continually updated. For example, the changes of some needs or market trends may influence the purchasing power of customers. If we are not able to take into account the changing of customer’s habits, our predictions reveal to be not trustable over time.

Given a supervised trained model, we may encounter two opposite situations that affect future performances. A shift in feature distributions or a shift in target distribution. A shift in feature distributions it’s harmful because our model makes predictions on data that it doesn’t see before. On the contrary, a shift in label distribution is bad because our model is trained to approximate a different ground truth.

In this post, we make an experiment to test the ability of some models to survive data shifts over time. We consider a linear model, a Decision Tree, and a linear tree. We want to see how different regime changes can affect the accuracy of predictions on future test data. This permits us to elaborate more about the ability of the three algorithms to handle unavoidable data shifts.

Linear trees, as introduced in my previous post, are a special case of model trees. They build a tree on the passed data evaluating the best splits fitting linear models. The fitted model results in a tree-based structure with linear models in the leaves. In other words, it computes multiple linear regressions partitioning the dataset according to simple decision rules. Everyone can easily implement linear trees using the linear-tree package. It can be used as a scikit-learn BaseEstimator to wrap every linear estimator from sklearn.linear_model and build an optimal tree structure.

EXPERIMENT SETUP

For our experiments, we use some artificial data generated in a classification context. We partition our dataset into some temporal blocks that we use to train (including hyperparameter searching) and evaluate the performance of our models.

Train-test temporal splits strategy (image by the author)
Train-test temporal splits strategy (image by the author)

As introduced in the section above, we use three different models: a linear model, a decision tree, and a linear tree. We approximate the passage of time changing the test starting point. We consider 4 testing blocks where we compute the evaluation of each model type at different dates. This enables us to track how the models behave over time.

We propose the experiment in three different versions: no shifts, feature shifts, and label shifts.

NO SHIFTS

In a ‘no shifts’ context, we suppose to see in the future the same data distribution. This is the more perfect and unrealistic scenario. It’s perfect because we train and test our model on similar data, resulting in stable results. It’s unrealistic because we have to be lucky to encounter a situation like this. We live in a dynamic world and all are prone to change rapidly, corrupting our model performances.

However, we start from this situation to fit the models of our interest. We perform training with hyperparameter tuning on the selected temporal blocks. The evaluations on the test sets are summarized in the plot below.

Performances with no shifts (image by the author)
Performances with no shifts (image by the author)

As expected, the accuracies of the models are constant over time. The decision tree and the linear tree reach the same scores.

FEATURE SHIFTS

In a ‘feature shifts’ context, we may encounter a shift in some feature distributions over time. We simulate a double shift in two different time periods. This is possible simply by adding some constant values to change where the distributions are centered.

feature distributions over years (image by the author)
feature distributions over years (image by the author)

Predict unseen values may result in wrong predictions. It’s easy to image with tree-based algorithms, where data may fall in the most unexplored split sections. Obviously, in this situation, we have to expect a general decrease in performance comparing to a ‘no shifts’ situation. Our models are not trained to deal with distribution shifts, so we register accuracy decreases of at least 10 points. What we observe is that the linear tree survives better concerning the decision tree.

Performances with feature shifts (image by the author)
Performances with feature shifts (image by the author)

LABEL SHIFTS

In a ‘label shifts’ context, we may encounter a shift in target distribution over time. We simulate this kind of scenario maintaining the same feature generation process but changing the label balances in two different time periods.

label distribution over years (image by the author)
label distribution over years (image by the author)

In this scenario, we are using an algorithm, trained on a given label distribution, to predict data with a different label balance. It’s normal to reach sub-optimal results (especially for tree-based algorithms) because the model doesn’t expect so many occurrences of a particular class. We register performances constant over time but with worsenings comparing to the ‘no shifts’ case. As before, the linear tree survives better concerning the decision tree.

Performances with label shifts (image by the author)
Performances with label shifts (image by the author)

In all our experiments with the presence of shifts, we register a good survival behavior of linear trees. They suffer less a decrease in performances as opposed to standard decision trees and simple Linear Models. Their ability to generalize better is the result of the union of linear approximations and the splits of the data in a tree structure. This made linear trees more accurate when data fall in unexplored splitting regions, which are particularly stressed in the case of temporal data shifts.

SUMMARY

In this post, we simulated some cases of temporal data shifts. Linear trees revealed to be good models to test and validate when we expect temporal shifts. Despite they are not widely adopted, they are very simple yet powerful predictive algorithms and represent valuable alternatives to classical decision trees.


CHECK MY GITHUB REPO

Keep in touch: Linkedin


Related Articles