Hands-on Tutorials

Combining tree based models with a linear baseline model to improve extrapolation

Writing your own sklearn functions, part 1

Published in

Towards Data Science

7 min readOct 27, 2020

This post is a short intro on combining different machine learning models for practical purposes, to find a good balance between their advantages and disadvantages. In our case we will ensemble a random forest, a very powerful non-linear, non-parametric tree-based allrounder, with a classical linear regression model, a model that is very easy to interpret and can be verified using domain knowledge.

For many problems gradient boosting or random forests are the go-to-model. They often outperform many other models as they are able to learn almost any linear or non-linear relationship. Nevertheless one of the disadvantages of tree models is that they do not handle new data very well, they often extrapolate poorly — read more on this. For practical purposes that can lead to undesired behaviour, for example when predicting time, distances or cost, which will be outlined in a second.

We can quickly verify that the sklearn implementation for random forest models can learn the identity very well for the provided range (0 to 50), but then fails miserably for values out of the training data range:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor()X = np.arange(0, 100).reshape(-1, 1)
y = np.arange(0, 100)# train on [0, 50]
model.fit(X[:50], y[:50]);# predict for [0, 100]
## RandomForestRegressor()
plt.ylim(0, 100);
sns.lineplot(X.reshape(-1,), model.predict(X));
plt.show()

Tree model has trouble to predict out of sample — image done by author.

In contrast to tree models, linear models handle new data differently. They can very easily extrapolate predictions, but of course with the drawback of learning only linear relationships. Moreover linear models allow an easy interpretation, simply by looking at the regression coefficients we can actually compare it to domain knowledge, and practically verify it.

For some regression problems it therefore makes sense to combine both approaches, which we can elaborate with an example: Let’s assume we want to estimate travel time with respect to travel distance for different vehicles, using historic data (a dataset containing the vehicle type, distances and the travel time). We might consider using a linear or a non-linear model, also depending on our expectations. If we aim to estimate travel time for bikes, an increase in distance leads to proportional change in time (if you ride a bike twice as long you can probably travel twice the distance). If we train a linear model it will be a good fit, and the coefficient expresses the proportionality between distance and time.

But a drawback of a linear model is that it could have problems learning the relationship of time versus distance for other vehicle types, like cars. If the travel distance by car increases we might be able to take highways, opposed to the shorter distances, so the increase in travel time might not be linearly proportional any more. In this case a non-linear model will perform better. However, if we train a non-linear tree model on observed travelled car distances, it will perform poorly on new data, i.e. unseen distances: if our training data only contains car trips up to 100 kilometers, an estimate for 200 kilometers will be very bad, whereas the linear model will still provide a good, extrapolated estimate.

To tackle this we could support the tree model with a linear model in the following way: if the prediction of the tree model is too far away from the prediction of our baseline linear model, we default to the linear model prediction. So for our example: if the non-linear model prediction for a 200 kilometer car ride is 2 hours (because it has only seen car rides up to 2 hours), but the linear model prediction is 3 hours, we default to the 3 hours. Whereas if the prediction of the non-linear model is not too far away (say less than 25%) from the validation prediction of the linear model, we stick to it (as we expect it to perform better in general).

Writing your own combined estimator

In sklearn we could implement this in the following way, by creating our own estimator. Be advised that this first implementation does not comply with the sklearn API standards yet - we will improve this version at the end of this post according to the interface.

We simply fit two estimators, and predict using two estimators. We can then compare both predictions and adjust our prediction accordingly. We will take the tree model prediction if it is within a certain range of the linear prediction. Alternatively to comparing both models and picking the more reasonable prediction, we can also take the average of both predictions or blend them differently. Moreover we could optimize the parameters upper and lower using grid search (for this we will need to implement some further methods like get_params and set_params).

Let us generate some sample data using the non-linear function f(x)=x+sqrt(x)+rnorm(0, 3) to evaluate the model:

import pandas as pddef f(x):
    if isinstance(x, int):
        return np.sqrt(x) + np.random.normal(0, 3)
    else:
        return np.sqrt(x) + np.random.normal(0, 3, len(x))def generate_data(n=100, x_max=100):
    x = np.random.uniform(0, x_max, n)
    return pd.DataFrame.from_records({'x': x}), f(x)

We can now train three different models: RandomForestRegressor, LinearRegression and our custom CombinedRegressor on our small sample dataset.

from sklearn.metrics import mean_absolute_errornp.random.seed(100)
X, y = generate_data(n=100)for model_name, model in models.items():
    print(f'Training {model_name} model.')
    model.fit(X, y);
    print(f'Training score: {model.score(X, y)}')
    print(f'In-sample MAE: {mean_absolute_error(y, model.predict(X))} \n')
## Training tree model.
## RandomForestRegressor()
## Training score: 0.8620816087110539
## In-sample MAE: 1.081361379476639 
## 
## Training linear model.
## LinearRegression()
## Training score: 0.28917115492576073
## In-sample MAE: 2.586843328406717 
## 
## Training combined model.
## CombinedRegressor()
## Training score: 0.35418433030406293
## In-sample MAE: 2.3593815648352554

If we look at plots (see below) of all three model predictions and the true values we can quickly see the initially discussed problem: the tree model cannot really extrapolate

As expected the random forest model performs best in terms of in-sample mean absolute error. Now if we evaluate out-of-sample (150 opposed to the maximum of 100 in the training data) things look different:

np.random.seed(101)x = [150]
X_new, y_new = pd.DataFrame({'x': x}), f(x)for model_name, model in models.items():
    y_pred = model.predict(X_new)
    print(f'Testing {model_name} model.')
    print(f'y_new: {y_new}, y_pred: {y_pred}')
    print(f'Test MAE: {mean_absolute_error(y_new, y_pred)} \n')
## Testing tree model.
## y_new: [20.36799823], y_pred: [7.81515835]
## Test MAE: 12.552839880512696 
## 
## Testing linear model.
## y_new: [20.36799823], y_pred: [13.39757247]
## Test MAE: 6.970425764867624 
## 
## Testing combined model.
## y_new: [20.36799823], y_pred: [13.39757247]
## Test MAE: 6.970425764867624

The tree model makes a bad job due to the incapability of extrapolation, the linear model does a better job, and therefore the ensembled model that is backed up by the linear regression. At the same time it will still perform good in-sample, and it will perform decent (better than the tree model only, worse than the linear regression) if we increase the range drastically:

np.random.seed(102)X_new, y_new = generate_data(n=100, x_max=200)for model_name, model in models.items():
    y_pred = model.predict(X_new)
    print(f'Testing {model_name} model.')
    print(f'Test MAE: {mean_absolute_error(y_new, y_pred)} \n')
## Testing tree model.
## Test MAE: 3.67092130330585 
## 
## Testing linear model.
## Test MAE: 2.5770058863460985 
## 
## Testing combined model.
## Test MAE: 2.6143623109839984

We can also plot the individual predictions on some random data (both in and out of training sample) and quickly see the initially discussed problem of previously unseen x-values and the tree model, which flattens out. The linear model continues with the trend, and the combined version has an unpleasant bump but at least follows the trend once the tree model deviates too much.

Extrapolation behaviour of our different models — image done by author.

Sklearn compatibility

If we want to achieve full sklearn compatiblity (model selection, pipelines, etc.) and also use sklearn’s onboard testing utilities we have to do some modifications to the estimator:

we need to add the setters and getters for parameters (we use sklearn’s convention to prefix the parameters with the name and two underscores, i.e. base_regressor__some_param)
consistent handling of the random state

This can be achieved in the following way:

We can then run hyperparameter grid search on our custom estimator:

We can also check our estimator using sklearn.utils.estimator_checks utilities:

In some cases you might not be able to satisfy a specific validation, in this case you could for example mock the return value of that specific check as true:

import mockfrom sklearn.utils.estimator_checks import check_estimatorwith mock.patch('sklearn.utils.estimator_checks.check_estimators_data_not_an_array', return_value=True) as mock:
    check_estimator(CombinedRegressor())

Wrap up

This approach can be very helpful if we have one model that works charmingly on a certain set of inputs, but where we lose confidence once the inputs are more exotic. In this case we can guarantee some prediction sanity by using a more transparent (for ex. linear) approach, which we can easily synchronize with domain knowledge and which guarantees a specific behaviour (for ex. extrapolation, proportional increments).

Instead of using a combined estimator we could also use a non-linear model with better extrapolation behaviour, like a neural network. Another alternative would be to add some artificial samples to the training dataset to increase the supported response range.

Originally published at https://blog.telsemeyer.com.