Tree-Boosted Mixed Effects Models

GPBoost: Combining Tree-Boosting and Mixed Effects Models

Fabio Sigrist
Towards Data Science

--

This article shows how tree-boosting (sometimes also referred to as “gradient tree-boosting”) can be combined with mixed effects models using the GPBoost algorithm. Background is provided on both the methodology as well as on how to apply the GPBoost library using Python. We show how (i) models are trained, (ii) parameters tuned, (iii) model are interpreted, and (iv) predictions are made. Further, we do a comparison of several alternative approaches.

Introduction

Tree-boosting with its well-known implementations such as XGBoost, LightGBM, and CatBoost, is widely used in applied data science. Besides state-of-the-art predictive accuracy, tree-boosting has the following advantages:

  • Automatic modeling of non-linearities, discontinuities, and complex high-order interactions
  • Robust to outliers in and multicollinearity among predictor variables
  • Scale-invariance to monotone transformations of the predictor variables
  • Automatic handling of missing values in predictor variables

Mixed effects models are a modeling approach for clustered, grouped, longitudinal, or panel data. Among other things, they have the advantage that they allow for more efficient learning of the chosen model for the regression function (e.g. a linear model or a tree ensemble).

As outlined in Sigrist (2020), combined gradient tree-boosting and mixed effects models often performs better than (i) plain vanilla gradient boosting, (ii) standard linear mixed effects models, and (iii) alternative approaches for combing machine learning or statistical models with mixed effects models.

Modeling grouped data

Grouped data (aka clustered data, longitudinal data, panel data) occurs naturally in many applications when there are multiple measurements for different units of a variable of interest. Examples include:

  • One wants to investigate the impact of some factors (e.g. learning technique, nutrition, sleep, etc.) on students’ test scores and every student does several tests. In this case, the units, i.e. the grouping variable, are the students and the variable of interest is the test score.
  • A company gathers transaction data about its customers. For every customer, there are several transactions. The units are then the customers and the variable of interest can be any attribute of the transactions such as prices.

Basically, such grouped data can be modeled using four different approaches:

  1. Ignore the grouping structure. This is rarely a good idea since important information is neglected.
  2. Model each group (i.e. each student or each customer) separately. This is also rarely a good idea as the number of measurements per group is often small relative to the number of different groups.
  3. Include the grouping variable (e.g. student or customer ID) in your model of choice and treat it as a categorical variable. While this is a viable approach, it has the following disadvantages. Often, the number of measurements per group (e.g. number of tests per student, number of transactions per customer) is relatively small and the number of different groups is large (e.g. number of students, customers, etc.). In this case, the model needs to learn many parameters (one for every group) based on relatively little data which can make the learning inefficient. Further, for trees, high cardinality categorical variables can be problematic.
  4. Model the grouping variable using so-called random effects in a mixed effects model. This is often a sensible compromise between the approaches 2. and 3. above. In particular, as illustrated below and in Sigrist (2020), this is beneficial compared to the other approaches in the case of tree-boosting.

Methodological background

For the GPBoost algorithm, it is assumed that the response variable y is the sum of a potentially non-linear mean function F(X) and so-called random effects Zb:

y = F(X) + Zb + e

where

  • y is the response variable (aka label)
  • X contains the predictor variables (aka features) and F() is a potentially non-linear function. In linear mixed effects models, this is simply a linear function. In the GPBoost algorithm, this is an ensemble of trees.
  • Zb are the random effects which are assumed to follow a multivariate normal distribution
  • e is an error term

The model is trained using the GPBoost algorithm, where trainings means learning the (co-)variance parameters (aka hyper-parameters) of the random effects and the regression function F(X) using a tree ensemble. The random effects Zb can be estimated (or predicted, as it is often called) after the model has been learned. In brief, the GPBoost algorithm is a boosting algorithm that iteratively learns the (co-)variance parameters and adds a tree to the ensemble of trees using a gradient and/or a Newton boosting step. The main difference to existing boosting algorithms is that, first, it accounts for dependency among the data due to clustering and, second, it learns the (co-)variance parameters of the random effects. See Sigrist (2020) for more details on the methodology. In the GPBoost library, (co-)variance parameters can be learned using (accelerated) gradient descent or Fisher scoring, and trees are learned using the LightGBM library. In particular, this means that the full functionality of LightGBM is available.

How to use the GPBoost library in Python

In the following, we show how combined tree-boosting and mixed effects models can be applied using the GPBoost library from Python. The complete code used in this article can be found here as a Python script. Note that there is also an equivalent R package. More information on this can be found here.

Installation

pip install gpboost -U

Simulate data

We use simulated data here. We adopt a well known non-linear function F(X). For simplicity, we use one grouping variable. But one could equally well use several random effects including hierarchically nested ones, crossed ones, or random slopes. The number of samples is 5'000 and the number of different groups or clusters is 500. We also generate test data for evaluating the predictive accuracy. For the test data, we include both known, observed groups as well as novel, unobserved groups.

import gpboost as gpb
import numpy as np
import sklearn.datasets as datasets
import time
import pandas as pd
# Simulate data
ntrain = 5000 # number of samples for training
n = 2 * ntrain # combined number of training and test data
m = 500 # number of categories / levels for grouping variable
sigma2_1 = 1 # random effect variance
sigma2 = 1 ** 2 # error variance
# Simulate non-linear mean function
np.random.seed(1)
X, F = datasets.make_friedman3(n_samples=n)
X = pd.DataFrame(X,columns=['variable_1','variable_2','variable_3','variable_4'])
F = F * 10**0.5 # with this choice, the fixed-effects regression function has the same variance as the random effects
# Simulate random effects
group_train = np.arange(ntrain) # grouping variable
for i in range(m):
group_train[int(i * ntrain / m):int((i + 1) * ntrain / m)] = i
group_test = np.arange(ntrain) # grouping variable for test data. Some existing and some new groups
m_test = 2 * m
for i in range(m_test):
group_test[int(i * ntrain / m_test):int((i + 1) * ntrain / m_test)] = i
group = np.concatenate((group_train,group_test))
b = np.sqrt(sigma2_1) * np.random.normal(size=m_test) # simulate random effects
Zb = b[group]
# Put everything together
xi = np.sqrt(sigma2) * np.random.normal(size=n) # simulate error term
y = F + Zb + xi # observed data
# split train and test data
y_train = y[0:ntrain]
y_test = y[ntrain:n]
X_train = X.iloc[0:ntrain,]
X_test = X.iloc[ntrain:n,]

Learning and making predictions

The following code shows how one trains a model and makes predictions. As can be seen below, the learned variance parameters are close to the true ones. Note that when making predictions, one can make separate predictions for the mean function F(X) and the random effects Zb.

# Define and train GPModel
gp_model = gpb.GPModel(group_data=group_train)
# create dataset for gpb.train function
data_train = gpb.Dataset(X_train, y_train)
# specify tree-boosting parameters as a dict
params = { 'objective': 'regression_l2', 'learning_rate': 0.1,
'max_depth': 6, 'min_data_in_leaf': 5, 'verbose': 0 }
# train model
bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=31)
gp_model.summary() # estimated covariance parameters
#Covariance parameters:
# Error_term Group_1
#Param. 0.92534 1.016069

# Make predictions
pred = bst.predict(data=X_test, group_data_pred=group_test)
y_pred = pred['response_mean']
np.sqrt(np.mean((y_test - y_pred) ** 2)) # root mean square error (RMSE) on test data. Approx. = 1.26

Parameter tuning

A careful choice of the tuning parameters is important for all boosting algorithms. Arguably the most important tuning parameter is the number of boosting iterations. A too large number will often result in over-fitting in regression problems and a too small value in “under-fitting”. In the following, we show how the number of boosting iterations can be chosen using cross-validation. Other important tuning parameters include the learning rate, the tree-depth, the minimal number of samples per leaf, and an L2 penalty on leaf values.

# Choosing number of boosting iterations using cross-validation
gp_model = gpb.GPModel(group_data=group_train)
cvbst = gpb.cv(params=params, train_set=data_train,
gp_model=gp_model, use_gp_model_for_validation=False,
num_boost_round=100, early_stopping_rounds=5,
nfold=4, verbose_eval=True, show_stdv=False, seed=1,
metric="l2")
# 'metric' was called 'metrics' in earlier versions of gpboost
best_iter = np.argmin(cvbst['l2-mean'])
print("Best number of iterations: " + str(best_iter))
# Best number of iterations: 31

Update: as of version 0.4.3, GPBoost now has a function grid_search_tune_parameters which can be used for parameter tuning. See this demo (section “Choosing tuning parameters”) for more details.

Feature importance and partial dependence plots

Feature importance plots and partial dependence plots are tools for interpreting machine learning models. These can be used as follows.

# Plotting feature importances
gpb.plot_importance(bst)
Feature importance plot

Univariate partial dependence plots. Note (June 2023): the following code runs with pdpbox version 0.2.1 or earlier, for newer versions of pdpbox, see this Python script.

from pdpbox import pdp
# Single variable plots (takes a few seconds to compute)
pdp_dist = pdp.pdp_isolate(model=bst, dataset=X_train,
model_features=X_train.columns,
feature='variable_2',
num_grid_points=50,
predict_kwds={"ignore_gp_model": True})
pdp.pdp_plot(pdp_dist, 'variable_2', plot_lines=True)

Multivariate partial dependence plots

# Two variable interaction plot
inter_rf = pdp.pdp_interact(model=bst, dataset=X_train,
model_features=X_train.columns,
features=['variable_1','variable_2'],
predict_kwds={"ignore_gp_model": True})
pdp.pdp_interact_plot(inter_rf, ['variable_1','variable_2'], x_quantile=True, plot_type='contour', plot_pdp=True)
# ignore any error message
Two dimensional partial dependence plot for visualizing interactions

SHAP values

SHAP values and dependence plots are another important tool for model interpretation. These can be created as follows. Note: you need shap version>=0.36.0 for this.

import shap
shap_values = shap.TreeExplainer(bst).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.dependence_plot("variable_2", shap_values, X_test)
SHAP values
SHAP dependence plot for variable 2

Comparison to alternative approaches

In the following, we compare the GPBoost algorithm to several existing approaches using the above simulated data. We consider the following alternative approaches:

  • A linear mixed effects model (‘Linear_ME’) where F(X) is a linear function
  • Standard gradient tree-boosting ignoring the grouping structure (‘Boosting_Ign’)
  • Standard gradient tree-boosting including the grouping variable as a categorical variables (‘Boosting_Cat’)
  • Mixed-effects random forest (‘MERF’) (see here and Hajjem et al. (2014) for more information)

We compare the algorithms in terms of predictive accuracy measured using the root mean square error (RMSE) and computational time (clock time in seconds). The results are shown in the table below. The code for producing these results can be found below in the appendix.

Comparison of GPBoost and alternative approaches.

We see that GPBoost and MERF perform best (and almost equally well) in terms of predictive accuracy. Further, the GPBoost algorithm is approximately 1000 times faster than the MERF algorithm. The linear mixed effects model (‘Linear_ME’) and tree-boosting ignoring the grouping variable (‘Boosting_Ign’) have clearly lower predictive accuracy. Tree-boosting with the grouping variable included as a categorical variable (‘Boosting_Cat’) also shows lower predictive accuracy than GPBoost or MERF. Note that in this example, the test data contains both existing groups, which have already been observed in the training data, and novel groups not observed in the training data (50% each). If the test data consists only of existing groups, differences among Boosting_Cat and GPBoost / MERF are larger; see the experiments in Sigrist (2020).

Note that, for simplicity, we do only one simulation run (see Sigrist (2020) for a much more detailed comparison). Except for MERF, all computations are done using the gpboost Python package version 0.7.6. Further, we use the MERF Python package version 0.3.

Conclusion

GPBoost allows for combining mixed effects models and tree-boosting. If you apply linear mixed effects models, you should investigate whether the linearity assumption is indeed appropriate. The GPBoost model allows for relaxing this assumption. It may help you to find non-linearities and interactions and achieve higher predictive accuracy. If you are a frequent user of boosting algorithms such as XGBoost and LightGBM and you have categorical variables with potentially high-cardinality, GPBoost (which extends LightGBM) can make learning more efficient and result in higher predictive accuracy.

Note that the GPBoost library also supports other types of random effects such as Gaussian processes in addition to grouped or clustered random effects. More information on GPBoost can be found in the companion article Sigrist (2020) and on GitHub.

References

Hajjem, A., Bellavance, F., & Larocque, D. (2014). Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313–1328.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., … & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems (pp. 3146–3154).

Pinheiro, J., & Bates, D. (2006). Mixed-effects models in S and S-PLUS. Springer Science & Business Media.

Sigrist, F. (2020). Gaussian Process Boosting. arXiv preprint arXiv:2004.02653.

Appendix

Code for comparison of alternative approaches

results = pd.DataFrame(columns = ["RMSE","Time"],
index = ["GPBoost", "Linear_ME","Boosting_Ign","Boosting_Cat","MERF"])
# 1. GPBoost
gp_model = gpb.GPModel(group_data=group_train)
start_time = time.time() # measure time
bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=best_iter)
results.loc["GPBoost","Time"] = time.time() - start_time
pred = bst.predict(data=X_test, group_data_pred=group_test)
y_pred = pred['response_mean']
results.loc["GPBoost","RMSE"] = np.sqrt(np.mean((y_test - y_pred) ** 2))

# 2. Linear mixed effects model ('Linear_ME')
gp_model = gpb.GPModel(group_data=group_train)
X_train_linear = np.column_stack((np.ones(ntrain),X_train))
X_test_linear = np.column_stack((np.ones(ntrain),X_test))
start_time = time.time() # measure time
gp_model.fit(y=y_train, X=X_train_linear) # add a column of 1's for intercept
results.loc["Linear_ME","Time"] = time.time() - start_time
y_pred = gp_model.predict(group_data_pred=group_test, X_pred=X_test_linear)
results.loc["Linear_ME","RMSE"] = np.sqrt(np.mean((y_test - y_pred['mu']) ** 2))

# 3. Gradient tree-boosting ignoring the grouping variable ('Boosting_Ign')
cvbst = gpb.cv(params=params, train_set=data_train,
num_boost_round=100, early_stopping_rounds=5,
nfold=4, verbose_eval=True, show_stdv=False, seed=1)
best_iter = np.argmin(cvbst['l2-mean'])
print("Best number of iterations: " + str(best_iter))
start_time = time.time() # measure time
bst = gpb.train(params=params, train_set=data_train, num_boost_round=best_iter)
results.loc["Boosting_Ign","Time"] = time.time() - start_time
y_pred = bst.predict(data=X_test)
results.loc["Boosting_Ign","RMSE"] = np.sqrt(np.mean((y_test - y_pred) ** 2))

# 4. Gradient tree-boosting including the grouping variable as a categorical variable ('Boosting_Cat')
X_train_cat = np.column_stack((group_train,X_train))
X_test_cat = np.column_stack((group_test,X_test))
data_train_cat = gpb.Dataset(X_train_cat, y_train, categorical_feature=[0])
cvbst = gpb.cv(params=params, train_set=data_train_cat,
num_boost_round=1000, early_stopping_rounds=5,
nfold=4, verbose_eval=True, show_stdv=False, seed=1)
best_iter = np.argmin(cvbst['l2-mean'])
print("Best number of iterations: " + str(best_iter))
start_time = time.time() # measure time
bst = gpb.train(params=params, train_set=data_train_cat, num_boost_round=best_iter)
results.loc["Boosting_Cat","Time"] = time.time() - start_time
y_pred = bst.predict(data=X_test_cat)
results.loc["Boosting_Cat","RMSE"] = np.sqrt(np.mean((y_test - y_pred) ** 2))

# 5. Mixed-effects random forest ('MERF')
from merf import MERF
rf_params={'max_depth': 6, 'n_estimators': 300}
merf_model = MERF(max_iterations=100, rf_params=rf_params)
print("Warning: the following takes a lot of time")
start_time = time.time() # measure time
merf_model.fit(pd.DataFrame(X_train), np.ones(shape=(ntrain,1)), pd.Series(group_train), y_train)
results.loc["MERF","Time"] = time.time() - start_time
y_pred = merf_model.predict(pd.DataFrame(X_test), np.ones(shape=(ntrain,1)), pd.Series(group_test))
results.loc["MERF","RMSE"] = np.sqrt(np.mean((y_test - y_pred) ** 2))

print(results.apply(pd.to_numeric).round(3))

--

--

Fabio Sigrist is Professor of Applied Statistics and Data Science at Lucerne University of Applied Sciences and Arts.