
In Part I and [Part II](https://medium.com/towards-data-science/mixed-effects-machine-learning-for-high-cardinality-categorical-variables-part-ii-gpboost-3bdd9ef74492) of this series, we showed how random effects can be used for modeling high-cardinality categorical in machine learning models, and we gave an introduction to the GPBoost
library which implements the GPBoost algorithm combining tree-boosting with random effects. In this article, we demonstrate how the Python and R packages of the GPBoost
library can be used for Longitudinal Data (aka repeated measures or panel data). You might want to first read Part II of this series as it gives a first introduction to the GPBoost
library. GPBoost
version 1.2.1 is used in this demo.
Table of contents
∘ 1 Data: description, loading, and sample split ∘ 2 Modeling options for longitudinal data in GPBoost · · 2.1 Subject grouped random effects · · 2.2 Fixed effects only · · 2.3 Subject and time grouped random effects · · 2.4 Subject random effects with temporal random slopes · · 2.5 Subject-specific AR(1) / Gaussian process models · · 2.6 Subject grouped random effects and a joint AR(1) model ∘ 3 Training a GPBoost model ∘ 4 Choosing tuning parameters ∘ 5 Prediction ∘ 6 Conclusion and references
1 Data: description, loading, and sample split
The data used in this demo is the wages data which was already used in Part II. It can be downloaded from here. The data set contains a total of 28’013 samples for 4’711 persons for which data was measured over several years. Such data is called longitudinal data, or panel data, since for every subject (person ID =idcode
), data was collected repeatedly over time (years = t
). In other words, the samples for every level of the categorical variable idcode
are repeated measurements over time. The response variable is the logarithmic real wage (ln_wage
), and the data includes several predictor variables such as age, total work experience, job tenure in years, date, and others.
In the code below, we load the data and randomly partition the data into 80% training data and 20% test data.
Python
import gpboost as gpb
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
data = data.assign(t_sq = data['t']**2)# Add t^2
# Partition into training and test data
n = data.shape[0]
np.random.seed(n)
permute_aux = np.random.permutation(n)
train_idx = permute_aux[0:int(0.8 * n)]
test_idx = permute_aux[int(0.8 * n):n]
data_train = data.iloc[train_idx]
data_test = data.iloc[test_idx]
# Define fixed effects predictor variables
pred_vars = [col for col in data.columns if col not in ['ln_wage', 'idcode', 't', 't_sq']]
R
library(gpboost)
library(tidyverse)
# Load data
data <- read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
data[,"t_sq"] <- data[,"t"]^2 # Add t^2
data <- as.matrix(data) # Convert to matrix since the boosting part does not support data.frames
# Partition into training and test data
n <- dim(data)[1]
set.seed(n)
permute_aux <- sample.int(n, n)
train_idx <- permute_aux[1:as.integer(0.8*n)]
test_idx <- permute_aux[(as.integer(0.8*n)+1):n]
data_train <- data[train_idx,]
data_test <- data[test_idx,]
# Define fixed effects predictor variables
pred_vars <- colnames(data)[-which(colnames(data) %in% c("ln_wage", "idcode", "t", "t_sq"))]
2 Modeling options for longitudinal data in GPBoost
Various combinations of fixed and random effects models for longitudinal data can be handled with the GPBoost
library. In the following, we demonstrate several ones. This list is not exhaustive and other models are possible. After defining a random effects model (gp_model
) and a Dataset
(data_bst
), training can be done with the gpb.train
function. This latter training step is identical for all models presented in the following. For this reason, we do not include it here, but first show only how to define the corresponding models. See the subsequent Section 3 for how training is done.
2.1 Subject grouped random effects
A first option is to use grouped random effects for the categorical subject variable (idcode
) and fixed effects for the time variable. This corresponds to the following model:

where i is the person index (=idcode
), j is the temporal index, and we assume that xij contains the time variable (or a "dummified" / one-hot encoding version of it) and other predictor variables. The idcode
random effect for person number i is bi and F(xij) is the tree-ensemble fixed effects function. Such a model can be specified in GPBoost
as shown below. Subject-specific random effects are defined by passing the idcode
categorical variable to the group_data
argument of the GPModel()
constructor. Note that the time variable t
is of very low cardinality, and we can thus use one-hot encoding or dummy variables. This is what we do below as pred_vars
, which defines the fixed effects predictor variables, already contains year dummy variables.
Python
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
R
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
2.2 Fixed effects only
Another option is to use a fixed effects-only model in which both subjects (idcode
) and time are modeled using fixed effects, and there are no random effects. This corresponds to the following model:

again, for notational simplicity, assuming that xij contains the time variable. Such a model can be specified in GPBoost
by simply including the subject (idcode
) and time variables in the tree-boosting predictor variables and having no random effects model.
Python
pred_vars_FE = ['idcode'] + pred_vars
data_bst = gpb.Dataset(data=data_train[pred_vars_FE], categorical_feature=[0],
label=data_train['ln_wage'])
R
pred_vars_FE = c("idcode", pred_vars)
data_bst <- gpb.Dataset(data = data_train[,pred_vars_FE], categorical_feature = c(1),
label = data_train[,"ln_wage"])
2.3 Subject and time grouped random effects
One can also use random effects for both subjects (idcode
) and time (t
). This corresponds to the following model:

where bi and bj are the idcode
and t
random effects. Such a model can be specified in GPBoost
as follows. In contrast to the above single-level random effects model in Section 2.1, we now pass two variables (idcode
and t)
to the group_data
argument of the GPModel()
constructor.
Python
gp_model = gpb.GPModel(group_data=data_train[['idcode','t']], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
R
gp_model <- GPModel(group_data = data_train[,c("idcode","t")], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
2.4 Subject random effects with temporal random slopes
In the random effects models introduced so far (e.g., in Section 2.1), every subject (idcode
) has a different intercept (=mean). However, besides this "shift", the temporal evolution over time is the same for all persons. This restriction can be relaxed by including polynomial random coefficients (= random slopes). For instance, a model with random slopes for t
and t^2
is given by

where b(0,i) are the intercept random effects, b(1,i) and b(2,i) are the random slopes, and tij denotes the time t
for the sample ij. Due to the temporal random slopes, every person has a different evolution over time t
in this model. Such a model can be specified in GPBoost
as shown in the code below. The group_rand_coef_data
variable contains the random coefficient covariate data t
and t^2
. Further, the indices in ind_effect_group_rand_coef
indicate for which categorical variable in group_data
this covariate data should be used to obtain random coefficients. Here, there is only one categorical variable (idcode
), and thus ind_effect_group_rand_coef
is simply (1,1) (i.e., both t
and t^2
belong to idcode
).
Python
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian',
group_rand_coef_data=data_train[["t","t_sq"]],
ind_effect_group_rand_coef=[1,1])
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
R
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
group_rand_coef_data = cbind(data_train[,"t"], data_train[,"t_sq"]),
ind_effect_group_rand_coef = c(1,1))
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
2.5 Subject-specific AR(1) / Gaussian process models
A further extension of the above random slopes model is to use time series models, or Gaussian processes, instead of polynomial random slopes. This allows for separately modeling the temporal evolution of every subject in a more flexible way. For instance, one can use subject-specific AR(1) models:

In this model, every person i has separate random effects bij which evolve over time according to an AR(1) model (= discretized Gaussian process with an exponential covariance function). Such a model can be specified in GPBoost
as shown in the following code. A temporal Gaussian process is defined by passing the time variable t
to the gp_coords
argument. We use a Gaussian process with an exponential covariance function which is equivalent to an AR(1) model. Further, the cluster_ids
variable indicates that every person (idcode
) has a separate (= a priori independent) Gaussian process.
Python
gp_model = gpb.GPModel(gp_coords=data_train['t'], cov_function='exponential',
cluster_ids=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
R
gp_model <- GPModel(gp_coords = data_train[,"t"], cov_function="exponential",
cluster_ids = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
After training, the AR(1) model parameters can be obtained from the Gaussian process covariance parameters as follows.
Python
cov_pars = gp_model.get_cov_pars()
phi_hat = np.exp(-1 / cov_pars['GP_range'][0])
sigma2_hat = cov_pars['GP_var'][0] * (1. - phi_hat ** 2)
print("Estimated innovation variance and AR(1) coefficient of year effect:")
print([sigma2_hat, phi_hat])
R
cov_pars = gp_model$get_cov_pars()
phi_hat = exp(-1 / cov_pars["GP_range"])
sigma2_hat = cov_pars["GP_var"] * (1 - phi_hat^2)
print("Estimated innovation variance and AR(1) coefficient:")
print(c(sigma2 = sigma2_hat, phi = phi_hat))
2.6 Subject grouped random effects and a joint AR(1) model
Another alternative is to use one single temporal AR(1) model which is shared across subjects but include separate temporally constant subject (idcode
) random effects. This corresponds to the following model:

Such a model can be specified in GPBoost
as follows.
Python
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian'
gp_coords=data_train['t'], cov_function='exponential')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
R
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
gp_coords = data_train[,"t"], cov_function="exponential")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
3 Training a GPBoost model
The following code shows how to train a GPBoost model. We first define a random effects model (gp_model
), a Dataset
(data_bst
), tuning parameters (params
and nrounds
= number of trees / boosting iteration), and then run the GPBoost algorithm by calling the gpb.train
function. We use a subject (idcode
) random effects model with temporal random slopes (see Section 2.4) as such a model often provides a good compromise between flexibility/accuracy and runtime. Note that we use tuning parameters that have been selected beforehand (see below).
Python
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian',
group_rand_coef_data=data_train[["t","t_sq"]],
ind_effect_group_rand_coef=[1,1])
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10,
'lambda_l2': 10, 'num_leaves': 2**10, 'verbose': 0}
nrounds = 379
gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model,
num_boost_round=nrounds)
gp_model.summary() # Estimated random effects model
## =====================================================
## Covariance parameters (random effects):
## Param.
## Error_term 0.0531
## idcode 0.0432
## idcode_rand_coef_t 0.0004
## idcode_rand_coef_t_sq 0.0000
## =====================================================
R
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
group_rand_coef_data = cbind(data_train[,"t"], data_train[,"t_sq"]),
ind_effect_group_rand_coef = c(1,1))
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 10)
nrounds <- 379
gpbst <- gpb.train(data = data_bst, gp_model = gp_model,
nrounds = nrounds, params = params, verbose = 0)
summary(gp_model) # Estimated random effects mode
When having multiple random effects (as is the case here), it can be faster to use nelder_mead
instead of gradient_descent
(=default) as optimizer. This can be set as follows before calling gpb.train
. For this data set, it does not matter, though.
Python
gp_model.set_optim_params(params={"optimizer_cov": "nelder_mead"})
R
gp_model$set_optim_params(params = list(optimizer_cov = "nelder_mead"))
4 Choosing tuning parameters
It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Below, we show how tuning parameters can be chosen using the gpb.grid.search.tune.parameters
function. We use a deterministic grid search with the parameter combinations shown below in the code. Note that we tune the max_depth
parameter and thus set the num_leaves
to a large value. We randomly split the training data into 80% inner training data 20% validation data and use the mean squared error (mse
) as a prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (metric = "test_neg_log_likelihood"
= default) which also takes prediction uncertainty into account. _Note: depending on the data set and the grid size, this can take some time (approx. half an hour on my laptop for this data set). Instead of a deterministic grid search as below, one can also do a random grid search (see num_try_random
) or another approach such as Bayesian optimization to speed things up._
Python
# Partition training data into inner training data and validation data
ntrain = data_train.shape[0]
np.random.seed(ntrain)
permute_aux = np.random.permutation(ntrain)
train_tune_idx = permute_aux[0:int(0.8 * ntrain)]
valid_tune_idx = permute_aux[int(0.8 * ntrain):ntrain]
folds = [(train_tune_idx, valid_tune_idx)]
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid = {'learning_rate': [1,0.1,0.01], 'max_depth': [1,2,3,5,10],
'min_data_in_leaf': [10,100,1000], 'lambda_l2': [0,1,10]}
other_params = {'num_leaves': 2**10, 'verbose': 0}
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian',
group_rand_coef_data=data_train[["t","t_sq"]],
ind_effect_group_rand_coef=[1,1])
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
# Find optimal tuning parameters
opt_params = gpb.grid_search_tune_parameters(param_grid=param_grid, params=other_params,
num_try_random=None, folds=folds, seed=1000,
train_set=data_bst, gp_model=gp_model,
num_boost_round=1000, early_stopping_rounds=10,
verbose_eval=1, metric='mse')
opt_params
# {'best_params': {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10, 'lambda_l2': 10}, 'best_iter': 379, 'best_score': 0.08000593485771226}
R
# Partition training data into inner training data and validation data
ntrain <- dim(data_train)[1]
set.seed(ntrain)
valid_tune_idx <- sample.int(ntrain, as.integer(0.2*ntrain))
folds <- list(valid_tune_idx)
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid <- list("learning_rate" = c(1,0.1,0.01), "max_depth" = c(1,2,3,5,10),
"min_data_in_leaf" = c(10,100,1000), "lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
group_rand_coef_data = cbind(data_train[,"t"], data_train[,"t_sq"]),
ind_effect_group_rand_coef = c(1,1))
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
# Find optimal tuning parameters
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, folds = folds,
data = data_bst, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse")
opt_params
5 Prediction
Predictions can be obtained via the predict
function. We need to provide both test predictor variables (data
) for the tree ensemble as well as test subject variables and random slopes data (group_data_pred
and group_rand_coef_data_pred
) for the random effects model. Below, we make predictions on the left-out test data and calculate the test mean squared error (MSE). As the results below show, the prediction accuracy of this random slopes model is higher compared to a simpler subject random effects model without random slopes introduced in Section 2.1 (see Part II for results of the latter model).
Python
pred = gpbst.predict(data=data_test[pred_vars], group_data_pred=data_test['idcode'],
group_rand_coef_data_pred=data_test[["t","t_sq"]],
predict_var=False, pred_latent=False)
y_pred = pred['response_mean']
np.mean((data_test['ln_wage'] - y_pred)**2)
## 0.08117682279905969
R
pred <- predict(gpbst, data = data_test[,pred_vars], group_data_pred = data_test[,"idcode"],
group_rand_coef_data_pred = cbind(data_test[,"t"], data_test[,"t_sq"]),
predict_var = FALSE, pred_latent = FALSE)
y_pred <- pred[["response_mean"]]
mean((data_test[,"ln_wage"] - y_pred)^2)
6 Conclusion and references
We have demonstrated how to use the GPBoost
library for modeling longitudinal data. There are several more features of the GPBoost
library that we did not show here. For instance, in Part II of this series, we show to understand the fixed effects function using Machine Learning interpretability techniques, and we also demonstrate how the GPBoost
library can handle (generalized) linear mixed effects models (GLMMs → F() is linear instead of tree ensemble). You may also want to have a look at the Python examples and R examples.
- F. Sigrist. Gaussian Process Boosting. The Journal of Machine Learning Research, 23(1):10565–10610, 2022.
- F. Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1894–1905, 2023.
- https://github.com/fabsig/GPBoost