Mixed Effects Machine Learning for Longitudinal & Panel Data with GPBoost (Part III)

A demo of GPBoost in Python & R using real-world data

Fabio Sigrist
Towards Data Science

--

Illustration of longitudinal data: time series plots for different subjects (idcode) — Image by author

In Part I and Part II of this series, we showed how random effects can be used for modeling high-cardinality categorical in machine learning models, and we gave an introduction to the GPBoost library which implements the GPBoost algorithm combining tree-boosting with random effects. In this article, we demonstrate how the Python and R packages of the GPBoost library can be used for longitudinal data (aka repeated measures or panel data). You might want to first read Part II of this series as it gives a first introduction to the GPBoost library. GPBoost version 1.2.1 is used in this demo.

Table of contents

1 Data: description, loading, and sample split
2 Modeling options for longitudinal data in GPBoost
· · 2.1 Subject grouped random effects
· · 2.2 Fixed effects only
· · 2.3 Subject and time grouped random effects
· · 2.4 Subject random effects with temporal random slopes
· · 2.5 Subject-specific AR(1) / Gaussian process models
· · 2.6 Subject grouped random effects and a joint AR(1) model
3 Training a GPBoost model
4 Choosing tuning parameters
5 Prediction
6 Conclusion and references

1 Data: description, loading, and sample split

The data used in this demo is the wages data which was already used in Part II. It can be downloaded from here. The data set contains a total of 28’013 samples for 4’711 persons for which data was measured over several years. Such data is called longitudinal data, or panel data, since for every subject (person ID =idcode), data was collected repeatedly over time (years = t). In other words, the samples for every level of the categorical variable idcode are repeated measurements over time. The response variable is the logarithmic real wage (ln_wage), and the data includes several predictor variables such as age, total work experience, job tenure in years, date, and others.

In the code below, we load the data and randomly partition the data into 80% training data and 20% test data.

Python

import gpboost as gpb
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
data = data.assign(t_sq = data['t']**2)# Add t^2
# Partition into training and test data
n = data.shape[0]
np.random.seed(n)
permute_aux = np.random.permutation(n)
train_idx = permute_aux[0:int(0.8 * n)]
test_idx = permute_aux[int(0.8 * n):n]
data_train = data.iloc[train_idx]
data_test = data.iloc[test_idx]
# Define fixed effects predictor variables
pred_vars = [col for col in data.columns if col not in ['ln_wage', 'idcode', 't', 't_sq']]

R

library(gpboost)
library(tidyverse)
# Load data
data <- read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
data[,"t_sq"] <- data[,"t"]^2 # Add t^2
data <- as.matrix(data) # Convert to matrix since the boosting part does not support data.frames
# Partition into training and test data
n <- dim(data)[1]
set.seed(n)
permute_aux <- sample.int(n, n)
train_idx <- permute_aux[1:as.integer(0.8*n)]
test_idx <- permute_aux[(as.integer(0.8*n)+1):n]
data_train <- data[train_idx,]
data_test <- data[test_idx,]
# Define fixed effects predictor variables
pred_vars <- colnames(data)[-which(colnames(data) %in% c("ln_wage", "idcode", "t", "t_sq"))]

2 Modeling options for longitudinal data in GPBoost

Various combinations of fixed and random effects models for longitudinal data can be handled with the GPBoost library. In the following, we demonstrate several ones. This list is not exhaustive and other models are possible. After defining a random effects model (gp_model) and a Dataset (data_bst), training can be done with the gpb.train function. This latter training step is identical for all models presented in the following. For this reason, we do not include it here, but first show only how to define the corresponding models. See the subsequent Section 3 for how training is done.

2.1 Subject grouped random effects

A first option is to use grouped random effects for the categorical subject variable (idcode) and fixed effects for the time variable. This corresponds to the following model:

where i is the person index (=idcode), j is the temporal index, and we assume that xij contains the time variable (or a “dummified” / one-hot encoding version of it) and other predictor variables. The idcode random effect for person number i is bi and F(xij) is the tree-ensemble fixed effects function. Such a model can be specified in GPBoost as shown below. Subject-specific random effects are defined by passing the idcode categorical variable to the group_data argument of the GPModel() constructor. Note that the time variable t is of very low cardinality, and we can thus use one-hot encoding or dummy variables. This is what we do below as pred_vars, which defines the fixed effects predictor variables, already contains year dummy variables.

Python

gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])

R

gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])

2.2 Fixed effects only

Another option is to use a fixed effects-only model in which both subjects (idcode) and time are modeled using fixed effects, and there are no random effects. This corresponds to the following model:

again, for notational simplicity, assuming that xij contains the time variable. Such a model can be specified in GPBoost by simply including the subject (idcode) and time variables in the tree-boosting predictor variables and having no random effects model.

Python

pred_vars_FE = ['idcode'] + pred_vars
data_bst = gpb.Dataset(data=data_train[pred_vars_FE], categorical_feature=[0],
label=data_train['ln_wage'])

R

pred_vars_FE = c("idcode", pred_vars)
data_bst <- gpb.Dataset(data = data_train[,pred_vars_FE], categorical_feature = c(1),
label = data_train[,"ln_wage"])

2.3 Subject and time grouped random effects

One can also use random effects for both subjects (idcode) and time (t). This corresponds to the following model:

where bi and bj are the idcode and t random effects. Such a model can be specified in GPBoost as follows. In contrast to the above single-level random effects model in Section 2.1, we now pass two variables (idcode and t) to the group_data argument of the GPModel() constructor.

Python

gp_model = gpb.GPModel(group_data=data_train[['idcode','t']], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])

R

gp_model <- GPModel(group_data = data_train[,c("idcode","t")], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])

2.4 Subject random effects with temporal random slopes

In the random effects models introduced so far (e.g., in Section 2.1), every subject (idcode) has a different intercept (=mean). However, besides this “shift”, the temporal evolution over time is the same for all persons. This restriction can be relaxed by including polynomial random coefficients (= random slopes). For instance, a model with random slopes for t and t^2 is given by

where b(0,i) are the intercept random effects, b(1,i) and b(2,i) are the random slopes, and tij denotes the time t for the sample ij. Due to the temporal random slopes, every person has a different evolution over time t in this model. Such a model can be specified in GPBoost as shown in the code below. The group_rand_coef_data variable contains the random coefficient covariate data t and t^2. Further, the indices in ind_effect_group_rand_coef indicate for which categorical variable in group_data this covariate data should be used to obtain random coefficients. Here, there is only one categorical variable (idcode), and thus ind_effect_group_rand_coef is simply (1,1) (i.e., both t and t^2 belong to idcode).

Python

gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian',
group_rand_coef_data=data_train[["t","t_sq"]],
ind_effect_group_rand_coef=[1,1])
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])

R

gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
group_rand_coef_data = cbind(data_train[,"t"], data_train[,"t_sq"]),
ind_effect_group_rand_coef = c(1,1))
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])

2.5 Subject-specific AR(1) / Gaussian process models

A further extension of the above random slopes model is to use time series models, or Gaussian processes, instead of polynomial random slopes. This allows for separately modeling the temporal evolution of every subject in a more flexible way. For instance, one can use subject-specific AR(1) models:

In this model, every person i has separate random effects bij which evolve over time according to an AR(1) model (= discretized Gaussian process with an exponential covariance function). Such a model can be specified in GPBoost as shown in the following code. A temporal Gaussian process is defined by passing the time variable t to the gp_coords argument. We use a Gaussian process with an exponential covariance function which is equivalent to an AR(1) model. Further, the cluster_ids variable indicates that every person (idcode) has a separate (= a priori independent) Gaussian process.

Python

gp_model = gpb.GPModel(gp_coords=data_train['t'], cov_function='exponential',
cluster_ids=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])

R

gp_model <- GPModel(gp_coords = data_train[,"t"], cov_function="exponential",
cluster_ids = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])

After training, the AR(1) model parameters can be obtained from the Gaussian process covariance parameters as follows.

Python

cov_pars = gp_model.get_cov_pars()
phi_hat = np.exp(-1 / cov_pars['GP_range'][0])
sigma2_hat = cov_pars['GP_var'][0] * (1. - phi_hat ** 2)
print("Estimated innovation variance and AR(1) coefficient of year effect:")
print([sigma2_hat, phi_hat])

R

cov_pars = gp_model$get_cov_pars()
phi_hat = exp(-1 / cov_pars["GP_range"])
sigma2_hat = cov_pars["GP_var"] * (1 - phi_hat^2)
print("Estimated innovation variance and AR(1) coefficient:")
print(c(sigma2 = sigma2_hat, phi = phi_hat))

2.6 Subject grouped random effects and a joint AR(1) model

Another alternative is to use one single temporal AR(1) model which is shared across subjects but include separate temporally constant subject (idcode) random effects. This corresponds to the following model:

Such a model can be specified in GPBoost as follows.

Python

gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian'
gp_coords=data_train['t'], cov_function='exponential')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])

R

gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
gp_coords = data_train[,"t"], cov_function="exponential")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])

3 Training a GPBoost model

The following code shows how to train a GPBoost model. We first define a random effects model (gp_model), a Dataset (data_bst), tuning parameters (params and nrounds = number of trees / boosting iteration), and then run the GPBoost algorithm by calling the gpb.train function. We use a subject (idcode) random effects model with temporal random slopes (see Section 2.4) as such a model often provides a good compromise between flexibility/accuracy and runtime. Note that we use tuning parameters that have been selected beforehand (see below).

Python

gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian',
group_rand_coef_data=data_train[["t","t_sq"]],
ind_effect_group_rand_coef=[1,1])
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10,
'lambda_l2': 10, 'num_leaves': 2**10, 'verbose': 0}
nrounds = 379
gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model,
num_boost_round=nrounds)
gp_model.summary() # Estimated random effects model
## =====================================================
## Covariance parameters (random effects):
## Param.
## Error_term 0.0531
## idcode 0.0432
## idcode_rand_coef_t 0.0004
## idcode_rand_coef_t_sq 0.0000
## =====================================================

R

gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
group_rand_coef_data = cbind(data_train[,"t"], data_train[,"t_sq"]),
ind_effect_group_rand_coef = c(1,1))
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 10)
nrounds <- 379
gpbst <- gpb.train(data = data_bst, gp_model = gp_model,
nrounds = nrounds, params = params, verbose = 0)
summary(gp_model) # Estimated random effects mode

When having multiple random effects (as is the case here), it can be faster to use nelder_mead instead of gradient_descent(=default) as optimizer. This can be set as follows before calling gpb.train. For this data set, it does not matter, though.

Python

gp_model.set_optim_params(params={"optimizer_cov": "nelder_mead"})

R

gp_model$set_optim_params(params = list(optimizer_cov = "nelder_mead"))

4 Choosing tuning parameters

It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Below, we show how tuning parameters can be chosen using the gpb.grid.search.tune.parameters function. We use a deterministic grid search with the parameter combinations shown below in the code. Note that we tune the max_depth parameter and thus set the num_leaves to a large value. We randomly split the training data into 80% inner training data 20% validation data and use the mean squared error (mse) as a prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (metric = "test_neg_log_likelihood" = default) which also takes prediction uncertainty into account. Note: depending on the data set and the grid size, this can take some time (approx. half an hour on my laptop for this data set). Instead of a deterministic grid search as below, one can also do a random grid search (see num_try_random) or another approach such as Bayesian optimization to speed things up.

Python

# Partition training data into inner training data and validation data
ntrain = data_train.shape[0]
np.random.seed(ntrain)
permute_aux = np.random.permutation(ntrain)
train_tune_idx = permute_aux[0:int(0.8 * ntrain)]
valid_tune_idx = permute_aux[int(0.8 * ntrain):ntrain]
folds = [(train_tune_idx, valid_tune_idx)]
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid = {'learning_rate': [1,0.1,0.01], 'max_depth': [1,2,3,5,10],
'min_data_in_leaf': [10,100,1000], 'lambda_l2': [0,1,10]}
other_params = {'num_leaves': 2**10, 'verbose': 0}
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian',
group_rand_coef_data=data_train[["t","t_sq"]],
ind_effect_group_rand_coef=[1,1])
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
# Find optimal tuning parameters
opt_params = gpb.grid_search_tune_parameters(param_grid=param_grid, params=other_params,
num_try_random=None, folds=folds, seed=1000,
train_set=data_bst, gp_model=gp_model,
num_boost_round=1000, early_stopping_rounds=10,
verbose_eval=1, metric='mse')
opt_params
# {'best_params': {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10, 'lambda_l2': 10}, 'best_iter': 379, 'best_score': 0.08000593485771226}

R

# Partition training data into inner training data and validation data
ntrain <- dim(data_train)[1]
set.seed(ntrain)
valid_tune_idx <- sample.int(ntrain, as.integer(0.2*ntrain))
folds <- list(valid_tune_idx)
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid <- list("learning_rate" = c(1,0.1,0.01), "max_depth" = c(1,2,3,5,10),
"min_data_in_leaf" = c(10,100,1000), "lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian",
group_rand_coef_data = cbind(data_train[,"t"], data_train[,"t_sq"]),
ind_effect_group_rand_coef = c(1,1))
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
# Find optimal tuning parameters
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, folds = folds,
data = data_bst, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse")
opt_params

5 Prediction

Predictions can be obtained via the predict function. We need to provide both test predictor variables (data) for the tree ensemble as well as test subject variables and random slopes data (group_data_pred and group_rand_coef_data_pred) for the random effects model. Below, we make predictions on the left-out test data and calculate the test mean squared error (MSE). As the results below show, the prediction accuracy of this random slopes model is higher compared to a simpler subject random effects model without random slopes introduced in Section 2.1 (see Part II for results of the latter model).

Python

pred = gpbst.predict(data=data_test[pred_vars], group_data_pred=data_test['idcode'],
group_rand_coef_data_pred=data_test[["t","t_sq"]],
predict_var=False, pred_latent=False)
y_pred = pred['response_mean']
np.mean((data_test['ln_wage'] - y_pred)**2)
## 0.08117682279905969

R

pred <- predict(gpbst, data = data_test[,pred_vars], group_data_pred = data_test[,"idcode"], 
group_rand_coef_data_pred = cbind(data_test[,"t"], data_test[,"t_sq"]),
predict_var = FALSE, pred_latent = FALSE)
y_pred <- pred[["response_mean"]]
mean((data_test[,"ln_wage"] - y_pred)^2)

6 Conclusion and references

We have demonstrated how to use the GPBoost library for modeling longitudinal data. There are several more features of the GPBoost library that we did not show here. For instance, in Part II of this series, we show to understand the fixed effects function using machine learning interpretability techniques, and we also demonstrate how the GPBoost library can handle (generalized) linear mixed effects models (GLMMs → F() is linear instead of tree ensemble). You may also want to have a look at the Python examples and R examples.

  • F. Sigrist. Gaussian Process Boosting. The Journal of Machine Learning Research, 23(1):10565–10610, 2022.
  • F. Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1894–1905, 2023.
  • https://github.com/fabsig/GPBoost

--

--

Fabio Sigrist is Professor of Applied Statistics and Data Science at Lucerne University of Applied Sciences and Arts.