The world’s leading publication for data science, AI, and ML professionals.

Mixed Effects Machine Learning for High-Cardinality Categorical Variables - Part II: GPBoost…

A demo of GPBoost in Python & R using real-world data

High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set. In Part I of this series, we did an empirical comparison of different Machine Learning methods and found that random effects are an effective tool for handling high-cardinality categorical variables with the GPBoost algorithm [Sigrist, 2022, 2023] having the highest prediction accuracy. In this article, we demonstrate how the GPBoost algorithm, which combines tree-boosting with random effects, can be applied with the Python and R packages of the GPBoost library. GPBoost version 1.2.1 is used in this demo.

Table of contents

1 Introduction2 Data: description, loading, and sample split3 Training a GPBoost model4 Choosing tuning parameter5 Prediction6 Interpretation7 Further modeling options · · 7.1 Interaction between categorical variables and other predictor variables · · 7.2 (Generalized) linear mixed effects models8 Conclusion and references

1 Introduction

Applying a GPBoost model involves the following main steps:

  1. Define a GPModel in which one specifies the following:

    • A Random Effects model: grouped random effects via group_data and/or Gaussian processes via gp_coords
    • The likelihood (= distribution of the response variable conditional on fixed and random effects)
  2. Create a Dataset containing the response variable (label) and fixed effects predictor variables (data)
  3. Choose tuning parameters, e.g., using the function gpb.grid.search.tune.parameters
  4. Train the model
  5. Make predictions and/or interpret the trained model

In the following, we go through these points step-by-step.

2 Data: description, loading, and sample split

The data used in this demo is the wages data which was also analyzed in Sigrist (2022). It can be downloaded from here. The data contains 28’013 samples and one high-cardinality categorical variable (person ID = idcode) with 4’711 different levels. The response variable is the logarithmic real wage (ln_wage), and the data includes several predictor variables such as age, total work experience, job tenure in years, date, and others.

In the code below, we first load the data and randomly partition the data into 80% training data and 20% test data. The latter is left aside and will be used later for measuring prediction accuracy.

Python

import gpboost as gpb
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
# Partition into training and test data
n = data.shape[0]
np.random.seed(n)
permute_aux = np.random.permutation(n)
train_idx = permute_aux[0:int(0.8 * n)]
test_idx = permute_aux[int(0.8 * n):n]
data_train = data.iloc[train_idx]
data_test = data.iloc[test_idx]
# Define fixed effects predictor variables
pred_vars = [col for col in data.columns if col not in ['ln_wage', 'idcode', 't']]

R

library(gpboost)
library(tidyverse)
# Load data
data <- read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
data <- as.matrix(data) # Convert to matrix since the boosting part does not support data.frames
# Partition into training and test data
n <- dim(data)[1]
set.seed(n)
permute_aux <- sample.int(n, n)
train_idx <- permute_aux[1:as.integer(0.8*n)]
test_idx <- permute_aux[(as.integer(0.8*n)+1):n]
data_train <- data[train_idx,]
data_test <- data[test_idx,]
# Define fixed effects predictor variables
pred_vars <- colnames(data)[-which(colnames(data) %in% c("ln_wage", "idcode", "t"))]

3 Training a GPBoost model

The code below shows how to train a GPBoost model. We first define a random effects model (gp_model), a Dataset (data_bst), tuning parameters (params and nrounds = number of trees / boosting iteration), and then run the GPBoost algorithm by calling the gpb.train function. Note that we use tuning parameters that have been selected beforehand (see below).

Python

gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10,
          'lambda_l2': 10, 'num_leaves': 2**10, 'verbose': 0}
nrounds = 391
gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model,
                  num_boost_round=nrounds)

R

gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
               min_data_in_leaf = 10, lambda_l2 = 10)
nrounds <- 391
gpbst <- gpb.train(data = data_bst, gp_model = gp_model, 
                   nrounds = nrounds, params = params, verbose = 0)

For this data, there is only one high-cardinality categorical variable. If there are multiple categorical variables that should be modeled using random effects, this can be specified in the gp_model as shown in the following code. _Note: computations can become slower when having multiple random effects compared to only one. Also, it can be faster to use nelder_mead instead of gradient_descent (=default) as optimizer when having multiple random effects. This can be set as follows before calling gpb.train._

Python

gp_model = gpb.GPModel(group_data=data_train[['idcode','t']], likelihood='gaussian')
gp_model.set_optim_params(params={"optimizer_cov": "nelder_mead"})

R

gp_model <- GPModel(group_data = data_train[,c("idcode","t")], likelihood = "gaussian")
gp_model$set_optim_params(params = list(optimizer_cov = "nelder_mead"))

4 Choosing tuning parameter

It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Below, we show how tuning parameters can be chosen using the gpb.grid.search.tune.parameters function. We use a deterministic grid search with the parameter combinations shown below in the code. Note that we tune the max_depth parameter and thus set the num_leaves to a large value. We randomly split the training data into 80% inner training data 20% validation data and use the mean squared error (mse) as a prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (test_neg_log_likelihood = default) which also takes prediction uncertainty into account. _Note: depending on the data set and the grid size, this can take some time (a few minutes on my laptop for this data set). Instead of a deterministic grid search as below, one can also do a random grid search (see num_try_random) or another approach such as Bayesian optimization to speed things up._

Python

# Partition training data into inner training data and validation data
ntrain = data_train.shape[0]
np.random.seed(ntrain)
permute_aux = np.random.permutation(ntrain)
train_tune_idx = permute_aux[0:int(0.8 * ntrain)]
valid_tune_idx = permute_aux[int(0.8 * ntrain):ntrain]
folds = [(train_tune_idx, valid_tune_idx)]
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid = {'learning_rate': [1,0.1,0.01], 'max_depth': [1,2,3,5,10],
              'min_data_in_leaf': [10,100,1000], 'lambda_l2': [0,1,10]}
other_params = {'num_leaves': 2**10, 'verbose': 0}
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
# Find optimal tuning parameters
opt_params = gpb.grid_search_tune_parameters(param_grid=param_grid, params=other_params,
                                             num_try_random=None, folds=folds, seed=1000,
                                             train_set=data_bst, gp_model=gp_model,
                                             num_boost_round=1000, early_stopping_rounds=10,
                                             verbose_eval=1, metric='mse')
opt_params
# {'best_params': {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10, 'lambda_l2': 10}, 'best_iter': 391, 'best_score': 0.08507862825877585}

R

# Partition training data into inner training data and validation data
ntrain <- dim(data_train)[1]
set.seed(ntrain)
valid_tune_idx <- sample.int(ntrain, as.integer(0.2*ntrain))
folds <- list(valid_tune_idx)
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid <- list("learning_rate" = c(1,0.1,0.01), "max_depth" = c(1,2,3,5,10),
                   "min_data_in_leaf" = c(10,100,1000), "lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
# Find optimal tuning parameters
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
                                              num_try_random = NULL, folds = folds,
                                              data = data_bst, gp_model = gp_model,
                                              nrounds = 1000, early_stopping_rounds = 10,
                                              verbose_eval = 1, metric = "mse")
opt_params

5 Prediction

Predictions can be obtained via the predict function. We need to provide both test predictor variables (data) for the tree ensemble and test categorical variables (group_data_pred) for the random effects model. Below, we make predictions on the left-out test data and calculate the test mean squared error (MSE).

Python

pred = gpbst.predict(data=data_test[pred_vars], group_data_pred=data_test['idcode'], 
                     predict_var=False, pred_latent=False)
y_pred = pred['response_mean']
np.mean((data_test['ln_wage'] - y_pred)**2)
## 0.0866730945477676

R

pred <- predict(gpbst, data = data_test[,pred_vars], group_data_pred = data_test[,"idcode"], 
                predict_var = FALSE, pred_latent = FALSE)
y_pred <- pred[["response_mean"]]
mean((data_test[,"ln_wage"] - y_pred)^2)

6 Interpretation

Information on the estimated random effects model can be obtained as follows.

Python

gp_model.summary()
## =====================================================
## Covariance parameters (random effects):
##             Param.
## Error_term  0.0703
## idcode      0.0448
## =====================================================

R

summary(gp_model)

There are several tools for understanding the form of the fixed effects tree ensemble function supported by the GPBoost library:

  • SHAP values
  • SHAP & partial dependence plots (one and two-dimensional)
  • SHAP force plots
  • Interaction statistics
  • Split-based variable importances

Below we look at SHAP values and a SHAP dependence plot.

Python

import shap
shap_values = shap.TreeExplainer(gpbst).shap_values(data_train[pred_vars])
feature_names = [ a + ": " + str(b) for a,b in zip(pred_vars, np.abs(shap_values).mean(0).round(2)) ]
shap.summary_plot(shap_values, data_train[pred_vars], max_display=10, feature_names=feature_names)
shap.dependence_plot("ttl_exp", shap_values, data_train[pred_vars], alpha=0.5)

R

library(SHAPforxgboost)
library(ggplot2)
shap.plot.summary.wrap1(gpbst, X = data_train[,pred_vars], top_n=10) + 
  ggtitle("SHAP values")
shap_long <- shap.prep(gpbst, X_train = data_train[,pred_vars])
shap.plot.dependence(data_long = shap_long, x = "ttl_exp", 
                     color_feature = "occ_code_7", smooth = FALSE, size = 2) + 
  ggtitle("SHAP dependence plot for ttl_exp")

7 Further modeling options

7.1 Interaction between categorical variables and other predictor variables

In the above model, there is no interaction between the random effects and the fixed effects predictor variables, i.e., between the high-cardinality categorical variable and the other predictor variables. Such interaction can be modeled by additionally including the categorical variable in the tree ensemble. The following code shows how this can be done. Below, we also calculate the test MSE of this model. The test MSE is slightly smaller compared to a model without such interaction. Note: the tuning parameters used below have been chosen beforehand, but for the sake of brevity, we omit this code.

Python

pred_vars_int = ['idcode'] + pred_vars
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars_int], categorical_feature=[0], 
                       label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10,
          'lambda_l2': 1, 'num_leaves': 2**10, 'verbose': 0}
gpbst = gpb.train(params=params, train_set=data_bst,  gp_model=gp_model,
                  num_boost_round=482)
pred = gpbst.predict(data=data_test[pred_vars_int], group_data_pred=data_test['idcode'])
np.mean((data_test['ln_wage'] - pred['response_mean'])**2)
## 0.08616085739486774

R

pred_vars_int = c("idcode", pred_vars)
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars_int], categorical_feature = c(1), 
                        label = data_train[,"ln_wage"])
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
               min_data_in_leaf = 10, lambda_l2 = 1)
gpbst <- gpb.train(data = data_bst, gp_model = gp_model, 
                   nrounds = 482, params = params, verbose = 0)
pred <- predict(gpbst, data = data_test[,pred_vars_int], group_data_pred = data_test[,"idcode"])
mean((data_test[,"ln_wage"] - pred[["response_mean"]])^2)

7.2 (Generalized) linear mixed effects models

The GPBoost library also supports (generalized) linear mixed effects models (GLMMs). In a GLMM, one assumes that the fixed effects function F() is a linear function instead of a tree ensemble. The code below shows how this can be done. Note that one needs to manually add a column of 1s to include an intercept. With the option params={"std_dev": True} / params = list(std_dev=TRUE), one obtains standard deviations and p-values in the summary function.

Python

X = data_train[pred_vars]
X = X.assign(Intercept = 1)
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood="gaussian")
gp_model.fit(y=data_train['ln_wage'], X=X, params={"std_dev": True})
gp_model.summary()
## =====================================================
## Model summary:
##  Log-lik      AIC     BIC
## -6191.97 12493.95 12934.9
## Nb. observations: 22410
## Nb. groups: 4567 (idcode)
## -----------------------------------------------------
## Covariance parameters (random effects):
##             Param.  Std. dev.
## Error_term  0.0794     0.0008
## idcode      0.0452     0.0014
## -----------------------------------------------------
## Linear regression coefficients (fixed effects):
##              Param.  Std. dev.  z value  P(>|z|)
## grade        0.0379     0.0025  14.8659   0.0000
## age          0.0032     0.0013   2.4665   0.0136
## ttl_exp      0.0310     0.0012  25.4221   0.0000
## tenure       0.0107     0.0009  11.9256   0.0000
## not_smsa    -0.1326     0.0079 -16.8758   0.0000
## south       -0.0884     0.0072 -12.3259   0.0000
...

R

X = cbind(Intercept = rep(1,dim(data_train)[1]), data_train[,pred_vars])
gp_model <- fitGPModel(group_data = data_train[,"idcode"], X = X, 
                       y = data_train[,"ln_wage"], likelihood = "gaussian", 
                       params = list(std_dev=TRUE))
summary(gp_model)

8 Conclusion and references

We have demonstrated how tree-boosting combined with random effects can be applied with the Python and R packages of the GPBoost library. There are several more features of the GPBoost library that we did not show here. You may want to have a look at the Python examples and R examples. In Part III of this series, we will show how longitudinal data (aka panel data) can be modeled with the GPBoost library.

  • F. Sigrist. Gaussian Process Boosting. The Journal of Machine Learning Research, 23(1):10565–10610, 2022.
  • F. Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1894–1905, 2023.
  • https://github.com/fabsig/GPBoost

Related Articles