High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set. In Part I of this series, we did an empirical comparison of different Machine Learning methods and found that random effects are an effective tool for handling high-cardinality categorical variables with the GPBoost algorithm [Sigrist, 2022, 2023] having the highest prediction accuracy. In this article, we demonstrate how the GPBoost algorithm, which combines tree-boosting with random effects, can be applied with the Python and R packages of the GPBoost
library. GPBoost
version 1.2.1 is used in this demo.
Table of contents
∘ 1 Introduction ∘ 2 Data: description, loading, and sample split ∘ 3 Training a GPBoost model ∘ 4 Choosing tuning parameter ∘ 5 Prediction ∘ 6 Interpretation ∘ 7 Further modeling options · · 7.1 Interaction between categorical variables and other predictor variables · · 7.2 (Generalized) linear mixed effects models ∘ 8 Conclusion and references
1 Introduction
Applying a GPBoost model involves the following main steps:
-
Define a
GPModel
in which one specifies the following:- A Random Effects model: grouped random effects via
group_data
and/or Gaussian processes viagp_coords
- The
likelihood
(= distribution of the response variable conditional on fixed and random effects)
- A Random Effects model: grouped random effects via
- Create a
Dataset
containing the response variable (label
) and fixed effects predictor variables (data
) - Choose tuning parameters, e.g., using the function
gpb.grid.search.tune.parameters
- Train the model
- Make predictions and/or interpret the trained model
In the following, we go through these points step-by-step.
2 Data: description, loading, and sample split
The data used in this demo is the wages data which was also analyzed in Sigrist (2022). It can be downloaded from here. The data contains 28’013 samples and one high-cardinality categorical variable (person ID = idcode
) with 4’711 different levels. The response variable is the logarithmic real wage (ln_wage
), and the data includes several predictor variables such as age, total work experience, job tenure in years, date, and others.
In the code below, we first load the data and randomly partition the data into 80% training data and 20% test data. The latter is left aside and will be used later for measuring prediction accuracy.
Python
import gpboost as gpb
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
# Partition into training and test data
n = data.shape[0]
np.random.seed(n)
permute_aux = np.random.permutation(n)
train_idx = permute_aux[0:int(0.8 * n)]
test_idx = permute_aux[int(0.8 * n):n]
data_train = data.iloc[train_idx]
data_test = data.iloc[test_idx]
# Define fixed effects predictor variables
pred_vars = [col for col in data.columns if col not in ['ln_wage', 'idcode', 't']]
R
library(gpboost)
library(tidyverse)
# Load data
data <- read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz")
data <- as.matrix(data) # Convert to matrix since the boosting part does not support data.frames
# Partition into training and test data
n <- dim(data)[1]
set.seed(n)
permute_aux <- sample.int(n, n)
train_idx <- permute_aux[1:as.integer(0.8*n)]
test_idx <- permute_aux[(as.integer(0.8*n)+1):n]
data_train <- data[train_idx,]
data_test <- data[test_idx,]
# Define fixed effects predictor variables
pred_vars <- colnames(data)[-which(colnames(data) %in% c("ln_wage", "idcode", "t"))]
3 Training a GPBoost model
The code below shows how to train a GPBoost model. We first define a random effects model (gp_model
), a Dataset
(data_bst
), tuning parameters (params
and nrounds
= number of trees / boosting iteration), and then run the GPBoost algorithm by calling the gpb.train
function. Note that we use tuning parameters that have been selected beforehand (see below).
Python
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10,
'lambda_l2': 10, 'num_leaves': 2**10, 'verbose': 0}
nrounds = 391
gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model,
num_boost_round=nrounds)
R
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 10)
nrounds <- 391
gpbst <- gpb.train(data = data_bst, gp_model = gp_model,
nrounds = nrounds, params = params, verbose = 0)
For this data, there is only one high-cardinality categorical variable. If there are multiple categorical variables that should be modeled using random effects, this can be specified in the gp_model
as shown in the following code. _Note: computations can become slower when having multiple random effects compared to only one. Also, it can be faster to use nelder_mead
instead of gradient_descent
(=default) as optimizer when having multiple random effects. This can be set as follows before calling gpb.train
._
Python
gp_model = gpb.GPModel(group_data=data_train[['idcode','t']], likelihood='gaussian')
gp_model.set_optim_params(params={"optimizer_cov": "nelder_mead"})
R
gp_model <- GPModel(group_data = data_train[,c("idcode","t")], likelihood = "gaussian")
gp_model$set_optim_params(params = list(optimizer_cov = "nelder_mead"))
4 Choosing tuning parameter
It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Below, we show how tuning parameters can be chosen using the gpb.grid.search.tune.parameters
function. We use a deterministic grid search with the parameter combinations shown below in the code. Note that we tune the max_depth
parameter and thus set the num_leaves
to a large value. We randomly split the training data into 80% inner training data 20% validation data and use the mean squared error (mse
) as a prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (test_neg_log_likelihood
= default) which also takes prediction uncertainty into account. _Note: depending on the data set and the grid size, this can take some time (a few minutes on my laptop for this data set). Instead of a deterministic grid search as below, one can also do a random grid search (see num_try_random
) or another approach such as Bayesian optimization to speed things up._
Python
# Partition training data into inner training data and validation data
ntrain = data_train.shape[0]
np.random.seed(ntrain)
permute_aux = np.random.permutation(ntrain)
train_tune_idx = permute_aux[0:int(0.8 * ntrain)]
valid_tune_idx = permute_aux[int(0.8 * ntrain):ntrain]
folds = [(train_tune_idx, valid_tune_idx)]
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid = {'learning_rate': [1,0.1,0.01], 'max_depth': [1,2,3,5,10],
'min_data_in_leaf': [10,100,1000], 'lambda_l2': [0,1,10]}
other_params = {'num_leaves': 2**10, 'verbose': 0}
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
# Find optimal tuning parameters
opt_params = gpb.grid_search_tune_parameters(param_grid=param_grid, params=other_params,
num_try_random=None, folds=folds, seed=1000,
train_set=data_bst, gp_model=gp_model,
num_boost_round=1000, early_stopping_rounds=10,
verbose_eval=1, metric='mse')
opt_params
# {'best_params': {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10, 'lambda_l2': 10}, 'best_iter': 391, 'best_score': 0.08507862825877585}
R
# Partition training data into inner training data and validation data
ntrain <- dim(data_train)[1]
set.seed(ntrain)
valid_tune_idx <- sample.int(ntrain, as.integer(0.2*ntrain))
folds <- list(valid_tune_idx)
# Specify parameter grid, gp_model, and gpb.Dataset
param_grid <- list("learning_rate" = c(1,0.1,0.01), "max_depth" = c(1,2,3,5,10),
"min_data_in_leaf" = c(10,100,1000), "lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars], label = data_train[,"ln_wage"])
# Find optimal tuning parameters
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, folds = folds,
data = data_bst, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse")
opt_params
5 Prediction
Predictions can be obtained via the predict
function. We need to provide both test predictor variables (data
) for the tree ensemble and test categorical variables (group_data_pred
) for the random effects model. Below, we make predictions on the left-out test data and calculate the test mean squared error (MSE).
Python
pred = gpbst.predict(data=data_test[pred_vars], group_data_pred=data_test['idcode'],
predict_var=False, pred_latent=False)
y_pred = pred['response_mean']
np.mean((data_test['ln_wage'] - y_pred)**2)
## 0.0866730945477676
R
pred <- predict(gpbst, data = data_test[,pred_vars], group_data_pred = data_test[,"idcode"],
predict_var = FALSE, pred_latent = FALSE)
y_pred <- pred[["response_mean"]]
mean((data_test[,"ln_wage"] - y_pred)^2)
6 Interpretation
Information on the estimated random effects model can be obtained as follows.
Python
gp_model.summary()
## =====================================================
## Covariance parameters (random effects):
## Param.
## Error_term 0.0703
## idcode 0.0448
## =====================================================
R
summary(gp_model)
There are several tools for understanding the form of the fixed effects tree ensemble function supported by the GPBoost
library:
- SHAP values
- SHAP & partial dependence plots (one and two-dimensional)
- SHAP force plots
- Interaction statistics
- Split-based variable importances
Below we look at SHAP values and a SHAP dependence plot.
Python
import shap
shap_values = shap.TreeExplainer(gpbst).shap_values(data_train[pred_vars])
feature_names = [ a + ": " + str(b) for a,b in zip(pred_vars, np.abs(shap_values).mean(0).round(2)) ]
shap.summary_plot(shap_values, data_train[pred_vars], max_display=10, feature_names=feature_names)
shap.dependence_plot("ttl_exp", shap_values, data_train[pred_vars], alpha=0.5)
R
library(SHAPforxgboost)
library(ggplot2)
shap.plot.summary.wrap1(gpbst, X = data_train[,pred_vars], top_n=10) +
ggtitle("SHAP values")
shap_long <- shap.prep(gpbst, X_train = data_train[,pred_vars])
shap.plot.dependence(data_long = shap_long, x = "ttl_exp",
color_feature = "occ_code_7", smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for ttl_exp")
7 Further modeling options
7.1 Interaction between categorical variables and other predictor variables
In the above model, there is no interaction between the random effects and the fixed effects predictor variables, i.e., between the high-cardinality categorical variable and the other predictor variables. Such interaction can be modeled by additionally including the categorical variable in the tree ensemble. The following code shows how this can be done. Below, we also calculate the test MSE of this model. The test MSE is slightly smaller compared to a model without such interaction. Note: the tuning parameters used below have been chosen beforehand, but for the sake of brevity, we omit this code.
Python
pred_vars_int = ['idcode'] + pred_vars
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood='gaussian')
data_bst = gpb.Dataset(data=data_train[pred_vars_int], categorical_feature=[0],
label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10,
'lambda_l2': 1, 'num_leaves': 2**10, 'verbose': 0}
gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model,
num_boost_round=482)
pred = gpbst.predict(data=data_test[pred_vars_int], group_data_pred=data_test['idcode'])
np.mean((data_test['ln_wage'] - pred['response_mean'])**2)
## 0.08616085739486774
R
pred_vars_int = c("idcode", pred_vars)
gp_model <- GPModel(group_data = data_train[,"idcode"], likelihood = "gaussian")
data_bst <- gpb.Dataset(data = data_train[,pred_vars_int], categorical_feature = c(1),
label = data_train[,"ln_wage"])
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
gpbst <- gpb.train(data = data_bst, gp_model = gp_model,
nrounds = 482, params = params, verbose = 0)
pred <- predict(gpbst, data = data_test[,pred_vars_int], group_data_pred = data_test[,"idcode"])
mean((data_test[,"ln_wage"] - pred[["response_mean"]])^2)
7.2 (Generalized) linear mixed effects models
The GPBoost
library also supports (generalized) linear mixed effects models (GLMMs). In a GLMM, one assumes that the fixed effects function F() is a linear function instead of a tree ensemble. The code below shows how this can be done. Note that one needs to manually add a column of 1s to include an intercept. With the option params={"std_dev": True}
/ params = list(std_dev=TRUE)
, one obtains standard deviations and p-values in the summary
function.
Python
X = data_train[pred_vars]
X = X.assign(Intercept = 1)
gp_model = gpb.GPModel(group_data=data_train['idcode'], likelihood="gaussian")
gp_model.fit(y=data_train['ln_wage'], X=X, params={"std_dev": True})
gp_model.summary()
## =====================================================
## Model summary:
## Log-lik AIC BIC
## -6191.97 12493.95 12934.9
## Nb. observations: 22410
## Nb. groups: 4567 (idcode)
## -----------------------------------------------------
## Covariance parameters (random effects):
## Param. Std. dev.
## Error_term 0.0794 0.0008
## idcode 0.0452 0.0014
## -----------------------------------------------------
## Linear regression coefficients (fixed effects):
## Param. Std. dev. z value P(>|z|)
## grade 0.0379 0.0025 14.8659 0.0000
## age 0.0032 0.0013 2.4665 0.0136
## ttl_exp 0.0310 0.0012 25.4221 0.0000
## tenure 0.0107 0.0009 11.9256 0.0000
## not_smsa -0.1326 0.0079 -16.8758 0.0000
## south -0.0884 0.0072 -12.3259 0.0000
...
R
X = cbind(Intercept = rep(1,dim(data_train)[1]), data_train[,pred_vars])
gp_model <- fitGPModel(group_data = data_train[,"idcode"], X = X,
y = data_train[,"ln_wage"], likelihood = "gaussian",
params = list(std_dev=TRUE))
summary(gp_model)
8 Conclusion and references
We have demonstrated how tree-boosting combined with random effects can be applied with the Python and R packages of the GPBoost
library. There are several more features of the GPBoost
library that we did not show here. You may want to have a look at the Python examples and R examples. In Part III of this series, we will show how longitudinal data (aka panel data) can be modeled with the GPBoost
library.
- F. Sigrist. Gaussian Process Boosting. The Journal of Machine Learning Research, 23(1):10565–10610, 2022.
- F. Sigrist. Latent Gaussian Model Boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1894–1905, 2023.
- https://github.com/fabsig/GPBoost