Mixed Effects Machine Learning with GPBoost for Grouped and Areal Spatial Econometric Data

A demo using European GDP data

Fabio Sigrist
Towards Data Science

--

European GDP data: spatial effect and SHAP dependence plot for predictor variable ‘edu’ — Image by author

The GPBoost algorithm extends linear mixed effects and Gaussian process models by replacing the linear fixed effects function with a non-parametric non-linear function modeled using tree-boosting. This article shows how the GPBoost algorithm implemented in the GPBoost library can be used for modeling data with a spatial and grouped structure. We demonstrate the functionality of the GPBoost library using European GDP data which is an example of areal spatial econometric data.

Table of contents

Introduction
· · Basic workflow of GPBoost
· · Data description
· ·Data loading and short visualization
Training a GPBoost model
Choosing tuning parameters
Model interpretation
· ·Estimated random effects model
· ·Spatial effect map
· ·Understanding the fixed effects function
Extensions
· ·Separate random effects for different time periods
· ·Interaction between space and fixed effects predictor variables
· ·Large data
· ·Other spatial random effects models
· ·(Generalized) linear mixed effects and Gaussian process models
References

Introduction

Basic workflow of GPBoost

Applying a GPBoost model (= combined tree-boosting and random effects / GP models) involves the following main steps:

  1. Define a GPModel in which you specify the following:
    - A random effects model (e.g., spatial random effects, grouped random effects, combined spatial and grouped, etc.)
    - The likelihood (= distribution of the response variable conditional on fixed and random effects)
  2. Create a gpb.Dataset with the response variable and fixed effects predictor variables
  3. Choose tuning parameters, e.g., using the function gpb.grid.search.tune.parameters
  4. Train the model with the gpboost / gpb.train function
  5. Interpret the trained model an/or make predictions

In this article, we demonstrate these steps using a real world data set. Besides, we also show how to interpret fitted models, and we look at several extensions and additional features of the GPBoost library. All results are obtained using GPBoost version 1.2.1. This demo uses the R package, but the corresponding Python package provides the same functionality.

Data description

The data used in this demo is European gross domestic product (GDP) data. It can downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The data was collected by Massimo Giannini, University of Rome Tor Vergata, from Eurostat and kindly provided to Fabio Sigrist for a talk at the University of Rome Tor Vergata on June 16, 2023.

Data was collected for 242 European regions for the two years 2000 and 2021. I.e., the total number of data points is 484. The response variable is (log) GDP / capita. There are four predictor variables:

  • L: (log) share of employment (empl/pop)
  • K: (log) fixed capital/population
  • Pop: log(population)
  • Edu: share of tertiary education

Further, there are centroid spatial coordinates of every region (longitude and latitude), a spatial region ID for the 242 different European regions, and an additional spatial cluster ID which identifies the cluster the region belongs (there are two clusters).

Data loading and short visualization

We first load the data and create a map illustrating the (log) GDP / capita over space. We create two maps: one with all data and another one when excluding some remote islands. In the commented code below, we also show how to create a spatial plot when no shape file for spatial polygons is available.

library(gpboost)
library(ggplot2)
library(gridExtra)
library(viridis)
library(sf)

# Load data
data <- read.csv("https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv")
FID <- data$FID
data <- as.matrix(data[,names(data)!="FID"]) # convert to matrix since the boosting part currently does not support data.frames
covars <- c("L", "K", "pop", "edu")
# Load shape file for spatial plots
cur_tempfile <- tempfile()
download.file(url = "https://raw.githubusercontent.com/fabsig/GPBoost/master/data/shape_European_regions.zip", destfile = cur_tempfile)
out_directory <- tempfile()
unzip(cur_tempfile, exdir = out_directory)
shape <- st_read(dsn = out_directory)

# Create spatial plot of GDP
data_plot <- merge(shape, data.frame(FID = FID, y = data[,"y"]), by="FID")
p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="GDP / capita (log)", option = "B") +
ggtitle("GDP / capita (log)")
# Sample plot excluding islands
p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="GDP / capita (log)", option = "B") +
xlim(2700000,6400000) + ylim(1500000,5200000) +
ggtitle("GDP / capita (log) -- excluding islands")
grid.arrange(p1, p2, ncol=2)

# # Spatial plot without a shape file
# p1 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],
# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +
# geom_point(size=2, alpha=0.5) + scale_color_viridis(option = "B") +
# ggtitle("GDP / capita (log)")
# p2 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],
# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +
# geom_point(size=3, alpha=0.5) + scale_color_viridis(option = "B") +
# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)
# grid.arrange(p1, p2, ncol=2)
log GDP / capita for 242 European regions — Image by author

Training a GPBoost model

In the following, we use a Gaussian process model with an exponential covariance function to model spatial random effects. Additionally, we include grouped random effects for the cluster variable cl. In the GPBoost library, Gaussian process random effects are defined by the gp_coords argument and grouped random effects via the group_data argument of the GPModel constructor. The above-mentioned predictor variables are used in the fixed effects tree-ensemble function. We fit the model using the gpboost, or equivalently the gpb.train, function. Note that we use tuning parameters that are selected below using cross-validation.

gp_model <- GPModel(group_data = data[, c("cl")], 
gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 0)
nrounds <- 37
# gp_model$set_optim_params(params = list(trace=TRUE)) # To monitor hyperparameter estimation
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model,
nrounds = nrounds, params = params,
verbose = 1) # same as gpb.train# same as gpb.train gpb.train

Choosing tuning parameters

It is important that tuning parameters are appropriately chosen for boosting. There are no universal default values and every data set will likely need different tuning parameters. Below we show how tuning parameters can be chosen using the gpb.grid.search.tune.parameters function. We use the mean square error (mse) as prediction accuracy measure on the validation data. Alternatively, one can also use, e.g., the test negative log-likelihood (test_neg_log_likelihood = default value if nothing is specified) which also takes prediction uncertainty into account. Depending on the data set and the grid size, this can take some time. Instead of a deterministic grid search as below, one can also do a random grid search to speed things up (see num_try_random).

gp_model <- GPModel(group_data = data[, "cl"], 
gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
param_grid = list("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(10,100,1000),
"max_depth" = c(1,2,3,5,10),
"lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, nfold = 4,
data = boost_data, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"
opt_params
# ***** New best test score (l2 = 0.0255393919591794) found for the following parameter combination: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37

Model interpretation

Estimated random effects model

Information on the estimated random effects model can be obtained by calling the summary function on the gp_model. For Gaussian processes, GP_var is the marginal variance, i.e., the amount of spatial correlation or structure spatial variation, and GP_range is the range, or scale, parameter that measures how fast correlation decays over space. For an exponential covariance function, three times this value (approx. 17 here) is the distance where the (residual) spatial correlation is essentially zero (below 5%). As the results below show, the amount of spatial correlation is relatively small since the marginal variance of 0.0089 is small compared to the total variance of the response variable which is approx. 0.29. Additionally the error term and the cl grouped random effects also have small variances (0.013 and 0.022). We thus conclude that most of the variance in the response variable is explained by the fixed effects predictor variables.

summary(gp_model)
## =====================================================
## Covariance parameters (random effects):
## Param.
## Error_term 0.0131
## Group_1 0.0221
## GP_var 0.0089
## GP_range 5.7502
## =====================================================

Spatial effect map

We can plot the estimated (“predicted”) spatial random effects at the training data locations by calling the predict function on the training data; see the code below. Such a plot shows the spatial effect when factoring out the effect of the fixed effects predictor variables. Note that these spatial effects take into account both the spatial Gaussian process and the grouped region cluster random effects. If one wants to obtain only spatial random effects from the Gaussian process part, one can use the predict_training_data_random_effects function (see the commented code below). Alternatively, one can also do spatial extrapolation (=“Kriging”), but this makes not a lot of sense for areal data.

pred <- predict(gpboost_model, group_data_pred = data[1:242, c("cl")], 
gp_coords_pred = data[1:242, c("Long", "Lat")],
data = data[1:242, covars], predict_var = TRUE, pred_latent = TRUE)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean), by="FID")
plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_cov), by="FID")
plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Std. dev.", option = "B") +
ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.arrange(plot_mu, plot_sd, ncol=2)

# Only spatial effects from the Gaussian process ignoring the grouped random effects
# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]
# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,1]), by="FID")
# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(name="Spatial effect (mean)", option = "B") +
# ggtitle("Spatial effect from Gausian process (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,2]), by="FID")
# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(name="Uncertainty (std. dev.)", option = "B") +
# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# grid.arrange(plot_mu, plot_sd, ncol=2)
Spatial effect (after factoring out the predictor variables) — Image by author

Understanding the fixed effects function

There are several tools for understanding the form of the fixed effects function. Below we consider variable importance measures, interaction measures, and dependence plots. Specifically, we look at

  • SHAP values
  • SHAP dependence plots
  • Split-based variable importance
  • Friedman’s H-statistic
  • Partial dependence plots (one and two-dimensional)
  • SHAP force plot

As the results below show, the information obtained from SHAP values and SHAP dependence plots is the same as when looking at traditional variable importance measures and partial dependence plots. The most important variables are ‘K’ and ‘edu’. From the dependence plots, we conclude that there are non-linearities. For instance, the effect of K is almost flat for large and small values of K and increasing in-between. Further, the effect of edu is steeper for small values of edu and flattens for larger values of edu. Friedman’s H-statistic indicates that there are some interactions. For the two variables with the largest amount of interaction, L and pop, we create a two-dimensional partial dependence plot below. Further, the SHAP force plot shows that the effect of the predictor variables is different for the year 2000 compared to the year 2021.

# SHAP values
library(SHAPforxgboost)
shap.plot.summary.wrap1(gpboost_model, X = data[,covars]) + ggtitle("SHAP values")
SHAP values — Image by author
# SHAP dependence plots
shap_long <- shap.prep(gpboost_model, X_train = data[,covars])
shap.plot.dependence(data_long = shap_long, x = covars[2],
color_feature = covars[4], smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for K")
shap.plot.dependence(data_long = shap_long, x = covars[4],
color_feature = covars[2], smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for edu")
SHAP dependence plots for K and edu— Image by author
# SHAP force plot
shap_contrib <- shap.values(gpboost_model, X_train = data[,covars])
plot_data <- cbind(shap_contrib$shap_score, ID = 1:dim(data)[1])
shap.plot.force_plot(plot_data, zoom_in=FALSE, id = "ID") +
scale_fill_discrete(name="Variable") + xlab("Regions") +
geom_vline(xintercept=242.5, linewidth=1) + ggtitle("SHAP force plot") +
geom_text(x=100,y=-1.2,label="2000") + geom_text(x=400,y=-1.2,label="2021")
SHAP force plot — Image by author
# Split-based feature importances
feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)
gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",
main = "Split-based variable importances")
Variable importances — Image by author
# H-statistic for interactions
library(flashlight)
fl <- flashlight(model = gpboost_model, data = data.frame(y = data[,"y"], data[,covars]),
y = "y", label = "gpb",
predict_fun = function(m, X) predict(m, data.matrix(X[,covars]),
gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),
group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),
pred_latent = TRUE)$fixed_effect)
plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +
ggtitle("H interaction statistic") # takes a few seconds
H interaction statistic — Image by author
# Partial dependence plots
gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 2, xlab = covars[2],
ylab = "gdp", main = "Partial dependence plot" )
gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 4, xlab = covars[4],
ylab = "gdp", main = "Partial dependence plot" )
Partial dependence plots for K and edu — Image by author
# Two-dimensional partial dependence plot (to visualize interactions)
i = 1; j = 3;# i vs j
gpb.plot.part.dep.interact(gpboost_model, data[,covars], variables = c(i,j), xlab = covars[i],
ylab = covars[j], main = "Pairwise partial dependence plot")
Two-dimensional partial dependence plot for pop and L — Image by author

Extensions

Separate random effects for different time periods

In the above model, we have used the same random effects for both years 2000 and 2021. Alternatively, one can also use independent spatial and grouped random effects for different time periods (independent under the prior, conditional on the data, there is dependence …). In the GPBoost library, this can be done via the cluster_ids argument. cluster_ids needs to be a vector of length the sample size, and every entry indicates the “cluster” to which an observation belongs to. Different clusters then have independent spatial and grouped random effects, but the hyperparameters (e.g., marginal variance, variance components, etc.) and the fixed effects function are the same across clusters.

Below we show how we can fit such a model and create two separate spatial maps. As the results below show, the spatial effects are quite different for the two years 2000 and 2021. In particular, there is less (residual) spatial variation (or heterogeneity) for the year 2021 compared to 2000. This is confirmed by looking at standard deviations of the predicted spatial random effects which is almost twice as large for 2000 compared to the year 2021 (see below). A further conclusion is that in 2000, there were more regions with “low” GDP numbers (spatial effects below 0), and this is no longer the case for 2021.

gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential",
cluster_ids = c(rep(1,242), rep(2,242)))
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
# Separate spatial maps for the years 2000 and 2021
pred <- predict(gpboost_model, group_data_pred = data[, c("cl")],
gp_coords_pred = data[, c("Long", "Lat")],
data = data[, covars], predict_var = TRUE, pred_latent = TRUE,
cluster_ids_pred = c(rep(1,242), rep(2,242)))
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")
plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect for 2000 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")
plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect for 2021 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.arrange(plot_mu_2000, plot_mu_2021, ncol=2)
sd(pred$random_effect_mean[1:242]) # 0.2321267
sd(pred$random_effect_mean[243:484]) # 0.1286398
Separate spatial effects for the years 2000 and 2021 — Image by author

Interaction between space and fixed effects predictor variables

In the above model, there is no interaction between the random effects and the fixed effects predictor variables. I.e., there is no interaction between the spatial coordinates and the fixed effects predictor variables. Such interaction can be modeled by additionally including the random effects input variables (= the spatial coordinates or the categorical grouping variable) in the fixed effects function. The code below shows how this can be done. As the variable importance plot below shows, the coordinates are not used in the tree-ensemble, and there is thus no such interaction detectable for this data set.

gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
covars_interact <- c(c("Long", "Lat"), covars) ## add spatial coordinates to fixed effects predictor variables
boost_data <- gpb.Dataset(data = data[, covars_interact], label = data[, "y"])
params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)
gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",
main = "Variable importances when including coordinates in the fixed effects")
Variable importances when including coordinates in the tree-ensemble — Image by author

Large data

For large data sets, calculations with Gaussian processes are slow, and one has to use an approximation. The GPBoost library implements several ones. For instance, setting gp_approx="vecchia" in the GPModel constructor will use a Vecchia approximation. The data set used in this article is relatively small, and we can do all calculations exactly.

Other spatial random effects models

Above, we have used a Gaussian process to model spatial random effects. Since the data is areal data, another option is to use models that rely on neighborhood information such as CAR and SAR models. These models are currently not yet implemented in the GPBoost library (might be added in the future -> contact the author).

Another option is to use a grouped random effects model defined by the categorical region ID variable for modeling spatial effects. The code below shows how this can be done in GPBoost. However, this model essentially ignores the spatial structure.

gp_model <- GPModel(group_data = data[, c("group", "cl")], likelihood = "gaussian")

(Generalized) linear mixed effects and Gaussian process models

(Generalized) linear mixed effects and Gaussian process models can also be run in the GPBoost library. The code below shows how this can be done with the fitGPModel function. Note that one needs to manually add a column of 1’s to include an intercept.

X_incl_1 = cbind(Intercept = rep(1,dim(data)[1]), data[, covars])
gp_model <- fitGPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential",
y = data[, "y"], X = X_incl_1, params = list(std_dev=TRUE))
summary(gp_model)
## =====================================================
## Model summary:
## Log-lik AIC BIC
## 195.97 -373.94 -336.30
## Nb. observations: 484
## Nb. groups: 2 (Group_1)
## -----------------------------------------------------
## Covariance parameters (random effects):
## Param. Std. dev.
## Error_term 0.0215 0.0017
## Group_1 0.0003 0.0008
## GP_var 0.0126 0.0047
## GP_range 7.2823 3.6585
## -----------------------------------------------------
## Linear regression coefficients (fixed effects):
## Param. Std. dev. z value P(>|z|)
## Intercept 16.2816 0.4559 35.7128 0.0000
## L 0.4243 0.0565 7.5042 0.0000
## K 0.6493 0.0212 30.6755 0.0000
## pop 0.0134 0.0097 1.3812 0.1672
## edu 0.0100 0.0009 10.9645 0.0000
## =====================================================

References

--

--

Fabio Sigrist is Professor of Applied Statistics and Data Science at Lucerne University of Applied Sciences and Arts.