Tree-Boosting for Spatial Data

GPBoost: Combining Tree-Boosting and Gaussian Process Models

Fabio Sigrist
Towards Data Science

--

Figure 1: comparison of true and predicted spatial field and predictor function.

Spatial data is often modeled using Gaussian process models. In many applications, there are additional predictor variables besides the spatial locations. Typically, these covariates are included in a linear regression term in the Gaussian process model (known as “universal kriging”). But, you might wonder whether the linearity assumption does indeed hold true. As an alternative, you might apply a state-of-the-art machine learning technique such as (gradient) tree-boosting. However, this might also not fully satisfy you as, first, predictions are discontinuous over space and, second, the fact that the physical space has a special structure is completely ignored by the machine learning algorithm.

This article shows how tree-boosting can be combined with Gaussian process models for modeling spatial data using the GPBoost algorithm. Background is provided on both the methodology as well as on how to apply the GPBoost library in R and Python. It is shown how (i) models are trained and predictions are made, (ii) parameters are tuned, and (iii) model are interpreted. Further, a short comparison to several alternative approaches is done.

Introduction

Gaussian processes are popular models for modeling spatial data due to several advantages:

  • They allow for making probabilistic predictions
  • They can be seen as reasonable prior models which incorporate Tobler’s first law of geography (“everything is related to everything else, but near things are more related than distant things.”) and they explicitly model spatial correlation among samples

However, if there are additional predictor variables standard spatial Gaussian process models also have a potentially severe drawback: they assume a linear relationship between the predictor variables and the response variable.

Tree-boosting is a machine learning technique that often achieves superior predictive accuracy. This is reflected in statements such as “In general `boosted decision trees’ is regarded as the most effective off-the-shelf non-linear learning method for a wide range of application problems” (Johnson and Zhang, 2013) or “Why Does XGBoost Win “Every” Machine Learning Competition?” (Nielsen, 2016). Apart from this, tree-boosting with its well-known software implementations such as XGBoost, LightGBM, and CatBoost has several other advantages; see e.g. this blog post. When applying standard boosting algorithms to spatial data, one essentially has two options to account for the fact that the data has been observed over space: (i) ignore the spatial coordinates or (ii) consider the spatial locations as “regular” predictor variables by adding them to the set of predictor variables. It goes without saying that the first options is often a bad choice as potentially important information is lost. However, when modeling spatial data by including locations in the predictor variables, plain tree-boosting algorithms also have a few potential drawbacks:

  • They ignore any potential (residual) spatial correlation
  • They produce discontinuous spatial predictions
  • They ignore the reasonable prior information that “near things are more related than distant things” (see above)

Methodological background

The idea of the GPBoost algorithm (Sigrist, 2020) is to combine tree-boosting with Gaussian process models in such a way that one can (i) leverage the advantages of both techniques and (ii) avoid the above mentioned drawbacks. Roughly speaking, Gaussian processes and tree-boosting are both used for modeling “the parts for which they are good at”. In more detail, it is assumed that

y = F(X) + b(s) + e

where

  • y is the response variable (aka label)
  • X contains the predictor variables and F() is a potentially non-linear function
  • b(s) is a Gaussian process and s denotes the spatial locations
  • e is an error term

As in classical tree-boosting algorithms, it is assumed that the function F() is an ensemble of trees. In standard Gaussian process model, the function F() is simply assumed to be a linear function. Further, when applying a standard tree-boosting algorithm, one would simply include the locations s in the non-linear function: F(X,s). The GPBoost algorithm is a boosting algorithm that iteratively learns the covariance parameters (aka hyperparameters) of the Gaussian process and adds a tree to the ensemble of trees using a gradient and/or a Newton boosting step which accounts for the spatial correlation. See Sigrist (2020) for more details on the methodology. In the GPBoost library, covariance parameters can be learned using (Nesterov accelerated) gradient descent or Fisher scoring (aka natural gradient descent), and trees are learned using the LightGBM library.

Applying the GPBoost algorithm in R and Python

In the following, it is shown how the GPBoost algorithm can be applied in both R and Python. The complete code used in this article can be found here. Also, more information on the GPBoost library and the corresponding R and Python packages can be found here.

Data

We simulate data y = F(X) + b(s) + e as follows. b(s) follows a Gaussian process with an exponential covariance function. X contains two predictor variables of which only one has a non-linear effect. The latter choice is simply done in order to be able to nicely visualize and compare the learned and true functions F(). Further, we use a relatively small sample size of 200. For larger data sets, the GPBoost library supports Vecchia approximations to speed up computations. The following figure illustrates the simulated data. The appendix of this article contains the full code for simulating the data in R and Python.

Illustration of data: mean function F(X) used for simulation, “true” latent Gaussian process b, and observed data y (bottom plots).

Training and prediction

Training is done by first defining a GPModel and then passing this to the gpboost or gpb.train function. Predictions are obtained by calling the predict function and passing the predictor variables for the tree ensemble and the prediction locations for the Gaussian process. Note that the predictions for the tree ensemble and the Gaussian process are returned separately. I.e., one needs to sum them to obtain a single point prediction. The following code shows how this can be done in R and Python. Figure 1 at the beginning of this article shows predicted mean and standard deviations (=“prediction uncertainty”) of the Gaussian process as well as the predicted mean function F(X).

In R

library(gpboost)
gp_model <- GPModel(gp_coords = coords,
cov_function = "exponential")
# Training
bst <- gpboost(data = X, label = y, gp_model = gp_model,
nrounds = 247, learning_rate = 0.01,
max_depth = 3, min_data_in_leaf = 10,
num_leaves = 2^10,
objective = "regression_l2", verbose = 0)
summary(gp_model) # Estimated covariance parameters
# Make predictions: latent variables and response variable
pred <- predict(bst, data = X_test, gp_coords_pred = coords_test,
predict_var = TRUE, pred_latent = TRUE)
# pred[["fixed_effect"]]: predictions from the tree-ensemble.
# pred[["random_effect_mean"]]: predicted means of the gp_model.
# pred["random_effect_cov"]]: predicted (co-)variances
of the gp_model
pred_resp <- predict(bst, data = X_test,
gp_coords_pred = coords_test,
pred_latent = FALSE)
y_pred <- pred_resp[["response_mean"]] # predicted response mean
# Calculate mean square error
MSE_GPBoost <- mean((y_pred-y_test)^2)

In Python

import gpboost as gpb
gp_model = gpb.GPModel(gp_coords=coords, cov_function="exponential")
data_train = gpb.Dataset(X, y)
params = { 'objective': 'regression_l2', 'learning_rate': 0.01,
'max_depth': 3, 'min_data_in_leaf': 10,
'num_leaves': 2**10, 'verbose': 0 }
# Training
bst = gpb.train(params=params, train_set=data_train,
gp_model=gp_model, num_boost_round=247)
gp_model.summary() # Estimated covariance parameters
# Make predictions: latent variables and response variable
pred = bst.predict(data=X_test, gp_coords_pred=coords_test,
predict_var=True, pred_latent=True)
# pred['fixed_effect']: predictions from the tree-ensemble.
# pred['random_effect_mean']: predicted means of the gp_model.
# pred['random_effect_cov']: predicted (co-)variances
of the gp_model
pred_resp = bst.predict(data=X_test, gp_coords_pred=coords_test,
predict_var=False, pred_latent=False)
y_pred = pred_resp['response_mean'] # predicted response mean
# Calculate mean square error
np.mean((y_pred-y_test)**2))

Model interpretation

A trained model can be interpreted using various tools. Besides classical feature importance measures and partial dependence plots, SHAP values and dependence plots can be created as shown below. The figures that we obtain correctly identify the importance and the form of the effect of the first predictor variable (“Feature 0”).

In R

library("SHAPforxgboost")
shap.plot.summary.wrap1(bst, X = X)
shap_long <- shap.prep(bst, X_train = X)
shap.plot.dependence(data_long = shap_long, x = "Covariate_1",
color_feature = "Covariate_2", smooth = FALSE)

Note that for the R package, you need gpboost version 0.4.3 or latter for creating these SHAP plots. Further, for the SHAPforxgboost package, the data matrix X needs to have column names. If this is not the case, add them prior to training the booster model using the following code:

your_colnames <- paste0(“Covariate_”,1:2)
X <- matrix(as.vector(X), ncol=ncol(X),
dimnames=list(NULL,your_colnames))

In Python

import shap
shap_values = shap.TreeExplainer(bst).shap_values(X)
shap.summary_plot(shap_values, X)
shap.dependence_plot("Feature 0", shap_values, X)
SHAP values
SHAP dependence plot

Parameter tuning

Boosting with trees as base learners has several tuning parameters. Arguably the most important one is the number of boosting iterations (=number of trees). Other tuning parameters include the learning rate, the maximal tree depth, the minimal number of samples per leaf, the number of leaves, and others such as L2 regularization on the leaf values and an L0 penalty on the number of leaves. Usually smaller learning rates lead to more accurate models. However, it is advisable to also try larger learning rates (e.g., 1 or larger) since when using gradient boosting, the scale of the gradient can depend on the loss function and the data, and even a larger number of boosting iterations (say 1000) might not be enough for small learning rates. This is in contrast to Newton boosting, where learning rates smaller than 0.1 are used since the natural gradient is not scale-dependent.

These tuning parameters can, for instance, be chosen using a random or deterministic grid search and k-fold cross-validation as shown in the following. A computationally cheaper alternative to full k-fold cross-validation is to pass a validation data set to the parameter tuning function. See here (for R) and here (for Python) for more details. Note that you need GPBoost version 0.4.3 or latter for using the grid.search.tune.parameters function.

In R

# Create random effects model and datasets
gp_model <- GPModel(gp_coords = coords,
cov_function = "exponential")
dtrain <- gpb.Dataset(data = X, label = y)
# Candidate parameter grid
param_grid = list("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(1,10,100),
"max_depth" = c(1,3,5,10))
# Other parameters not contained in the grid of tuning parameters
params <- list(objective = "regression_l2", verbose = 0,
"num_leaves" = 2^10)
# Use random grid search and cross-validation. Set 'num_try_random=NULL' to use deterministic grid search
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(
param_grid = param_grid, params = params,
num_try_random = 20, nfold = 4,
data = dtrain, gp_model = gp_model,
verbose_eval = 1,
nrounds = 1000, early_stopping_rounds = 5,
metric = "l2")
# Found the following parameters:
# ***** New best score (0.397766105827059) found for the following parameter combination: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 3, nrounds: 247

In Python

gp_model = gpb.GPModel(gp_coords=coords, cov_function="exponential")
data_train = gpb.Dataset(X, y)
# Candidate parameter grid
param_grid = {'learning_rate': [1,0.1,0.01],
'min_data_in_leaf': [1,10,100],
'max_depth': [1,3,5,10]}
# Other parameters not contained in the grid of tuning parameters
params = { 'objective': 'regression_l2', 'verbose': 0,
'num_leaves': 2**17 }
opt_params = gpb.grid_search_tune_parameters(
param_grid=param_grid, params=params,
num_try_random=20, nfold=4,
gp_model=gp_model, train_set=data_train,
verbose_eval=1,
num_boost_round=1000, early_stopping_rounds=5,
seed=1, metric='l2')
# 'metric' was called 'metrics' in earlier versions of gpboost
print("Best number of iterations: " + str(opt_params['best_iter']))
print("Best score: " + str(opt_params['best_score']))
print("Best parameters: " + str(opt_params['best_params']))

Comparison to alternative approaches

In the following, we compare the GPBoost algorithm to the following alternative approaches:

  • A linear Gaussian process model (‘Linear GP’) where F(X) is a linear function
  • Standard gradient tree-boosting and including the spatial coordinates as regular predictor variables (‘L2Boost’) in F()
  • Random forest and including the spatial coordinates as regular predictor variables

We choose tuning parameters using cross-validation and measure predictive accuracy using the root mean square error (RMSE) on the test data. R code for producing these results can be found in the appendix below. The results are shown in the table below.

Comparison of prediction accuracy of GPBoost and alternative approaches.

We see that GPBoost outperforms all other alternatives in terms of predictive accuracy. Note that, for simplicity, we do only one simulation run; see Sigrist (2020) for a much more detailed comparison. Except for the random forest, all computations are done using the GPBoost library version 0.7.5. Further, we use the randomForest R package version 4.6–14.

Conclusions

The GPBoost library allows for combining Gaussian process models and tree-boosting. In the examples above, we have assumed that there are no interactions between the spatial locations s and the other predictor variables X. This can be easily relaxed, for instance, by also including the spatial locations s in the predictor function F(X,s) and/or by additionally including the other non-spatial predictor variables X in the Gaussian process b(X,s).

In summary, if you use Gaussian process models for modelling spatial data, you should investigate whether the linearity assumption is indeed appropriate. The GPBoost algorithm allows for relaxing this assumption in a flexible manner. It may help you to find non-linearities and interactions and achieve higher predictive accuracy. If you are a frequent user of boosting algorithms and want to apply them to spatial data, GPBoost (which extends LightGBM) can make learning more efficient and result in higher predictive accuracy. In general, the GPBoost algorithm can also be applied to non-spatial datasets where tree-boosting and Gaussian process regression should be combined. Further, the GPBoost library also allows for combining tree-boosting with (grouped) random effects models.

Hopefully, you have found this article useful. More information on GPBoost can be found in the companion article (Sigrist, 2020) and on GitHub.

References

Johnson, R., & Zhang, T. (2013). Learning nonlinear functions using regularized greedy forest. IEEE transactions on pattern analysis and machine intelligence, 36(5), 942–954.

Nielsen, D. (2016). Tree boosting with XGBoost-why does XGBoost win ” every” machine learning competition? (Master’s thesis, NTNU).

Sigrist, F. (2020). Gaussian Process Boosting. arXiv preprint arXiv:2004.02653.

Appendix

R code for simulating data

set.seed(1)
# Simulate Gaussian process: training and test data (the latter on a grid for visualization)
cov_function <- "exponential"
sigma2_1 <- 0.35 # marginal variance of GP
rho <- 0.1 # range parameter
sigma2 <- 0.1 # error variance
n <- 200 # number of training samples
nx <- 50 # test data: number of grid points on each axis
# training locations (exclude upper right rectangle)
coords <- matrix(runif(2)/2,ncol=2, dimnames=list(NULL,paste0("s_",1:2)))
while (dim(coords)[1]<n) {
coord_i <- runif(2)
if (!(coord_i[1]>=0.6 & coord_i[2]>=0.6)) {
coords <- rbind(coords,coord_i)
}
}
# test locations (rectangular grid)
s_1 <- s_2 <- rep((1:nx)/nx,nx)
for(i in 1:nx) s_2[((i-1)*nx+1):(i*nx)]=i/nx
coords_test <- cbind(s_1=s_1,s_2=s_2)
n_all <- nx^2 + n # total number of data points
D <- as.matrix(dist(rbind(coords_test,coords))) # distance matrix
if(cov_function=="exponential"){
Sigma <- exp(-D/rho)+diag(1E-10,n_all)
}else if (cov_function=="gaussian"){
Sigma <- exp(-(D/rho)^2)+diag(1E-10,n_all)
}
C <- t(chol(Sigma))
b_all <- sqrt(sigma2_1)*C%*%rnorm(n_all)
b_train <- b_all[(nx^2+1):n_all] # training data GP
# Mean function. Use two predictor variables of which only one has an effect for easy visualization
f1d <- function(x) sin(3*pi*x) + (1 + 3 * pmax(0,x-0.5)/(x-0.5)) - 3
X <- matrix(runif(2*n), ncol=2, dimnames=list(NULL,paste0("Covariate_",1:2)))
F_X_train <- f1d(X[,1]) # mean
xi_train <- sqrt(sigma2) * rnorm(n) # simulate error term
y <- F_X_train + b_train + xi_train # observed data
# test data
x <- seq(from=0,to=1,length.out=nx^2)
x[x==0.5] = 0.5 + 1E-10
X_test <- cbind(Covariate_1=x, Covariate_2=rep(0,nx^2))
F_X_test <- f1d(X_test[,1])
b_test <- b_all[1:nx^2]
xi_test <- sqrt(sigma2) * rnorm(nx^2)
y_test <- F_X_test + b_test + xi_test

Python code for simulating data

import numpy as np
np.random.seed(1)
# Simulate Gaussian process: training and test data (the latter on a grid for visualization)
sigma2_1 = 0.35 # marginal variance of GP
rho = 0.1 # range parameter
sigma2 = 0.1 # error variance
n = 200 # number of training samples
nx = 50 # test data: number of grid points on each axis
# training locations (exclude upper right rectangle)
coords = np.column_stack(
(np.random.uniform(size=1)/2, np.random.uniform(size=1)/2))
while coords.shape[0] < n:
coord_i = np.random.uniform(size=2)
if not (coord_i[0] >= 0.6 and coord_i[1] >= 0.6):
coords = np.vstack((coords,coord_i))
# test locations (rectangular grid)
s_1 = np.ones(nx * nx)
s_2 = np.ones(nx * nx)
for i in range(nx):
for j in range(nx):
s_1[j * nx + i] = (i + 1) / nx
s_2[i * nx + j] = (i + 1) / nx
coords_test = np.column_stack((s_1, s_2))
n_all = nx**2 + n # total number of data points
coords_all = np.vstack((coords_test,coords))
D = np.zeros((n_all, n_all)) # distance matrix
for i in range(0, n_all):
for j in range(i + 1, n_all):
D[i, j] = np.linalg.norm(coords_all[i, :] - coords_all[j, :])
D[j, i] = D[i, j]
Sigma = sigma2_1 * np.exp(-D / rho) + np.diag(np.zeros(n_all) + 1e-10)
C = np.linalg.cholesky(Sigma)
b_all = C.dot(np.random.normal(size=n_all))
b_train = b_all[(nx*nx):n_all] # training data GP
# Mean function. Use two predictor variables of which only one has an effect for easy visualization
def f1d(x):
return np.sin(3*np.pi*x) + (1 + 3 * np.maximum(np.zeros(len(x)),x-0.5)/(x-0.5)) - 3
X = np.random.rand(n, 2)
F_X_train = f1d(X[:, 0]) # mean
xi_train = np.sqrt(sigma2) * np.random.normal(size=n) # simulate error term
y = F_X_train + b_train + xi_train # observed data
# test data
x = np.linspace(0,1,nx**2)
x[x==0.5] = 0.5 + 1e-10
X_test = np.column_stack((x,np.zeros(nx**2)))
F_X_test = f1d(X_test[:, 0])
b_test = b_all[0:(nx**2)]
xi_test = np.sqrt(sigma2) * np.random.normal(size=(nx**2))
y_test = F_X_test + b_test + xi_test

R code for comparison of alternative approaches

# 1. Linear Gaussian process model
# Add an intercept term to the model matrix
X1 <- cbind(Intercept=rep(1,n),X)
gp_model <- fitGPModel(gp_coords = coords, cov_function = "exponential",
y = y, X = X1)
X_test1 <- cbind(rep(1,dim(X_test)[1]),X_test)
print("Fitted linear Gaussian process model:")
summary(gp_model)
y_pred_linGP <- predict(gp_model, gp_coords_pred = coords_test,X_pred = X_test1)
MSE_lin <- mean((y_pred_linGP$mu-y_test)^2)
print(paste0("MSE of linear Gaussian process model: ", MSE_lin)) # 1.306599
# 2. Gradient tree-boosting with a squared loss
# Add the coordinates to the predictor variables
XC <- cbind(X,coords)
X_testC <- cbind(X_test,coords_test)
# Choose tuning parameters
dtrain <- gpb.Dataset(data = XC, label = y)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(
param_grid = param_grid,
params = params,
num_try_random = NULL,
nfold = 4,
data = dtrain,
verbose_eval = 1,
nrounds = 1000,
early_stopping_rounds = 5,
eval = "l2")
# Found the following parameters:
# ***** New best score (0.450492863373992) found for the following parameter combination: learning_rate: 0.05, min_data_in_leaf: 20, max_depth: 10, nrounds: 176
bst <- gpboost(data = XC, label = y,
nrounds = 176, learning_rate = 0.05,
max_depth = 10, min_data_in_leaf = 20,
objective = "regression_l2", verbose = 0)
pred_l2boost <- predict(bst, data = X_testC)
MSE_l2boost <- mean((pred_l2boost-y_test)^2)
print(paste0("MSE of plain gradient tree-boosting: ", MSE_l2boost)) # 0.555046
# 3. Random forest
library(randomForest)
colnames(XC) <- colnames(X_testC) <- paste0("V",rep(1:dim(XC)[2]))
# Choose tuning parameters
library(caret)
control <- trainControl(method="cv", number=4, search="grid")
tunegrid <- expand.grid(.mtry=c(1:4))
set.seed(1)
rf_gridsearch <- train(y~., data=data.frame(y=y,XC), method="rf", metric="RMSE", tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)
# Found the following parameters: mtry = 3
set.seed(1)
rf <- randomForest(y ~ ., data = XC, ntree=1000, mtry=3)
# plot(rf) # check "convergence"
pred_rf <- predict(rf, newdata = X_testC) ## predicted labels
MSE_rf <- mean((pred_rf-y_test)^2)
print(paste0("MSE of random forest: ", MSE_rf)) # 0.5730892
# Compare root mean square errors of different methods
RMSEs <- sqrt(c(GPBoost=MSE_GPBoost, Lin_GP=MSE_lin, L2Boost=MSE_l2boost, RF=MSE_rf))
print(RMSEs)
## GPBoost Lin_GP L2Boost RF
## 0.6279240 1.1430655 0.7450141 0.7570265

--

--

Fabio Sigrist is Professor of Applied Statistics and Data Science at Lucerne University of Applied Sciences and Arts.