Thoughts and Theory

Generalized Linear Mixed Effects Models in R and Python with GPBoost

An introduction and comparison with ‘lme4’ and ‘statsmodels’

Fabio Sigrist
Towards Data Science
7 min readJun 22, 2021

--

Figure 1: Comparison of gpboost, lme4, and statsmodels: estimation time and mean squared error (MSE) of coefficients and variance parameters of a single-level random effects model for varying numbers of samples

GPBoost is a recently released C++ software library that, among other things, allows for fitting generalized linear mixed effects models in R and Python. This article shows how this can be done using the corresponding R and Python gpboost packages. Further, we do a comparison to the lme4 R package and the statsmodels Python package. In simulated experiments, we find that gpboost is considerably faster than the lme4 R package (more than 100 times in some cases). Disconcertingly, the statsmodels Python package often wrongly estimates models.

Introduction: Generalized Linear Mixed Effects Models (GLMMs)

Generalized linear mixed effects models (GLMMs) assume that a response variable y follows a known parametric distribution p(y|mu) and that a parameter mu of this distribution (often the mean) is related to the sum of so-called fixed effects Xb and random effects Zu:

y ~ p(y|mu)

mu = f( Xb + Zu )

  • y is the response variable (aka label, dependent variable)
  • Xb are the fixed effects, X is a matrix with predictor variables (aka features, covariates), b are coefficients
  • Zu are the random effects, where u are assumed to follow a multivariate normal distribution and Z is a matrix that relates u to the samples
  • f() is a link-function that ensures that mu = f( Xb + Zu ) is in the proper range (for instance, for binary data, the mean must be between 0 and 1)

What distinguishes a GLMM from a generalized linear model (GLM) is the presence of the random effects Zu. Random effects can consist of, for instance, grouped (aka clustered) random effects with a potentially nested or crossed grouping structure. As such, random effects can also be seen as an approach for modeling high-cardinality categorical variables. Further, random effects can consist of Gaussian processes used, for instance, for modeling spatial data. Compared to using fixed effects only, random effects have the advantage that a model can be more efficiently estimated when, e.g., the number of groups or categories is large relative to the sample size. Linear mixed effects models (LMEs) are a special case of GLMMs in which p(y|mu) is Gaussian and f() is simply the identity.

Using GPBoost for Modeling GLMMs in R and Python

We briefly demonstrate how the R and Python gpboost packages can be used for inference and prediction with GLMMs. For more details, we refer to the GitHub page, in particular the R and Python GLMM examples.

Installation

The gpboost R and Python packages are available on CRAN and PyPI and can be installed as follows:

Python: pip install gpboost -U

R: install.packages("gpboost", repos="https://cran.r-project.org")

Model estimation

Estimation of GLMMs is a non-trivial task due to the fact that the likelihood (the quantity that should be maximized) cannot be written down in closed form. The current implementation of GPBoost (version 0.6.3) is based on the Laplace approximation. Model estimation in Python and R can be done as follows:

Python

gp_model = gpb.GPModel(group_data=group_data, likelihood="binary")
gp_model.fit(y=y, X=X)
gp_model.summary()

R

gp_model <- fitGPModel(group_data=group_data, likelihood="binary",                               
y=y, X=X)
summary(gp_model)

where

  • group_data is a matrix or vector with categorical grouping variable(s) specifying the random effects structure. If there are multiple (crossed or nested) random effects, the corresponding grouping variables should be in the columns of group_data
  • y is a vector with response variable data
  • X is a matrix with fixed effects covariate data
  • likelihood denotes the distribution of the response variable (e.g., likelihood=”binary” denotes a Bernoulli distribution with a probit link function)

After estimation, the summary() function shows the estimated variance and covariance parameters of the random effects and the fixed effects coefficients b.

Approximate p-values for fixed effects coefficients

Obtaining p-values for fixed effects coefficients in GLMMs is a somewhat murky endeavor. Since the likelihood cannot be calculated exactly in the first place, one has to rely on multiple asymptotic arguments. I.e., p-values can be (very) approximate and should be taken with a grain of salt. However, since the lme4 and statsmodels packages allow for calculating approximate standard deviations and p-values, we also show how this can be done using gpboost relying on the same approach as lme4. In short, one has to enable "std_dev": True / std_dev=TRUE, to calculate approximate standard deviations when fitting the model, and then use approximate Wald tests as shown below.

Python

gp_model = gpb.GPModel(group_data=group, likelihood="binary")
gp_model.fit(y=y, X=X, params={“std_dev”: True})
coefs = gp_model.get_coef()
z_values = coefs[0] / coefs[1]
p_values = 2 * stats.norm.cdf(-np.abs(z_values))
print(p_values) # show p-values

R

gp_model <- fitGPModel(group_data=group_data, likelihood="binary",                               
y=y, X=X, params=list(std_dev=TRUE))
coefs <- gp_model$get_coef()
z_values <- coefs[1,] / coefs[2,]
p_values <- 2 * exp(pnorm(-abs(z_values), log.p=TRUE))
coefs_summary <- rbind(coefs, z_values, p_values)
print(signif(coefs_summary, digits=4)) # show p-values

Prediction

Predictions can be obtained by calling the predict() function as shown below.

Python

pred = gp_model.predict(X_pred=X_test, group_data_pred=group_test,
predict_var=True, predict_response=False)
print(pred['mu']) # predicted latent mean
print(pred['var']) # predicted latent variance

R

pred <- predict(gp_model, X_pred=X_test,
group_data_pred=group_test,
predict_var=TRUE, predict_response=FALSE)
pred$mu # predicted latent mean
pred$var # predicted latent variance

where

  • group_data_pred is a matrix or vector with categorical grouping variable(s) for which predictions are made
  • X_pred is a matrix with fixed effects covariate data for which predictions are made
  • predict_var (boolean) indicates whether predictive variances should be calculated in addition to the mean
  • predict_response (boolean) indicates whether the response y or the latent Xb + Zu should be predicted. I.e., the random effects part is also predicted. If group_data_pred contains new, unobserved categories, the corresponding random effects predictions will be 0.

Comparison to lme4 and statsmodels

In the following, we do a simulation study to compare gpboost (version 0.6.3) with lme4 (version 1.1–27) and statsmodels (version 0.12.2). The code to reproduce the full simulation study can be found here. We use the default options for all packages. In particular, all packages use the Laplace approximation to approximate the (marginal) likelihood. We evaluate both computational time and the accuracy of variance parameters and fixed effects coefficients estimates measured in terms of root mean squared error (RMSE). Concerning the latter, we expect to see only minor differences as, in theory, all packages rely on the same statistical methodology and differ only in their specific software implementations.

Simulation setting

As baseline setting, we use the following model to simulate data:

  1. n=1000 samples
  2. 10 samples per group (i.e., 100 different groups)
  3. 10 fixed effects covariates plus an intercept term
  4. A single-level grouped random effects model
  5. A binary Bernoulli likelihood with a probit link function

We instigate how the results change when varying each of these choices while holding all others fix. In detail, we vary these choices as follows: (1.) number of samples: 100, 200, 500, 1000, 2000, (2.) number of groups: 2, 5, 10, 20, 50, 100, 200, 500, (3.) number of covariates: 1, 2, 5, 10, 20, (4.) nested and crossed random effects models, and (5.) a Poisson likelihood instead of binary one. A variance of 1 is used for the random effects. The covariates X are sampled from a normal distribution with mean 0 and variance chosen such that the signal-to-noise ratio between the fixed and random effects is one, and the true regression coefficients are all 1 except for the intercept which is 0. For each combination of the above-mentioned model choices, we simulate 100 times data and estimate the corresponding models using the three different software packages. See here for more details on the simulation study. All calculations are run on a laptop with a 2.9 GHz quad-core processor.

Results

The results are reported in the five figures below. Note that we plot the results on a logarithmic scale since, e.g., the differences in computational time between lme4 and gpboost are very large. We observe the following findings. First, statsmodels gives parameter estimates with very large RMSEs, i.e., very inaccurate estimates. This is disconcerting as, theoretically, all three packages should be doing the “same thing”. Further, gpboost is considerably faster than lme4with the difference being larger the higher the dimension of the random effects and the larger the number of fixed effects covariates. For instance, for a binary single-level random effects model with 100 groups, 1000 samples, and 20 covariates, gpboost is on average approximately 600 times faster than lme4. As expected, gpboost and lme4 have almost the same RMSEs as both packages use the same methodology.

Varying number of samples n
Varying number of groups
Varying number of covariates
Different random effects models
Poisson likelihood and varying number of samples

Note: gpboost uses OpenMP parallelization in C++ for operations that can be parallelized. But the main computational bottleneck is the calculation of a Cholesky factorization, and this operation cannot be parallelized. I.e., the large difference in computational time between lme4and gpboost is not a result of parallelization.

Conclusion

GPBoost is a recently released C++ software library that, among other things, allows for fitting generalized linear mixed effects models in R and Python. As shown above, gpboost is considerably faster than the lme4 R package. Disconcertingly, the statsmodels Python package often results in very inaccurate estimates. Besides grouped random effects considered in this article, GPBoost also allows for modeling Gaussian processes for, e.g., spatial or temporal random effects, as well as combined grouped random effects and Gaussian process models. Further, GPBoost supports random coefficients such as random slopes or spatially varying coefficients. Finally, apart from LMMs and GLMMs, GPBoost also allows for learning non-linear models without assuming any functional form on the fixed effects using tree-boosting; see these two blog posts on combining tree-boosting with grouped random effects and Gaussian processes or Sigrist (2020) and Sigrist (2021) for more details.

--

--

Fabio Sigrist is Professor of Applied Statistics and Data Science at Lucerne University of Applied Sciences and Arts.