When Mixed Effects (Hierarchical) Models Fail: Pooling and Uncertainty

An intuitive and visual guide on situations where partial-pooling becomes troublesome and Bayesian methods can help quantify uncertainty

Published in

Towards Data Science

14 min readJun 9, 2020

Photo by National Cancer Institute on Unsplash

In this article I provide an intuitive, visual dive into the foundations of mixed effect (hierarchical) model and the concept of “pooling” with applied examples. If you’re interested in implementing Bayesian Hierarchical models in R / Python, I’ve published step-by-step guides in subsequent articles.

The world is full of nested (grouped) structures and we often impose this structure to simplify our understanding of the world. You’ve been exposed to some of these nested structures since you were young (whether you’ve been aware of it or not). For example, in geography class countries were nested within continents or regions, and states within countries. Similarly, you can think of players within teams, customers within regions, and even departments within a university.

Given the nature of how we tend to structure our world, we commonly encounter data sets with nested structure in analytics (i.e. observations as members of groups).

Mixed effects models are simply an extension of regression models that take into account the impact group membership has on an outcome of interest.

Like any widely used method across disciplines people can’t agree on a common name. These models are referred to as “Mixed Models”, “Hierarchical Models”, “Multilevel Models”, “Random Effects Models”, etc. (Why people can’t agree on one name is a mystery to me. 🤷‍♂)

Before you’re tempted to think “Oh great, another article about linear models… 😒” let me tell you this: yes, this covers linear models but without disregarding a commonly overlooked point — when models DON’T WORK and the consequences this can have.

I focus on linear models for simplicity and will cover generalized mixed models in subsequent publications.

At a glance

Here’s what I’ll cover

A quick intuitive, visual dive into linear mixed effects modeling (highly recommended if you haven’t seen mixed models before, otherwise skim this part but make sure you’re familiar with their distributional forms)
The concept of “pooling” and how it drives the underlying machinery of this method (with animations and an applied example in marketing)
A case where “pooling” is undesired and a discussion of some limitations of Frequentist (i.e. non-Bayesian) approaches

What are Linear Mixed Models (LMM)?

The following section is based on a subset of the sleepstudy data set . The data comes from sleep deprivation study where the observations are the average reaction times per days of sleep deprivation on a series of tests for multiple subjects.

Simple Linear Regression

Imagine we’re trying to estimate the effect days of sleep deprivation has on the average reaction times of subjects. We can start with a simple linear model to understand this relationship as follows

where for each observation i the days of sleep deprivation ( Days , i.e. independent variable) affect the average reaction time/delay (react delay , i.e. dependent variable) based on the β variables. These β parameters are called fixed effects because they are constant for all observations (i.e. the population).

Note that given linear regression assumes data is normally distributed, we can also define it in distributional form. The equation above can be read as “observations are normally distributed with mean β0 + β1*Days and population variance σ² ”(since we assume all observations i have the same variance).

Now, as seen below, sleep deprivation affects each subject differently so using a model that groups all observations together (referred to as complete pooling) doesn’t make sense.

So, how could we model differences between subjects? Well, we can use various flavors of a linear mixed effects model.

1. Varying-intercepts

We might first want to consider the case where subjects have different baselines but the rate (slope) at which their reaction times increase is the same. In this case we expand our previous model to allow the intercepts to vary by subject subj as follows

where β0 and β1 are fixed/population effects (constant across all observations) and b0,subj is a random effect that allows the intercept to vary by subject (i.e. to deviate from the population intercept β0). This leads to our first flavor of a mixed effect model a varying-intercept model. In distributional terms, the mean varies for each subject based on b0,subj which is normally distributed with mean 0 and subject-specific variance τ²_0. (This is why this term is considered to be “random.”)

The figure below provides an example of this model. We can see it gives an OK fit but doesn’t account for slope variation across subjects.

2. Varying-slopes

Now, if we thought reaction time increased differently for each subject then we might want to let the slopes vary (by subj ) and keep the intercept fixed as follows

where β0 is a fixed (population) intercept and the slopes vary according to β1 (fixed slope) + b1,subj (random slope). This leads to our second flavor of mixed effect model, avarying-slope model. In distributional terms, the mean varies for each subject based on b1,subj*Days which is normally distributed with mean 0 and subject-specific variance τ²_1.

The figure below provides an example of this model and we can see it actually fits the data worse than the previous one. Can we combine both approaches? Yes, we can. 🙂

3. Varying-intercepts, varying-slopes

Observing the plots above it is reasonable to consider the situation where each subject’s baseline reaction time and rate of increase is different from the others.

We have both slope and intercept fixed effects (β0 and β1) and random effects (b0,subj and b1,subj ). This leads to our last flavor of mixed effect model, a varying-intercept, varying-slope model. In distributional terms, the mean varies for each subject based on b0,subj and b1,subj*Days but now these two subject-specific random effects are correlated. Thus, the variance component is now a matrix which includes the slope and intercept variances (τ_22 and intercept τ_11), as well as their covariance (τ_21 and τ_12).

Compared to the previous two plots we can see that this model tends to fit the data more tightly as it improves how subject-specific differences are accounted.

Note that this has just been a visual guide of one type of mixed models (linear). In order to implement these, you need to understand the question you’re answering, your data and be able to carefully assess model performance / fit with additional methods, some of which I’ll cover in the next section.

What is “pooling”? (an applied example)

This section is a summary of a more thorough analysis in Github. It uses simulated data based on the following Kaggle dataset. For simplicity the example was coded in R but you can find Python examples in Github as well.

Imagine your company recently revamped its website and the marketing department would like to know how the website is performing. Specifically, they’d like to know how much longer young customers spend on the website versus older ones.

To address this question you’ve collected customer “bounce rates” (bounce_time, secs) and age across 8 different locations ( county ), where bounce rates measure the time a customer spends on the website. However, as you can see below, you weren’t able to get the same number of observations for all locations (e.g. one has 150 observations while another has only 7).

Complete pooling (or simple linear regression)

To start, you fit a linear model, given you’re trying to understand the dependence of bounce_time on age. However, before doing so you center-scale your age variable (mean=0, variance =1) to improve model stability and simplify later interpretations (i.e. the intercept is now the average age ).

This model is considered a complete pooling model given it ignores variation between counties and treats all observations as part of the same group or pool. Thus, it estimates a single slope and intercept (β0=200 and β1=4.69) for all the data. However, given the model shows lack of fit (assessment not shown), you’d like to see if modeling differences at the county level improves performance.

A quick EDA yields allows you to observe that average bounce_time does seem to vary across locations ( county ) and that the within-county variability isn’t the same for all locations. Thus the independence assumption of a simple linear regression isn’t met.

No-pooling model (or separate linear regressions)

Instead you might fit individual regression lines for each county using the lme4 R package. The resulting no-pooling model assumes customers share no similarities across counties and considers each county independently. As a result, we estimate 8 distinct intercepts and slopes (i.e. one per county) as shown below.

This new model seems to have a better fit (not shown), performs better than the previous one, and visually appears to capture each group’s trend.

In addition, the slope estimates are much smaller than before, a behavior that is explained via Simpson’s paradox (i.e. trends estimated at the county or group level can be completely different from globally estimated trends). The animation below illustrates changes between the complete pooling and no-pooling model estimates.

However, concluding that the no-pooling model is best based on a seemingly tighter fit is premature.

As Gelman describes “whereas complete pooling ignores variation between [groups], the no-pooling analysis overstates it [and]… overfits the data within each [group]”. In other words, this model exaggerates the variation between counties and makes them look more distinct when in fact it is plausible (even expected) that customers across counties share some similarities.

Partial-pooling model (or linear mixed effects)

Trying to capture of the similarities between counties you fit a model that falls in between the two extremes (i.e. the complete and no-pooling models). Using R’s lmer function, you fit a linear mixed effects model, again estimating 8 distinct slopes and intercepts.

However, these are estimated via partial-pooling which, briefly, combines information from the population (fixed) effects of the complete pooling model and county-specific (random) effects of the no-pooling one.

Looking at the summary output you see some promising indicators. The random intercept seems to account for most of the variability between counties (≈195) while still having some population-level variance (i.e. residual). (Having both indicates that there are both fixed and random effects in play, if residual variance were ≈0 then fitting a no-pooling model would’ve been just as good.)

However, you notice a warning message from the function stating that the model fit is likely “singular” (bottom of output above). You might be tempted to ignore this message, seeing that the model has good fit (not shown) and performs on par to the no-pooling setup.

So, why not use this model to answer the marketing department’s question about bounce times?

I’ll tell you why but first let’s take a quickly detour into partial-pooling.

An visual dive into partial-pooling

Partial-pooling is the essence of why mixed effects models are powerful and I’ve found that more often than not people tend to describe what partial-pooling does rather than illustrate it. Partial-pooling represents a compromise between complete and no-pooling models. Let’s explore what that means.

First, the animation on the left (below) provides a real time example of why partial-pooling is a “compromise.” We can see how the slope and intercept estimates change across the three models we just covered. With partial-pooling, the county-specific (no-pooling) estimates are drawn towards the population (complete pooling) estimates. But, why do some estimates have larger changes than others?

We can answer that with the set of equations on the right, which are color coded to match the animation on the left. Simply put, a partial-pooling estimate (blue) is a weighted average of the terms from the complete (red) and no-pooling models (green), where the weights are defined by variances and group sample sizes. (Note: the equations are taken from a slightly simpler example and are not the exact ones computed in the models above.)

People refer to groups as “carrying” more or less information on the basis of the effect of group sample size and/or variance in this equation. For example, in counties with smaller sample sizes n_j (i.e.n_j -> 0 ) we see that the left term gets down-weighted. As a result, the estimates are pulled or “shrunk” closer to the overall population average, which is beneficial, since we generally don’t want to estimate parameters with few data points. The effect is the opposite for groups with larger samples sizes — we’re more confident about the estimates of group-level effects in these data.

In essence what we see is smaller groups “borrow information” from larger groups and get their estimates shrunk towards the population values.

When partial-pooling goes wrong

Let’s dive back to the website bounce rates example.

Given your strong (or newly developed) understanding of the problem at hand and how partial-pooling can lead to better estimates in this case, you choose the mixed effects model to answer how age affects bounce rates across counties. Partial-pooling seems to have “shrunk” estimates and confidence intervals as intended (see below). Notice something funny about the slope estimates — the confidence intervals have almost disappeared! How is that possible? What happened?

1. Vanishing confidence intervals

Remember the “singular fit” warning I previously mentioned? Well, that happened. Simply put, this warning tells us that the model is too complex for the data to support. In our case, in counties with smaller sample sizes, the random intercept accounts for most of the variability so when we try to fit a county-specific slope there isn’t enough information left to do so.

We could simplify our model and remove the random slope, but what if we believe there might be some merit to having county-specific slopes? We can include both components using a Bayesian approach (see next section).

2. Unintended Shrinkage (cautionary tale)

Before moving on it is important to note a crucial lesson about how partial-pooling/shrinkage might lead to unintended consequences. In 2016, George et al. showed that partial pooling can lead to artificially low predicted mortality rates for smaller hospitals. As seen in the figures below, the predicted mortality rates (y-axis) have been “artificially shrunk” for hospitals with smaller numbers of beds (x-axis). That is to say, the model is systematically down-weighting the observed (higher) mortality in smaller hospitals. In this scenario, partial-pooling from a standard mixed effects model has potentially nightmarish public health and legal consequences. This under-prediction of the observed mortality rate could affect patients down the line. 😱

Adapted from George et al (AMSTAT, 2016)

In their paper George et al. provide extensions to the mixed effects model that can help limit unwanted effects of partial-pooling.

Bayes to the rescue

If you’re not very familiar with Bayesian inference I recommend Aerin Kim’s amazing article.
Also, given that fitting a Bayesian model is a conversation in itself (e.g. choosing priors, diagnostics, etc.) this section will only focus on results. Future articles will cover Bayesian fits in more detail.

Bayesian mixed effects method are powerful alternatives to more commonly used Frequentist approaches (as above). Particularly, they are able to account for uncertainty when estimating group-level effects and provide stable estimates for groups with smaller sample sizes with the help of weakly informative priors.

Let’s imagine, you’ve recently learned that a Bayesian approach can help you deal with the “singular fit” problem in your model. Plus, you’ve heard from a colleague that there is a great R package (brms) package that provides an easy-to-use formula builder to fit Bayesian mixed models with stan. After selecting your priors (covered in future installments), you fit the same mixed effects model as before.

This new model seems to have a good fit (not shown) and posterior predictive checks give you confidence in your estimates and associated errors. More importantly, you notice that the 95% credible intervals for the Essex, Kent, London, and Norfolk slopes include 0 (i.e. there is some probability that these slopes are actually zero).

Visually you can see posterior draws for each county’s trend line (left) seem to fit the data and this new approach had a marginal improvement the model’s performance (right) in terms of RMSE.

Most importantly, your estimates still utilize partial-pooling but this time without the vanishing error bounds we saw earlier. This is because you introduced “information” via your weakly informative priors. This gets to one idea at the heart of Bayesian inference, which combines information in the observed data with information from specified prior distributions (which we can often think of as baseline estimates). Posterior estimates are always theoretically possible (unlike our “singular fit” example) but with insufficient data, these estimates default to resembling our priors.

As a final remark, I’m not saying Bayesians > Frequentists, or that you should drop everything and become Bayesian. As in any situation, the method of your choosing should reflect an understanding of the problem at hand. Here I fit a Bayesian model to help illustrate how it can help dealing with vanishing error bounds. However, I could instead have simplified the model (see rintercept_only RMSE table above). It all depends on what questions you’re trying to answer.

Conclusions

Here’s some final thoughts to consider

Mixed effects models are powerful methods that help us account for complex, nested structures in our data
Keep in mind that, depending on the context/problem, partial-pooling can hurt you rather than help you
Consider using a Bayesian framework when appropriate (I’ll cover this in detail in the next article)

References

Gelman, Andrew, and Jennifer Hill. Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press, 2007. Print.
Gelman, Andrew. “Multilevel (hierarchical) modeling: what it can and cannot do.” Technometrics 48.3 (2006): 432–435.
Gabry, Jonah, et al. “Visualization in Bayesian workflow.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 182.2 (2019): 389–402.
BRMS. https://paul-buerkner.github.io/brms/

If you liked the article feel free to share! Comment or tweet (@ecoronado92) if you have any questions or you see anything incorrect/questionable.

I’d like to give special thanks to Kim Roche who has diligently helped me proofread this article.

All code and notebooks used in this article can be found here

ecoronado92/towards_data_science

Repo contains foundational materials and examples implementing Bayesian and Frequentist Hierarchical models on…

github.com

A Bayesian Approach to Linear Mixed Models (LMM) in R/Python

Implementing these can be simpler than you think

towardsdatascience.com

Evaluating Bayesian Mixed Models in R/Python

Learn what is meant by “posterior predictive checks” and how to visually assess model performance