Evaluating Bayesian Mixed Models in R/Python

Learn what is meant by “posterior predictive checks” and how to visually assess model performance

Eduardo Coronado Sroka
Towards Data Science

--

What can I say, model checking and evaluation are just one of those things you can’t (and shouldn’t) avoid in your model development process (as if this isn’t obvious enough). Yet, I think in many cases we end up chasing performance metrics — say RMSE, misclassification, etc.— rather than understanding how well our model represents the data. This can lead to problems down the line.

So, no matter what framework and metrics you’re using I’d always recommend you keep the following questions in mind:

  1. What aspects of our data and understanding of the problem aren’t being captured by my model?
  2. Which model performs best and how well might it generalize (i.e. work in the future)?

Avoid chasing performance metrics (e.g. RMSE, misclassification, etc.). Understanding how well our model represents the data and our knowledge is also crucial.

In this article, my goal guide is you through some useful model checking and evaluation VISUAL METHODS for Bayesian models (not your typical RMSE) in both R and Python. I will build upon an example and set of models covered in my previous post so I recommend you take a quick look before moving forward.

At a glance

Here’s what I’ll cover:

  1. How to check model fit via posterior predictive checks (PPCs)
  2. How to evaluate performance via cross-validation (LOOCV, in a Bayesian setting)
  3. An applied model selection example using these two concepts

Without further ado, let’s travel through the dunes of uncertainty and onto Bayesian lands. (Sorry not sorry for this cheesy metaphor. 😉)

Photo by Andrew DesLauriers on Unsplash

Picking up where we left off

In my previous post our EDA suggested we explore three Bayesian models — a simple linear regression (base model), a random intercept model and a random intercept, random slope model — on simulated website bounce times with the overall goal of determining whether younger people spend more time on a website than older ones across locations. Similarly, we ran some MCMC visual diagnostics to check whether we could trust the samples generated from the sampling methods in brms and pymc3.

Thus, the next step in our model development process should be to evaluate each model’s fit to the data given the context, as well as gauging their predictive performance with the end of goal selecting the best one.

NOTE: The previous post contains only code snippets. All model fitting and MCMC diagnostics code is found on Github.

Step 3.b Posterior Predictive Checks (PPP)

Posterior predictive checks are just a fancy way of saying model checking in Bayesian jargon. The name originates from the methods used to assess goodness-of-fit (explained below). Once you’ve fit a model and the MCMC diagnostics show no red flags, you might want to visually explore how well the model fits the data. In a non-Bayesian setting you’d probably look at residual or Normal Q-Q plots (which you can still do for Bayesian models). However, be aware that Bayesian residual plots won’t capture the uncertainty around our inferences since they are based on single parameter draw.

Bayesian models are generative thus we can simulate values under a model and check whether these resemble those in our original data.

Bayesian models are generative in nature which allows us to simulate datasets under a model and compare these against observed ones. If the model fits well, we expect simulated values to look similar to those in our observed data. If instead our simulated data doesn’t resemble the observed data then we’ve misspecified the model and should be wary of our parameter inferences.

If simulated values don’t resemble those in the observed then we can say we have a “lack of fit” or “model misspecification”.

Thus, the most common technique to check model fit is to simulate datasets from the posterior predictive distribution — hence the term posterior predictive checks — which is defined as follows

We are basically marginalizing (integrating) over our posterior inferences to obtain a predicted value ỹ given our observed data y. You can think about this procedure as first simulating some parameter values from your joint posterior (e.g. mean and variance) and then using those as inputs in a likelihood function (e.g. a Normal distribution) to generate predicted samples.

Thus the quality of our posterior inferences will impact how well the posterior predicted samples will represent the observed data.

Let’s dive to some coded examples.

Comparing Posterior Densities

First, let’s explore how well the densities of simulated datasets compare to those of observed data.

In R bayesplot provides nice built-in function ppc_dense_overlay to generate these visualizations.

In Python, PyMC3 also has built-in function plot_ppc generated via arviz.

Below we see that simulated data generated from the random intercept model fits the observed data well (i.e. has a similar pattern).

Simulated data (light color) vs observed data (dark color) densities in R (left) and Python (right)

Comparing Posterior Test Statistics

Ok so our simulated data visually seems to resemble the observed density, however we might also want to know how comparable simulated summary statistics are to those of the observed data. From your good ‘ol Stat 101 days you might recall that the typical test statistics are the mean, variance, etc. and you might be tempted to use those. Let’s see what happens when we use the mean with the samples from our base linear regression.

Great! The means of the posterior samples also seem to fit the data — so we should move on, right? Well, wait a second. Let’s look at another summary statistic: skewness.

Uhm, what? What happened?! Well, this is a cautionary tale that illustrates the importance of checking more than just the mean and the variance of the posterior predicted samples. If we had only done this, we might have incorrectly concluded that our model has no lack of fit and that it correctly represents the data while it clearly doesn’t capture its skewness.

In fact, it is recommended to look at summary statistics unrelated (also referred to as orthogonal) to any of the parameters you’re updating.

Test statistics related to our model parameters might be unable to capture potential issues between the data and our models.

In our case, our models assume a Gaussian distribution and we are updating mean and variance parameters. So unrelated metrics we could explore are skewness and median values — the former at a population level and the later at the group level.

In R, we can use two bayesplot function to generate these diagrams: ppc_stat and ppc_stat_grouped.

In Python, this isn’t as straightforward but can be achieved with some custom code using pandas, matplotlib, and plotly as follows.

Now, we can see the random intercept model captures the behavior of the observed data. Firstly, based on skewness

Simulated (histogram) vs observed (dark line) skewness statistic in R (left) and Python (right)

and secondly based on county-level median values.

Step 3.c Marginal (Pointwise) Posterior Predictive Checks

In the previous section we covered model checking using the joint posterior predictive density (i.e. considering all the data). However, what if we’re also interested in detecting unusual observations such as outliers and influential points, or checking how well our model is specified across observations? Using the joint distribution makes it really hard to do this. Instead, we can take our model checking to a more granular level based on marginal (i.e. univariate) posterior predictive distributions. One way to obtain these marginal samples is via the leave-one-out cross-validation (LOO CV) predictive density.

“Wait, but that’s so inefficient!” I know, I know. Computing these requires us to refit the model n times (!!), but fret not dear reader, there are ways to approximate these densities without all the refitting. Smart people came up with an elegant approximation method called Pareto Smoothed Importance Sampling (PSIS) which I’ll refer to as LOO-PSIS.

Marginal predictive checks can be particularly useful for detecting outliers and influential points, or for checking overall model calibration (i.e. how well the model is specified).

Model Calibration — identifying model mis-specifications

In a Bayesian setting, we use some probability theory to test this — specifically, a concept called the probability integral transform (PIT), which basically uses CDF properties to convert continuous random variables into uniformly distributed ones. However, this only holds exactly if the distribution used to generate the continuous random variables is the true distribution.

What does that mean for us? In our case, the fitted model is what’s generating the continuous random variables so if it is a good approximation of the true bounce time generative distribution, then the set of empirical CDF values we calculate for each LOO marginal density will be approximately uniform. That is, for each univariate density p( ỹ_i | y_-i ) of a left out point we will compute a CDF value [ p_i = Pr( ỹ_i ≤ y_i | y_-i ) ] and then plot these p_is to see if they look uniformly distributed.

If our fitted model is a good approximation of the true data generating distribution, then the PIT converted values will approximately follow a uniform.

This might sound like “voodoo magic” to some so I’ve written a nice side tutorial you can review to get more intuition on why this is true. Feel free to play around with the code and data.

Yay! If you’re still here it means I either didn’t completely lose you with the PIT explanation or I’ve regained your trust (hopefully) with that tutorial. Ok, let’s dive into an example on how to use this diagnostic. As a heads up, I’ll refer to the empirical CDF values as LOO-PIT.

In R, we can use the loo package to generate LOO-PSIS samples and the bayesplot ppc_loo_pit_overlay function to apply the PIT to these values and compare them against samples form a Uniform[0,1].

In Python, you can use a PyMC3 built-in function plot_loo_pit as follows

Below we can see that the LOO-PIT values (thick curve) from our model plotted against random samples of a standard uniform (thin curves). Without going into details, here’s a general way to interpret these plots:

  1. LOO-PIT values concentrated near 0 or 1 hint at an under-dispersed model (i.e. our pointwise predictive densities are too narrow and expect less variation than found in the data)
  2. LOO-PIT values concentrated near 0.5 hint at an over-dispersed model (i.e. our pointwise predictive densities are too broad and expect greater variation than found in the data)

Looking at our model there isn’t any apparent lack-of-fit since the LOO-PIT values follow plausible uniform samples. However, there seems to be a sticky point near 0.6 which hints a slight over-dispersion (i.e. some marginal densities are too broad compared to the data) and is discussed in the tutorial linked above. If instead the LOO-PIT lines were nowhere near the uniform samples then you should interpret this as lack-of-fit and explore further modeling.

LOO-PIT examples for random intercept model in R (left) and Python(right).

(To learn more I recommend you read pages 152–153 of Bayesian Data Analysis 3).

Identify outliers and influential points with the Pareto-k diagnostic

As in any modeling scenario, we’re interested in the effect potential influential data points or outliers could have on our model — i.e. if we were to omit influential points, would they drastically change our posterior predictive density? Looking at observations associated with these points provides hints as to whether and how we might want to modify our modeling strategy. Traditionally we compare the full data posterior predictive density and with each LOO CV predictive density to identify these points but this can be computationally expensive.

Luckily, the LOO-PSIS method already computes an empirical estimate of this comparison: the tail shape parameter ( k̂ ) of the generalized Pareto distribution used in the sampler. Intuitively, if a marginal predictive density of a left out point has a large then it suggests that this point is highly influential. In practice, observations with values:

  • Less than 0.7 are considered non-influential points with reliable PSIS estimates of the LOO predictive density
  • Between 0.7 and 1 are considered influential and less reliable
  • Greater than 1 are considered highly influential and unreliably estimated by the model

The Pareto-k diagnostic can be useful to identify problematic point(s) in your model (or even across models!).

Let’s work through an example to get some intuition on how to use this to identify these unusual points.

In R, you can use the loo function to compute and extract the Pareto shape estimates and visualize these with ggplot .

In Python, you can use the PyMC3 built-in function loo to generate the pointwise samples and the plot_khat function to visualize these shape estimates.

Since all observations have a k̂ <0.5 then none are considered influential in the posterior given our model is able to characterize them correctly. In other words, our model’s LOO-PSIS estimates are reliable.

Pareto-k diagnostic plots in R (left) and Python (right)

Step 4: Model Evaluation and Comparison

As always we would like to know the predictive accuracy of our models given some out-of-sample (unknown) data. A common approach to estimate this error is to use some type of cross-validation. Aside from your typical RMSE metric, linear Bayesian models’ performance can be assessed with some additional metrics. For example, the Expected Log Predictive Density (ELPD) or the Watanabe–Akaike information criterion (WAIC), which is faster and less computationally expensive the ELPD. Here I will focus on the ELPD criteria.

Intuitively, ELPD refers to the model’s average predictive performance over the distribution of unseen data (which, as stated above, we can approximate with cross-validation). Regardless of the cross-validation flavor we choose — K-fold or LOO — we can obtain an expected log predicted density (ELPD) value for each out-of-sample datapoint (see here for an intuitive example). In general, higher ELPD values indicate better model performance.

Although it might sound counterintuitive, in a Bayesian setting computing K-fold CV can be less efficient and more complex than LOO CV.

Now, using K-fold might seem like a no-brainer compared to LOO given it generally leads to lower variance (especially when data is correlated), and can be less computationally expensive. However, in a Bayesian setting we have ways to approximate LOO values without having to refit the model multiple times (e.g. the PSIS method discussed above).

Additionally, LOO ELPD estimates have a major advantage over K-fold ELDP values. They allow us to identify points that are hard to predict and can even be used in model comparison to identify which model best captures each out-of-sample observation.

Let’s walk through an example of how to compare the predictive performance models using overall LOO ELPD and pointwise LOO ELPD values.

In R, we can use loo and loo_compare functions to compare the overall performance of multiple models, as well as the pointwise differences between a pair of models. (Note: loo also provides a kfold function.)

In Python, we can use the PyMC compare and loo functions to obtain the same overall and pointwise model comparisons.

Below are the outputs from the code above. Models with best overall performance are commonly listed first and have the smallest ( d_loo ) and largest ( elpd_loo ). In our case, the random intercept model has the best overall performance.

Comparison of overall model performance output in R (left) and Python (right)

Now, as I mentioned above we can obtain pointwise ELPD estimates regardless of the CV method of our choosing which allows us to compare models more granularly. The values plotted below are the differences between the ELPD values from the random intercept model minus those of the linear regression. Here we see the random intercept model has better overall performance (i.e. most values > 0), however, notice there are certain points where the linear regression performs better.

Pointwise model comparison in R (left) and Python (right)

The ELPD pointwise plots paired with the above Pareto-k diagnostic plots can help identify unusual points.

Putting it all together

Now that we’ve covered all the ins and outs of the ways Bayesian models can be evaluated, let’s apply these techniques to the three models from my previous post.

Model checking

First, we compare at the posterior predictive densities of our models. Here we observe that the linear mixed models tend to better fit the observed data.

Similarly, we’d want to look at some posterior test statistics, keeping in mind that statistics related to our model parameters might not be as useful for detecting model misspecification. Below, observe how the mean fails to capture the lack-of-fit of the linear regression while the skewness coefficient clearly shows a misspecification.

We can also look at the county-specific test statistics to identify potential lack-of-fit. Below we observe that the mixed models also tend to capture the county medians (thick lines) better than the linear regression model.

Next, looking at the LOO-PIT diagnostic we see no lack-of-fit for any of our models given all PIT values (thick lines) follow probable random uniform samples (thin lines). However, as discussed above, there seems to be some sticky points for all models.

The linear regression seems to present under-dispersion (i.e. values concentrating towards 1) while the mixed models present some over-dispersion (i.e. values concentrating towards 0.5). This provides hints that further modeling effort could focus on narrowing the univariate posterior predictive distributions in the mixed models to better capture the uncertainty.

Model Comparison

Now, looking at the predictive performance between models using the ELPD information criteria we see that the random intercept model seems to be the best model (i.e. elpd_diff = 0.0 ).

##                  elpd_diff se_diff
## bayes_rintercept 0.0 0.0
## bayes_rslope -1.3 0.3
## lin_reg -347.8 18.8

We can look at the pointwise differences between the random intercept model and the other two to see where it performs better (i.e. ELPD_rand_int_i — ELDP_other_model_i ). The top diagrams shows the random intercept model performs better than the linear regression for most points ( ELPD Pointwise Diff > 0 ) and thus the large difference in model performance above. A similar comparison between the mixed models demonstrates their similar performance but we can also use this analysis to identify observations that are harder to predict (e.g. those in Dorset county).

Finally, looking at the indices of those hard-to-predict observations in the Pareto-k diagnostic we can see the random intercept model is able to better resolve these observations (i.e. smaller value).

Final Model

Congrats, you’ve reached the Bayesian lands! You should feel proud of yourself at this point. Having to tread through all that material can be daunting but hopefully you’ve gained a better understanding and some intuition about how to evaluate Bayesian models.

From the analysis above we see that the best model for this data is the random intercept one. It not only shows good predictive performance but also shows a good fit for our data. So as a final step, let’s look at our inferences and answers our initial question: “How does age affect bounce time?”

The table below provides a summary of some fixed effects and within-/between-group variance estimate with 95% credible intervals. Two important things to note here:

  • Given the 95% intervals don’t contain 0, we’re confident that all these estimates are non-zero
  • Related to our original question, the effect of age on bounce time is about 1.8. Since we center-scaled our variable, the effect will be positive or negative relative to whether we’re considering deviations above or below the average age.

You can achieve summary tables like these using tidybayes in R or PyMC3’s pm.summary() function in Python.

A very useful approach for understanding the estimated differences amongst groups is to visualize the random effects. Below we see that cheshire, cumbria, essex, and kent tend to have higher average bounce times while london and norfolk have lower ones. We can’t say that for dorset and devon since their 95% intervals contain zero and thus we can’t be certain the bounce rates in these counties deviate from the population average. One thing to note is how our Bayesian framework is able to capture uncertainty even in counties with smaller sample sizes (i.e. kent and essex ).

Using tidybayes in R or PyMC3’s pm.foresplot() function in Python you can achieve these very nice visuals

We observe that counties with larger sample sizes have narrower intervals around the fits (e.g. cheshire) while smaller ones have wider intervals (e.g. kent). Finally, we can check the fits against the data.

Using tidybayes in R you can achieve these very nice visuals quite easily, in Python it can be more tedious but still possible with seaborn and pandas.

Conclusion

Evaluating Bayesian models follows many of the same goodness-of-fit testing and performance comparison steps as for any other model you’d encounter. However, there are tools and concepts that are particularly useful for assessing Bayesian models. Some are quite complex and it can feel daunting to use them if you’re unfamiliar with what their outputs mean. (There’s nothing more distressing than not being able to understand a model’s output summary.)

Luckily there are visual ways of diagnosing model fit, evaluating performance, and even interpreting results from Bayesian models.

I hope this part 2 on Bayesian mixed models has continued to build your intuition about Bayesian modeling such that it becomes a powerful method in your toolset.

References

  1. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
  2. Gabry, Jonah, et al. “Visualization in Bayesian workflow.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 182.2 (2019): 389–402.
  3. Gelman, Andrew, Jessica Hwang, and Aki Vehtari. “Understanding predictive information criteria for Bayesian models.” Statistics and computing 24.6 (2014): 997–1016.
  4. Vehtari, Aki, Andrew Gelman, and Jonah Gabry. “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and computing 27.5 (2017): 1413–1432.

If you liked the article feel free to share! Comment or tweet (@ecoronado92) if you have any questions, or you see anything incorrect/questionable.

All R code can be found here

All Python code can be found here

--

--

Duke MSc Stats | Views = own | Passionate about responsible and well-founded uses of data analytics | Website: ecoronado92.github.io