The world’s leading publication for data science, AI, and ML professionals.

Applied Bayesian Inference pt. 1

An Introduction to the Conditional World

Hands-on Tutorials

Applied Bayesian Inference with PyMC3 Part 1

Milky Way above Maui. Image by author
Milky Way above Maui. Image by author

This story will be for those already a little familiar with statistics and Python and looking to take their skills to the next level. I’ll start with philosophies and then attempt to directly implement concepts in Python to build a more actionable guide for you. I’ve read about this world countless times, but until I actually applied it myself I didn’t really feel confident in my abilities so I implore you to find a dataset you’re fascinated in and dive right in!

Prerequisites that will help: familiarity with statistics up to hypothesis testing, beginner to intermediate Python skills.

In this story, I’ll cover the Bayesian mentality, probabilistic programming introduction, and a coin flip example with PyMC3 & ArviZ.

Thinking Bayes

When we hear the word ‘statistics’, we are traditionally thought to think more in terms of long-running frequencies. Our minds instantly jump to examples like: the probability of a coin landing heads, the probability of rolling a 3 on a 6-sided die, the probability of lightning striking. For example: if you wanted get to the probability of a coin landing heads then you would imagine a world where you flipped a coin infinite number of times and see how many times it lands heads out of some incredibly large number of flips, or trials. This classical school of thought is called Frequentist where probability is considered as the limit of the relative frequency of an event after many trials. There are scenarios where the ‘long-run frequency’ thinking does not make much logical sense like: probability of getting an A on a test and probability of election outcomes. Frequentists navigate these questions by invoking alternative realities and saying that across all of these realities, the frequency of occurrence determines the probability. Essentially, Frequentists see the parameter (what the test statistic attempts to estimate) as a deterministic, or non-random, variable.

The true probability is approached to as n leads to infinity. Image by author
The true probability is approached to as n leads to infinity. Image by author

In contrast to this school of thought is Bayesian, where probability expresses a degree of belief in an event. Since majority of this piece will be spent in the Bayesian world, let me illustrate an example to introduce this way of thinking.

This example is from behavioral psychologists Daniel Kahneman and Amos Tversky, popularized in Thinking Fast and Slow (Kahneman, 2011):

"Steve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. A meek and tidy soul. He has a need for order and structure, and a passion for detail."

Is Steve more likely to be a librarian or a farmer?

They found that people overwhelmingly chose librarian (90%) over farmer (10%). Regardless of the answer, Daniel & Amos found it irrational that people didn’t even consider the proportion of librarians to farmers in the world to make their decision. Even regarding the proportion of librarians to farmers, it’s less about people actually knowing the true answer and more about thinking to estimate it. As Grant Sanderson from 3Blue1Brown so eloquently states, this is better illustrated graphically:

Image by author
Image by author

Let’s say in our estimation, we believed that librarians accounted for about 1/21 of the amount of farmers. Our goal is to find the probability of Steve being a librarian given the description. Say we find evidence that out of 10 librarians and 200 farmers, 4 librarians and 20 farmers fit that description.

Image by author
Image by author

So considering our belief, evidence, and goal, we can claim the probability to be:

Image by author
Image by author

Now let’s say that after doing more reading on the topic we find that the proportion of librarians to farmers is not so low. Well in that case, all we would have to do is update our prior beliefs (essentially widen the slim blue bar) and recalculate. This is the true essence of Bayesian philosophy; there’s no notion of ‘right’ or ‘wrong’, as you are using your prior knowledge and new evidence to consistently update your understanding of the world.

Image by author
Image by author

Essentially, Bayesians see the parameter (most likely distribution that explains the data) as a stochastic, or random, variable. Here’s how our example maps to terminology in the Bayesian world:

  • Prior: P(H) = 1/21. What we know about the value of some parameter before seeing the evidence.
  • Evidence (aka Marginal Likelihood): P(E) = 24. How likely the evidence E is regardless of the event.
  • Likelihood: P(E|H) = 4/10 = 2/5. Given the hypothesis, how likely is the evidence.
  • Posterior: P(H|E) = 4/24 = 1/6. Given the evidence, how likely is the hypothesis.

Putting it all together, we get:

Image by author
Image by author

and as we simply it down, we arrive at:

Bayes Theorem. Image by author
Bayes Theorem. Image by author

By the way, due to this "subjective" nature of arriving at insights in the Bayesian world, it can be confused and easily reduced down to just opinion. This can be a lazy generalization because all of statistics (Frequentist and Bayesian) requires assumptions to be drawn, and wherever assumptions are made to go from data to insights then you have a relatively subjective craft.

This speaks to the true value of data, because as we gather an infinite amount of data (or evidence), say as N approaches infinity, Bayesian results tend to align with Frequentist results. So it can be reasoned that statistical inference is more or less objective for large N. The issue is that often we do not have infinite resources (time, energy, data, compute, etc.), so we have to work in the world where there is a smaller amount of evidence available. In the Frequentist world, this would yield results with greater variance and larger confidence intervals. Bayesian analysis can really shine here as you preserve the uncertainty that highlights the instability of the statistical inference on a small dataset. Note that this does not mean that Bayesian is ‘more accurate’ than Frequentist on a small dataset. Both require assumptions to arrive to insights and the smaller the dataset, the more/broader the assumptions that would need to be made.


Probabilistic Programming

Remember our goal is to compute the posterior, and ideally on a large amount of data (in an efficient way). Due to struggles in the past and the evolution of the computational era, we have arrived at the development of probabilistic programming languages (PPL) – namely PyMC3. PPL was designed to assist in drawing a line between model creation and inference while also making it easier to design & debug. This frameworks allows for a variety of numerical methods to be applied as universal inference engines for automating the inference part of this whole process. With a relatively few lines of code, one can create a complex probabilistic model in an efficient and sensible way. For visualizing PyMC3 results, we’ll use the ArviZ library which is amazingly built to complement PyMC3 as you will see.

Let’s cover some of the most notable inference engines. A lot of this knowledge is incredibly explained in Bayesian Analysis with Python (Martin 2016) so I’ll pull inspiration from there:

Non-Markovian methods:

  • Grid computing
  • Quadratic approximation
  • Variational methods

Markovian methods:

  • Metropolis-Hastings
  • Hamiltonian Monte Carlo/No U-Turn Sampler

Since the Markovian methods, dubbed MCMC (Monte Carlo Markov Chain) models, are amongst the most popular today I’m just going to focus on these here. I’ve found the best way to learn the theory of what these methods are doing is by doing and backing into it so let’s first start with a simple, contrived example of coin flipping.

import pandas as pd
import numpy as np
import pymc3 as pm
import arviz as az
from arviz.plots.plot_utils import xarray_var_iter
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed = 0
#the number of samples
N=15
#establishing prior and getting observed data
theta_real = .5
observed_data=stats.bernoulli.rvs(p=theta_real, size=N)
#the number of heads
k=observed_data.sum()
print(observed_data)
print(f"There are {k} heads")
[1 0 1 1 1 1 1 0 0 0 0 1 0 1 1]
There are 9 heads

The above code runs only 15 coin flips with a probability (theta_real) of 0.5. In most real use cases this would need to be estimated. Now that we have our data, let’s cook up our model:

#fit the observed data 
with pm.Model() as coin_flip:
    theta=pm.Beta('theta', alpha=1, beta=1)
    y=pm.Bernoulli('y', theta, observed=observed_data)

pm.model_to_graphviz(coin_flip)
Image by author
Image by author

To elaborate the thought process here, the book Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (Davidson-Pilon, 2015) offers a phenomenal mental model guide. The first question you should ask yourself when you construct a Bayesian model is "How could this data have been generated?".

1. We start by thinking, "What is the best random variable to describe this coin flipping data?" A Binomial random variable is a good candidate because it can represent binary decision processes with n independent trials well. This is the third line in the code block and represents our likelihood. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

2. Next, we think, "Okay, assuming coin flips are Binomial-distributed, what do I need for the Binomial distribution?" Well, the Binomial distribution has a parameter theta (θ).

3. Do we know θ? Yes, but let’s assume we don’t. In most real-world cases we won’t and we’ll need to estimate it but for our example, we know it to be .5. but we can introduce a naive prior as if we don’t know θ. This is the second line in the code block and represents our prior.

4. What is a good distribution for θ? The Beta is good, as it models things that have a limited range, like 0 to 1. The Beta distribution has two parameters, alpha (α) and beta (β) that determine its shape.

5. Do we know what α and β might be? No. At this point, we could continue and assign a distribution to these parameters, but it’s better to stop once we reach a set level of ignorance (remember we only conducted 15 trials for our evidence); whereas we have a prior belief about θ ("it may gravitate to 0.5 as we approach infinity"), we don’t really have any strong beliefs about α and β. So it’s best to stop modeling here.

6. What is a good value for α and β, then? Beta distribution shapes vary based on these values, but to start we assign α and β to be 1. This is synonymous to a Uniform distribution ranging from [0,1]. This is a weakly informative prior as it essentially assumes no prior knowledge of what the probability a fair coin flipping heads may be. Essentially, the parameter could be any value between 0 and 1.

Probability Density Functions for the Beta distribution, Public Domain
Probability Density Functions for the Beta distribution, Public Domain

This way of thinking "How could this data have been generated?" is incredibly critical. Essentially, we have some assumption about a process in the world and want to model that process so we want to pick appropriate distribution combinations to create the model. If we assume that a fair coin flip’s probability represents a Beta distribution and we want our model to depict that then choosing our prior to be a Beta distribution and likelihood to be a Binomial distribution is an important choice due to conjugacy. If the posterior distribution is in the same probability distribution family as the prior probability distribution, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. Essentially, if we use a beta distribution as prior and a binomial distribution as likelihood, we will get a beta as a posterior.

Conjugacy ensures mathematical tractability of the posterior, which is important given that a common problem in Bayesian Statistics is to end up with a posterior we cannot solve analytically. Luckily for us, modern computational methods enable us to solve Bayesian problems whether we choose conjugate priors or not but it is important to recognize the value in crafting an appropriate model.

Now let’s push the inference engine button and see what we get:

with coin_flip:
 step = pm.Metropolis()
 trace = pm.sample(10000, step=step, return_inferencedata=True)

By default Pymc3 will run the Metropolis inference engine (Metropolis-Hastings), but we can explicitly state it as well. We pick the number of samples to run for and now let’s see how well our model learned the data.

var_names = ["theta"]
lines = list(xarray_var_iter(trace.posterior[var_names].mean(dim=("chain", "draw"))))
az.plot_trace(trace, lines=lines);
Image by author
Image by author

As we can see, the mean of the posterior is close to 0.5 but definitely seems a little biased. Considering this was just with 15 trials we can imagine how the model may naturally find its way there if we ran for more trials! For the plot on the left, what we’re looking for is a relatively smooth kernel density estimation (KDE) plot. It’s not as smooth, which may be due to a variety of reasons (choice of inference engine, number of observations in the data, number of samples, etc.). For the plot on the right, what we’re looking for is no noticeable patterns. We want the plot on the right to resemble white noise where no ‘divergences’ are evidently depicted; essentially we want good mixing of our various Monte Carlo Markov Chains.

Before we go further in our analysis of the posterior distribution, let’s see if we can get a better KDE plot.

with coin_flip:
 step = pm.NUTS()
 trace = pm.sample(10000, step=step, return_inferencedata=True)
var_names = ["theta"]
lines = list(xarray_var_iter(trace.posterior[var_names].mean(dim=("chain", "draw"))))
az.plot_trace(trace, lines=lines);
Image by author
Image by author

This looks much better! But why? The difference between our first model and this one was the inference engine. We first used the default Metropolis-Hastings inference engine and now we tried the Hamiltonian Monte Carlo/No U-Turn Sampler. This page explains their differences (conceptually and mathematically) quite brilliantly and I’ll attempt to summarize the differences here:

Metropolis-Hastings enables us to obtain samples from any distribution with probability p(x) given that we can compute at least a value proportional to it. It would be conceptually similar to attempting to measure the shape of a lake floor from a boat by randomly using a long stick to measure distance from a point, and then iteratively moving in directions that resembled greater depth and only those directions. As you can imagine, over a large number of iterations one would have a very accurate depiction of the lake floor’s shape and this is similar in the Bayesian world as well (the lake floor is the posterior distribution).

Hamiltonian Monte Carlo/NUTS arose out of the realization that taking a very large amount of samples could be quite costly (time and computation). This method is essentially the same as Metropolis-Hastings except that instead of proposing random displacements of our boat we do something more clever instead; we move the boat following the curvature of the lake’s bottom. In order to decide where to move next we let a ball roll at the bottom of the lake, starting from our current position. We throw a ball and we let it roll for a short moment, and then we move the boat to where the ball is. Now we accept or reject this step using the Metropolis criteria just as we saw with the Metropolis-Hastings method. Those familiar with Gradient Descent optimized with Adam may find this approach similar.

Like all things, there is a tradeoff to using the cleverer Hamiltonian approach. Each HMC step is more expensive to compute than a Metropolis-Hastings one but the probability of accepting that step is much higher with HMC than with Metropolis. I highly recommend reading the page linked earlier to dive deeper into this concept, but for now let’s march forward.

We can also print out a dataframe that shows key metrics on how well our posterior distribution was modeled:

az.summary(trace)
Image by author
Image by author

By default, we get quite a lot of information: mean, standard deviation, Highest Density Interval (HDI), and more. One of the quantities returned is the Monte Carlo Standard Error (MCSE) mean and standard deviation. This is an estimation of the error introduced by the sampling method. It is defined as standard deviation of the chains divided by their effective sample size. The estimation takes into account that the samples are not truly independent of each other. Another quick check of quality for the posterior distribution is the r_hat metric. A rule of thumb is you want this value to be close to 1.0 as that indicates convergence in the chains. Typically a value higher than 1.05 is a point of concern and a value between 1.0 and 1.5 is worthy of investigating.

What’s interesting here is that even with 15 trials, our estimation still yields a mean value close to 0.5 but definitely higher than we’d like. Seeing the r_hat tells us that our model didn’t necessarily encounter errors in the construction as well so a lot of our estimation results are due to the fact that we had such little data.

To explain the High-Density Intervals, let’s look at them visually:

az.plot_posterior(trace, kind='kde', ref_val=0.5);
Image by author
Image by author

In Bayesian inference, unlike Frequentist inference, we get the entire distribution of the values. Every time ArviZ computes and reports a HPD, it will use, by default, a value of 94%. In the Frequentist world, we have p-values and confidence intervals. In the Bayesian world, this is referred to as a credible interval. Here we can interpret that there is a 94% probability that the belief of a coin flipping heads is between .45 and 53. The benefit here is that we trade relatively confusing terminology and interpretations in the Frequentist world for a sensible estimation in the Bayesian world. The downside is that we lose a more definitive action to take in exchange for a spectrum of belief to be had. For example, in this scenario it’s quite sensible that the coin could be slightly biased for heads as majority of the interval lies above .5, but it could also be .38. The Frequentist world may provide us a more definitive answer, but as always: you’re consistently making assumptions to get to insights.

Finally, we can run Posterior Predictive Checks (PPCs) to validate our model. The idea here is we compare the observed data and the predicted data to spot differences between these two sets. The main goal is to check for auto-consistency. This is done by generating data from the model using parameters from each draw from the posterior. Ideally, the generated data and the observed data should look more or less similar. Even if we do all the modeling right, though, they could look off. An investigation into why could lead to significant improvements of our model that we may not have thought of otherwise.

with coin_flip:
 ppc = pm.sample_posterior_predictive(trace, var_names=["theta", "y"])
ppc['y'].shape
(20000, 15)
az.plot_ppc(az.from_pymc3(posterior_predictive=ppc, model=coin_flip));
Image by author
Image by author
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist([y.mean() for y in ppc['y']], bins=19, alpha=0.5)
ax.axvline(observed_data.mean())
ax.axvline(0.5, c='r')
ax.set(title='Posterior Predictive Check', xlabel='Observed Data', ylabel='Frequency');
Image by author
Image by author

It looks like our Bayesian model was able to learn our observed data distribution quite well! This is especially notable because we were only working with 15 trials which, frankly, is quite crazy. Like any data project, a high performant model is more than just good metric numbers. Our knowledge of coin flips tells us that 0.6 doesn’t make much sense for a fair coin flip.

So now, in true Bayesian fashion, let’s gather more evidence, update our prior, and see what happens.

Updating Beliefs in Light of New Evidence

Let’s say we run the same coin flip trials 1000 times:

#the number of samples
N_updated=1000
#establishing prior and getting observed data
theta_real = .5
observed_data_updated=stats.bernoulli.rvs(p=theta_real, size=N_updated)
#the number of heads
k_updated=observed_data_updated.sum()
print(observed_data_updated)
print(f"There are {k_updated} heads")
[0 1 0 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 1]
There are 511 heads

As we cook up our model, we can also update our prior Beta distribution to resemble a shape that would reflect coin flips better.

#fit the observed data 
with pm.Model() as coin_flip_updated:
 theta=pm.Beta('theta', alpha=2, beta=2)
 y=pm.Bernoulli('y', theta, observed=observed_data_updated)

pm.model_to_graphviz(coin_flip_updated)
Image by author
Image by author
with coin_flip_updated:
 step = pm.NUTS()
 trace_updated = pm.sample(10000, step=step, return_inferencedata=True)
var_names = ["theta"]
lines = list(xarray_var_iter(trace_updated.posterior[var_names].mean(dim=("chain", "draw"))))
az.plot_trace(trace_updated, lines=lines);
Image by author
Image by author
az.summary(trace_updated)
Image by author
Image by author
az.plot_posterior(trace_updated, kind='kde', ref_val=0.5);
Image by author
Image by author
with coin_flip_updated:
 ppc_updated = pm.sample_posterior_predictive(trace_updated, var_names=['theta', 'y'])
az.plot_ppc(az.from_pymc3(posterior_predictive=ppc_updated, model=coin_flip_updated));
Image by author
Image by author
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist([y.mean() for y in ppc_updated['y']], bins=19, alpha=0.5)
ax.axvline(observed_data_updated.mean())
ax.axvline(0.5, c='r')
ax.set(title='Posterior Predictive Check', xlabel='Observed Data', ylabel='Frequency');
Image by author
Image by author

What’s amazing here is that our Bayesian model is able to go beyond our evidence (which had a mean of about .51) and identify the true mean probability of a fair coin flip: 0.5. Just with gathering new evidence, updating our prior beliefs, and remodeling our data we are able to find a much better estimation of the coin flip phenomena!

It has been mathematically proved that if we want to extend logic to include uncertainty we must use probabilities and probability theory. Bayes’ theorem is just a logical consequence of the rules of probability as we have seen. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty.

There are a ton more fascinating depths to this world, and in part 2 I plan on applying this to a real-world dataset! See here: Applied Bayesian Inference with Python pt. 2

References

[1] Osvaldo Martin, Bayesian Analysis with Python

[2] Cameron Davidson-Pilon, Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

[3] Cassie Kozyrkov, Statistics: Are you Bayesian or Frequentist?


Related Articles