The world’s leading publication for data science, AI, and ML professionals.

Bayesian Inference and Markov Chain Monte Carlo Sampling in Python

An introduction to using Bayesian Inference and MCMC sampling methods to predict the distribution of unknown parameters through an in-depth…

Image from Adobe Stock
Image from Adobe Stock

Introduction

This article extrapolates a basic coin-flip example into a larger context in which we can examine the use and power of Bayesian Inference and Markov Chain Monte Carlo sampling to predict unknown values. There are many useful packages to employ MCMC methods, but here we will build our own MCMC from scratch in Python with the goal of understanding the process at its core.

What is Bayesian Inference?

Bayesian inference is a method in which we use Bayes’ Theorem to update our understanding of a probability or a parameter as we gather more data and evidence.

What is Markov Chain Monte Carlo sampling?

The MCMC method (as it’s commonly referred to) is an algorithm used to sample from a probability distribution. This class of algorithms employs random sampling to achieve numerical results that converge on the truth as the number of samples increases. Many different algorithms exist that can be used to create this type of sampling chain – the one we will utilize for this specific example is called the Metropolis-Hastings algorithm. The MH implementation allows us to draw samples from an unknown distribution provided we know a function that is proportional to the probability density of the distribution we want to sample from. Not needing to know the exact probability density, but just its proportionality makes the Mcmc-MH especially useful. We will describe the detailed steps of the algorithm later in this example.

The Coin Factory Problem

This example will model the probability of the outcome of a coin flip but in an expanded context. Imagine that a brand new coin factory has just been constructed to manufacture a new kind of coin, and they have tasked you with determining a parameter of the coins the factory produces – the probability they land on heads/tails when flipped. We will call this the coin’s ‘bias’. Each coin will have its own bias, but given they are produced in the same process and vary only slightly, we expect the factory to produce coins not randomly, but around a ‘factory bias’. We will use both information from the specific coin and the factory bias to create a probability distribution to determine the bias a coin is most likely produced with. In Bayes’ terms, we will use the likelihood (coin data) and the prior (factory bias) to produce the posterior distribution for the coin bias.

Image from Adobe Stock
Image from Adobe Stock

So, the factory opens up and produces its very first coin. Given it is a new kind of coin and mint, we don’t know much about what to expect for the way it behaves when flipped (pretend we don’t know how coins generally work, we are trying to predict an ‘unknown’ parameter). So, we run an experiment with this first coin by flipping it 100 times and recording the number of times it landed on heads – 57 times. We will do this with more coins as they are produced in order to update our understanding of the factory bias, but first, let us analyze just the first coin.

The Problem in a Bayesian Context

Before we get too far, we need to define the problem in the context of Bayes’ Theorem. The left side of the equation is called the posterior; generally, it is the probability of a hypothesis (H) given some evidence (E). In the numerator on the right side, we have our likelihood (the probability of seeing the evidence given our hypothesis is true), multiplied by the prior (the probability of the hypothesis). The denominator on the left is the marginal likelihood or evidence; this is the probability of observing the evidence. Fortunately, we will not need to use the marginal likelihood to sample the posterior.

Bayes' Theorem - image by author.
Bayes’ Theorem – image by author.

In most practical applications, computing the posterior distribution directly is not possible, this is why we employ numerical sampling techniques like MCMC.

Posterior

So what is the posterior probability we are interested in for our coin factory? It is the probability, P, that the factory produces a coin with bias, p, given our data, x number of heads – P(p|x). The rest of the theorem can be written as follows:

Bayes' Rule in the context of our coin factory - image by author.
Bayes’ Rule in the context of our coin factory – image by author.

It is important to remember here that we are predicting the posterior probability distribution (P(p|x)) that a coin has a certain probability of heads, or bias (p), these are two different things. The factory bais is the probability distribution of a coin being produced with a certain bias; this is P(p), the prior.

Likelihood – The Binomial Distribution

The likelihood function here is the probability of observing a heads, x, given a coin with bias p. For a coin toss, this function can be described precisely by the Binomial Distribution. For a coin with bias p, the probability of observing x heads out of n flips can be written:

Binomial Distribution - image by author
Binomial Distribution – image by author

It is important to note that we do not currently know the value p for a given coin. This value is the one our MCMC will randomly sample. After initializing p with a random value, and through many samples, the model will converge towards the true value of p.

Let’s begin coding this example in Python. We need to define the observed data from our first coin experiment, and a likelihood function that gives a probability from a binomial distribution given our data. Scipy.stats is a great Python library for easily defining distributions like the binomial and is what I use for this example.

Prior

Next, we need to define our prior function, P(p). The value for p can only exist between 0 and 1, with 0 representing a coin that will never land heads, and 1 representing a coin that will always land heads. Remember, the factory has only produced one coin, so we don’t have any information on what to expect the prior probability of p to be (pretend we don’t know how coins work). In the context of the problem, with only one coin produced, we have no idea of what the factory bias may yet be. Because of this, we will use what is called a uniform prior. In the Bayesian Inference context, this says that we assign equal weight to the probability of p being any value between 0 and 1.

We can do this easily with Scipy and the uniform distribution PDF. This default distribution exists only from 0 to 1, just as we need.

Now that we have our likelihood and prior defined, we can move on to understanding and coding the Markov Chain.

The Metropolis-Hastings MCMC

As mentioned above, these methods draw samples from a continuous random variable – in our case for p. The MCMC we will use is of the random walk type, which randomly generates samples and keeps them or not based on how they fit the model.

Acceptance Ratio

The Metropolis-Hastings algorithm is fairly straightforward, but first, we need to define how we will either accept or reject the new sample draw. Each iteration, a new value for p between 0 and 1 will be proposed, we will call this proposed value p′. We only want to accept and update this value if it is better than the previous. This is done by computing an acceptance ratio, R. This acceptance ratio is the ratio of our Bayes’ Theorem for the proposed value over the previous value and is shown below.

Acceptance Ratio, R - image by author
Acceptance Ratio, R – image by author

There are a few things to note here. First, you may have noticed that this acceptance ratio does not include the marginal likelihood (evidence) part of Bayes’ Theorem, nor did we define a function for it above. This is because the evidence does not change for a new value of p, and thus cancels itself out in this ratio. This is awesome, as computing the marginal likelihood part of Bayes’ Theorem is usually extremely difficult or impossible in practice. MCMC and Bayesian Inference allow us to sample the posterior without needing to know the marginal likelihood!

Second, any value greater than 1 here means that the proposed value is better and should be accepted. The second part of accepting new values is comparing R to another random draw between 0 and 1, so it is convention to just set R to 1 when it is higher.

For simplicity, we will write a function that computes this acceptance ratio to easily implement in our sampling chain loop.

Let’s explicitly define the steps of the MH algorithm so this is more clear.

The Sampling Algorithm

We have already defined our functions for the likelihood, prior, and acceptance probability. The last thing we must do before looping is to initialize p with a random value within its range, 0 to 1.

Here are the steps of our Metropolis-Hastings algorithm:

  1. Propose a new value of p randomly between 0 and 1, call it p′ (or p_new).
  2. Compute the acceptance ratio, R.
  3. Generate another uniform random number between 0 and 1, call it u.
  4. If u < R, accept the new value and set p = p′. Otherwise, keep the current value of p.
  5. Record the final value of p for this sample.
  6. Repeat steps 1 through 5 many, many times.

Pretty straightforward. Before we code this up, let’s explore a couple of other common concepts utilized in MCMC.

Burn-in

MCMCs are initialized randomly and must converge towards the correct value, and this can often take quite a lot of samples. When plotting our results and posterior distribution, it is not effective to include these early samples before the model has converged. So we implement what is called a "burn-in", in which those first, less accurate samples are excluded. Burn-in for MCMCs is typically around 2000–5000 samples when the entire chain is around 10k–20k.

Lag

Another very important thing to consider with MCMCs is sample independence. A new sample here is often dependent on the previous one as occasionally we do not accept a new random value and keep the old. To address this problem, we implement what is called "lag". Lag is where rather than recording every sample, we record every other, or perhaps every fifth or tenth sample.

The Simulation

Great, we now have everything we need to write and run our MCMC.

Visualizing the results

MCMC results are typically plotted in two ways – the posterior distribution and the traceplot. The posterior can be shown in a histogram in which we can examine the most likely value and the variance visually. A traceplot shows the value of p for each sample iteration and shows the behavior and convergence of the MCMC. Posterior distributions should never include burn-in samples, but including burn-in for traceplots can help us to examine where the model started and how well it converged. Burn-in samples are excluded from the traceplot below.

Posterior Distribution and Traceplot - image by author.
Posterior Distribution and Traceplot – image by author.

Examining the results, we can see that our posterior is normally distributed. We can also see from the traceplot that we sampled well randomly around the convergence value – this is good. When extracting a value for the posterior, using the last value for p is not always accurate with the rest of the distribution. For this reason, the value for the posterior is usually taken as the mean of the distribution. In this case, our mean posterior value is 0.569.

This value is nearly what we would have predicted if took the frequency probability of our data for this coin, 57/100 heads. The reason our MCMC predicts this is because it is the first coin, and we don’t have much knowledge about how they should behave. With a uniform prior, the posterior is more influenced by the likelihood function, i.e., the data. The next step in Bayesian Inference is to update our understanding with more data, so let’s keep the factory running and make more coins to test.

Photo by GSJJ on Unsplash
Photo by GSJJ on Unsplash

Updating Our Understanding

We wait a while, and the factory produces 500 coins. We run the same 100 flip experiments and record the posterior’s most likely bias for each coin. Let’s plot a histogram and examine the spread of biases produced to learn about the factory bias.

500 Coin Factory Bias - image by author.
500 Coin Factory Bias – image by author.

Examining this data, we learn that the mean bias is 0.511 and the standard deviation is 0.049. The data appears normal, so let’s model it with a normal distribution with these parameters – this is shown in red.

This distribution contains information about which biases we expect to be more likely for coins produced at this factory. Updating our understanding like this will give us a much more accurate result for the bias of a coin than our previous, uninformed, uniform prior. This is exactly what Bayesian Inference is built for – we can simply update our prior function to represent the data of the factory bias.

Now that we have improved our model with more prior information, let’s produce the 501st coin and run the same experiment. In this experiment, we get 63 heads out of 100 tosses. We must build our likelihood function off this data as we did before:

Great, now let’s see how including information about the factory bias affects the result of the MCMC and what we think the true bias of this coin to be. I have run an MCMC with the same number of samples and parameters as above, but with the updated prior and new coin data. Below shows the posterior distribution and traceplot for the 501st coin.

Posterior and Trace from MCMC with updated Prior - updated by author.
Posterior and Trace from MCMC with updated Prior – updated by author.

Even though the data for this coin suggests a bias around 0.63, our Bayesian Inference model suggests the actual value is much closer to 0.53. This is because our informed prior function holds weight in the model, telling us that even if we observed 63 heads for this coin, given that coins are produced with an average bias of around 0.51, we expect the bias for the 501st to be closer to the factory bias. Even if this coin had exactly a 0.5 bias, observing 63 heads out of 100 tosses is not entirely unlikely, and we shouldn’t assume this data to be representative of the exact value. The prior function holds weight just like the likelihood does in informing the posterior distribution. If we were to produce thousands of more coins and inform the prior distribution even more, this would give it an even higher weight in the model. This idea of updating our understanding with more information to predict unknown parameters is exactly why Bayesian Inference is useful. It is tuning and manipulating these likelihood and prior functions with more and better data that allows us to improve and inform our inference model.

Conclusion

When implementing Bayesian Inference and MCMC, one does not usually expect to know what the posterior distribution will look like, as we do here because we made up the data. This is a powerful tool for the inference of unknown parameters – one only needs to know what the posterior distribution is proportional to (eg. a linear model and predicting parameters slope and intercept, or a logistic model and parameters alpha and beta). This allows us to do things like predicting the results of high dimensional integrals we cannot solve – believe it or not, you can do it with just randomness and statistics!

References

Besides the Wikipedia links given above, this article was very helpful in combing the context of Bayesian Inference and MCMC –Introduction to MCMC using RevBayes.

Moving forward, I recommend utilizing packages like PyMC3 to do Bayesian Inference, rather than coding from scratch. This article by Will Koehrsen provides an awesome real-world example, it is worth checking out:

Markov Chain Monte Carlo in Python

If you wish to dive deeper into the math and reasoning that makes Bayesian Inference and MCMC possible, I highly recommend this article – Bayesian Inference Problem, MCMC and Variational Inference.


If you’re looking for more information on my implementation or would like to give feedback (always appreciated), please shoot me a message or email at [email protected].

Thank you!


Related Articles