From Scratch: Bayesian Inference, Markov Chain Monte Carlo and Metropolis Hastings, in python

Joseph Moukarzel
Towards Data Science
15 min readNov 13, 2018

--

Credit: Adi coco unsplash

Hello and welcome to the first article in the “From Scratch” series, where I explain and implement/build anything from scratch.

Why do I want to do that? Because in the current state of things, we are in possession of such powerful libraries and tools that can do a lot of the work for us. Most experienced authors are well aware of the complexities of implementing such tools. As such, they make use of them to provide short, accessible and to-the-point reads to users from diverse backgrounds. In many of the articles that I read, I failed to understand how this or that algorithm is implemented in practise. What are their limitations? Why were they invented? When should they be used?

As Hilary Mason puts it:

“ When you want to use a new algorithm that you don’t deeply understand, the best approach is to implement it yourself to learn how it works, and then use a library to benefit from robust code.”

That’s why, I propose to explain and implement from scratch: Bayesian Inference (somewhat briefly), Markov Chain Monte Carlo and Metropolis Hastings, in Python.

The notebook, and a pdf version can be found on my repository at: joseph94m

Prerequisites: Basic probabilities, calculus and Python.

1- Introduction

In many of my readings, I came across a technique called Markov Chain Monte Carlo, or as it’s more commonly referred to, MCMC. The description for this method stated something along the lines of: MCMC is a class of techniques for sampling from a probability distribution and can be used to estimate the distribution of parameters given a set of observations.

Back then, I did not think much of it. I thought, “oh it’s just another sampling technique”, and I decided I’d read on it when I’d practically need it. This need never emerged, or perhaps it did and I wrongly used something else to solve my problem.

1.1- So why the interest now?

Recently, I have seen a few discussions about MCMC and some of its implementations, specifically the Metropolis-Hastings algorithm and the PyMC3 library. Markov Chain Monte Carlo in Python A Complete Real-World Implementation, was the article that caught my attention the most. In this article, William Koehrsen explains how he was able to learn the approach by applying it to a real world problem: to estimate the parameters of a logistic function that represents his sleeping patterns.

The assumed model

Mr. Koehrsen uses the PyMC3 implementation of the Metropolis-Hastings algorithm to compute the distribution space of α and β, thus deriving the most likely logistic model.

1.2- So why am I talking about all that?

In this article, I propose to implement from scratch the Metropolis-Hastings algorithm to find parameter distributions for a dummy data example and then of a real world problem.

I figured that if I get my hands dirty, I might finally be able to understand it. I will only use numpy to implement the algorithm, and matplotlib to draw pretty things. Alternatively, scipy can be used to compute the density functions (which I will talk about later), but I will also show how to implement them using numpy.

1.3- Flow of the article

In part 1, I will introduce Bayesian inference, MCMC-MH and their mathematical components. In part 2, I will explain the MH algorithm using dummy data. Finally, part 3 will provide a real world application for MCMC-MH.

2- Part 1: Bayesian inference, Markov Chain Monte Carlo, and Metropolis-Hastings

2.1- A bird’s eye view on the philosophy of probabilities

In order to talk about Bayesian inference and MCMC, I shall first explain what the Bayesian view of probability is, and situate it within its historical context.

2.1.1- Frequentist vs Bayesian thinking

There are two major interpretations to probabilities: Bayesian and Frequentist.

From a Frequentist’s perspective, probabilities represent long term frequencies with which events occur. A frequentist can say that the probability of having tails from a coin toss is equal to 0.5 on the long run. Each new experiment, can be considered as one of an infinite sequence of possible repetitions of the same experiment. The main idea is that there is no belief in a frequentist’s view of probability. The probability of event x happening out of n trials is approximately the following :

and the true probability is reached when n — > ∞. Frequentists will never say “I am 45% (0.45) sure that there is lasagna for lunch today”, since this does not happen on the long run. Commonly, a frequentist approach is referred to as the objective approach since there is no expression of belief and/or prior events in it.

On the other hand, in Bayesian thinking, probabilities are treated as an expression of belief. Therefore it is perfectly reasonable for a Bayesian to say “I am 50% (0.5) sure that there is lasagna for lunch today”. By combining prior beliefs, and current events (the evidence), one can compute the posterior, i.e. the probability that there is lasagna today. The idea behind Bayesian thinking is to keep updating the beliefs as more evidence is provided. Since this approach deals with belief, it is usually referred to as the subjective view on probability.

2.1.2- Bayesian inference

In the philosophy of decision making, Bayesian inference is closely related to the Bayesian view on probability, in the sense that it manipulates priors, evidence, and likelihood to compute the posterior. Given some event B, what is the probability that event A occurs? This is answered by Bayes’ famous formula:

With:

In our case, we are mostly interested in the specific formulation of Bayes’ formula:

That is, we would like to find the most likely distribution of θ, the parameters of the model explaining the data, D.

A supposed portrait of Thomas Bayes, an English statistician, philosopher, and theologian. Image Credit: Wikipedia Commons

Computing some of these probabilities can be tedious, especially the evidence P(D). Also, other problems can arise such as those of ensuring conjugacy which I will not dive into in this article. Luckily, some techniques, namely MCMC, allow us to sample from the posterior, and a draw distributions over our parameters without having to worry about computing the evidence, nor about conjugacy.

2.1.3- Markov Chain Monte Carlo

MCMC allows us to draw samples from a distribution even if we can’t compute it. It can be used to sample from the posterior distribution (what we wish to know) over parameters. It has seen much success in many applications, such as computing the distribution of parameters given a set of observations and some prior belief, and also computing high dimensional integrals in physics and in digital communications.

Bottom line: It can be used to compute the distribution over the parameters given a set of observations and a prior belief.

2.1.4- Metropolis-Hastings

MCMC is a class of methods. Metropolis-Hastings is a specific implementation of MCMC. It works well in high dimensional spaces as opposed to Gibbs sampling and rejection sampling.

This technique requires a simple distribution called the proposal distribution (Which I like to call transition model) Q(θ′/θ) to help draw samples from an intractable posterior distribution P( Θ = θ/D).

Metropolis-Hastings uses Q to randomly walk in the distribution space, accepting or rejecting jumps to new positions based on how likely the sample is. This “memoriless” random walk is the “Markov Chain” part of MCMC.

The “likelihood” of each new sample is decided by a function f . That’s why f must be proportional to the posterior we want to sample from. f is commonly chosen to be a probability density function that expresses this proportionality.

To get a new position of the parameter, just take our current one, θ, and propose a new one , θ’, that is a random sample drawn from Q(θ′/θ). Often this is a symmetric distribution. For instance, a normal distribution with mean θ and some standard deviation σ: Q(θ′/θ) = N(θ, σ)

To decide if θ′ is to be accepted or rejected, the following ratio must be computed for each new proposed θ’:

Using Bayes’ formula this can be easily re-formulated as:

The evidence P(D) is simply crossed out during the division

Which is also equivalent to:

Where f is the proportional function mentioned previously.

The rule for acceptance, can then be formulated as:

Note: The prior components are often crossed if there is no preference or restrictions on the parameters.

This means that if a θ’ is more likely than the current θ, then we always accept θ’. If it is less likely than the current θ, then we might accept it or reject it randomly with decreasing probability, the less likely it is.

So, briefly, the Metropolis-Hastings algorithm does the following:

Metropolis-Hastings algorithm

3- Part 2: Dummy data example

3.1- Step 1: Data generation

We generate 30,000 samples from a normal distribution with mean μ= 10, and standard deviation σ= 3, but we can only observe 1000 random samples from them.

3.2- Step 2: What do we want?

We would like to find a distribution for σ{observed} using the 1000 observed samples. Those of you who are avid mathematicians will say that there is a formula for computing σ:

Why do we want to sample and whatnot? Well, this is just a dummy data example, the real problem is in part 3, where it’s hard to compute the parameters directly. Plus here, we are not trying to find a value for σ, but rather, we are trying to compute a probability distribution for σ.

3.3- Step 3: Define the PDF and the transition model

From Figure 1, we can see that the data is normally distributed. The mean can be easily computed by taking the average of the values of the 1000 samples. By doing that, we get that μ{observed} = 9.8 (although on a side note, we could have also assumed μ to be unknown and sampled for it just like we are doing for σ. However, I want to make this starting example simple.)

3.3.1- For the transition model/ proposal distribution

I have no specific distribution in mind, so I will choose a simple one: the Normal distribution!

Note that σ′ is unrelated to σ{new} and σ{current}. It simply specifies the standard deviation of the parameter space. It can be any value desired. It affects the convergence time of the algorithm and the correlation between samples, which I talk about later.

3.3.2- For the PDF

Since f should be proportional to the posterior, we choose f to be the following Probability Density Function (PDF), for each data point di in the data set D:

Since μ is constant, we can practically consider that σ is equivalent to θ

3.4- Step 4: Define when we accept or reject σ{new}

3.4.1- The acceptance formula

If this ratio is not larger than 1, then we compare it to a uniformly generated random number in the closed set [0,1]. If the ratio is larger than the random number, we accept σ{new}, otherwise we reject it. This ensures that even if a sample is less likely than the current, we might still want to try it. (Similar notion to simulated annealing)

3.4.2- The likelihood

The total likelihood for a set of observation D is:

This must be computed for both new and current sigma in order to compute the ratio in equation (1)

3.4.3- The Prior P( μ,σ)

We don’t have any preferences for the values that σ{new} and σ{current} can take. The only thing worth noting is that they should be positive. Why? Intuitively, the standard deviation measures dispersion. Dispersion is a distance, and distances cannot be negative.

Mathematically:

and the square root of a number cannot be negative, so σ is always positive. We strictly enforce this in the prior.

3.4.4- The final acceptance form

In our case, we will log both the prior and the likelihood function. Why log? Simply because it helps with numerical stability, i.e. multiplying thousands of small values (probabilities, likelihoods, etc..) can cause an underflow in the system’s memory, and the log is a perfect solution because it transforms multiplications to additions and transforms small positive numbers into non-small negative numbers.

Our acceptance condition from equation (1) becomes:

This form can be reduced even more by taking the square root and the multiplication out of the log, but never mind that now, there’s already enough maths!

3.4.5- The implementation of that rant

The implementation is quite simple, right ?!

3.6- Step 6: Run the algorithm with initial parameters and collect accepted and rejected samples

The algorithm accepted 8317 samples (which might be different on each new run). The last 10 samples contain the following values for σ:

[2.87920187, 3.10388928, 2.94469786, 3.04094103, 2.95522153, 3.09328088, 3.07361275, 3.08588388, 3.12881964, 3.03651136]

Let’s see how the algorithm worked its way to these values:

So, starting from an initial σ of 0.1, the algorithm converged pretty quickly to the expected value of 3. That said, it’s only sampling in a 1D space…. so it’s not very surprising.

Still, we will consider the initial 25% of the values of σ to be “burn-in”, so we drop them.

3.6.2- Let’s visualise the trace of σ and the histogram of the trace

The most likely value for σ is around 3.1. This is a wee bit more than the original value of 3.0. The difference is due to us observing only 3.33% of the original population (1,000 out of 30,000)

3.6.3- Predictions: How would our model fare in predicting the original population of 30,000?

First, we average the last 75% of accepted samples of σ, and we generate 30,000 random individuals from a normal distribution with μ=9.8 and σ=3.05 (the average of the last 75% of accepted samples) which is actually better than the most likely value of 3.1.

And voilà:

And now, onto the real stuff!

4- Part 3: A real world example

Sunspots. Credit Wikipedia commons

A sunspot is a region on the Sun’s surface (photo-sphere) that is marked by a lower temperature than its environment. These reduced temperatures are caused by concentrations of magnetic field flux that inhibit convection by an effect similar to eddy current brakes. Sunspots usually appear in pairs of opposite magnetic polarity. Their number varies according to the approximately 11-year solar cycle.

The data we will be working on is the “Monthly mean total sunspot number”, for each month from January 1749 to November 2018. This data is collected, curated and made publicly available by the World Data Center for the production, preservation and dissemination of the international sunspot number.

4.1- Let’s plot the data over the years to see what the distribution might be like

4.2- It seems like we could model this phenomenon with a gamma distribution, with a new cycle resetting every 12 years

A gamma distribution Γ is a two-parameter family of continuous probability distributions. The parameters are the shape a and the scale b. A random variable X that is gamma-distributed is noted X~Γ(a, b), and in our case X is the count of sunspots. The two parameters a and b are the unknowns that we would like to calculate distributions for.

Gamma function with different parameters. Credit: Wikipedia Commons

For example, in the first cycle, the sunspot counts start from their highest at about 300 at the end of 1749, and fall to their lowest 6 years after, during 1755. Then the number rises up again to it’s maximum during 1761 and 1762 before falling again during 1766 and so on. . .

Let’s make sure by plotting a histogram of sunspot counts:

4.3- Indeed, it does seem like the frequency of counts follows a gamma distribution

The gamma distribution, has for PDF, f such that:

where Γ is the gamma function: Γ(a) = (a-1)! (not to be confused with the gamma distribution!)

Following the same procedure as in the dummy data example, we can write down the log likelihood from this pdf (see code below). Alternatively, one could use the scipy.stats.gamma(a, b).pdf(x) function to compute it. However, beware that scipy’s implementation is several orders of magnitudes slower than the one I implemented.

Since a and b must be strictly positive, we enforce this in the prior:

Run the code and collect samples:

Starting from a=4, and b =10, the algorithm accepted 8561 pairs of samples, the last value for a is 0.98848982 and the last value for b is 84.99360422, which are pretty far off the initial values.

As with the dummy data example, let’s see how the algorithm worked its way to these values:

As we can see from figures 10, 11, and 12, the algorithm converges quickly to the [a=1,b=85] zone.

Tip: when the algorithm starts to heavily reject samples, that means that we have reached a zone of saturation of the likelihood. Commonly, this can be interpreted as having reached the optimal parameter space from which we can sample, i.e. there is very little reason for the algorithm to accept new values. This is marked in figures 11, and 12 where the algorithm no longer accepts any values outside of a small range.

4.3.1- We consider the initial 50% of the values of a and b to be “burn-in”, so we drop them. Let’s visualise the traces of and b and the histogram of the traces.

4.4- Prediction time

First, we average the last 50% of accepted samples of a and b, and we generate random individuals from a Γ distribution. a{average}=0.9866200759935773 and b{average}=83.70749712447888.

And the predictions:

4- Evaluation

4.1- Evaluation of the proposal distribution

How do we specify the parameters for the distribution Q? Should we move far from the current sample θ, or stay relatively close? These questions can be answered by measuring the auto-correlation between accepted samples. We don’t want distant samples to be too correlated as we are trying to implement a Markov Chain, i.e. a sample should only depend on its previous sample, and the auto-correlation plot should show a quick, exponential decrease between the correlation of sample i and i-1,i-2,…i-k

The auto-correlation is defined by computing the following function for each lag k:

The lag 𝑘, is basically the range ahead of sample 𝑌i in which we would like to measure the correlation.

The plots below show the auto-correlation for a, b for k going from 1 to 100. A lag of k=0 means that we are measuring the correlation of a sample with itself, so we expect it to be equal to 1. The higher k goes, the lower that correlation ought to be.

In our case, we are lucky to have a low enough correlation. In general, we might want to setup the parameters of the proposal distribution, Q, automatically. A common method is to keep adjusting the proposal parameters so that more than 50% proposals are rejected. Alternatively, one could use an enhanced version of MCMC called Hamiltonian Monte Carlo, which reduces the correlation between successive sampled states and reaches the stationary distribution quicker.

6- Conclusion

While the abstraction behind this algorithm may seem out of grasp at first, the implementation is actually pretty simple, and gives awesome results. In fact, the great thing about probabilistic programming, notably MCMC is that you only need to write down the model and then run it. There is no need to compute the evidence, or ensure some constraining mathematical properties.

I hope that everyone reading this article found it enjoyable and insightful. If there’s positive feedback, there will be more articles of this series “From Scratch”, where I explain and implement stuff from scratch (obviously!), so if you like it, please do suggest what you want me to talk about next!

Any questions are welcome, and I will do my best to answer them! Feedback is more than welcome as this is my first article and I wish to improve.

References:

Peter Driscoll, “A comparison of least-squares and Bayesian fitting techniques to radial velocity data sets”

Carson Chow, “MCMC and fitting models to data”

John H. Williamson, “Data Fundamentals - Probabilities”

Simon Rogers, “A first course in machine learning”

Acknowledgements:

Fellow medium writer, Vera Chernova, for the article: “How to embed Jupyter Notebook into Medium Posts in three steps: 1–2–3!”

Friends and fellow data scientists Mark Askew and Constantinos Ioannidis for giving feedback on this article before publishing.

--

--