Introduction to PyMC3: A Python package for probabilistic programming

Introduction

We often hear something like this on weather forecast programs: the chance of raining tomorrow is 80%. What does that mean? It is often hard to give meaning to this kind of statement, especially from a frequentist perspective: there is no reasonable way to repeat the raining/not raining experiment an infinite (or very big) number of times.

The Bayesian approach provides a solution for this type of statement. The following sentence, taken from the book Probabilistic Programming & Bayesian Methods for Hackers, perfectly summarizes one of the key ideas of the Bayesian perspective.

The Bayesian world-view interprets probability as measure of believability in an event, that is, how confident we are in an event occurring.

In other words, in the Bayesian approach, we can never be absolutely sure about our beliefs, but can definitely say how confident we are about the relevant events. Furthermore, as more data is collected, we can become more confident about our beliefs.

As a scientist, I am trained to believe in the data and always be critical about almost everything. Naturally, I find Bayesian inference to be rather intuitive.

However, it is often computationally and conceptually challenging to work with Bayesian inference. Often, a lot of long and complicated mathematical computations are required to get things done. Even as a mathematician, I occasionally find these computations tedious; especially when I need a quick overview of the problem that I want to solve.

Luckily, my mentor Austin Rochford recently introduced me to a wonderful package called PyMC3 that allows us to do numerical Bayesian inference. In this article, I will give a quick introduction to PyMC3 through a concrete example.

A concrete example

Let’s assume that we have a coin. We flip it three times and the result is:

[0, 1, 1]

where 0 means that the coin lands in a tail and 1 means that the coin lands in a head. Are we confident in saying that this is a fair coin? In other words, if we let θ be the probability that the coin will return the head, is the evidence strong enough to support the statement that θ=12?

Well, as we do not know anything about the coin other than the result of the above experiment, it is hard to say anything for sure. From the frequentist-perspective, a point estimation for θ would be

While this number makes sense, the frequentist approach does not really provide a certain level of confidence about it. In particular, if we do more trials, we are likely to get different point estimations for θ.

This is where the Bayesian approach could offer some improvement. The idea is simple, as we do not know anything about θ, we can assume that θ could be any value on [0,1]. Mathematically, our prior belief is that θ follows a Uniform(0,1) distribution. For those who need a refresh in maths, the pdf of Uniform(0,1) is given by

We can then use evidence/our observations to update our belief about the distribution of θ.

Let us formally call D to be the evidence (in our case, it is the result of our coin toss.) By the Bayesian rule, the posterior distribution is computed by

where p(D|θ) is the likelihood function, p(θ) is the prior distribution (Uniform(0,1) in this case.) There are two ways to go from here.

The explicit approach

In this particular example, we can do everything by hand. More precisely, given θ, the probability that we get 2 heads out of three coin tosses is given by

By assumption, p(θ)=1. Next, we evaluate the dominator

By some simple algebra, we can see that the above integral is equal to 1/4 and hence

Remark: By the same computation, we can also see that if the prior distribution of θ is a Beta distribution with parameters α,β, i.e p(θ)=B(α,β), and the sample size is N with k of them are head, then the posterior distribution of θ is given by B(α+k,β+N−k). In our case, α=β=1,N=3,k=2.

The numerical approach

In the explicit approach, we are able to explicitly compute the posterior distribution of θ by using conjugate priors. However, sometimes conjugate priors are used for computational simplicity and they might not reflect the reality. Furthermore, it is not always feasible to find conjugate priors.

We can overcome this problem by using the Markov Chain Monte Carlo (MCMC) method to approximate the posterior distributions. The math here is pretty beautiful but for the sole purpose of this article, we will not dive into it. Instead, we will explain how to implement this method using PyMC3.

To run our codes, we import the following packages.

import pymc3 as pm
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from IPython.core.pylabtools import figsize

First, we need to initiate the prior distribution for θ. In PyMC3, we can do so by the following lines of code.

with pm.Model() as model:
    theta=pm.Uniform('theta', lower=0, upper=1)

We then fit our model with the observed data. This can be done by the following lines of code.

occurrences=np.array([1,1,0]) #our observation 
with model:
    obs=pm.Bernoulli("obs", p, observed=occurrences) #input the observations
    step=pm.Metropolis()
    trace=pm.sample(18000, step=step)
    burned_trace=trace[1000:]

Internally, PyMC3 uses the Metropolis-Hastings algorithm to approximate the posterior distribution. The trace function determines the number of samples withdrawn from the posterior distribution. Finally, as the algorithm might be unstable at the beginning, it is useful to only withdraw samples after a certain period of iterations. That is the purpose of the last line in our code.

We can then plot the histogram of our samples obtained from the posterior distribution and compare it with the true density function.

from IPython.core.pylabtools import figsize
p_true=0.5
figsize(12.5, 4)
plt.title(r"Posterior distribution of $theta$")
plt.vlines(p_true,0, 2, linestyle='--', label=r"true $theta$ (unknown)", color='red')
plt.hist(burned_trace["theta"], bins=25, histtype='stepfilled', density=True, color='#348ABD')
x=np.arange(0,1.04,0.04)
plt.plot(x, 12*x*x*(1-x), color='black')
plt.legend()
plt.show()

As we can clearly see, the numerical approximation is pretty close to the true posterior distribution.

What happens if we increase the sample size?

As we mentioned earlier, the more data we get, the more confident we are about the true value of θ. Let us test our hypothesis by a simple simulation.

We will randomly toss a coin 1000 times. We then use PyMC3 to approximate the posterior distribution of θ. We then plot the histogram of samples obtained from this distribution. All of these steps can be done by the following lines of code

N=1000 #the number of samples
occurences=np.random.binomial(1, p=0.5, size=N)
k=occurences.sum() #the number of head

#fit the observed data 
with pm.Model() as model1:
    theta=pm.Uniform('theta', lower=0, upper=1)

with model1:
    obs=pm.Bernoulli("obs", theta, observed=occurrences)
    step=pm.Metropolis()
    trace=pm.sample(18000, step=step)
    burned_trace1=trace[1000:]
#plot the posterior distribution of theta.

p_true=0.5
figsize(12.5, 4)
plt.title(r"Posterior distribution of $theta for sample size N=1000$")
plt.vlines(p_true,0, 25, linestyle='--', label="true $theta$ (unknown)", color='red')
plt.hist(burned_trace1["theta"], bins=25, histtype='stepfilled', density=True, color='#348ABD')
plt.legend()
plt.show()

Here is what we get.