Introduction to Bayesian Inference

Nailing the basics with some theory and step-by-step examples — diagnostic of a disease and parameter estimation

António Góis
Towards Data Science

--

Motivation

Imagine the following scenario: you are driving an ambulance to a hospital and have to decide between route A and B. In order to save your patient, you need to arrive in less than 15 minutes. If we estimate that route A takes 12 minutes and route B takes 10 minutes, which would you choose? Route B seems faster, so why not?

The information provided so far consisted of point estimates of routes A and B. Now, let’s add information about the uncertainty of each prediction: route A takes 12 min ±1min, while route B takes 10 min ±6min.

Now it seems like the prediction of route B is significantly more uncertain, eventually risking taking longer than the 15 minute limit. Adding information about uncertainty here can make us change our decision from taking route B to taking route A.

Photo by Mark Cruz on Unsplash

Credit: the previous example was based on what I could remember from a tutorial by Tamara Broderick at Columbia University. Work related to this example can be found in [1].

More broadly, consider the following cases:

  • We want to estimate a quantity which does not have a fixed value — instead, it can change between different ones
  • Regardless of the true value being fixed or not, we are interested in knowing the uncertainty of our estimation

The ambulance example was intended to illustrate the second case. For the first case, we can have a quick look at the work of Nobel Prize winning economist Christopher Sims. I will simply cite his student Toshiaki Watanabe:

I once asked Chris why he favoured the Bayesian approach. He replied by pointing to the Lucas critique, which argues that when government and central bank policies change, so do the model parameters, so that they should be regarded not as constants but as stochastic variables.

For both cases, Bayesian inference can be used to model our variables of interest as a whole distribution, instead of a unique value or point estimate.

Introduction

Central to Bayesian Inference is Bayes’ Rule:

Bayes’ Rule

Given the definition of conditional probability, this rule can seem very obvious and almost redundant, although its use is not immediately clear:

Conditional Probability definition
Derivation of Bayes’ Rule

Judea Pearl describes it this way, in The Book of Why [2]:

(…) Bayes’s rule is formally an elementary consequence of his definition of conditional probability. But epistemologically, it is far from elementary. It acts, in fact, as a normative rule for updating beliefs in response to evidence.

Indeed, this seemingly simple concept has received enormous attention. Some of its practical consequences consist in allowing to invert conditional probabilities [going from P(B|A) to P(A|B)], and to update our belief in the values that A can take (i.e. its distribution) as we observe more samples of B.

Before we dive into examples, let’s just give names to the components of Bayes’ Rule:

Bayes’ Rule again

Let’s briefly discuss the meaning of each component. Hopefully the examples in the following section will make these definitions much clearer.

Assume we are interested in knowing about A, but can only observe B:

  • The prior is our belief in “how likely A is to take each possible value”, before observing any values of B. In other words, we must choose one distribution over A, where we have the chance of incorporating expert knowledge (that we have a priori), or can simply choose a uniform distribution if we know nothing.
  • The likelihood represents “for each possible value of A, what is the distribution over all values that B can take”. So we define a set of distributions over B, one distribution corresponding to each value of A. More specifically, each distribution P(B|A=a) is the conditional distribution of B after we assume A is equal to some value a.
  • The evidence is the overall probability of what we have observed (we observe B, which can be regarded as evidence of A). This can be computed once we have the prior and the likelihood, by summing over all possible values of A as shown below. This quantity can be nasty to compute, begging for approximate computation techniques such as Monte Carlo or Variational Inference.
Computation of the evidence (using the law of total probability). If A is continuous, the summation is replaced by an integral.
  • The posterior, which is our new belief in the distribution of A [usually different from our prior belief P(A)], after taking into account the newly observed evidence B.

Additionally, we can denote the fraction likelihood/evidence as the likelihood ratio.

Note that we can consider both the prior and the likelihood function as fixed, i.e. they are assumptions chosen before we begin observing samples of B. In this case, the evidence and the posterior are consequences of our assumptions + observations.

The following two examples were very important for me to see how these concepts materialize.

Example 1 — Disease and Test

Credit: The following example was inspired in Judea Pearl’s The Book of Why [2].

Imagine we want to estimate the probability of having a disease D, given the result of our test T. If we know that 10% of the population has D, we can incorporate this in our prior (if we know details about the patient, we can use a subset of the population which better represents her):

  • P(D=1)=0.1
  • P(D=0)=0.9

Note that D can only take two values (0 or 1 — healthy or sick), but in other problems our variable could accept more discrete values, or even continuous values.

We also know, from historic data, how reliable the test is — we know how likely the test is to fail when the patient has D, and also when it doesn’t have D. This will constitute our likelihood function:

  • P(T=1|D=1)=0.75; P(T=0|D=1)=0.25
  • P(T=1|D=0)=0.2; P(T=0|D=0)=0.8

This way, we have one conditional distribution over T, for each value of D.

As you saw, we assume to know the probability of the test being correct given the truth about the disease — P(T|D), but we are interested in the opposite [probability of having the disease given the test result — P(D|T)]. This illustrates the importance of being able to reverse conditional probabilities, which Bayes’ Rule allows:

Bayes’ Rule in the Disease/Test example

Imagine we took a test which gave a positive result T=1. We now wish to compute the posterior, the updated probability that we have the disease given this new information. We can first compute the evidence:

  • P(T=1) = P(T=1|D=1)*P(D=1)+P(T=1|D=0)*P(D=0) = 0.75*0.1+0.2*0.9=0.255

To output the updated distribution over D, let’s now compute each of the values that D can assume:

  • P(D=1|T=1) = P(T=1|D=1)*P(D=1)/P(T=1) = 0.75*0.1/0.255=0.29
  • P(D=0|T=1) = P(T=1|D=0)*P(D=0)/P(T=1) = 0.2*0.9/0.255=0.71

Naturally the second step is redundant here, but in other settings D may take more than 2 values. Our updated distribution says that P(D=1) increased from 10% to 29% after getting a positive test. We could use this new distribution as input if we were to take a second test, instead of the original prior P(D). If the new test is positive again (T=1), we could repeat this computation, further increasing P(D=1) to 60%:

  • P(T=1)=0.75*0.29+0.2*0.71=0.36
  • P(D=1|T=1) = P(T=1|D=1)*P(D=1)/P(T=1) = 0.75*0.29/0.36=0.6

It’s also interesting now to observe the meaning of previously mentioned likelihood ratio (likelihood/evidence). This is the proportion according to which our previous belief will increase or decrease.

Hopefully this was useful to understand the mechanics of Bayes’ Rule when updating a belief. However, to me it didn’t feel like enough to understand what happens when talking about a model’s parameters (e.g. Christopher Sims’ research on modelling the economy, or many machine learning applications).

Example 2 — Modelling a Binomial

Moving to a more abstract domain, Bayes’ Rule can be applied to estimate a model’s parameter as a distribution. For instance, Christopher Sims took advantage of Bayes’ Rule to model an economy where governments and policies change. In this setting, if we describe a model parameter with a distribution we can account for such changes more precisely than using a single value.

Credit: the following example was partly inspired from Antonio Salmerón’s lecture at Probabilistic.AI summer school (slides, video)

We have a Binomial distribution that represents the outcome of 20 binary experiments, each with probability of success p (say, 20 flips of a biased coin). As you may guess, we are interested in estimating parameter p from observations x (with p representing how biased the coin is). In this case we consider 1 single observation of x, which tells us how many of the 20 independent experiments were observed as successful — how many times we got tails from the coin flip.

One observation of x can have any value between 0 and 20 successes. Using the definition of Binomial distribution we can build 21 different likelihood functions P(x|p), one for each value of x:

Likelihood functions of parameter p, given x=2, 4, 10 or 15. Generated with © 2020 Wolfram Mathematica

The area of each of these curves does NOT need to sum to 1 — just like in the previous example P(T=1|D=1)+P(T=1|D=0)=0.75+0.2≠1. These curves are not probability distributions because, for P(x|p), we are fixing x and varying p. If we fix p (for instance p=0.3) and vary x instead, then the values must sum to 1:

Probability function of variable x, given p=0.3. You can change the plot in this link. ©2020 Wolfram Alpha LLC

To provide additional intuition, let’s regard the function P(x|p) as having 2 input arguments: x and p. If we allow both to vary simultaneously, this is what it looks like:

P(x|p) for all values of x and p. In orange we see the intersection of P(x|p) with the plane p=0.3. Generated with © 2020 Wolfram Mathematica

If we fix x to an observed value, we have a likelihood function (we are slicing this 3D plot across the x-axis, obtaining 1 blue curve), if we fix p we have a probability distribution (we slice it across the p-axis, obtaining a curve like the orange one). After we observe x, we fix that value and obtain the corresponding likelihood function to use with the Bayes’ rule.

Ok, we have almost everything we need to update our belief in the value of p after observing x! A few notes before moving on:

  • Using a frequentist approach, we could now get one point estimate of p by simply picking the p which maximizes the likelihood function of the observed x.
  • This example probably seems more complex than the previous one. The reason is that our unobserved variable changed from discrete (Disease=healthy or sick) to continuous (p= any value between 0 and 1). For this reason, we cannot enumerate the likelihood of each possible value of p, like we did in the previous example [P(T=positive|D) for each value of D]. We can get the likelihood of any specific value of p, but not list them all, since there are infinite possible values of p between 0 and 1!
  • After completing an inference step, we won’t be able to enumerate the posterior probabilities of all p’s either, for the same reason [previously P(D|T=positive) for D=healthy and D=sick, but now p takes many values in the posterior P(p|x)].
  • Our observed variable remains discrete, but it increased from two possible values (test result T = 0 or 1) to twenty one (number of observed successes x = 0, 1, 2, …, 20). This also makes the visualisations a bit harder. In other settings, x could even be continuous.
  • A random fact about the word likelihood: Latin languages use a much more precise term — verosimilhança in Portuguese. Its single meaning is “how similar something is to the truth”. In every language, the chosen term avoids using probability, since its values may not sum to 1.

Finally, if we assume a uniform prior between 0–1 for p, we have everything we need to update our belief in p after observing a new sample x.

Components of Bayes’ Rule, when modelling a system’s parameters θ given observations x [Antonio Salmerón]

In this particular case (Binomial likelihood with Uniform prior) the posterior is easy to compute, which allows for a quick and exact result. However, if we assume other kinds of likelihood functions and priors we may need to resort to other techniques to compute our posterior, such as variational inference.

Some final notes on this example:

  • If we only wish to obtain a point estimate from Bayes’ Rule, we can ignore the computation of the denominator (evidence) without changing results — we can simply maximize the numerator. This is called Maximum a posteriori (MAP).
  • Besides ignoring the denominator, we may also ignore the prior when finding our point estimation. This can change the result (compared to MAP), and corresponds to the frequentist approach — Maximum Likelihood Estimation (MLE).

In this post I tried to motivate the use of Bayesian Inference, and to clear its basic ideas. For me the most confusing part was to understand what exactly must be provided a priori. Besides the prior, the likelihood function must be provided beforehand, but this can mean very different things which are interesting to compare:

  • In the disease/test example, we use as likelihood function a fixed probability of the test failing, when it evaluates each kind of patient (sick/healthy). This probability was previously known (possibly derived from historic data).
  • In the binomial example, this previous knowledge amounts to assuming that the likelihood behaves like a binomial of 20 independent experiments. Each of the 21 possible likelihood functions (analogously to the 2 kinds of patient) can be derived based on this assumption.

[1] D. Woodard, G. Nogin, P. Koch, D. Racz, M. Goldszmidt and E. Horvitz, Predicting travel time reliability using mobile phone GPS data (2017), Transportation Research Part C: Emerging Technologies

[2] J. Pearl and D. Mackenzie, The Book of Why (2018), Basic Books, Inc.

--

--