The world’s leading publication for data science, AI, and ML professionals.

Bayesian Statistics

Hard to believe there was once a controversy over probabilistic statistics

Getting Started

Frequentists vs Bayesians | Photo by Lucas Benjamin
Frequentists vs Bayesians | Photo by Lucas Benjamin

This article builds on my previous article about Bootstrap Resampling.

Introduction to Bayes Models

Bayesian models are a rich class of models, which can provide attractive alternatives to Frequentist models. Arguably the most well-known feature of Bayesian statistics is Bayes Theorem, more on this later. With the recent advent of greater computational power and general acceptance, Bayes methods are now widely used in areas ranging from medical research to natural language processing (NLP) to understanding to web searches.

In the early 20th century there was a big debate about the legitimacy of what is now called Bayesian, which is essentially a probabilistic way of doing some statistics – in contrast to the "Frequentist" camp that we are all intimately familiar with. To use the parlance of our times, it was the #Team Jacob vs #Team Edward debate of its time, among statisticians. The argument is now moot as almost everyone is both Bayesian and Frequentist. A good heuristic is that some problems are better handled by Frequentist methods and some with Bayesian methods.

By the end of this article I hope you can appreciate this joke: A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes she has seen a mule!

History

A limited version of Bayes theorem was proposed by the eponymous Reverend Thomas Bayes (1702–1761). Bayes’ interest was in probabilities of gambling games. He also was an avid supporter of Sir Isaac Newton’s newfangled theory of calculus with his amazingly entitled publication, "An Introduction to the Doctrine of Fluxions, and a Defence of the Mathematicians Against the Objections of the Author of The Analyst."

In 1814, Pierre-Simon Laplace published a version of Bayes theorem similar to its modern form in the "Essai philosophique sur les probabilités." Laplace applied Bayesian methods to problems in celestial mechanics. Solving these problems had great practical implications for ships in the late 18th/early 19th centuries. The geophysicist and mathematician Harold Jeffreys extensively used Bayes’ methods.

WWII saw many successful applications of Bayesian methods:

  • Andrey Kolmagorov, a Russian (née Soviet) statistician used Bayes methods to greatly improve artillery accuracy
  • Alan Turing used Bayesian models to break German U-boat codes
  • Bernard Koopman, French born American mathematician, helped the Allies ability to locate German U-boats by intercepting directional radio transmissions

The latter half of the 20th century lead to the following notable advances in computational Bayesian methods:

  • Statistical sampling using Monte Carlo methods; Stanislaw Ulam, John von Neuman; 1946, 1947
  • MCMC, or Markov Chain Monte Carlo; Metropolis et al. (1953) Journal of Chemical Physics
  • Hastings (1970), Monte Carlo sampling methods using Markov chains and their application
  • Geman and Geman (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images
  • Duane, Kennedy, Pendleton, and Roweth (1987), Hamiltonian MCMC
  • Gelfand and Smith (1990), Sampling-based approaches to calculating marginal densities.

Bayesian vs Frequentist views

With greater computational power and general acceptance, Bayes methods are now widely used in areas ranging from medical research to natural language understanding to web search. Among pragmatists, the common belief today is that some problems are better handled by frequentist methods and some with Bayesian methods.

From my first article on the Central Limit Theorem (CLT), I established that you need "enough data" for the CLT to apply. Bootstrap sampling is a convenient way of obtaining more distributions that approach your underlying distribution. These are Frequentist tools; but what happens if you have too little data to reliably apply the CLT? As a probabilistic Bayesian statistician, I already have some estimates, admittedly, with a little bit of guesswork, but there is a way of combining that guesswork in an intelligent fashion with my data. And that is what Bayesian Statistics is basically all about – you combine it and basically, that combination is a simple multiplication of the two probable probability distributions, the one that you guessed at, and the other one, that for which you have evidence.

The more data you have, the prior distribution becomes less and less important and Frequentist methods can be used. But, with less data, Bayesian’s will give you a better answer.

So how then do Bayesian and Frequentist methodologies differ? Simply stated:

  • Bayesian methodologies use prior distributions to quantify what is known about the parameters
  • Frequentists do not quantify anything about the parameters; instead using p-values and confidence intervals (CI’s) to express the unknowns about parameters

Both methods are useful, converging with more and more data – but contrasted:

Strict Frequentist approach

  • Goal is a point estimate and CI
  • Start from observations
  • Re-compute model given new observations
  • Examples: Mean estimate, t-test, ANOVA

Bayesian approach

  • Goal is posterior distribution
  • Start from prior distribution (the guesswork)
  • Update posterior (the belief/hypothesis) given new observations
  • Examples: posterior distribution of mean, highest density interval (HDI) overlap

Today no one is a strict Frequentist. Bayesian methodologies simply work, that is fact.

Bayes theorem

Bayes theorem describes the probability of an event, based on prior knowledge (our guesswork) of conditions that might be related to the event, our data. Bayes theorem tells us how to combine these two probabilities. For example, if developing a disease like Alzheimer’s is related to age, then, using Bayes’ theorem, a person’s age can be used to more reliably assess the probability that they have Alzheimer’s, or cancer, or any other age-related disease.

Bayes theorem is used to update the probability for a hypothesis; the hypothesis is this guesswork, this prior distribution thing. Prior means your prior belief, the belief that you have before you have evidence, and it is a way to update it. Because you get more information, you update your hypothesis, and so on and so forth.

For those who do not know, or who have forgotten, let us derive Bayes theorem from the rule for conditional probability:

Conditional probability
Conditional probability

Translates to "the probability of A happening given B is the case."

"The probability of B happening given A happens."

Eliminating 𝑃(𝐴∩𝐵) and doing very minor algebra:

or, rearranging terms

Which is Bayes theorem! This describes how one finds the conditional probability of event A given event B.

Applications

You have a disease test, and the probability that you will get a positive test result given that you have the disease is really, really high; in other words the test has a very high accuracy rate. The problem is that there is a probability that you will get a positive test result even if you do not have the disease. And that you can simply calculate from Bayes law. The big point is, is that these probabilities are not the same as the probability that you will get a positive result given the disease is not the same as the probability that you will have the disease given a positive result.

The test sensitivity does not match the rate of incidence of the disease
The test sensitivity does not match the rate of incidence of the disease

These are two different probability distributions. And what makes them so different is the probability of disease and the probability of a positive test result. So if the disease is rare, the probability of disease will be very, very small.

Disease testing: A = Have disease, B = Tested positive.

Example

The sensitivity of this test is quite bad despite the high accuracy of the test
The sensitivity of this test is quite bad despite the high accuracy of the test

Using Bayes theorem we have deduced that a positive test result for this disease indicates that only 10% of the time the person will actually have the disease, because the incidence rate of the disease is an order of magnitude lower than the false positive rate.

Python example: probabilities of hair and eye color

A sample population has the following probabilities of hair and eye color combinations. I will run the code below in a Jupyter Notebook to find the conditional probabilities:

# begin with the imports
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import scipy
import seaborn as sns
import itertools
%matplotlib inline
# create the dataframe
hair_eye = pd.DataFrame({
    'black': [0.11, 0.03, 0.03, 0.01],
    'blond': [0.01, 0.16, 0.02, 0.03],
    'brunette': [0.2, 0.14, 0.09, 0.05],
    'red': [0.04, 0.03, 0.02, 0.02],

}, index=['brown', 'blue', 'hazel', 'green'])
hair_eye

N.B: this is string index for eye color rather than a numeric zero-based index.

hair_eye.loc['hazel', 'red']
2% of people have hazel eyes given red hair
2% of people have hazel eyes given red hair

The figures in the table above are the conditional probabilities. Note that in this case:

𝑃(hair|eye)=𝑃(eye|hair)P(hair|eye)=P(eye|hair)

Given these conditional probabilities, it is easy to compute the marginal probabilities by summing the probabilities in the rows and columns. The marginal probability is the probability along one variable (one margin) of the distribution. The marginal distributions must sum to unity (1).

## Compute the marginal distribution of each eye color
hair_eye['marginal_eye'] = hair_eye.sum(axis=1)
hair_eye
Added marginal_eye column
Added marginal_eye column
hair_eye.sum(axis=0)
The hair color probabilities sum to unity
The hair color probabilities sum to unity
hair_eye.loc['marginal_hair'] = hair_eye.sum(axis=0)
hair_eye.describe
The eye color probabilities also sum to 1.0
The eye color probabilities also sum to 1.0

I have blue eyes. I wonder what the marginal probabilities of hair color are given that they have blue eyes?

# probability of blue eyes given any hair color divided by total probability of blue eyes
# marginal probabilities of hair color given blue eyes
#blue_black(.03/.36) + blue_brunette(.14/.36) + blue_red(.03/.36) + blue_blond(.16/.36)
blue = hair_eye.loc['blue',:]/hair_eye.loc['blue','marginal_eye']
blue
For a given person with blue eyes, it is 44% likely they have blond hair, like me
For a given person with blue eyes, it is 44% likely they have blond hair, like me

Applying Bayes theorem

There is a formulation of Bayes theorem that is convenient to use for computational problems because we do not want to sum all of the possibilities to compute P(B), like in the example above.

Here are more facts and relationships about conditional probabilities, largely arising from the fact that P(B) is so hard to calculate:

Rewriting Bayes theorem:

This is a mess, but we do not always need the complicated denominator, which is the probability of B. P(B) is our actual evidence, which is hard to come by. Remember that this is probabilistic and that we are working with smaller data samples than we are normally used to.

Rewriting we get:

Ignoring the constant 𝑘, because it is given that the evidence – P(B) – is presumed to be constant, we get:

We do not need to know the probability distribution of the actual evidence, of the data
We do not need to know the probability distribution of the actual evidence, of the data

Applying the reduced relationship Bayes theorem

We interpret the reduced relationship above as follows:

Proportional Bayes
Proportional Bayes

The proportional relationships apply to the observed data distributions, or to parameters in a model (partial slopes, intercept, error distributions, lasso constant, e.g.,). The above equations tell you how to combine your prior belief if you want to call that, which I call the prior distribution with your evidence.

An important nuance is that while we cannot speak about exact values we can talk about distributions; we can talk about the shapes of the curves of the probability distribution curves. What we want is the so-called posterior distribution. That is what we want. What we do is guess at the prior distribution: what is the probability of A? What is the probability that something is true given the data that I have? What is the probability of A given B? what is the probability that something is true given the data I have? Some people will rephrase this as what is the probability that my hypothesis A is true given the data B?

I might have very little data. Too little to do any Frequentist approaches; but I can estimate that with the little small amount of data I have, called the likelihood, I can estimate the probability that this data would be correct, given my hypothesis that I can calculate directly. Because I have the data, not a lot of data, it is too little for me to actually believe, but I have some data. And I have my hypothesis. And so I can do my calculations on my distributions.

The prior distribution is the probability of my hypothesis, which I do not know – but I can guess it. If I am really bad at guessing, I am going to say that this prior distribution is a uniform distribution, it could be a Gaussian or any other shape. Now if I guess a uniform distribution I am still doing probabilistic statistics, Bayesian statistics, but I am getting really, really close to what Frequentists would do.

In Bayesian, I will update this prior distribution as I get more data because I will say, "Hey!" at first I thought it was uniform. Now I see that this thing, this probability distribution actually has a mean somewhere in the middle. So I am now going to choose a different distribution. Or I’m going to modify my distribution to accommodate what I am seeing in my posterior. I am trying to make this prior distribution look similar in shape to my posterior distribution. And when I say similar in shape, that is actually very misleading. What I am trying to do is find a mathematical formula that will mix the probability of A and the probability of A given B similarly. Similar in type that the math, that the equation looks similar, as in the formulaic of the question and not the actual probabilities; we call that getting the conjugate prior. This is when you search for a distribution that is mathematically similar to what you believe to be the final distribution. This is formally referred to as searching for a conjugate prior. A beta distribution, because of its flexibility is often, but not always, used as the prior.

To reiterate, the hypothesis is formalized as parameters in the model, P(A), and the likelihood, P(B|A), is the probability of the data given those parameters – this is easy to do, t-tests can do this!

Creating Bayes models

Given prior assumptions about the behavior of the parameters (the prior), produce a model which tells us the probability of observing our data, to compute new probability of our parameters. Thus the steps for working with a Bayes model are:

  • Identify data relevant to the research question: e.g., measurement scales of the data
  • Define a descriptive model for the data. For instance, pick a linear model formula or a beta distribution of (2,1)
  • Specify a prior distribution of the parameters – this the formalization of the hypothesis. For example, thinking the error in the linear model is normally distributed as 𝑁(𝜃,𝜎²)
  • Choose the Bayesian inference formula to compute posterior parameter probabilities
  • Update Bayes model if more data is observed. This is key! The posterior distribution naturally updates as more data is added
  • Simulate data values from realizations of the posterior distribution of the parameters

Choosing a prior

The choice of the prior distribution is very important to Bayesian analysis. A prior should be convincing to a skeptical audience. Some heuristics to consider are:

  • Domain knowledge (SME)
  • Prior observations
  • If poor knowledge use less informative prior
  • Caution: the uniform prior is informative, you need to set the limits on the range of values

One analytically and computationally simple choice is the before mentioned conjugate prior. When a likelihood is multiplied by a conjugate prior the distribution of the posterior is the same as the likelihood. Most named distributions have conjugates:

A conjugate prior is not always used
A conjugate prior is not always used

Python Example: COVID-19 random sampling

As a health official for your city, you are interested in analyzing COVID-19 infection rates. You decide to sample 10 people at an intersection while the light is red and determine if they test positive for Covid – the test is fast and to avoid sampling bias we take the test to them. The data are binomially distributed; a person tests positive or tests negative – for the sake of simplicity, assume the test is 100% accurate.

In this notebook example:

  • Select a prior for the parameter 𝑝, the probability of having COVID-19.
  • Using data, compute the likelihood.
  • Compute the posterior and posterior distributions.
  • Try another prior distribution.
  • Add more data to our data set to updated the posterior distribution.

The likelihood of the data and the posterior distribution are binomially distributed. The binomial distribution has one parameter we need to estimate, 𝑝, the probability. We can write this formally for 𝑘 successes in 𝑁 trials:

The following code computes some basic summary statistics:

sampled = ['yes','no','yes','no','no','yes','no','no','no','yes']
positive = [1 if x is 'yes' else 0 for x in sampled]
positive
N = len(positive) # sample size
n_positive = sum(positive) # number of positive drivers
n_not = N - n_positive # number negative
print('Tested positive= %d Tested negative= %d'
 'nProbability of having COVID-19= %.1f' 
 % (n_positive, n_not, n_positive / (n_positive+ n_not)))

For the prior I will start with a uniform distribution as I do not know what to expect re Covid rates:

N = 100
p = np.linspace(.01, .99, num=N)
pp = [1./N] * N
plt.plot(p, pp, linewidth=2, color='blue')
plt.show()

Next step: compute the likelihood. The likelihood is the probability of the data given the parameter, 𝑃(𝑋|𝑝). Each observation of each test as "positive" or "not" is a Bernoulli trial, so I choose the binomial distribution.

def likelihood(p, data):
 k = sum(data)
 N = len(data)
 # Compute Binomial likelihood
 l = scipy.special.comb(N, k) * p**k * (1-p)**(N-k)
 # Normalize the likelihood to sum to unity
 return l/sum(l)
l = likelihood(p, positive)
plt.plot(p, l)
plt.title('Likelihood function')
plt.xlabel('Parameter')
plt.ylabel('Likelihood')
plt.show()
The probability our data is true given our hypothesis
The probability our data is true given our hypothesis

Now that we have a prior and a likelihood we can compute the posterior distribution of the parameter 𝑝: 𝑃(𝑝|𝑋).

N.B. The computational methods used here are simplified for the purpose of illustration. Computationally efficient code must be used!

def posterior(prior, like):
 post = prior * like # compute the product of the probabilities
 return post / sum(post) # normalize the distribution
def plot_post(prior, like, post, x):
 maxy = max(max(prior), max(like), max(post))
 plt.figure(figsize=(12, 4))
 plt.plot(x, like, label='likelihood', linewidth=12, color='black', alpha=.2)
 plt.plot(x, prior, label='prior')
 plt.plot(x, post, label='posterior', color='green')
 plt.ylim(0, maxy)
 plt.xlim(0, 1)
 plt.title('Density of prior, likelihood and posterior')
 plt.xlabel('Parameter value')
 plt.ylabel('Density')
 plt.legend()

post = posterior(pp, l)
plot_post(pp, l, post, p)
print('Maximum of the prior density = %.3f' % max(pp))
print('Maximum likelihood = %.3f' % max(l))
print('MAP = %.3f' % max(post))

With a uniform prior distribution, the posterior is just the likelihood. The key point is that the Frequentist probabilities are identical to the Bayesian posterior distribution given a uniform prior.

Trying a different prior

From the chart above, I chose the conjugate prior of the Binomial distribution which is the beta distribution. The beta distribution is defined on the interval 0 ≤ Beta(P|A, B)≤10 ≤ Beta(P|A, B)≤ 1. The beta distribution has two parameters, 𝑎 and 𝑏, which determine the shape.

plt.figure(figsize=(12, 8))
alpha = [.5, 1, 2, 3, 4]
beta = alpha[:]
x = np.linspace(.001, .999, num=100) #100 samples
for i, (a, b) in enumerate(itertools.product(alpha, beta)):
 plt.subplot(len(alpha), len(beta), i+1)
 plt.plot(x, scipy.stats.beta.pdf(x, a, b))
 plt.title('(a, b) = ({}, {})'.format(a,b))
plt.tight_layout()
Note how varied these beta distributions can become with slightly different parameters
Note how varied these beta distributions can become with slightly different parameters

I still do not know a lot about the behavior of drivers at this intersection who may or may not have COVID-19, so I pick a rather uninformative or broad beta distribution as the prior for the hypothesis. I choose a symmetric prior with a=2 and b=2:

def beta_prior(x, a, b):
 l = scipy.stats.beta.pdf(p, a, b) # compute likelihood
 return l / l.sum() # normalize and return
pp = beta_prior(p, 2, 2)
post = posterior(pp, l)
plot_post(pp, l, post, p)

Take note that the mode of the posterior is close to the mode of the likelihood, but has shifted toward the mode of the prior. This behavior of Bayesian posteriors to be shifted toward the prior is known as the shrinkage property: the tendency of the maximum likelihood point of the posterior is said to shrink toward the maximum likelihood point of the prior.

Updating the Bayesian model

Let us update the model with 10 new observations to our data set. Note that the more observations added to the model moves the posterior distribution closer to the likelihood.

N.B. With large datasets, it might require HUGE amounts of data to see convergence in behavior of the posterior and the likelihood.

new_samples = ['yes','no','no','no','no',
 'yes','no','yes','no','no'] # new data to update model, n = 20
new_positive = [1 if x is 'yes' else 0 for x in new_samples]
l = likelihood(p, positive+ new_positive)
post = posterior(pp, l)
plot_post(pp, l, post, p)
Compared to the first one, note the convergence of the two distributions as we updated with more data
Compared to the first one, note the convergence of the two distributions as we updated with more data

Credible Intervals

A credible interval is an interval on the Bayesian posterior distribution. The credible interval is sometimes called the highest density interval (HDI). For instance, the 90% credible interval encompasses 90% of the posterior distribution with the highest probability density. It can be confusing since both credible intervals of Bayesian and confidence intervals of Frequentist are abbreviated the same (CI).

Luckily, the credible interval is the Bayesian analog of the Frequentist confidence interval. However, they are conceptually different. The confidence interval is chosen on the distribution of a test statistic, whereas the credible interval is computed on the posterior distribution of the parameter. For symmetric distributions, the credible interval can be numerically the same as the confidence interval. However, in the general case, these two quantities can be quite different.

Now we plot the posterior distribution of the parameter of the binomial distribution parameter p. The 95% credible interval, or HDI, is also computed and displayed:

num_samples = 100000
lower_q, upper_q = [.025, .975]
# Caution, this code assumes a symmetric prior distribution and will not work in the general case
def plot_ci(p, post, num_samples, lower_q, upper_q):
 # Compute a large sample by resampling with replacement
 samples = np.random.choice(p, size=num_samples, replace=True, p=post)
 ci = scipy.percentile(samples, [lower_q*100, upper_q*100]) # compute the quantiles

 interval = upper_q - lower_q
 plt.title('Posterior density with %.3f credible interval' % interval)
 plt.plot(p, post, color='blue')
 plt.xlabel('Parameter value')
 plt.ylabel('Density')
 plt.axvline(x=ci[0], color='red')
 plt.axvline(x=ci[1], color='red')
 print('The %.3f credible interval is %.3f to %.3f' 
 % (interval, lower_q, upper_q))

plot_ci(p, post, num_samples, lower_q, upper_q)

Because the posterior distribution above is symmetric, like the beta prior we chose, the analysis is identical to CI’s in the frequentist approach. We can be 95% confident that around 40% of driver’s tested at this intersection during a red light will have COVID-19! This info is very useful for health agencies to allocate resources appropriately.

If you remember the joke in the third paragraph, about the Bayesian who vaguely suspected a horse and caught a fleeting glimpse of a donkey and strongly suspects seeing a mule – I hope you can now assign each clause of the joke the appropriate part Bayes theorem.

Check out my next piece on linear regression and using bootstrap with regression models.

Find me on Linkedin

Physicist cum Data Scientist- Available for new opportunity | SaaS | Sports | Start-ups | Scale-ups


Related Articles