Frequentist vs Bayesian Statistics

A Practical Introduction to Parameter Estimation with Python

Gabrielgilling

Published in

Towards Data Science

18 min readJul 16, 2021

Parameter estimation is a critical component of statistical inference, data science, and machine learning workflows. Though this topic can be complex, we offer an introduction to the process comparing two approaches with some theory and some code.

Credits: https://unsplash.com/@cgbriggs19

Whether it be an outcome that is yet to occur, or a reality that we cannot yet glimpse, we are obsessed with knowing the unknown. We spend immense resources in hopes of producing more accurate predictions of the future, and relish those whose forecasts consistently get it right. Over the past century, the burgeoning field of statistical inference has presented a more robust toolset for modeling these unknowable outcomes and relationships. The aim of statistical inference is to use observable data to infer the properties of a random variable (RV). The term “random” can be confusing and does not mean that a variable takes on values completely at random, but rather that it takes on different values, which are determined by an underlying probability distribution.

In this blog post, we’ll be working with one of the simplest RVs out there, the outcome of a coin flip, in order to understand a fundamental aspect of inference called parameter estimation. While a coin is a simple RV because it can take on two values- heads or tails, you can think of other RVs such as dice rolls — which can take 6 values in the case of a 6-sided die, or stock prices — which can theoretically take on any positive value.

In statistical inference, we want to understand the relationship between RVs in order to understand and make predictions about the world around us. The entity that governs the relationship between different RVs is called a parameter, and is often denoted with the Greek letter θ (theta). We can write that relationship mathematically in the following way:

Y is our dependent (or outcome) variable and X is our independent (or predictor) variable. θ is the parameter space, which encompasses all of the potential values that govern the relationship between Y and X.

Although the true value of a parameter is by definition unknown, we can work with its approximation. In this blog post, we’re going to look at two different approaches to doing so. We’ll start with the Frequentist - or Classical approach- which uses Maximum Likelihood Estimation (MLE) and then we’ll move on to the Bayesian framework. But before, let’s briefly discuss the Binomial Distribution, a relatively simple but nonetheless invaluable distribution that every data scientist should know, and how to simulate it by writing Python code.

The Binominal Distribution

We’ll center this discussion around the example of a coin-flip (as is mandatory of any statistics text). Specifically, we are interested in the probability that a coin will come up heads in a given flip. In this case, the outcome of our coin-flip is our RV and it can take on the value of 0 (Tails) or 1 (Heads).

We can express the outcome of a single coin-flip as a Bernoulli process, which is a fancy term which says that Y is a single outcome that has two potential values. Formally, we have:

In this context, p is the probability of the coin coming up heads, and this is our parameter of interest. Our task will be to estimate p as precisely as possible.

As such, we have:

So, given a coin, how would we estimate p? We could just flip the coin once, but this wouldn’t be very informative. Imagine that the truth is that p=0.5 (the coin is fair and has an equal probability of flipping heads or tails), and after one flip we observe a head. If we only relied on that one flip only, we might conclude that p=1, so that the probability of flipping heads is 100% and we’d flip heads all the time, which sounds dubious.

What we’ll want to do instead is observe a lot of flips. When a Bernouilli process is repeated more than once, it is called a Binomial process. The binomial process is built on the assumption that all of the trials, or coin flips in our case, are independent — whether you flipped a heads or a tails now has no impact on flipping a heads or tails in following iterations.

Suppose now that we flip our coin n times. Of those n flips, the total number of heads, Y, is a binomial random variable with parameters n and p. n denotes the number of trials, or coin flips, and p is the probability of success, or the probability of tossing heads. Finally, the Binomial distribution’s Probability Mass Function (PMF) gives the probability of observing exactly Y heads out of n trials for every value of p.

Formally, we have:

Now that we have a theoretical understanding of parameter estimation and the distribution we’ll be working with, we can start coding.

Simulating some Data

Let’s now assume someone gives us a biased coin that has a 60% probability of flipping heads, but we don’t know that we want to estimate that probability ourselves instead.

We can use the scipy.statslibrary to draw outcomes from the Binomial distribution. Simulations are very useful because we can hard-code the “true” parameters which then allows us compare how different frameworks compare in approximating it.

Let’s simulate 10000 coin flips and observe some results.

import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as pltnp.random.seed(42) # set seed for reproducibility# Set the probability of heads = 0.6
p = 0.6# Flip the Coin 10000 times and observe the results
n_trials = 10000data = stats.bernoulli.rvs(p, size = n_trials)# plot results
plt.hist(data);

sum(data)# 6108 coin flips out of 10000 are heads, since heads are coded as 1s and tails are coded as 0s

As expected, the split between heads and tails is close to 60/40, with 6108 heads out of 10000 coin flips, and that’s simply because we told the virtual coin flipper that there would be a 60% chance of a coin being heads!

Now that we have our data, let’s compare how the Frequentist and Bayesian approaches obtain the parameter of interest: p.

The Frequentist Approach

Frequentist statistics uses Maximum Likelihood Estimation (MLE). While a full treatment of MLE is outside the scope of this blog post, its working is in its name: it fits a model that maximizes the likelihood of having observed the observed data.

For the Binomial distribution, the MLE is the sample proportion of success[1]. This simply means that under the Frequentist framework, we assume that the true value of p is the amount of heads out of all of coin flips: if we have 6 heads out of 10 flips, then we think that p should be somewhere close to 6/10, or 60%.

After obtaining an estimate of p the next step is to quantify the uncertainty about that estimate. Remember, we never know its true value, so we must also quantify its full range of potential values. That range is called the confidence interval and is easily calculated for the binomial distribution. First, we calculate the parameter’s standard error which is simply its standard deviation scaled by √N (the sample size). Then, we can find our 95% confidence intervals by multiplying the standard error with the 95% Z-stat, which is equal to 1.96.

To conclude, we have:

Capturing the binomial distribution’s uncertainty

Let’s look at what happens when we toss 10 coins. In the graph below, the green line denotes the “true” value of p, which we decided would be 0.6 when we simulated the data. The dashed red line shows the MLE, and the dashed blue lines show the confidence intervals —the range of plausible values that p lies in.

Because we have observed 5 heads, the MLE is 0.5 (recall that the true value of p is 0.6). The uncertainty is captured by the confidence intervals, which indicate that the true value of p has a 95% probability of being between ~ 0.2 and ~ 0.8. Since the confidence intervals are proportional to the sample size, we can expect them to shrink as the number of coin flips increases — the more data we have the more confident we are about our predictions. To illustrate that, let’s see what happens when we flip increasingly large amounts of coins.

In the plots above, we take snapshots of our results after 1 flip, 10 flips, 25 flips, and so on up to 10,000 flips.These plots give us an idea of how our estimates, and our confidence in those estimates, change as we flip the coin an increasingly large number of times.

As we continue to flip the coin over and over again, we gain more and more information about that coin’s properties. As a result, we should expect our estimates to become more and more precise, as well as more accurate. When we’ve only flipped the coin a couple of times (say 1 to 100 times), we can see that our confidence interval is quite wide. This is because we simply haven’t seen enough information to rule out the likelihood that the true probability of heads lies somewhere to the sides of our current MLE. However, as we continue the flip the coin and observe more and more evidence about our parameter of interest, we see our confidence interval starting to narrow and hug the MLE. By the time we’ve flipped the coin 10,000 times, our confidence interval is only slightly to the side of our MLE.

Let’s think about this intuitively: as we gain more evidence, we should become increasingly confident in our estimates. Additionally, and most importantly, we should expect our estimate to get closer and closer to the truth! This is the law of large numbers: as the size of a sample increase, its parameter estimand gets closer to the true value of the whole population. This is confirmed by the knowledge that we set the true probability of heads to 0.6, and indeed our MLE estimate after just 1,000 flips is 0.61 and it does not waver after this (only the confidence interval narrows).

The Bayesian Approach

The material in this section — especially the simulation code is heavily indebted to https://tinyheero.github.io/2017/03/08/how-to-bayesian-infer-101.html [2].

As a refresher, recall that Bayes’ Theorem estimates model parameters by establishing them as a distribution conditional on the data we have observed.

P(θ) is the prior distribution of our model parameters, which represents our opinions about the relationship between our outcome and predictor variables before we’ve seen any data.
P(X|θ) is the likelihood term, indicating how well the prior fits the observed data.
P(X) is the marginal distribution of the predictor variables. In other words, it represents the probability of having observed the data given all the possible values for θ. When dealing with discrete distributions, P(X) can be obtained by summing over all values of θ; in the continuous case it is obtained by integrating over θ.

The Prior

The first step of the Bayesian workflow is specifying what our prior beliefs are about the outcome variable. For this example, this means encoding our beliefs about the probability that a coin flip comes up heads: that’s right — we’re putting a probability on a probability.

The code block above creates the prior distribution. First, the np.linspace function creates 11 values between 0 and 1 with an interval 0.1. Second, we hardcoded the probabilities associated with each value in the prior distribution. Feel free to play around with different probability values — so as long as they all sum to 1!

Prior for Theta: the probability of flipping heads

The prior distribution encapsulates our ideas around the probability of hitting heads without having seen any of the data. For the sake of simplicity, we are going to allow θ to only take on 10 values of 0.1 increments from 0 to 1. We’re going to assume that we do not know that the coin is biased, and nothing indicates that it would be.

Consequently, let’s build a distribution that is peaked at 0.5, with a value of 0.2: we are saying that there is a 20% chance that the coin is fair. We are also entertaining the possibility that the coin might not be fair, and that is reflected in the probabilities for θ ranging from 0.1 to 0.9. The only scenarios we believe not to be possible are those where p is equal to 0 (no chance of NEVER flipping heads) or 1 (no chance of ALWAYS flipping heads).

The Likelihood

Now that we have defined and encoded our prior beliefs, the next step of the Bayesian approach is to collect data and incorporate it into our estimates. This step is no different than what we did in the Frequentist approach: we’ll observe the exact same data and again use the likelihood function to extract information about our parameter space. The only difference is that we now aren’t simply after just the MLE, or the point estimate that we concluded with in the Frequentist approach, we need the likelihood value for each value in the parameter space.

The likelihood can be a complicated concept to grasp, but what it does is essentially returning a number that tells us how well a certain value of p fits the data. High likelihood values for a set of p means that they “fit” the data well, and vice-versa. We are basically asking: for any given p, can we be confident that it actually generated the data we witnessed?

The equation for the binomial likelihood is its Probability Mass Function (PMF) [3].

Let’s unpack the equation above. We are saying: given n tosses and y heads, what is the likelihood associated with p? For each value of p in θ, the likelihood evaluates the probability of that p value. As an example, if we’ve observed 1 heads, then the likelihood of p= 0 must be 0 — since we’ve observed at least 1 heads, then the likelihood of NEVER flipping heads has to be 0. Another example, albeit an unlikely one (see what I did there?) would be witnessing 10 heads out of 10 flips. In that case, the likelihood would gauge a value of p = 1 to be extremely likely and would likely assign it a probability close to 100%.

Let’s now compute the likelihood and visualize it. We’ll keep the same first 10 data points from the previous example, so we have 5 heads out of 10 flips.

Likelihood after observing 5 heads from 10 coin flips

You’ll notice that the likelihood is peaked around 0.5, with values further away from it getting increasingly smaller. That makes sense: the likelihood has updated our prior beliefs about θ with the data, which have shown that for 5 out of 10 flips were heads — given the data we’ve observed, the most likely true value of theta is 0.5. Thus 0.5 will get the highest likelihood. Vice-versa, values such as 0.1 and 0.9 are very unlikely of being the true value, and we see that reflected in the chart accordingly. In turn, values at the extreme are not corroborated by the data, and are consequently given very small likelihood values.

It is important to note however, that the likelihood is not a valid probability distribution. You can check that yourself by summing all of the probabilities for θ, the result does not add to 1! That’s where the normalizing constant / denominator comes in: dividing each likelihood probabilities by the normalizing constant (which is a scalar) yields a valid probability distribution that we call the posterior.

The Posterior Distribution

Now that we’ve calculated our likelihood values, we obtain the numerator of Bayes’ Theorem by multiplying them with the prior values, which yields the joint distribution of θ and X. In turn, the denominator is obtained by summing over the resulting vector (marginalizing out θ).

You may have seen that the denominator is what often makes Bayesian Inference computationally intractable. Later blog posts will address why that is the case, but for this scenario we have the advantage of working with a parameter space that is discrete (summation is easier than integration) and that only takes on 10 values, so calculating the denominator is entirely feasible.

In the next blog post, we’ll deal with what happens when we work with continuous distributions. For now, note that our ability to derive the marginal distribution of X is possible because we are working with one variable, which in turn has very few (11) values. This makes summing over all of the values in the parameter space easy, which would not be the case if we were dealing with multiple predictors that can take on thousands (or an infinite! as in the continuous case) amount of values.

The posterior distribution is a compromise between the likelihood and the prior

So what is the posterior distribution? It is a reflection of our beliefs about our parameter of interest, having incorporated all of the information we have access to. It is the product of two components: our prior beliefs about theta, and the likelihood, or the evidence, which reflects the information we gained from the observed data. In combining these two pieces, we obtain a probability distribution for our parameter of interest. This is a key piece separating the Bayesian and Frequentist approach. In the frequentist methodology, our answer is expressed in the form of a point estimate (with a confidence interval). In the Bayesian methodology however, our answer is expressed in the form of a probability distribution, allowing us to place a value on the probability of each potential value of p being correct.

In the case above, after having observed 10 flips and 5 heads, our prior distribution has been squeezed by the likelihood towards theta = 0.5, as seems to increasingly be the most likely answer. However, we haven’t seen enough information to rule out the possibility that theta lies somewhere to the side of 0.5, so we will continue to observe more data. Similar to how we did in the frequentist discussion, let’s see how our bayesian conclusion changes as we see more and more data.

Putting It All Together

Let’s now look at what happens as the number of coin flips increases under a Bayesian framework. The plot below overlays the posterior distributions (in blue and with values on the left Y-axis) and the likelihoods (in red and with values on the right Y-axis).

Posterior Probabilities and Likelihoods as n Increases

Let’s start off by analyzing the situation in which we’ve flipped the coin a single time and observed heads. In the frequentist approach, we saw that, in this scenario, our estimate of p = 1, an obviously incorrect conclusion, and yet we had no other way to answer the question. In the Bayesian world, our answer is very different. Because we defined our prior beliefs to say that the most likely value of theta is 0.5, we are only slightly swayed by that first flip. Think about this intuitively, if you really believed that a coin was fair, and you flipped it once and it came up heads, would that be enough evidence to convince you that the coin isn’t fair? Probably not!

The Bayesian approach performs better than the Frequentist approach when there is relatively little data to work with. The viability of the Frequentist answer relies on the law of large numbers, and thus in the absence of large amounts of data, the results aren’t always reliable.

However, as we begin to observe more flips of the coin, we start to see our Bayesian answer become quite clear, and eventually we’re placing all of our eggs in the p = 0.6 basket. After just 100 flips, we are assigning a very small probability to values which don’t equal 0.6, and eventually the probability we assign to 0.6 converges to 1, or ~ 100%. This means that we are close to 100% sure that 60% of the time, our coin will be heads.

In the background, the likelihood (or the observed data), is beginning to dominate the prior beliefs we established at the beginning. This is because the evidence is overwhelming, and so we rely more on it. As the plot above shows, as n increases, the likelihood and the posterior distributions converge (note that the scales on the X and Y axes are different though — because the likelihood is not a probability distribution). As we observe more data, the prior’s influence gets washed out and the data speaks for itself.

Conclusion

In this blog post, we’ve looked at two different ways of estimating unknown parameters. Under the Frequentist approach, we let the data do the talking: we estimate the relationship between X and Y by building a model that fits the observed data as closely as possible. This gives us a single point estimate, the Maximum Likelihood Estimate, and uncertainty is captured by the model’s standard error, which is inversely proportional to the sample size. As such, a typical Frequentist endeavor is to collect as much data as possible about the parameter of interest in order to reach a more accurate estimate (in theory, at least).

Under the Bayesian approach, we begin our analysis with an idea of what the relationship between our variables is: the prior distribution. We then observe our data and update our beliefs about the parameter’s distribution: this gives us the posterior distribution. An important but subtle difference under the Bayesian framework is that we treat parameters as random variables in order to capture the uncertainty about their true values. This entails working with distributions at every stage of the modeling process.

If we recall the plots we showed for the Frequentist approach, we realize that the Bayesian and Frequentist answers agree with each other as the sample size increases, and this is as it should be! As we gain fuller information about our parameter of interest (i.e observe more data), our answers should become more objective.

So you might wonder, what’s the point of having different approaches if they end up giving the same answers? It depends on the use case really. We would argue that in cases where you have lots and lots of data, then deploying a fully fledged Bayesian framework might be overkill. In cases in which you have less data, as is often the case in the social sciences for instance, then the ability to work with posterior distributions can be very insightful. Additionally, in the frequentist approach, we are entirely at the mercy of our data’s accuracy. If our data isn’t accurate or is skewed, then our estimate will be so as well. In the Bayesian approach, we can offset this effect by setting stronger priors (i.e. being more confident in our prior beliefs). We also argue that what Bayesian analysis lacks in simplicity it makes up in overall precision and there are many scenarios in which presenting a posterior distribution to a client or colleague can be much more informative than a point estimate.

In this blog post, we’ve treated the prior space as relatively simple, with only 11 arbitrary values p could take on. But what if p could take on any possible value between 0 and 1? When we start working with continuous distributions for parameter spaces things get a bit dicier, and this will be the topic of the next blog post, where we’ll look at why Markov Chain Monte Carlo approximations are used — stay tuned!

We realize there’s a lot to unpack so here’s a TLDR for you:

- The Frequentist approach estimates the unknown parameter using the Maximum Likelihood Estimate (MLE): the point estimate which is the most likely true value of the parameter, given the data that we’ve observed.

- The Frequentist answer is completely formed from the observed data and is delivered in the form of a single point estimate. This puts us at the mercy of the data’s accuracy and quality.

- The Bayesian approach combines a prior probability distribution with observed data (in the form of a likelihood distribution) to obtain a posterior probability distribution.

- While the Bayesian approach also relies on data, its estimate incorporates our prior *knowledge* about the parameter of interest, its answer is delivered in the form of the parameter of interest’s probability distribution. We can offset our reliance on the data by setting stronger priors.

- Both the Frequentist and Bayesian estimates begin to equal each other as our sample size grows and the statistical inference becomes objective.