The world’s leading publication for data science, AI, and ML professionals.

Understanding Bayes’ Theorem

Understanding the Rationale Behind the Famous Theorem

Photo by Antoine Dautry on Unsplash
Photo by Antoine Dautry on Unsplash

It’s one of the most famous equations in the world of statistics and probability. Even if you don’t work in a quantitative field, you’ve probably had to memorize it at some point for an exam.

P(A|B) = P(B|A) * P(A)/P(B)

But what does it mean and why does it work? Find out in today’s post where we explore Bayes’ Theorem in depth.


A Framework for Updating Our Beliefs

What’s the point of probability (and statistics) anyways? One of its most important applications is decision making under uncertainty. When you decide on an action (assuming you are a rational human being), you are betting that completing the action will leave you better off than had you not done it. But bets are inherently uncertain, so how do you decide whether to go ahead with it or not?

Implicitly or explicitly, you estimate a probability of success – and if the probability is higher than some threshold, you forge ahead.

So being able to accurately estimate this success probability is critical to making good decisions. While chance will always play a role in the outcome, if you can consistently stack the odds in your favor, then you should do very well over time.

That’s where Bayes’ Theorem comes in – it gives us a quantitative framework for updating our beliefs as the facts around us change, which in turn allows us to improve our decision making over time.


Let’s Try Out the Formula

Let’s take a look at the formula again:

P(A|B) = P(B|A) * P(A)/P(B)

  • P(A|B) – is the probability of A given that B has already happened.
  • P(B|A) – is the probability of B given that A has already happened. It looks circular and arbitrary now but we will see why it works shortly.
  • P(A) – is the unconditional probability of A occurring.
  • P(B) – is the unconditional probability of B occurring.

P(A|B) is an example of a conditional probability – one that measure probability over only certain states of the world (states where B has occurred). P(A) is an example of an unconditional probability and is measured over all states of the world.

Let’s see Bayes’ Theorem in action with an example. Suppose that you are a recently graduated Data Science bootcamp student. You have yet to hear back from some of the companies you interviewed with and are getting nervous. So you decide to calculate the probability that a specific company will make you an offer given that it’s been 3 days and they still have not called you yet.

Let’s rewrite the formula in terms of our example. Here, outcome A is "receiving an offer" (EDIT: an earlier version of this post mistakenly identified A as "no offer" – it has been corrected now to be "receiving an offer") and outcome B is "no phone call for 3 days". So we can write our formula as:

P(Offer|NoCall) = P(NoCall|Offer) * P(Offer) / P(NoCall)

The value of P(Offer|NoCall), the probability of receiving an offer given no phone call for 3 days, is hard to estimate.

But the reverse, P(NoCall|Offer), or the probability of no phone call for 3 days given that you have an offer from the company, feels more like something we can reasonably peg a value for. From talking with friends, recruiters, and job counselors, you learn that it is somewhat unlikely, but not uncommon for a company to maintain radio silence for as long as 3 days if they are planning to make you an offer. So you estimate:

P(NoCall|Offer) = 40%

40% is not bad, seems like there’s still hope! But we’re not done yet. Now we need to estimate P(Offer), the probability of landing a job offer period. Everyone knows that job hunting is a long and arduous process and chances are that you will need to interview at least a few times before nabbing that offer so you estimate:

P(Offer) = 20%

Now we just need to estimate P(NoCall), the probability of not getting a call back from the company for 3 days. There’s any number of reasons that a company might not call you for 3 days – they might have decided to pass on you, or they might still be interviewing other candidates, or the hiring manager might have caught a cold. Wow, there’s a lot of reasons that they might not have called, so for the last probability you estimate:

P(NoCall) = 90%

Now plugging it all in, we can calculate P(Offer|NoCall):

P(Offer|NoCall) = 40% * 20%/90% = 8.9%

That’s pretty low – so unfortunately we shouldn’t get our hopes up (and we should definitely keep dropping those resumes). If it all seems kind of arbitrary, don’t worry. I felt the same way when I first learned Bayes’ Theorem too. Now let’s unravel how and why we arrived at that 8.9% (bear in mind that your initial estimate of 20% was already low to start with).


The Intuition Behind the Formula

Remember how we said that Bayes’ Theorem is a framework for updating our beliefs? So where do our beliefs come in? They come in through the prior, P(A), which in our example is P(Offer) – this is our prior belief about how likely it is to receive an offer. In our example, you can think of the prior as our belief of the likelihood that you will receive an offer at the exact moment that you exit the interview room.

Now, new information has come in – 3 days have gone by and the company has yet to call you. So we use the other parts of the equation to adjust our prior for the new event that has occurred.

Let’s examine P(B|A), which is P(NoCall|Offer) in our example. When you first learn Bayes’ Theorem, it’s natural to wonder what the point of the P(B|A) term is. If I don’t know P(A|B), then how am I supposed to magically know what P(B|A) is? This reminds me of something that Charles Munger once said:

"Invert, always invert!" – Charles Munger

What he meant is that when trying to solve a challenging problem, it’s easier to turn the problem around on its head and look at it backwards – which is exactly what Bayes’ Theorem is doing. Let’s reframe Bayes’ Theorem into statistical terms to make it more interpretable (I first read about this here):

Bayes' Theorem reframed so that it is more intuitive
Bayes’ Theorem reframed so that it is more intuitive

To me, this is a much more intuitive way of thinking about the formula. We have a hypothesis (that we got the job), a prior, and observed some evidence (no phone call for 3 days). Now we want to know the probability that our hypothesis is true given the evidence. As we discussed above, we already have our prior of 20%.

Time to invert! We use P(Evidence|Hypothesis) to flip the problem around by asking, "What is the probability of observing this evidence in a world where our hypothesis is true?" So in our example, we want to know how likely it is to go 3 days without a phone call in a world where the company has definitely decided to make you an offer. In my annotated image of the formula above, I call P(Evidence|Hypothesis) the scaler because that’s exactly what it does. When we multiply it against the prior, the scaler scales the prior up or down depending on whether the evidence helps or hurts our hypothesis – in our case, the scaler reduces the prior because more days going by without a phone call would be an increasingly bad sign. 3 days of radio silence is already not good (it reduces our prior by 60%), but 20 days of silence would completely destroy any hope we had of getting the job. So the more our evidence accumulates (more days without a phone call) the more the scaler reduces our prior. The scaler is the mechanism that Bayes’ Theorem utilizes to adjust our prior beliefs.

EDIT: One thing I struggled with somewhat in the original version of this post is articulating why P(Evidence|Hypothesis) is easier to estimate than P(Hypothesis|Evidence). The reason for this is thatP(Evidence|Hypothesis) is a much more constrained way of thinking about the world – by narrowing the scope, we simplify our problem. An easy way to see this is with the popular fire and smoke example where fire is our hypothesis and observing smoke is the evidence. P(fire|smoke) is harder to estimate because any number of things can cause smoke – exhaust from cars, a factory, someone grilling burgers over a charcoal flame. P(smoke|fire) is much easier to estimate – in a world where there is a fire, there will almost certainly be smoke.

The value of the scaler decreases as more days pass with no call - the lower the scaler, the more it reduces the prior as they are multiplied together
The value of the scaler decreases as more days pass with no call – the lower the scaler, the more it reduces the prior as they are multiplied together

The last part of our formula, P(B), a.k.a. P(Evidence) is the normalizer. Like the name implies, its purpose is to normalize the product of the prior and the scaler. If we didn’t divide by the normalizer, we would have the following equation:

Notice that the product of prior and scaler is equal to a joint probability. And because one of the terms in it is P(Evidence), the joint probability would be impacted by the rarity of our evidence.

This is problematic because the joint probability is a value that considers all states of the world. But we don’t care about all states – we only care about the ones where the evidence has occurred. In other words, we are living in a world where the evidence has already occurred and the abundance or scarcity of our evidence is no longer relevant (so we don’t want it to affect our calculation at all). Dividing the product of prior and scaler by P(Evidence) changes it from a joint probability to a conditional one – a conditional probability is one that only considers the states of the world where the evidence has occurred, which is what we desire.

EDIT: Another way to think about why we divide the scaler by the normalizer is that they answer two different and important questions – and the ratio of them combines the information in a useful way. Let’s use an example from my new post on naive Bayes. Say we are trying to figure out whether an observed animal is a cat based on a single feature agility. All we know is that the animal in question is agile.

  1. The scaler tells us what proportion of cats are agile – this should be pretty high, say 0.90.
  2. The normalizer tells us what proportion of animals overall are agile – this should be medium, say 0.50.
  3. The ratio 0.90/0.50 = 1.8 tells us to scale up our prior – it’s saying whatever you believed before, it’s time to revise it up because it looks like you may be dealing with a cat. The reason it says so is because we observed some evidence that the animal is agile. Then we figured out that the proportion of cats that are agile is greater than the proportion of overall animals that are agile. Given that we only know this one piece of evidence and nothing else at the moment, the reasonable thing to do is to revise up our belief that we are dealing with a cat.

Putting It All Together

Now that we know how to think about each of the formula’s individual pieces, let’s revisit our example from start to finish one last time:

  • Fresh out of the interview, we start with a prior – there is a 20% chance that you will get the job you just interviewed for.
  • As more days pass, we use the scaler to scale down our prior. For example, after 3 days have gone by, we estimate that in a world where you got the job, there is just a 40% chance that the company would have waited this long to call you. Multiplying scaler and prior we get 20% * 40% = 8%.
  • Finally, we recognize that the 8% is calculated over all states of the world. But we only care about states of the world where you have not received a phone call from the company for 3 days post interview. In order to capture only those states, we estimate the unconditional probability of not receiving a call for 3 days to be 90% – this is our normalizer. We divide our previously calculated 8% by the normalizer, 8% / 90% = 8.9%, to get our final answer. So in purely the states of the world where you have not heard back from the company for 3 days, there is an 8.9% chance that you will receive an offer.

Hope this was helpful, cheers!


More Data Science and Analytics Related Posts By Me:

What Do Data Scientists Do?

Are Data Scientists at Risk of Automation

The Binomial Distribution

_Understanding PCA_

The Curse Of Dimensionality

_Understanding Neural Nets_


Related Articles