Bayesian Intuition Illustrated with Everyday Physics

Updating your beliefs with data is like pouring water into a bucket

Tyler Buffington, PhD
Towards Data Science

--

Photo by Benjamin Voros on Unsplash

Introduction

Let’s play a game. A wizard has chosen a real number, and we need to guess it.

With no additional information, we have no idea what the number is. It could be anything on the real number line.

Fortunately, the wizard is willing to give us a clue. He will randomly draw a number from a normal distribution with a mean equal to his chosen number and a standard deviation equal to 10. The drawn number is 136.6.

Now we have a better idea of what the chosen number is. Before the clue, it was a distinct possibility that the number was negative. Now, it is very likely that the number is not only positive but also greater than 100. The clue updated our beliefs.

The wizard agrees to give us a second clue in the same way. This time, he draws 123.4. We now have reason to believe that the chosen number is somewhere in the ballpark of 130— the average of the two drawn numbers.

The second clue also makes us more confident. After the first clue, it was plausible that the chosen number would be above 150. This outcome is now somewhat unlikely in light of the second clue.

We can imagine that if we kept receiving clues from the wizard, we would continue to update our beliefs about the wizard’s chosen number. If we received 10,000 similar clues, we would be reasonably confident that the wizard’s number is close to the average of the 10,000 draws.

How much should each clue update our beliefs? Bayes' rule provides the answer to this question.

Updating our beliefs with Bayesian inference

Defining confidence precisely

Before digging into the Bayesian model, we need to review some terminology.

There are two additional ways of describing a standard deviation. The first is the variance, which is the average squared distance from the mean of a distribution. The second is the precision, which is the reciprocal of the variance.

A large variance corresponds to high uncertainty, and a large precision corresponds to high confidence.

In the example from the last section, the standard deviation of the normal “clue” distribution is 10, the variance is 10²=100, and the precision is 1/100.

The normal normal model

Now that we have defined precision, let’s get into the model. In our game, let’s call the wizard’s chosen number θ; this is the parameter we are trying to infer. For a given clue, we will refer to the drawn number as D, because it is the “data” that updates our beliefs.

We want to compute P(θ|D) — the probability distribution for the chosen number given the data provided in a clue. This can be calculated using Bayes’ rule as shown below:

Image by the author. Bayes’ rule.

The above equation has four components:

  • P(θ|D)- known as the posterior. This is the distribution that describes our updated beliefs after observing D.
  • P(D|θ)- known as the likelihood. For a given θ, how would D be distributed? In the game, this distribution is normal with a mean of θ and a standard deviation of 10.
  • P(θ)- known as the prior. This is the distribution that describes our beliefs before observing D.
  • P(D)- known as the marginal likelihood. Generally, we don’t worry too much about this component. We can treat it as a constant that ensures the posterior distribution integrates to one.

If we use a normal distribution to describe our prior beliefs, computing the posterior is relatively simple. This is because the normal distribution is the conjugate prior to a normal likelihood with a known variance.

There is a fair bit of algebra involved that I won’t go into here, but the final result from Bayes’ theorem is that the posterior distribution is also normal.

In the normal-normal model, the posterior mean is the precision-weighted mean of the prior mean and the data (the drawn value in a clue). The posterior precision is the sum of the prior precision and the data’s precision.

We can write a simple Python function the calculate the posterior parameters as follows:

Once we update our beliefs from a given clue, the posterior distribution becomes the prior distribution for the next clue. As a result, we use Bayes’ rule iteratively to update our beliefs with each clue.

Before the first clue, we had no information about the wizard’s number. We can model these uninformed beliefs as an infinitely wide normal distribution centered at 0. This distribution has a variance of infinity, which corresponds to a precision of 0. As we previously noted, the precision of the distribution “clues” is 1/100. So each clue adds a precision of 1/100 to our beliefs. So our belief distribution has a precision of 0.05 after five clues, 0.1 after ten clues, etc.

As our beliefs become more confident (high precision), each new clue has less ability to change our beliefs. For example, when we receive the second clue, we update our beliefs by putting 50% of the weight into the drawn value from the clue and 50% of the weight into our prior beliefs. When we receive the 10th clue, we put 90% of the weight into our prior beliefs and only 10% into the new information.

When we have confident prior beliefs, new information does little to change our minds.

The following animation shows the evolution of our beliefs as we receive randomly generated clues. Notice that at first, our beliefs noticeably change with each new clue because we are “open-minded.” After receiving many clues, we have essentially made up our minds that θ is around 120, and even clues far away from 120 do little to change our beliefs. Our beliefs become heavy and immovable once we observe overwhelming evidence.

Image by the author. A visualization of the “normal normal” model. The black lines represent draws from a normal distribution with an unknown mean of θ and a known standard deviation of 10. The blue distribution represents beliefs about the actual value of θ. Each draw is a clue that updates our beliefs about the true value of θ.

The thermodynamics of mixing water

Let’s say that I have 2 kg of water in a bucket at 30 degrees C. I then pour 1 kg of hot water at 90 degrees C into the bucket. Neglecting heat transfer with the surroundings, what will be the mass and temperature of the water in the bucket after mixing?

By conservation of mass, the final mass is simply the sum of the mass of the water initially in the bucket (1 kg) and the mass of the water poured into the bucket (2 kg). So, 3 kg.

If we assume constant specific heat capacities, the final temperature is the mass-weighted average of the temperature of the water already in the bucket and the temperature of the water I add. This gives (2 * 30 + 1 * 90)/3 = 50 degrees C.

This provides an exact mathematical analogy to the normal-normal Bayesian model with a known variance. The water initially in the bucket represents our prior beliefs and the water we pour represents new information. The masses add together just like the precisions add together in the Bayesian model. The final temperature is a mass-weighted average just like the posterior mean is a precision-weighted average. The exact mathematical analogy is specific to the assumptions we made in the Bayesian and thermodynamic models. However, I find the analogy qualitatively useful for understanding Bayesian inference in general.

Exploring the analogy

If there is a lot of water already in a bucket, it takes a large amount of water at a different temperature to update the temperature of the water in the bucket. If someone has very confident prior beliefs, changing their mind requires strong contrary evidence.

Using evidence to convince someone who has firmly made up their mind is like trying to warm the ocean by pouring a cup of hot coffee into it.

Conversely, if we pour water into an empty bucket, the final temperature will be the temperature of the water we pour. This is equivalent to ignoring our prior beliefs entirely and forming conclusions based on new information alone — like we did in the wizard’s game with the first clue.

If we pour water into a bucket of water at the same temperature, the new temperature will be the same as the original temperature. However, there will be more water in the bucket, making it harder to change the temperature from now on. This is equivalent to evidence that confirms our beliefs — it does not change our minds, but it does make our beliefs more immovable in the future.

Final remarks

When we face new information, we should consider how much water we already have in our bucket and how much water the information adds. As humans, we often err on the side of overconfidence. It is all too easy to form rigid conclusions based on our experiences and fail to update our beliefs even when presented with overwhelming contrary evidence. However, the solution is not to ignore prior beliefs entirely.

One of the classic examples of Bayesian statistics is the sunrise problem introduced by Pierre-Simon Laplace in the 18th century. Based on past observations, what is the probability that the sun will rise tomorrow? If someone has only lived on earth for one day, it might be plausible to them that a sunrise is a rare event. For those of us who have witnessed many sunrises, we can be confident that the sun will rise again tomorrow, and we should be highly skeptical of any evidence to the contrary. In this case, our priors should be nearly immovable. As Carl Sagan put it, “Extraordinary claims require extraordinary evidence.”

--

--

Senior data scientist passionate about music, statistics, and decision analysis.