Will the Sun Rise Tomorrow? Introduction to Bayesian Statistics for Machine Learning

Have you ever asked yourself what is the probability that an event will occur that has previously never occurred?

Matthew Stewart, PhD
Towards Data Science

--

In this article, we will delve into the mysterious world of Bayesian statistics and how some of its tenets, such as the Bernstein-von Mises Theorem and Cromwell’s rule, can be helpful in analyzing real-world machine learning problems.

“Bayesian statistics is difficult in the sense that thinking is difficult” — Don Berry

If you were looking for a deep dive into the mathematics behind Bayesian statistics, this is not the place to look (although I will post articles on this in the future). This article is primarily to introduce the Bayesian approach to people new to the concept.

Imagine for a moment that you are designing a nuclear power plant. You are tasked with using data to determine whether the plant is functioning correctly. This may seem like a relatively simple task until you realize that you actually don’t have any data about what a plant looks like when a nuclear meltdown occurs. How are you supposed to predict something like this?

If you are an astute machine learning specialist you might suggest some kind of unsupervised method such as a (restricted) Boltzmann machine, that is able to learn what a ‘normal’ power plant looks like and thus to know when things have gone a bit awry (this is, in fact, one way that people predict the normal operating conditions in a nuclear power plant).

However, if we think about this problem in a more general sense, what do we do when we have little to no negative examples against which to compare? This could occur for several reasons:

  • The probability of the event is so low that the event has not been observed to occur at all in (finite) sample data. (The low probability scenario)
  • Observations have occurred but there are very few. (The data sparse scenario)
  • The result of a failure would be so catastrophic that it could only occur once, for example, the destruction of the sun. (The catastrophe scenario)

Traditional statistics is not well suited for these kinds of problems, and typically a different approach is required.

An even more general question is how do we deal with extremely low (but strictly non-zero) or extremely high (close to one but strictly not one) probabilities? Let’s first look at a few rules that were developed to study a famous problem posed by the mathematician Pierre-Simon Laplace.

The Sunrise problem

Imagine one morning you woke up and the Sun had decided to have a day off. Not only would this (most likely) ruin your day and screw up your body clock, this would also directly change how you feel about the Sun. You would automatically be more likely to predict that perhaps the next day the Sun will not rise as well. Alternatively, if the Sun was just having a bad day and returned the following day, your expectation that the Sun would take a day off again would be significantly higher than it was previously.

So what happened here? We changed our belief about the probability of an event based on new evidence. This is the crux of all Bayesian statistics and is formally described using an equation known as Bayes’ rule.

Bayes’ Rule

Bayes’ rule tells us that we have to start with some inherent probability about how likely an event is to happen (before the fact). We call this a prior probability. Progressively, as we are presented with new observations and evidence, we update our belief based on looking at the evidence and deciding how likely our current stance is. This updated belief is called the posterior probability (after the fact).

Going back to our Sunrise problem, every day we observe that the Sun rises, and every time it happens we are a little more sure that it will rise again the next day. However, if one day we find that the Sun does not rise, this will drastically affect our posterior probability based on the new evidence.

This is expressed mathematically in the following form, which looks daunting at first but can be abstracted: our updated belief is based on our initial belief and new evidence presented to us based on our current belief (the likelihood). The likelihood says the new evidence that I have, how likely is it that my belief is correct? If I believe that the probability of the Sun not rising tomorrow is a million to one, and then it happens, the likelihood that my belief (my model) is wrong is very high, and the posterior probability will be updated to predict that it is more likely it will happen again.

Bayes’ theorem.

This is a pretty nifty idea, and it is present in many different places, especially when it comes to humans and their beliefs. For example, let’s say your friend messages you telling you that one of your favorite celebrities has died. Initially, you might be upset and also slightly skeptical. As you go about your day, you read the newspaper and it tells you that the celebrity died, and this belief will then be enforced further. Perhaps you then see interviews on the television of their mourning family on the news, and your belief will be enforced even further. However, if you instead see the person being interviewed on television about a rumor being spread that they had died, your belief that what your friend told you would be lowered.

This is an essential aspect of science, theories are proven through experiments and simulations, and the more people who do these experiments and verify the theories gradually make these theories more robust and believable. Whereas, for example, someone who is religious may decide that they do not need empirical evidence (of the same kind at least) to believe in something, and we call this faith.

It is interesting how something so pervasive in our everyday lives can be so fundamental to statistics and machine learning, but it is, and we will discuss why. First, however, we need to look at some problems that occur with Bayes’ theorem for very low probabilities.

Cromwell’s Rule

Oliver Cromwell was a prominent figure in British history and is was famously quoted in the General Assembly of the Church of Scotland in 1658 as saying

“I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”

The use of this phrase led to the definition of Cromwell’s rule by Dennis Lindley, which poses the idea that if one begins with a prior probability that is equal to zero (I know something is not true) or one (I know something is true), then despite what evidence is shown to you, your belief will not be moved.

This shows us the danger of an absolutist viewpoint when looking at things that can be empirical observed. If I hold a belief so strongly that I am certain I am right, nothing anyone can say or do will ever convince me otherwise. This is the height of ignorance and not something that we want to incorporate into machine learning models. If we look back at Bayes theorem we can see why this is the case if our prior probability is zero then multiplying it by anything will still give us a posterior probability of zero.

In principle (see Cromwell’s rule), no possibility should have its probability set to zero, since nothing in the physical world should be assumed strictly impossible (though it may be) — even if contrary to all observations and current theories.

An ideal example of where this can occur is in a neural network. When you initiate a neural network, your nodes start with some inherent value. If you assign these nodes all to have a weight of zero, then the nodes will not be able to update themselves since all iterations of a gradient descent algorithm will be multiplied by zero. Instead, random initializations are done (typically not visible to the user) which usually prevents problems such as these.

Another intriguing property of Bayes’ theorem comes when we look at what happens after an infinite number of observations, often called the Bernstein-von Mises Theorem.

Bernstein-von Mises Theorem

In simple terms, the Bernstein-von Mises theorem tells us that our posterior estimate will be asymptotically independent of our initial (prior) belief as we obtain more data — assuming of course, that it obeys Cromwell’s rule. This is in some ways analogous to the law of law numbers in frequentist statistics, that tells us the mean of a sample will eventually be the same as the total population as we obtain more and more data.

So what’s the big difference between Bayesian statistics and normal statistics? Why do machine learning specialists and data scientists need Bayesian statistics?

Bayesian Statistics vs Frequentist Statistics

For those of you with no idea what the terms Bayesian and frequentist are, let me elaborate. A frequentist approach looks at data from the point of view of frequency. For example, let’s say I have a biased coin with heads on both sides. I flip the coin 10 times, and 10 times I get heads. If I take the average result of all the coin flips, I get 1, indicating that my next flip will have a 100% chance of being heads, and a 0% chance of being tails, this is a frequentist way of thinking.

Now take the Bayesian point of view. I start out with a prior probability which I will choose to be 0.5 because I am assuming the coin is fair. However, what is different is how I choose to update my probability. After each coin flip, I will look at how likely my next observation is given my current belief (that I have a fair coin). Progressive, as I flip more heads, my probability will tend towards a value of 1, but it will never be explicitly 1.

The fundamental difference between the Bayesian and frequentist approach is about where the randomness is present. In the frequentist domain, the data is considered random and the parameters (e.g. mean, variance) are fixed. In the Bayesian domain, the parameters are considered random and the data is fixed.

I really want to stress one point right now.

It is not called Bayesian because you are using the Bayes theorem (which is commonly used also in a frequentist perspective).

It is called Bayesian because the terms in the equations have a different underlying meaning. Then, from a theoretical difference, you end up with a very meaningful practical difference: while before you had just a single parameter as a result of your estimator (the data is random, the parameters are fixed), now you have a distribution over the parameters (the parameters are random, the data are fixed), so you need to integrate to obtain the distribution over your data. This is one reason the mathematics behind Bayesian statistics gets a bit messier than normal statistics, and one must resort to using Markov Chain Monte Carlo methods to sample from distributions in order to estimate the value of intractable integrals. Other nifty techniques, such as the Law Of The Unconscious Statistician (what a great name, right?), aka. LOTUS can help with the mathematics.

So which methodology is better?

These methods are essentially two sides of the same coin (pun intended), they typically give you the same results but the way they get there is slightly different. Neither is better than the other. In fact, I even have professors in my classes at Harvard that frequently argue over which is better. The general consensus is that ‘it depends on the problem’ if one can consider that a consensus. Personally, I find the Bayesian approach more intuitive but the underlying mathematics is far more involved than the traditional frequentist approach.

Now that you (hopefully) understand the difference, perhaps the below joke will make you chuckle.

Bayesian vs frequentist joke.

When should I use Bayesian statistics?

Bayesian statistics encompasses a specific class of models that could be used for machine learning. Typically, one draws on Bayesian models for one or more of a variety of reasons, such as:

  • Having relatively few data points
  • Having strong prior intuitions (from pre-existing observations/models) about how things work
  • Having high levels of uncertainty, or a strong need to quantify the level of uncertainty about a particular model or comparison of models
  • Wanting to claim something about the likelihood of the alternative hypothesis, rather than simply accepting/rejecting the null hypothesis

Looking at this list, you might think that people would want to use Bayesian methods in machine learning all of the time. However, that’s not the case, and I suspect the relative dearth of Bayesian approaches to machine learning is due to:

  • Most machine learning is done in the context of “big data” where the signature of Bayesian models — priors — don’t actually play much of a role.
  • Sampling posterior distributions in Bayesian models is computationally expensive and slow.

As we can see clearly, there is so much synergy between the frequentist and Bayesian approaches, especially in today’s world where big data and predictive analytics have become so prominent. We have loads and loads of data for a variety of systems, and we can constantly make data-driven inferences about the system and keep updating them as more and more data becomes available. Since Bayesian statistics provides a framework for updating “knowledge”, it is, in fact, used a whole lot in machine learning.

Several machine learning techniques, such as Gaussian processes and simple linear regression, have Bayesian and non-Bayesian versions. There are also algorithms that are purely frequentist (e.g. support vector machines, random forest), and those that are purely Bayesian (e.g. variational inference, expectation maximization). Learning when to use each of these and why is what makes you a real data scientist.

Are you a Bayesian or a Frequentist at heart?

Personally, I am not in one camp or another, this is because sometimes I am using statistics/machine learning on a dataset with thousands of features, of which I know nothing about. Thus, I have no prior belief and Bayesian inference seems inappropriate. However sometimes I have a small number of features and I know quite a lot about them and I would like to incorporate that within my model — in which case Bayesian methods will give me more conclusive intervals/results that I trust.

Where should I go to learn more about Bayesian statistics?

There are several great online classes that delve deep into Bayesian statistics for machine learning. The best resource I would recommend is the class I took here at Harvard, AM207: Advanced Scientific Computing (Stochastic Optimization Methods, Monte Carlo Methods for Inference and Data Analysis). You can find all the lecture resources, notes, and even Jupyter notebooks running through the techniques here.

Here is also a great video which talks about converting between Bayesian and frequentist domains (go to around 11 minutes in the video).

If you want to become a really great data scientist, I would suggest you get a firm grip on Bayesian statistics and how it can be used to solve problems. The journey is difficult and it is a steep learning curve, but it is a great way to separate yourself from other data scientists. From discussions I have had with colleagues going for data science interviews, Bayesian modeling is something that comes up pretty often, so keep that in mind!

Newsletter

For updates on new blog posts and extra content, sign up for my newsletter.

--

--

ML Postdoc @Harvard | Environmental + Data Science PhD @Harvard | ML consultant @Critical Future | Blogger @TDS | Content Creator @EdX. https://mpstewart.io