From Theory to Practice with Bayesian Neural Network, Using Python

Here’s how to incorporate uncertainty in your Neural Networks, using a few lines of code

Piero Paialunga
Towards Data Science

--

Photo by Towfiqu barbhuiya on Unsplash

I have a master's degree in physics and work as an aerospace engineering researcher.

Physics and engineering are two distinct sciences that share a desire to understand nature and the ability to model it.

The approach of a physicist is more theoretical. The physicist looks at the world and tries to model it in the most accurate way possible. The reality that a physicist models is imperfect and has approximations, but once we consider these imperfections the reality becomes neat, perfect, and elegant.

The approach of an engineer is way more practical. The engineer realizes all the limits of the physicist’s models and tries to make the experience as smooth as possible in the laboratory. The engineer might do more brutal approximations (e.g. pi = 3) but its approximations are actually more efficient in real-life experiments.

This difference between the practical approach of an engineer and the elegant and theoretical approach of a physicist is summarized by this quote by Gordon Lindsay Glegg:

A scientist can discover a new star but he cannot make one. He would have to ask an engineer to do it for him.

In the everyday life of a researcher, it kind of works like this. A physicist is someone who has a theory about a particular phenomenon. An engineer is a scientist who can set up the experiment and see if the theory works.

Now practically, as I started this transition from physicist to engineer, one of the questions I got asked a lot was the following:

“Ok, your model seems to work… but how robust is it?”

This is a typical engineer's question.

When you have a physical model, given certain conditions, the model is theoretically perfect.

Image by author

Nonetheless, when you perform the experiment, there is a certain degree of error, and you have to be able to estimate it properly.

Image by author

In this specific example that we are doing, how do we estimate the difference in energy between the theoretical output and the experiment result?

Two options:

A. If the model is deterministic, you may change the initial conditions by a certain delta (e.g. apply that deterministic rule to a noisy version of the input)

B. If the model is probabilistic, for some given inputs, you extract some statistical information from the output (e.g. mean, standard deviation, uncertainty boundaries…)

Now let’s get into the language of machine learning. In this specific case:

A. If the machine learning model is deterministic, we can test its robustness by shuffling the training and validation sets.

B. If the machine learning model is probabilistic, for some given inputs, you extract some statistical information from the output (e.g. mean, standard deviation, uncertainty boundaries…)

Now, let’s say that the model we want to use is the neural network.
First question: do you need a Neural Network? If the answer is yes, then you must use it (you don’t say). Question:

“Is your machine learning model robust?”

The original definition of neural networks is "purely deterministic."
We can shuffle the training, validation, and test sets, but we need to consider that neural networks can take a long time to train, and if we want to do multiple tests (let’s say CV = 10,000), well, you might have to wait a while.

Another thing that we need to consider is that the neural network is optimized using an algorithm known as gradient descent. The idea is that we start from a point in the parameter space and, as the name suggests, descend in the direction indicated by the negative gradient of the loss. This would ideally take us to a global minimum (spoiler: it is never actually global).

The ideal situation for an unrealistically simple 1D loss function is the following:

Image by author

Now, in this situation, if we change the starting point, we still converge to the only global minimum.

A more realistic situation is something like this:

Image by author

So, if we randomly restart the training algorithm from different starting points, we converge to different local minima.

Image by author

So if we start from point 1 or point 3, we get to a lower point than the starting point 2.

The loss function could be potentially full of local minima, so finding the true global minimum can be a hard task. Another thing we could do is restart the training from different starting points and compare the loss function values. We have the same problem as before with this approach: we can only do it so many times.

There is a more robust, rigorous, and elegant approach to using the same computational power of neural networks in a probabilistic way; it is called Bayesian Neural Networks.

In this article, we will learn:

  1. The idea behind Bayesian Neural Networks
  2. The mathematical formulation behind Bayesian Neural Network
  3. The implementation of Bayesian Neural Networks using Python (more specifically Pytorch)
  4. How to solve a regression problem using a Bayesian Neural Network

Let’s start!

1. What is a Bayesian Neural Network?

As we said earlier, the idea of a Bayesian neural network is to add a probabilistic “sense” to a typical neural network. How do we do that?

Before understanding a Bayesian neural network, we should probably review a bit of the Bayes theorem.

A very efficient way of seeing the Bayes theorem is the following:

“The Bayes theorem is the mathematical theorem that explains why if all the cars in the world are blue then my car has to be blue, but just because my car is blue it doesn’t mean that all the cars in the world are blue.”

In mathematical terms, given events “A” and "B," the probability of event "A" happening given that event "B" has happened is the following:

Image by author

And the probability of event “B” happening given that event “A” has happened is the following:

Image by author

The equation that links the first and the last expressions are the following:

Image by author

Got it? Great. Now, let’s say that you have your neural network model. This neural network is nothing more than a set of parameters that convert a given input to the desired output.

A feed-forward neural network (the simplest deep learning structure) processes your input by multiplying the input by a matrix of parameters. Then a non-linear activation function (this is the true power of neural nets) is applied entry-wise to the result of this matrix multiplication. The result is the input of the next layer, where the same procedure is applied.

We will now refer to the model’s set of parameters as w. Now we can ask ourselves this tricky question.

Let’s say that I have a dataset D that is a set of pairs of input x_i and output y_i, for example, the i-th image of an animal and the i-th label (cat or dog):

Image by author

What is the probability of having a set of parameters, given a certain dataset D?

You probably need to read this question 3 or 4 times to grasp it, but the idea is there. If you have a certain mapping between input and output, in the extreme deterministic case, only a single set of parameters will be able to process the input and bring you the desired output. In a probabilistic way, there will be a probability set of parameters that is more probable than another one.

So what we are interested in is the quantity.

Image by author

Now, three things are pretty cool about it:

  1. You can still see it as a standard neural network model when you consider the average value given that distribution. For example:
Image by author

Whereas the left hand of the equation represents the computed average output, the right hand represents the average of all the possible sets of parameter results (N), with the probability distribution providing the weight for each result.

2. While p(w|D) is obviously a mystery, p(D|w) is something we can always work on. If we use the equation above for a huge N, there is no need for machine learning. You could simply say: “try all the possible models given a certain Neural Network and weigh all the possible results using the equation above”

3. When we get p, we are not only getting a machine learning model; we are virtually getting infinite machine learning models. This means that we can extract some uncertainty boundaries and statistical information from your prediction. The result is not only "10.23,” but it is more like “10.23 with a possible error of 0.50."

I hope I hyped you up. Let’s get to the next chapter

2. Some math

I don’t want this article to be small talk, but I don’t want it to be a pain. If you get the idea of the Bayesian Neural Network, or if you already know the math that is behind them, feel free to skip this chapter. If you want to have a reference, a good one is the following. (Hands-on Bayesian Neural Networks — A Tutorial for Deep Learning Users)

Now all this seems cool and swag, but I think that if you are a machine learning user, you have this thought in mind:

“How am I ever going to be able to optimize such a weird creature?”

The short answer is, “By maximizing:

Image by author

But I don’t think that this is self-explanatory.

In this case, the optimization principle is to find the best estimate of the distribution p(w|D). We will call this distribution q, and we want a metric of the distance between two distribution functions.

The metric that we will use is called Kullback–Leibler divergence

Image by author

Some fun facts about it:

  1. It is 0 for two distributions that are equals
  2. It is infinite if the denominator of the two distributions tends to zero while the numerator is still non zero
  3. It is non-symmetric.

Now the loss function that you see above is a surrogate quantity for the Kullback-Leibler divergence, and it is called the evidence lower bound (ELBO).

The distribution of the weights q is considered to be a normal distribution with mean mu and variance sigma2:

Image by author

So the optimization is about determining the best mu and sigma values for that distribution.

Image by author

In the practical PyTorch implementation, the MSE between the mean of the distribution and the target is also added to our L (mu, sigma).

3. Pyt(orch)hon implementation

The implementation of Bayesian neural networks in Python using PyTorch is straightforward thanks to a library called torchbnn.

Installing it is super easy with:

pip install torchbnn

And as we will see, we will build something that is very similar to a standard Tor neural network:

model = nn.Sequential(
bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=1, out_features=1000),
nn.ReLU(),
bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=1000, out_features=1),
)

Actually, there is a library to convert your torch model into its Bayesian surrogate:

transform_model(model, nn.Conv2d, bnn.BayesConv2d, 
args={"prior_mu":0, "prior_sigma":0.1, "in_channels" : ".in_channels",
"out_channels" : ".out_channels", "kernel_size" : ".kernel_size",
"stride" : ".stride", "padding" : ".padding", "bias":".bias"
},
attrs={"weight_mu" : ".weight"})

But let’s do a hands-on, detailed example:

4. Hands-On regression task

The first thing to do is to import some libraries:

After that, we will make our very simple bidimensional dataset:

So, given our 1D input x (ranging from -2 to 2), we want to find our y.

Clean_target is our ground truth generator, and target is our noisy data geneator.

Now we will define our Bayesian feed-forward neural network:

As we can see, it is a two-layer feed-forward neural network with Bayesian layers. This will allow us to have a probabilistic output.

Now we will define our MSE loss and the remaining Kullback-Leibler divergence:

Both the losses will be used in our optimization step:

2000 epochs have been used.

Let’s define our test set:

Now, the result that comes out of the model class is probabilistic. This means that if we run our model 10,000 times, we will get 10,000 slightly different values. For each data point from -2 to 2, we will get the mean and standard deviation,

and we will plot our confidence intervals.

5. Wrapping it up

In this article, we saw how to build a Machine Learning model that incorporates the power of a neural network and still keeps a probabilistic approach to our predictions.

In order to do that we can build what is called Bayesian Neural Network.
The idea is not to optimize the loss of a Neural Network but the loss of infinite Neural Networks. In other words, we are optimizing the probability distribution of our model parameters given a dataset.

We did that using a loss function that incorporates the metric known as Kullback-Leibler divergence. This is used to compute the distance between two distributions.

After optimizing our loss function we are able to use a model that is probabilistic. This means that if we repeat this model twice we get two different results, and if we repeat it 10k times we are able to extract a robust statistical distribution of our results.

We implemented this using torch and a library named torchbnn. We built our simple regression task and solved it using a two layers feed forward neural network.

6. Conclusions

If you liked the article and you want to know more about machine learning, or you just want to ask me something, you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.

--

--

PhD in Aerospace Engineering at the University of Cincinnati. Machine Learning Engineer @ Gen Nine, Martial Artist, Coffee Drinker, from Italy.