Uncertainty in Deep Learning — Epistemic Uncertainty and Bayes by Backprop

Kaan Bıçakcı

Published in

Towards Data Science

13 min readFeb 18, 2022

Knowledge is an unending adventure at the edge of uncertainty.
- Jacob Bronowski

This is the third part of the series Uncertainty In Deep Learning.

Part 1 — Brief Introduction
Part 2 — Aleatoric Uncertainty and Maximum Likelihood Estimation
Part 3 — Epistemic Uncertainty and Bayes by Backprop
Part 4 — Implementing Fully Probabilistic Bayesian CNN
Part 5 — Experiments with Bayesian CNN
Part 6 — Bayesian Inference and Transformers

Introduction

In this article we will explore how we can represent epistemic uncertainty using TensorFlow Probability as well as understanding the underlying algorithm and theoretical background.

The article is organized as follows:

What is Epistemic Uncertainty
Problem With Normal Neural Networks
Bayesian Neural Networks
Mathematics Behind This Scheme
Variational Bayes Methods
Backpropagation in Bayesian Neural Networks
Minibatches & Re-Weighting the KL-Divergence
Weight Uncertainty Using TensorFlow Probability
Conclusion & Next Steps

What is Epistemic Uncertainty?

Epistemic uncertainty is knowledge about the world that is missing, imprecise, or perhaps wrong. It exists in the real world and is not just a subjective feeling.

If you ask me, it is the most important type of uncertainty to deal with because it is what prevents you from being certain about anything. Our reality is often a big unknown and the information we base our decisions on could be flawed, incomplete, or simply unavailable.

No one knows anything with certainty because we are born into a world of uncertainty.

For example, you may be very confident of the answer to a mathematical problem, but it doesn’t mean you’re correct. That’s also a problem for the Deep Neural Networks, they are known as being overconfident due to output activations.

A more concrete example, if you randomly measure 10 people to find the average height of a crowded population, your estimate of the population average height is going to be inaccurate. This is because you might have selected people who are taller or shorter than average. The more people you measure, the more accurate your estimate will be. This is called epistemic uncertainty.

Apparently, epistemic uncertainty can be reduced with more data. Because if you measure more people, the estimation will be more accurate.

Before we start, I will add the imports:

Problem With Normal Neural Networks

Since we use Maximum Likelihood Estimation to get best weight values to explain the data, and also considering that our data is finite, there should be more than one model to explain or fit the data. The key point is that, the weights are single deterministic point estimate values.

Let’s check if that’s the case, linear regression:

When 3 separate models are fitted, their starting point will be different as weights are initialized randomly. Also, the optimization process is a stochastic process so every time the optimal weight values would be different.

That’s an epistemic uncertainty about model’s weights, because there is no single answer. There are reasonable set of weights. And we can get better estimates if the dataset gets bigger.

Let’s check for a non-linear regression problem:

As we expected, all of the three lines seem reasonable although the learnt weights are different. Same things apply for this case too.

And if you consider real life problems, which are more complicated, there should be multiple set of weights to explain the data well. Because in real life also models are more complicated while considering there are many local minimas exist in the loss landscape, sets of weights make more sense to get reasonable results.

In a nutshell, the problem is that we don’t know possible values for the weights but have point estimate values.

But, how can we incorporate with epistemic uncertainty in neural networks?

Bayesian Neural Networks

Idea

Weight Uncertainty in Neural Networks [1].

When we train a neural network, we will end up having point estimate values for the weights. However, as we discussed there are multiple set of weights which should explain data reasonable and well.

In order to capture epistemic uncertainty in the model weights, we simply change them into a probability distribution. So, instead of learning those point estimate values, we learn the mean and standard deviation of these distributions via backpropagation.

Deep Dive

Since each weight is replaced by a probability distribution, there is no such single value now. In order to make predictions or have a feed-forward value, we need to take sample(s) from these distributions.

1) Sample from network’s weights
2) Determine the output value
3) Update the mean and standard deviation.

Mathematics Behind This Scheme

The paper Weight Uncertainty in Neural Networks [1] introduces a method for this task, called Bayes by Backprop. The key idea relies on the famous Bayes’ Theorem:

If you replace the denominator, the final form of Bayes’ theorem can be written as:

Bayes’ Theorem makes it possible to combine prior belief and likelihood to obtain a distribution for the model parameters, called as posterior distribution.

Training Process in Principle

Select a prior distribution
Determine the likelihood
Determine the posterior distribution using the Bayes’ Theorem

However, calculating the true posterior is hard and may not be possible at all since it contains complicated integral(s).

In the end, we aim to get a distribution for the model weights.

Variational Bayes Methods

As you may have noticed, the denominator contains a complex integral to calculate true posterior distribution. For this reason, we need to approximate the posterior distribution.

Variational Bayes Methods try to approximate the true posterior distribution with a known distribution, specifically called variational posterior. You may think that approximating a function with another should have risks as the approximated one can be really bad. Yes, that’s true, however in order to mitigate this issue the variational posterior has some parameters. These parameters are tuned so that the approximated one should be as close as possible to the real posterior distribution.

Variational posterior is parameterized by phi here. Theta represents the weights while x represents the data.

Kullback-Leibler Divergence

By intuition, these two distributions should be close to the each other. So, how we measure this? Well, there is a metric called KL-Divergence, defined as:

The last equation actually tells us:

We transformed first integral into log(P(x)) because Q(Theta | Phi) is a probability distribution and integrates to 1.

Remember that we want the KL-Divergence to be low as possible. So now, this is an optimization problem. And since P(x) is constant, we can just ignore it. In the end, we are left up with this equation:

We will see what exactly is the expactation of NLL. But first let’s wrap this up, our loss function is:

Image by author.

There is one other thing I want to mention, I will not discuss how it is derived but would like to show Evidence Lower Bound or ELBO.

Image by author.

As you may have noticed, ELBO is the negative of our loss function. So minimizing to loss is actually equivalent to maximizing the ELBO.

At the end of the day, we want KL-Divergence between prior and variational posterior to be as low as possible while keeping the expected likelihood as high as possible.

Backpropagation in Bayesian Neural Networks

Recall our loss function:

Image by author.

So, the expectation term means integrating the negative log-likelihood times each parameter. On the other hand, we have a KL-Divergence which goes for another integral. Skipping a lot of math, if we write those equations as the integrals, we will end up having:

Exactly minimising this cost naively is computationally prohibitive. Instead gradient descent and various approximations are used [1].

Taking derivatives with respect to phi involves an integral over Theta, aka the model weights. This might be very expensive in terms of computation or not even possible to do so!

One way that we can change this an expectation form and apply Monte Carlo approximation in order to calculate the derivatives w.r.t Theta.

Now we have:

Image by author.

Well, this expectation also has some problems with computation as underlying distribution depends on phi.

Unbiased Monte Carlo Gradients

One way to tackle the computational complexity problem is to use reparameterization trick.

Proposition 1 is a generalisation of the Gaussian reparameterisation trick (Opper and Archambeau, 2009; Kingma and Welling, 2014; Rezende et al., 2014) — [1]

I will not go deeper here, but in a nutshell, with reparameterization we try to move dependence away from phi so that expectation eventually will be taken independently.

If you assume Q(theta | phi), which is the variational posterior, is a gaussian, you will end up getting the formula in the above image (Derivative of an expectation [1]).

So, we have now all the pieces. Let’s see the learning process as a high level overview:

Minibatches & Re-Weighting the KL-Divergence

Recall our loss function:

Image by author.

One common approach is to use mini-batches and take the average of the gradients over all the elements:

*(BatchSize is written as Batch for* abbreviation)

The term which is on the right hand side is calculated automatically by TensorFlow. We also need to re-weight the KL term for a proper training process. This can be done by

Image by author.

where M is the total number of samples. TensorFlow will always average the loss over all elements in the mini-batch by default for the every loss function or model that you train. This is why also the KL Divergence needs to be re-weighted. As you can see from this derivation, we get an unbiased estimate of the true ELBO objective.

That’s it, that’s the Bayes by Backprop method. Now it is time to implement this using TensorFlow Probability.

Weight Uncertainty Using TensorFlow Probability

DenseVariational Layer

What we have been talking can be implemented with the help of DenseVariational Layer.

There are four important parameters in this layer. The layer has:

make_prior_fn: Python callable taking tf.size(kernel), tf.size(bias), dtype and returns another callable which takes an input and produces a tfd.Distribution instance.

make_posterior_fn: Python callable taking tf.size(kernel), tf.size(bias), dtype and returns another callable which takes an input and produces a tfd.Distribution instance.

kl_weightAmount by which to scale the KL divergence loss between prior and posterior.

kl_use_exactPython bool indicating that the analytical KL divergence should be used rather than a Monte Carlo (MC) approximation.

Those explanations are taken from original documentation. Let’s explore them with examples.

Weight Uncertainty in Regression

Let’s implement prior and posterior functions which we have described in the Bayes by backprop algorithm.

prior and posterior functions have arguments as kernel_size and bias_size. They are added and referred to the total number parameters that we want to learn. Since this prior is not trainable, we simply return a callable object which is a Sequential model while using a Laplace prior. Prior is not trainable because we hard-coded the distribution’s mean and standard deviation.

posterior function follows the same logic, however we now use VariableLayer to indicate that this will be trainable distribution. And with the params_size, we let TFP to determine to correct shape for it and select Normal distribution as the posterior.

So far so good, our prior is a Laplace prior and posterior is a normal one. This is the posterior what we have discussed before, i.e the variational posterior.

Since the output is not a distribution, we can use mse as the loss function. In other words, this model can not capture aleatoric uncertainty.

kl_weight argument is the weighting the term which we discussed in the part Minibatches & Re-Weighting the KL-Divergence. By re-weighting the KL term we will get unbiased estimates. Recall the formula (ELBO Objective):

Image by author.

M corresponds to the total number of elements in the dataset. So in the layer we set kl_weight = x_100.shape[0].

We also did not set kl_use_exact = True. Depending on the choice of distributions used for both the posterior and the prior, it may be possible to analytically compute the KL divergence, and if it is and the analytical solution is registered in TFP library, then the kl_use_exact argument can be set to True. Otherwise this would raise an error. Here we calculate it by MC approximation.

You may ask, in model.compile() we set our loss to be mse and there is nothing about KL Term ? That’s correct, KL Term is added internally by model.add_loss when using this layer. Actually we can see KL Term by running:

model_100.losses # Returns list.

This will give:

[<tf.Tensor 'dense_variational/kldivergence_loss/batch_total_kl_divergence:0' shape=() dtype=float32>]

We see that we don’t need to worry about the loss function, only needed thing was to set kl_weight, and it is done.

We have completed our model and layer specifications, let’s see the summary:

Layer (type)                Output Shape              Param #   
=================================================================
 dense_variational (DenseVar  (None, 1)                4         
 iational)                                                       
                                                                 
=================================================================
Total params: 4
Trainable params: 4
Non-trainable params: 0

Output shape is expected, however in a normal case we only would have 2 parameters and they would correspond to the coefficient and y-intercept in the problem y = ax + b.

The problem is still the same. But now we learn parameters of the distributions over the weights and biases. In this case, each of them has a mean and variance, in total 4 learnable parameters.

Now, we have a weight and a bias term, for that reason there are two values in mean() and variance arrays. For example the first index of posterior variance corresponds to learnt variance of the weight distribution.

Now, each forward pass will give a different prediction as we are sampling from the distributions over the weights. That is, it can be thought as an ensemble classifier. Running the code above we will get:

Notice how shaky are the red lines. These lines are generated by the model at each forward pass. Shaky lines imply that the epistemic uncertainty is high in this case. Because we don’t have infinitely many data points, in other words our dataset is finite, we may end up having different weight values which explain the data well.

In the beginning, we stated that epistemic uncertainty could be reduced with more data. Let’s see if that’s the case.

The relationship is the same but this time we have 1000 data points instead of 100. Model configuration also stays the same. I will not repeat the fitting process as it is identical. When we sample from the new model, we will get:

One thing to notice, now these lines are closer to each other. We can compare them using subplots:

Effect of Dataset Size

We conclude that adding more data reduces epistemic uncertainty, and it can be seen from the plots. When epistemic uncertainty is less, the lines are closer to each other. This is because we have more information when there is more data.

Weight Uncertainty in Non-Linear Regression

Consider we have this data:

Let’s change our prior and variational posterior a little bit.

Now, we have widen out our prior as scale is multiplied by 2. In our VariableLayer we passed 2 * n implying that we want to learn both mean and standard deviation. This was done by params_size before.

Another thing is in posterior, we have scaled down our standard deviation. Other values rather than 0.003 may work better. If we don’t scale down posterior standard deviation, there may be some convergence issues.

Layer (type)                Output Shape              Param #   
=================================================================
 dense_variational (DenseVar  (None, 128)              512       
 iational)                                                       
                                                                 
 dense_variational_1 (DenseV  (None, 64)               16512     
 ariational)                                                     
                                                                 
 dense_variational_2 (DenseV  (None, 1)                130       
 ariational)                                                     
                                                                 
=================================================================
Total params: 17,154
Trainable params: 17,154
Non-trainable params: 0
_________________________________________________________________

Now this time layers has non-linear activations, and the rest of the process is the same. We specify our prior and posterior functions and re-weight the KL term.

But, we need to train this model longer as its job is harder.

In order to plot the outputs, we take the feed-forward output from the model 5 times.

As we can see, we got reasonable lines. Results may be improved by tweaking the variational posterior.

Before jumping into the conclusion I want to mention that choice of variational posterior effects the total parameters of the model. If we change our posterior into a Multivariate Gaussian in this case, that’s how model summary looks:

Layer (type)                Output Shape              Param #   
=================================================================
 dense_variational_3 (DenseV  (None, 128)              33152     
 ariational)                                                     
                                                                 
 dense_variational_4 (DenseV  (None, 64)               34093152  
 ariational)                                                     
                                                                 
 dense_variational_5 (DenseV  (None, 1)                2210      
 ariational)                                                     
                                                                 
=================================================================
Total params: 34,128,514
Trainable params: 34,128,514
Non-trainable params: 0
_________________________________________________________________

That’s because now we learn:

Mean
Standard Deviation
Covariances

and this is computationally expensive in this model.

Conclusion

In this article we have seen:

Problem with Normal Neural Networks
The key idea of Bayesian Neural Networks
Variational Bayes methods
Minibatches & Re-Weighting KL Divergence
DenseVariational Layer

You can find the whole notebook and codes from here.

Next Steps

In the next upcoming articles, we will only focus on writing models including Convolutional layers as well as capturing both epistemic and aleatoric uncertainty.

References

[1]: Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra, Weight Uncertainty in Neural Networks, 2015

Uncertainty in Deep Learning — Epistemic Uncertainty and Bayes by Backprop

Introduction

What is Epistemic Uncertainty?

Problem With Normal Neural Networks

Bayesian Neural Networks

Idea

Deep Dive

Mathematics Behind This Scheme

Training Process in Principle

Variational Bayes Methods

Kullback-Leibler Divergence

Backpropagation in Bayesian Neural Networks

Unbiased Monte Carlo Gradients

Minibatches & Re-Weighting the KL-Divergence

Weight Uncertainty Using TensorFlow Probability

DenseVariational Layer

Weight Uncertainty in Regression

Effect of Dataset Size

Weight Uncertainty in Non-Linear Regression

Conclusion

Next Steps

References

Written by Kaan Bıçakcı