Uncertainty in Deep Learning — Epistemic Uncertainty and Bayes by Backprop
Knowledge is an unending adventure at the edge of uncertainty.
- Jacob Bronowski
This is the third part of the series Uncertainty In Deep Learning.
- Part 1 — Brief Introduction
- Part 2 — Aleatoric Uncertainty and Maximum Likelihood Estimation
- Part 3 — Epistemic Uncertainty and Bayes by Backprop
- Part 4 — Implementing Fully Probabilistic Bayesian CNN
- Part 5 — Experiments with Bayesian CNN
- Part 6 — Bayesian Inference and Transformers
Introduction
In this article we will explore how we can represent epistemic uncertainty using TensorFlow Probability as well as understanding the underlying algorithm and theoretical background.
The article is organized as follows:
- What is Epistemic Uncertainty
- Problem With Normal Neural Networks
- Bayesian Neural Networks
- Mathematics Behind This Scheme
- Variational Bayes Methods
- Backpropagation in Bayesian Neural Networks
- Minibatches & Re-Weighting the KL-Divergence
- Weight Uncertainty Using TensorFlow Probability
- Conclusion & Next Steps
What is Epistemic Uncertainty?
Epistemic uncertainty is knowledge about the world that is missing, imprecise, or perhaps wrong. It exists in the real world and is not just a subjective feeling.
If you ask me, it is the most important type of uncertainty to deal with because it is what prevents you from being certain about anything. Our reality is often a big unknown and the information we base our decisions on could be flawed, incomplete, or simply unavailable.
No one knows anything with certainty because we are born into a world of uncertainty.
For example, you may be very confident of the answer to a mathematical problem, but it doesn’t mean you’re correct. That’s also a problem for the Deep Neural Networks, they are known as being overconfident due to output activations.
A more concrete example, if you randomly measure 10 people to find the average height of a crowded population, your estimate of the population average height is going to be inaccurate. This is because you might have selected people who are taller or shorter than average. The more people you measure, the more accurate your estimate will be. This is called epistemic uncertainty.
Apparently, epistemic uncertainty can be reduced with more data. Because if you measure more people, the estimation will be more accurate.
Before we start, I will add the imports:
Problem With Normal Neural Networks
Since we use Maximum Likelihood Estimation to get best weight values to explain the data, and also considering that our data is finite, there should be more than one model to explain or fit the data. The key point is that, the weights are single deterministic point estimate values.
Let’s check if that’s the case, linear regression:
When 3 separate models are fitted, their starting point will be different as weights are initialized randomly. Also, the optimization process is a stochastic process so every time the optimal weight values would be different.
That’s an epistemic uncertainty about model’s weights, because there is no single answer. There are reasonable set of weights. And we can get better estimates if the dataset gets bigger.
Let’s check for a non-linear regression problem:
As we expected, all of the three lines seem reasonable although the learnt weights are different. Same things apply for this case too.
And if you consider real life problems, which are more complicated, there should be multiple set of weights to explain the data well. Because in real life also models are more complicated while considering there are many local minimas exist in the loss landscape, sets of weights make more sense to get reasonable results.
In a nutshell, the problem is that we don’t know possible values for the weights but have point estimate values.
But, how can we incorporate with epistemic uncertainty in neural networks?
Bayesian Neural Networks
Idea
When we train a neural network, we will end up having point estimate values for the weights. However, as we discussed there are multiple set of weights which should explain data reasonable and well.
In order to capture epistemic uncertainty in the model weights, we simply change them into a probability distribution. So, instead of learning those point estimate values, we learn the mean and standard deviation of these distributions via backpropagation.
Deep Dive
Since each weight is replaced by a probability distribution, there is no such single value now. In order to make predictions or have a feed-forward value, we need to take sample(s) from these distributions.
- 1) Sample from network’s weights
- 2) Determine the output value
- 3) Update the mean and standard deviation.
Mathematics Behind This Scheme
The paper Weight Uncertainty in Neural Networks [1] introduces a method for this task, called Bayes by Backprop. The key idea relies on the famous Bayes’ Theorem:
If you replace the denominator, the final form of Bayes’ theorem can be written as:
Bayes’ Theorem makes it possible to combine prior belief and likelihood to obtain a distribution for the model parameters, called as posterior distribution.
Training Process in Principle
- Select a prior distribution
- Determine the likelihood
- Determine the posterior distribution using the Bayes’ Theorem
However, calculating the true posterior is hard and may not be possible at all since it contains complicated integral(s).
In the end, we aim to get a distribution for the model weights.
Variational Bayes Methods
As you may have noticed, the denominator contains a complex integral to calculate true posterior distribution. For this reason, we need to approximate the posterior distribution.
Variational Bayes Methods try to approximate the true posterior distribution with a known distribution, specifically called variational posterior. You may think that approximating a function with another should have risks as the approximated one can be really bad. Yes, that’s true, however in order to mitigate this issue the variational posterior has some parameters. These parameters are tuned so that the approximated one should be as close as possible to the real posterior distribution.
Variational posterior is parameterized by phi
here. Theta
represents the weights while x
represents the data.
Kullback-Leibler Divergence
By intuition, these two distributions should be close to the each other. So, how we measure this? Well, there is a metric called KL-Divergence, defined as:
The last equation actually tells us:
We transformed first integral into log(P(x))
because Q(Theta | Phi)
is a probability distribution and integrates to 1.
Remember that we want the KL-Divergence to be low as possible. So now, this is an optimization problem. And since P(x)
is constant, we can just ignore it. In the end, we are left up with this equation:
We will see what exactly is the expactation of NLL. But first let’s wrap this up, our loss function is:
There is one other thing I want to mention, I will not discuss how it is derived but would like to show Evidence Lower Bound or ELBO.
As you may have noticed, ELBO is the negative of our loss function. So minimizing to loss is actually equivalent to maximizing the ELBO.
At the end of the day, we want KL-Divergence between prior and variational posterior to be as low as possible while keeping the expected likelihood as high as possible.
Backpropagation in Bayesian Neural Networks
Recall our loss function:
So, the expectation term means integrating the negative log-likelihood times each parameter. On the other hand, we have a KL-Divergence which goes for another integral. Skipping a lot of math, if we write those equations as the integrals, we will end up having:
Exactly minimising this cost naively is computationally prohibitive. Instead gradient descent and various approximations are used [1].
Taking derivatives with respect to phi
involves an integral over Theta
, aka the model weights. This might be very expensive in terms of computation or not even possible to do so!
One way that we can change this an expectation form and apply Monte Carlo approximation in order to calculate the derivatives w.r.t Theta
.
Now we have:
Well, this expectation also has some problems with computation as underlying distribution depends on phi
.
Unbiased Monte Carlo Gradients
One way to tackle the computational complexity problem is to use reparameterization trick.
Proposition 1 is a generalisation of the Gaussian reparameterisation trick (Opper and Archambeau, 2009; Kingma and Welling, 2014; Rezende et al., 2014) — [1]
I will not go deeper here, but in a nutshell, with reparameterization we try to move dependence away from phi
so that expectation eventually will be taken independently.
If you assume Q(theta | phi)
, which is the variational posterior, is a gaussian, you will end up getting the formula in the above image (Derivative of an expectation [1]).
So, we have now all the pieces. Let’s see the learning process as a high level overview:
Minibatches & Re-Weighting the KL-Divergence
Recall our loss function:
One common approach is to use mini-batches and take the average of the gradients over all the elements:
The term which is on the right hand side is calculated automatically by TensorFlow. We also need to re-weight the KL term for a proper training process. This can be done by
where M is the total number of samples. TensorFlow will always average the loss over all elements in the mini-batch by default for the every loss function or model that you train. This is why also the KL Divergence needs to be re-weighted. As you can see from this derivation, we get an unbiased estimate of the true ELBO objective.
That’s it, that’s the Bayes by Backprop method. Now it is time to implement this using TensorFlow Probability.
Weight Uncertainty Using TensorFlow Probability
DenseVariational Layer
What we have been talking can be implemented with the help of DenseVariational
Layer.
There are four important parameters in this layer. The layer has:
make_prior_fn
: Python callable taking tf.size(kernel)
, tf.size(bias)
, dtype
and returns another callable which takes an input and produces a tfd.Distribution
instance.
make_posterior_fn
: Python callable taking tf.size(kernel)
, tf.size(bias)
, dtype
and returns another callable which takes an input and produces a tfd.Distribution
instance.
kl_weight
Amount by which to scale the KL divergence loss between prior and posterior.
kl_use_exact
Python bool
indicating that the analytical KL divergence should be used rather than a Monte Carlo (MC) approximation.
Those explanations are taken from original documentation. Let’s explore them with examples.
Weight Uncertainty in Regression
Let’s implement prior
and posterior
functions which we have described in the Bayes by backprop algorithm.
prior
and posterior
functions have arguments as kernel_size
and bias_size
. They are added and referred to the total number parameters that we want to learn. Since this prior is not trainable, we simply return a callable object which is a Sequential
model while using a Laplace
prior. Prior is not trainable because we hard-coded the distribution’s mean and standard deviation.
posterior
function follows the same logic, however we now use VariableLayer
to indicate that this will be trainable distribution. And with the params_size
, we let TFP to determine to correct shape for it and select Normal distribution as the posterior.
So far so good, our prior is a Laplace prior and posterior is a normal one. This is the posterior
what we have discussed before, i.e the variational posterior.
Since the output is not a distribution, we can use mse
as the loss function. In other words, this model can not capture aleatoric uncertainty.
kl_weight
argument is the weighting the term which we discussed in the part Minibatches & Re-Weighting the KL-Divergence. By re-weighting the KL term we will get unbiased estimates. Recall the formula (ELBO Objective):
M
corresponds to the total number of elements in the dataset. So in the layer we set kl_weight = x_100.shape[0]
.
We also did not set kl_use_exact = True
. Depending on the choice of distributions used for both the posterior and the prior, it may be possible to analytically compute the KL divergence, and if it is and the analytical solution is registered in TFP library, then the kl_use_exact
argument can be set to True
. Otherwise this would raise an error. Here we calculate it by MC approximation.
You may ask, in model.compile()
we set our loss to be mse
and there is nothing about KL Term
? That’s correct, KL Term is added internally by model.add_loss
when using this layer. Actually we can see KL Term by running:
model_100.losses # Returns list.
This will give:
[<tf.Tensor 'dense_variational/kldivergence_loss/batch_total_kl_divergence:0' shape=() dtype=float32>]
We see that we don’t need to worry about the loss function, only needed thing was to set kl_weight
, and it is done.
We have completed our model and layer specifications, let’s see the summary:
Layer (type) Output Shape Param #
=================================================================
dense_variational (DenseVar (None, 1) 4
iational)
=================================================================
Total params: 4
Trainable params: 4
Non-trainable params: 0
Output shape is expected, however in a normal case we only would have 2 parameters and they would correspond to the coefficient and y-intercept in the problem y = ax + b.
The problem is still the same. But now we learn parameters of the distributions over the weights and biases. In this case, each of them has a mean and variance, in total 4 learnable parameters.
Now, we have a weight and a bias term, for that reason there are two values in mean()
and variance
arrays. For example the first index of posterior variance corresponds to learnt variance of the weight distribution.
Now, each forward pass will give a different prediction as we are sampling from the distributions over the weights. That is, it can be thought as an ensemble classifier. Running the code above we will get:
Notice how shaky are the red lines. These lines are generated by the model at each forward pass. Shaky lines imply that the epistemic uncertainty is high in this case. Because we don’t have infinitely many data points, in other words our dataset is finite, we may end up having different weight values which explain the data well.
In the beginning, we stated that epistemic uncertainty could be reduced with more data. Let’s see if that’s the case.
The relationship is the same but this time we have 1000 data points instead of 100. Model configuration also stays the same. I will not repeat the fitting process as it is identical. When we sample from the new model, we will get:
One thing to notice, now these lines are closer to each other. We can compare them using subplots:
Effect of Dataset Size
We conclude that adding more data reduces epistemic uncertainty, and it can be seen from the plots. When epistemic uncertainty is less, the lines are closer to each other. This is because we have more information when there is more data.
Weight Uncertainty in Non-Linear Regression
Consider we have this data:
Let’s change our prior and variational posterior a little bit.
Now, we have widen out our prior as scale is multiplied by 2. In our VariableLayer
we passed 2 * n
implying that we want to learn both mean and standard deviation. This was done by params_size
before.
Another thing is in posterior, we have scaled down our standard deviation. Other values rather than 0.003
may work better. If we don’t scale down posterior standard deviation, there may be some convergence issues.
Layer (type) Output Shape Param #
=================================================================
dense_variational (DenseVar (None, 128) 512
iational)
dense_variational_1 (DenseV (None, 64) 16512
ariational)
dense_variational_2 (DenseV (None, 1) 130
ariational)
=================================================================
Total params: 17,154
Trainable params: 17,154
Non-trainable params: 0
_________________________________________________________________
Now this time layers has non-linear activations, and the rest of the process is the same. We specify our prior and posterior functions and re-weight the KL term.
But, we need to train this model longer as its job is harder.
In order to plot the outputs, we take the feed-forward output from the model 5 times.
As we can see, we got reasonable lines. Results may be improved by tweaking the variational posterior.
Before jumping into the conclusion I want to mention that choice of variational posterior effects the total parameters of the model. If we change our posterior into a Multivariate Gaussian in this case, that’s how model summary looks:
Layer (type) Output Shape Param #
=================================================================
dense_variational_3 (DenseV (None, 128) 33152
ariational)
dense_variational_4 (DenseV (None, 64) 34093152
ariational)
dense_variational_5 (DenseV (None, 1) 2210
ariational)
=================================================================
Total params: 34,128,514
Trainable params: 34,128,514
Non-trainable params: 0
_________________________________________________________________
That’s because now we learn:
- Mean
- Standard Deviation
- Covariances
and this is computationally expensive in this model.
Conclusion
In this article we have seen:
- Problem with Normal Neural Networks
- The key idea of Bayesian Neural Networks
- Variational Bayes methods
- Minibatches & Re-Weighting KL Divergence
- DenseVariational Layer
You can find the whole notebook and codes from here.
Next Steps
In the next upcoming articles, we will only focus on writing models including Convolutional layers as well as capturing both epistemic and aleatoric uncertainty.
References
[1]: Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra, Weight Uncertainty in Neural Networks, 2015