The world’s leading publication for data science, AI, and ML professionals.

Bayesian Deep Learning & Estimating Uncertainty

Weather Data; Aleatoric & Epistemic Uncertainty

Humidity as a function of Temperature. More Details in Text. (Source: Author)
Humidity as a function of Temperature. More Details in Text. (Source: Author)

Performances of Deep Neural Networks (DNNs) rely on the ability to progressively build and extract features from large data. Though these deep models are usually adaptive the performance depends on the data distribution. The robustness property is important in various applications such as computer vision tasks, eg. autonomous driving, because the outdoor environments may naturally vary. One thing at this point we would desire is to know how likely the predictions are going to be correct, and this can be done by incorporating uncertainty estimation within the model.

In this post, we will use real weather data publicly available and build simple linear and non-linear models to find not only the best models but with uncertainty estimations. What you can expect to learn/review from this post –

  1. Use TensorFlow Probability library for getting started the Bayesian Deep Learning.
  2. Two different types of uncertainty estimation: Aleatoric and Epistemic uncertainty.
  3. How certain are the network weights (Epistemic uncertainty) and what is a Variational Posterior?

The complete notebook is available at the link given in the reference. Without any delay let’s begin.


1. Loading and Pre-Processing the Data:

I am using weather data (publicly available under CC0 license), measured over a period of 10 years, from 2006 to 2016. For simplicity, I will be using only the Temperature and Humidity values and our goal would be to find how humidity varies as a function of temperature and build a model that not only predicts the behaviour but also gives us model (& data) uncertainty information. After selecting the required columns, our dataframe looks as below –

First few entries of our data
First few entries of our data

As we can see, measurements are taken every hour over a period of 10 years, resulting in over 95,000 data points. For further simplification, we re-sample the dataframe to decrease the frequency of the input data, and instead of a 1-hour interval, I chose 3 days interval, resulting in 1340 data points.

resampled data, with a frequency of 3 days, instead of an hour.
resampled data, with a frequency of 3 days, instead of an hour.

We can now plot the dependence of humidity as a function of temperature in a scatter plot as below –

Fig: 1. Humidity as a function of temperature.
Fig: 1. Humidity as a function of temperature.

We see that humidity (relative) varies between ~0.4 to 1 and temperature varies between ~-10 to ~30 degrees (Celsius) and seems like there’s an inverse linear relationship. We split the data into train-test sets and standardize the temperature values. With this, we are ready for the next steps, build models and quantify uncertainty.


2. Linear Regression:

Before dealing with any uncertainties, let’s just run a simple regression. Simple deterministic regression gives us a point estimate, i.e. for an input value, we get a predicted value. For this task, we assume that X-data has a linear relationship with Y-data and the noise term is normally distributed. Our model learns via minimizing the mean squared error (MSE loss), essentially maximizing the likelihood of the data under our statistical modelling assumption. We can easily build such a regression network in TensorFlow and check the prediction as below –

Fig. 2: Linear regression model and prediction using TensorFlow & Python.
Fig. 2: Linear regression model and prediction using TensorFlow & Python.

Aleatoric & Epistemic Uncertainty:

Before we delve into probabilistic regression, let’s discuss in brief the 2 common types of uncertainties: Aleatoric & Epistemic Uncertainty. Aleatoric uncertainty captures noise inherent in the observations, resulting in uncertainty which cannot be reduced even if we have more data. Epistemic uncertainty on the other hand accounts for uncertainty in the model parameters and can be reduced if more data is obtained. Epistemic uncertainty refers to the ignorance of the decision-maker (in this case deep neural network), instead of any underlying random/stochastic process.

3. 1. Aleatoric Uncertainty: Trainable Mean

Modelling Aleatoric uncertainty using the TensorFlow Probability library for a regression task is rather easy. Here our idea is to capture the inherent noise in the data and to get started we train a model that returns a normal distribution (very different from deterministic linear regression & point estimate) with a trainable mean but standard deviation =1 (See Code Block 2). The idea is that after training our model will be able to replicate the original data distribution. To model a normal distribution using the output from the Dense layer, I’ve used the DistributionLambda layer from TensorFlow Probability and it returns a distribution object. That’s why we use negative log-likelihood as a loss function instead of MSE loss. After training, one can sample from the learned distribution and plot the mean & standard deviation as below –

Fig. 3: Generated samples from the learned distribution (left) and corresponding mean and standard deviation are plotted.
Fig. 3: Generated samples from the learned distribution (left) and corresponding mean and standard deviation are plotted.

Since only the mean was trainable and we fixed the standard deviation (stddev) at 1, we see that the ±2σ lines from the mean are far too apart.

This is simply solved by adding the stddev as a learnable parameter in the previous model. Now instead of a Dense layer with a single unit, we will use 2 units (for mean and stddev).


3.2. Aleatoric Uncertainty: Trainable Mean & Variance:

For adding trainable variance in the previous model, we can use the previous code block with DistrbutionLambda layer, but in TF Probability library we have [IndependentNormal](https://www.TensorFlow.org/probability/api_docs/python/tfp/layers/IndependentNormal) layer that can directly include the stochasticity of the previous Dense layer output. Let’s check the model –

The event_shape argument within the IndependentNormal layer tells us the number of parameters required, which would be 2 for event_shape=1 ; That’s why the Dense layer has 2 units (mean and variance). To specify the number of units within Dense layer, we have used the static method params_size of IndependentNormal layer.

Plotting the samples from the learned distribution and corresponding mean & stddev looks as below –

Fig. 4: Same as Fig. 3 but now our model learned both the mean and standard deviation of the original data distribution.
Fig. 4: Same as Fig. 3 but now our model learned both the mean and standard deviation of the original data distribution.

Not only linear models, we can also add a Dense layer with a non-linear activation function to include non-linearity.

Fig. 5: Same as Fig. 4 but instead of a linear model, we assume a non-linear model to represent the data distribution.
Fig. 5: Same as Fig. 4 but instead of a linear model, we assume a non-linear model to represent the data distribution.

4. Epistemic Uncertainty:

So far we tried to model the general stochasticity of the data (that is the underlying noise), by adding some distribution at the last layer of the model i.e. the final prediction is a random variable from a distribution.

Epistemic uncertainty captures the uncertainties in the model parameters of the DNN. To capture epistemic uncertainty in a neural network (NN) we put a prior distribution over its weights, for example, a Gaussian prior distribution: W∼N(0, I). The idea behind representing weights of DNN as probability distributions rather than point estimates was proposed in the ‘Weight Uncertainty in Neural Networks‘ paper. We start with some prior distribution of the weights and update as the network sees more data to obtain posterior distributions. For a multivariate normal distribution, it is possible to get the exact form of the posterior. In general, it’s not possible to determine the analytical form for the posterior and then variational methods become useful.

Bayesian inference for neural networks calculates the posterior distribution of the weights given the training data, P(w|D). Variational Bayes methods approximate the posterior distribution with a second function called a variational posterior. This function has a known functional form, and hence avoids the need to determine the posterior P(w|D) exactly. To avoid risks of choosing a bad variational posterior, this approximate function i.e. variational posterior is parametrized by parameters θ, that are tuned so that the function approximates the original posterior (P(w|D)) as well as possible. Variational learning finds the parameters θ of a distribution on the weights q(w|θ) that minimises the Kullback-Leibler (KL) divergence with the true Bayesian posterior on the weights.

Minimizing the KL divergence leads us to the Evidence Lower Bound (ELBO). For more on ELBO, you can check my notebook listed in the references, also you can check a detailed post on Expectation-Maximization (EM) algorithm.

The way we implement a variational posterior in TensorFlow is via [DenseVariationalLayer](https://www.tensorflow.org/probability/api_docs/python/tfp/layers/DenseVariational) layer and the documentation definition is as follows –

This layer uses variational inference to fit a "surrogate" posterior to the distribution over both the kernel matrix and the bias terms which are otherwise used in a manner similar to [tf.keras.layers.Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense).

Before we use this layer, we need to implement the fixed prior and trainable posterior distributions, which will be used within the DenseVariationalLayer .

Let’s see what happened in Code Block 5: Prior: We define the prior weight as multivariate normal distributions (N(0, 1)) with diagonal covariance matrix, as a sequential and there are no trainable parameters. Posterior: Here the weights are trainable (VariableLayer ) & following a Multivariate Normal distribution with a full covariance matrix (MultivariateNormalTril). Model: Finally we define a model using DenseVariational layer and use the prior and posterior definitions. The KL divergence gets weighted by the number of training samples and you can check the notebook for the reason. After we train this model with MSE loss, we can plot the regression lines as below –

Fig. 6: Epistemic uncertainty: here, each line represents a different random draw of the model parameters from the posterior distribution.
Fig. 6: Epistemic uncertainty: here, each line represents a different random draw of the model parameters from the posterior distribution.

Every time we call this model, we will get a slightly different result, as you can see from those 5 lines. Different slopes suggest that our model is uncertain about the linear dependency of Temperature and Humidity. The DenseVariational essentially defines an ensemble of models, and in this case, we can get a mean from these 5 different calls.


5. Aleatoric + Epistemic Uncertainty:

Finally, we will create a model that can combine the aleatoric and epistemic uncertainty and it is pretty easy to do, we just need to add a IndependentNormal layer to the previous model (Code Block 5)to take into account the stochastic nature of the label (Humidity) distribution. Let’s see the code block below –

Once trained we can proceed as before to sample from the model (for aleatoric uncertainty) for different calls (for epistemic uncertainty) as below

for _ in range(2):
   y_model = model_non_lin_al_ep(X_test_scaled)
   y_hat = y_model.mean()
   y_hat_m2sd = y_hat - 2 * y_model.stddev()
   y_hat_p2sd = y_hat + 2 * y_model.stddev()

Plotting the mean and stddevs for these 2 calls looks as below—

Fig. 7: Same as plots before but here we have both Aleatoric & Epistemic uncertainties: The two different lines represent 2 different calls to the model (representing epistemic uncertainty).
Fig. 7: Same as plots before but here we have both Aleatoric & Epistemic uncertainties: The two different lines represent 2 different calls to the model (representing epistemic uncertainty).

We went through the basic building blocks of Bayesian Neural Network (BNN), specially in relation to epistemic uncertainty. If we denote our dataset as X = {x1, …, xN }, Y = {y1, …, yN }, then Bayesian inference is used to compute the posterior over the weights p(W|X, Y). Here we went through a simpler example where we know the exact posterior distribution and just needed to find the best parameters. While the epistemic uncertainty can be reduced given more data, aleatoric uncertainty is irreducible. Using real-world weather data, we developed a simple model suitable for regression tasks that goes beyond point estimate and predicts uncertainties. I hope the concepts used here for building these models that give us info about uncertainties will be useful and you can modify them according to your tasks.


References:

[1] ‘Aleatoric and epistemic uncertainty in machine learning‘: E. Hüllermeier, W. Waegeman

[2] ‘What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?’: A. Kendall, Y. Gal.

[3] TFP Layers: Probabilistic Regression .

[4] ‘Weight Uncertainty in Neural Networks‘: C. Bundell et.al.

[4] Notebook for Codes, Concepts & Maths: GitHub Link


If you’re interested in further fundamental machine learning concepts and more, you can consider joining Medium using My Link. You won’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!


Related Articles