The world’s leading publication for data science, AI, and ML professionals.

Modeling uncertainty in neural networks with TensorFlow Probability

Part 4: Going fully probabilistic¹

Hands-on Tutorials

This series is a brief introduction to modeling uncertainty using TensorFlow Probability library. I wrote it as a supplementary material to my PyData Global 2021 talk on uncertainty estimation in neural networks.

Articles in the series:

  • Part 1: An Introduction
  • Part 2: Aleatoric uncertainty
  • Part 3: Epistemic uncertainty
  • Part 4: Going fully probabilistic
Image by Jessie Zhang at Pexels
Image by Jessie Zhang at Pexels

Introduction

We went a long way so far! Today, it’s time for our final tutorial. We’re going to use all the knowledge we’ve gained and apply it to a new – more challenging – dataset.

Let’s get our hands dirty!

Data

This week, we’re going to build a non-linear heteroscedastic datset. This problem will be more challenging to model than the one we’ve seen previously.

We’re going to use the following process to generate the data:

Our data generating process. Image by yours truly.
Our data generating process. Image by yours truly.

Let’s import libraries, build the dataset and plot it.

Training data. Note that the higher the value of x the higher the variance. This is an example of heteroscedasticity. Data is also non-linear. Image by yours truly.
Training data. Note that the higher the value of x the higher the variance. This is an example of heteroscedasticity. Data is also non-linear. Image by yours truly.

As you can see, the data is indeed non-linear and heteroscedastic. We’ll need to introduce some non-linearity to our model to allow a good fit to this data – a great opportunity to expand our probabilistic toolkit!

Modeling aleatoric and epistemic uncertainty

Our dataset is in place. To build a model capable of modeling both – aleatoric and epistemic – uncertainty we’ll need to combine variational layers and a distribution layer. To model epistemic uncertainty, we’ll use two tfpl.DenseVariational layers and a non-linear activation function.

First, let’s define prior and posterior functions. As you might remember from the previous part of the series, we need to pass them to tfpl.DenseVariational² to make it work.

Let’s start with the prior:

A prior generating function must take kernel size, bias size and dtype as parameters (the same is true for posterior). We follow the same way to define prior generating function as we did in Part 3:

  • We use a non-trainable tfd.MultivariateNormalDiag distribution object as our prior.
  • We wrap it into tfpl.DistributionLambda that turns tfd.MultivariateNormalDiag distribution object into a Keras-compatible layer.
  • We pass this layer to tf.keras.Sequential for implementation consistency with the posterior-generating function.

Similarly, we follow the process from Part 3 to define the posterior-generating function:

  • We use tf.keras.Sequential.
  • We want our posterior to be trainable, so we use tfpl.VariableLayer that generates a trainable variable to parametrize tfpl.MultivariateNormalTriL.
  • We take advantage of a cool convenience method .params_size() introduced in Part 2 to get a precise number of parameters necessary to parametrize tfpl.MultivariateNormalTriL and use this layer as our trainable posterior.

We have everything necessary to parametrize our variational layers.

All we need to do now is to stack twotfpl.DenseVariational layers, add a non-linear activation and put a tfpl.MultivariateNormal layer on top of them.

As we’ve seen in Part 2, to train a model with a distribution layer as the last layer, we need a special log-likelihood cost function.

Let’s combine all of this and define our model generating function:

Let’s unpack it. We use a regular tf.keras.Sequential wrapper. We pass two tfpl.DenseVariational layers into it and put one tfpl.MultivariateNormal on top of them.

The first variational layer has 8 units (or neurons). First, we define input shape as (1,). That’s because our X is one dimensional. Then, we pass our prior and posterior generating functions to make_prior_fn and make_postrior_fn respectively. Next, we normalize the Kullback-Leibler divergence term³, weighting it by the reciprocal of the number of training examples. We set kl_use_exact to False (we could set it to True in our case though). Finally, we specify the non-linear activlation function⁴.

The second layer is parametrized similarly. There are a couple of differences though. Instead of hard-coding the number of units, we use .params_size() method to make sure that we get the correct number of units to parametrize our last distribution layer. Note, that we also do not use non-linearity here.

Our negative log-likelihood loss function is identical to the one we defined in Part 2.

We’re now ready to train the model!

Let’s do it and plot the loss:

Loss of our aleatoric & epistemic uncertainty estimating model vs. number of epochs. Image by yours truly.
Loss of our aleatoric & epistemic uncertainty estimating model vs. number of epochs. Image by yours truly.

The loss looks good, we could try to optimize the model further to converge faster, but that’s not our goal in this article.

Results

Our model is trained. That’s a very exciting moment! We’re ready to generate predictions. Our predictions will contain estimates of aleatoric and epistemic uncertainty. We expect to see a set of best fit lines and a set of confidence intervals. To make the visualization cleaner, we’ll modify the plotting code slightly to represent confidence intervals as lines rather than shaded areas. Similarly to what we’ve done in Part 3, we will compute 15 independent predictions:

Training data and 15 best fit lines with their respective confidence intervals. The plot represents estimates of aleatoric (confidence intervals) and epistemic (15 sets of best fit lines and their CIs) uncertainties. Image by yours truly.
Training data and 15 best fit lines with their respective confidence intervals. The plot represents estimates of aleatoric (confidence intervals) and epistemic (15 sets of best fit lines and their CIs) uncertainties. Image by yours truly.

As you can see, our model did a pretty good job fitting the data! 🚀 It seems that non-linearity and heteroscedasticity are both modeled well. Confidence intervals seem reliable, covering about 95% of the points.

Summary

In this episode of Modeling uncertainty in neural networks with TensorFlow Probability **** series we’ve seen how to model aleatoric and epistemic uncertainty using a single model.

We generated a non-linear and heteroscedastic dataset and used a deeper architecture with non-linear activation to model it. We refreshed how to define prior and posterior distributions necessary to parametrize variational layers. Finally, we’ve seen how to use a variational layer to parametrize a final probabilistic distribution layer.

This is the last article in Modeling uncertainty in neural networks with TensorFlow Probability **** series. Congratulations on completing the series! 🎉

Thank you!

If you have any question regarding the material covered in this series or suggestions for topics that you would like to read about in the future, feel free to let me know:

  • In the comments
  • Message me on LinkedIn
  • Reach out through any other channel listed on my webpage

Thank you!


Footnotes

¹ Theoretically, we could go even "more probabilistic" by using a variational process as the last layer of the model. That would be beyond the scope of this series though.

² You might also remember, that we pass these functions as function objects, without calling them.

³ Please refer to Part 3 for explanation.

⁴ We use sigmoid function in the example. Would the model fit look different if we used ReLU? If so, why? Feel free to let me know in the comments.


❤️ Interested in getting more content like this? Join using this link:

Join Medium with my referral link – Aleksander Molak

Thank you!



Related Articles