Hands-on Tutorials
This series is a brief introduction to modeling uncertainty using TensorFlow Probability library. I wrote it as a supplementary material to my PyData Global 2021 talk on uncertainty estimation in neural networks.
Articles in the series:
- Part 1: An Introduction
- Part 2: Aleatoric uncertainty
- Part 3: Epistemic uncertainty
- Part 4: Going fully probabilistic

Introduction
We went a long way so far! Today, it’s time for our final tutorial. We’re going to use all the knowledge we’ve gained and apply it to a new – more challenging – dataset.
Let’s get our hands dirty!
Data
This week, we’re going to build a non-linear heteroscedastic datset. This problem will be more challenging to model than the one we’ve seen previously.
We’re going to use the following process to generate the data:

Let’s import libraries, build the dataset and plot it.

As you can see, the data is indeed non-linear and heteroscedastic. We’ll need to introduce some non-linearity to our model to allow a good fit to this data – a great opportunity to expand our probabilistic toolkit!
Modeling aleatoric and epistemic uncertainty
Our dataset is in place. To build a model capable of modeling both – aleatoric and epistemic – uncertainty we’ll need to combine variational layers and a distribution layer. To model epistemic uncertainty, we’ll use two tfpl.DenseVariational
layers and a non-linear activation function.
First, let’s define prior and posterior functions. As you might remember from the previous part of the series, we need to pass them to tfpl.DenseVariational
² to make it work.
Let’s start with the prior:
A prior generating function must take kernel size, bias size and dtype
as parameters (the same is true for posterior). We follow the same way to define prior generating function as we did in Part 3:
- We use a non-trainable
tfd.MultivariateNormalDiag
distribution object as our prior. - We wrap it into
tfpl.DistributionLambda
that turnstfd.MultivariateNormalDiag
distribution object into a Keras-compatible layer. - We pass this layer to
tf.keras.Sequential
for implementation consistency with the posterior-generating function.
Similarly, we follow the process from Part 3 to define the posterior-generating function:
- We use
tf.keras.Sequential
. - We want our posterior to be trainable, so we use
tfpl.VariableLayer
that generates a trainable variable to parametrizetfpl.MultivariateNormalTriL
. - We take advantage of a cool convenience method
.params_size()
introduced in Part 2 to get a precise number of parameters necessary to parametrizetfpl.MultivariateNormalTriL
and use this layer as our trainable posterior.
We have everything necessary to parametrize our variational layers.
All we need to do now is to stack twotfpl.DenseVariational
layers, add a non-linear activation and put a tfpl.MultivariateNormal
layer on top of them.
As we’ve seen in Part 2, to train a model with a distribution layer as the last layer, we need a special log-likelihood cost function.
Let’s combine all of this and define our model generating function:
Let’s unpack it. We use a regular tf.keras.Sequential
wrapper. We pass two tfpl.DenseVariational
layers into it and put one tfpl.MultivariateNormal
on top of them.
The first variational layer has 8 units (or neurons). First, we define input shape as (1,)
. That’s because our X is one dimensional. Then, we pass our prior and posterior generating functions to make_prior_fn
and make_postrior_fn
respectively. Next, we normalize the Kullback-Leibler divergence term³, weighting it by the reciprocal of the number of training examples. We set kl_use_exact
to False
(we could set it to True
in our case though). Finally, we specify the non-linear activlation function⁴.
The second layer is parametrized similarly. There are a couple of differences though. Instead of hard-coding the number of units, we use .params_size()
method to make sure that we get the correct number of units to parametrize our last distribution layer. Note, that we also do not use non-linearity here.
Our negative log-likelihood loss function is identical to the one we defined in Part 2.
We’re now ready to train the model!
Let’s do it and plot the loss:

The loss looks good, we could try to optimize the model further to converge faster, but that’s not our goal in this article.
Results
Our model is trained. That’s a very exciting moment! We’re ready to generate predictions. Our predictions will contain estimates of aleatoric and epistemic uncertainty. We expect to see a set of best fit lines and a set of confidence intervals. To make the visualization cleaner, we’ll modify the plotting code slightly to represent confidence intervals as lines rather than shaded areas. Similarly to what we’ve done in Part 3, we will compute 15 independent predictions:

As you can see, our model did a pretty good job fitting the data! 🚀 It seems that non-linearity and heteroscedasticity are both modeled well. Confidence intervals seem reliable, covering about 95% of the points.
Summary
In this episode of Modeling uncertainty in neural networks with TensorFlow Probability **** series we’ve seen how to model aleatoric and epistemic uncertainty using a single model.
We generated a non-linear and heteroscedastic dataset and used a deeper architecture with non-linear activation to model it. We refreshed how to define prior and posterior distributions necessary to parametrize variational layers. Finally, we’ve seen how to use a variational layer to parametrize a final probabilistic distribution layer.
This is the last article in Modeling uncertainty in neural networks with TensorFlow Probability **** series. Congratulations on completing the series! 🎉
Thank you!
If you have any question regarding the material covered in this series or suggestions for topics that you would like to read about in the future, feel free to let me know:
Thank you!
Footnotes
¹ Theoretically, we could go even "more probabilistic" by using a variational process as the last layer of the model. That would be beyond the scope of this series though.
² You might also remember, that we pass these functions as function objects, without calling them.
³ Please refer to Part 3 for explanation.
⁴ We use sigmoid function in the example. Would the model fit look different if we used ReLU? If so, why? Feel free to let me know in the comments.
❤️ Interested in getting more content like this? Join using this link:
Thank you!