How to deal with Uncertainty in the era of Deep Learning

TensorFlow introduces Probabilistic Modeling in the Deep Learning Community

Published in

Towards Data Science

8 min readJul 22, 2019

Recently no day goes by without a publication of a new outstanding machine learning application, most likely powered by some deep learning model. Beneath the surface hundreds of thousands of companies applying the same technology to all kinds of different processes. At the latest when it supports critical decision-making, you should think about the degree of certainty that comes with every prediction. We will go through why that is, how to define uncertainty and eventually look at some code examples so that you will be able to apply our findings in your next project.

The Tale of the Trojan Zebra

Let’s do a thought experiment. Imagine you’re a company that organizes safaris, and you want to create a safer experience for clients in their adventures. As a machine learning engineer your objective could be to create an image classifier that labels surrounding animals as “potentially dangerous”, or as “likely harmless”. This program would allow the customer to feel safe, and give them the precautions if necessary.

Luckily you installed month ago cameras on all your vehicles so that you now have a ton of images of zebras, snakes, giraffes, etc. You start right away training your neural network. First evaluations are showing that your model is performing quite well, detecting all kinds of snakes as potentially dangerous and zebras as harmless.

“That’s great”, you think, “but what if I use this model in practice and an animal that was underrepresented or even completely absent in the data shows up in front of my cameras?”

Well, the model is forced to decide on one of the two possible outcomes even though it actually does not have a clue. Imagine further in this case the (almost) unknown animal would be a tiger. Maybe the model would locate the photo in its internal representation due to certain similarities (stripes, four legs, etc.) closer to a zebra than to a predator and as a result misclassify the situation.

Wouldn’t it be nice if the model were able to communicate its uncertainty and because you are primarily interested in your clients safety you could calibrate your system to react defensively when in doubt?

There are many causes for prediction uncertainty but from our modeling perspective it basically boils down into two categories. The easiest way to think of it is as data and model related uncertainty.

Aleatoric uncertainty

The former is probably the more obvious one. Whenever you are taking multiple measurements under the same circumstances, it’s still quite unlikely to get every time the exact same result. Why is that? For several reasons: If you are using a sensor, every device itself has its accuracy, precision, resolution, etc. In case of a manual lab sample the used technique, personal skill and experience playing their roles. Basically every aspect of the measurement that is unknown and hence introduces some kind of randomness falls in this category.

If you are able to reduce aleatoric uncertainty depends on how much influence you have on the way your data is collected.

Epistemic uncertainty

Now that we covered the uncertainty within the data, what about the model? It’s not just a question of the model’s capability to explain the given data but of how certain it is that the data incorporates all there is to know. Another way of looking at epistemic uncertainty is as a measure of how much an (infinite) ensemble of models would agree on the outcome. If multiple models come to strongly deviating conclusions, the data obviously did not paint the whole picture.

High epistemic uncertainty can be caused for example by simple models that try to fit complex functions, too little or missing data, etc. That’s where your experience as a modeler can shine. Is an important feature missing? Do you have underrepresented situations in your training data? Did you choose the right model?

A practical approach

Since both types of uncertainty are not constant throughout all predictions, we need a way of assigning a specific uncertainty to each prediction. That’s where the new TensorFlow Probability package steps in to save the day. It provides a framework that combines probabilistic modeling with the power of our beloved deep learning models.

For demonstration purposes we will take a very simple regression problem with just one input and one output dimension. The two blobs of the training data differ in their standard deviation and between them is a space where we do not have any data. Will the model be able to consider and communicate these aspects in any meaningful way?

We will take a two layer neural network. The first one takes the sigmoid and the second one the identity function as activation. So far, everything business as usual.

The last missing step is the new tfp.layers.DistributionLambda layer from TensorFlow Probability. Instead of a tensor, it returns a tfp.distributions.Distribution which you can use to perform all the kind of operations you would expect from distributions, like sampling, deriving its mean or standard deviation, etc.

**chart 3:** aleatoric uncertainty (standard deviation)

Did you notice the two neurons in the second dense layer? They are needed to learn the output distribution’s mean and standard deviation.

You can see that by predicting a distribution instead of a single value we are getting closer to what we actually want to achieve. As mentioned earlier, the two blobs have different variances and the model does reflect that in its prediction. One could argue that for each blob the model predicts a constant variance in respect to the fitted mean curve which does not seem to match the data but this is due to the limited model complexity and will be resolved with increasing degrees of freedom.

But what happens in between the two blobs? That’s not what we expect from a measure of uncertainty in a space without data, right? Who says that in that space there isn’t another so far unmeasured blob somewhere far away from the fitted function? Nobody; we just don’t know. And neither does the model. Regardless, you should not be too surprised by the model’s interpolation.

After all, that’s what models do: learning the function that best fits the known data and does not hurt this purpose where there is no data.

So, how can we compensate for this lack of information? It turns out that we already covered the solution earlier: the consideration of epistemic uncertainty.

The basic approach is to take the idea of using distributions instead of single variables one step further and apply it to all model parameters. Therefore, we take our example from before and substitute the two Keras dense layers with tfp.layers.DenseVariational layers. As a result the model does not only learn one set of weights and biases but sees all its parameters as random variables with more or less variance depending on how certain it is to have found the best fit. During inference the model samples from the parameter distributions which leads to different predictions for each run. Hence, one has to run inference several times and calculate the overall mean and standard deviation.

**chart 4:** aleatoric and epistemic uncertainty

**chart 5:** aleatoric and epistemic uncertainty (standard deviation)

How does this upgraded version of our model work with our toy data? We can see that it still reflects the different variances close to the two blobs although they both seem to have increased. This actually makes sense since what we see is the sum of both, the aleatoric and epistemic uncertainty.

Furthermore, in between the two data blobs the behavior changed more drastically: the model uncertainty, represented by the standard deviation, now spikes where we do not have any data available. This is exactly what we were hoping for.

Let’s take another look at our example and think about why that actually worked. To recap, the model is trying to find the function that best fits the data. Aren’t there a lot of other functions which fit the data similarly well as the one in chart 4? For example, if you shift the curve slightly along the x-axis to the left or the right the loss won’t change significantly. Therefore, the model parameter responsible for this shift has a wide distribution. Given the data the model is quite uncertain about this parameter.

If you are interested in the complete code example, take a look at the jupyter notebook. It’s based on the one the Tensorflow Probablity team published with their blog post on the same topic. In case you want to dive in even deeper, I recommend checking out the in a jupyter notebook rewritten version of the book Bayesian Methods for Hackers by Cameron Davidson-Pilon. I really like the interactive manner in which they present the content.

Conclusions

We saw the importance of considering uncertainty in our models and how to overcome the challenges which come along with it by looking at a simple example using Tensorflow Probablity. Although the library is still quite young I expect it to become a serious grown up pretty soon as it offers an easy way to combine probabilistic modeling and deep learning by fitting in seamlessly with Keras.

Hopefully it will become more and more common practice to pay attention to the model’s reliability before deriving important decisions from its predictions. To see the arguably most widely used deep learning library laying the foundation to do so is at least a big step in the right direction.

If you find this post helpful let me know your thoughts and experiences in the comments and don’t forget to follow me on Medium and LinkedIn.