On 14 Feb 2016 Google self-driving car hit a bus. On 23 March 2018 Tesla’s Autopilot was involved in another deadly car crash. A big part of intelligence is not acting when one is uncertain and this has to be integrated into our artificial intelligence (AI) algorithms to avoid accidents. Regarding Neural Networks – which are the building blocks of the current Deep Learning AI wave and are being used in safety-critical applications – the problem is that they don’t have an inherent Uncertainty quantification. This semi-technical article discusses the matter and reports different approaches proposed to resolve the problem.
A big part of intelligence is not acting when one is uncertain.
Discriminative vs Generative
Conventional Neural Networks have no inherent uncertainty quantification. This means it doesn’t tell us how much it knows about the input and how confident it is about its output. Consider a network that is trained to classify dog and cat images. What will happen if you feed it a zebra image? It will respond with a dog/cat prediction. At the end of the day that what the network was trained for and it has only two outputs anyway. But that doesn’t have to be the case. The network should be augmented with the concept of raising its hand and saying: this is something I am not trained for.
It is good to know that there exist two different types of machine learning models. (1) Discriminative models map inputs to their classes by finding decision boundaries. (2) Generative models model the data distribution and find classes’ central tendency and how they are distributed around centers.

Consider the figure above which showcase a dataset of two classes (orange and blue points) along with the decision boundary learned by a neural network over it. As you can see, the data is distributed between [-6, 6] on both axes. Now consider a data point that is above the decision boundary (blue region) but very far from the training dataset say the point (50, 4). The network will classify it as blue despite it being very far from the original training set and with a high class score! This is an example of a discriminative machine learning model – Neural Network – failing to deal with out-of-domain images.
On the other hand, generative models learn classes’ probability distribution as in the Naive Bayes classifier. The figure below gives the intuition of such model predictions in action. In this case, if we feed the model a far data point it will output a very low class score for both classes as it is very far from both centers and beyond the data standard deviation. In this case, we can say that the model is uncertain about the decision and this point is probably out of the training data domain.

All in all, neural networks (convolutional ones as a result) as discriminative models don’t have inherent uncertainty quantification. Also, this nature makes them more susceptible to adversarial inputs (inputs intentionally and systemically perturbed to deceive neural networks) than other models like Radial Basis Function (RBF) networks. However, this is another topic I leave for the curious to check [1]. In the following section, we will see what did researchers propose to add uncertainty quantification to neural networks.
Problem Resolution
The below resolutions are retrieved from two research papers [2][3] and considered to be state-of-the-art methodologies. Although the resolutions are based on a sound theory, I will explain them intuitively.
Monte Carlo Dropout Sampling
The simplest of all resolutions is Monte Carlo Dropout (MC-Dropout) which feeds the input N times through the same network while tweaking the network slightly every time. The tweak is done by shutting down some (different each time) of the network nodes or neurons. This is called Dropout. Dropout was first introduced as a regularization trick to prevent the model from over-fitting during training [4]. However, in our case, we are using it during inference to sample different parameters distribution. The uncertainty in this case is the output variance. In other words, if the models output different classes or the same class with different class scores the model is considered uncertain about the input.
We can see this as analogous to shutting down a person’s different sensory organs while asking him to recognize objects (say blindfold his eyes one time, and cover his skin in another, and so on). If he keeps predicting the same object even after losing different sensory organs, then the object is very familiar to him. Familiar to the extent where he can figure it using limited sensory organs. On the other hand, if he predicts different objects each time, then the object is challenging. The below figure is a bonus illustration.

If you are still curious about adversarial inputs, notice that a similar method has been used to detect them [5].
Ensembles
Ensembles are simply training multiple neural networks with different random initialization and different shuffling of the training dataset. The uncertainty will be the variance of the models’ predictions. The authors in [2] found it to be more accurate than MC-dropout in approximating the ground truth uncertainty; however, it is computationally expensive.

This reminds me of Stephen Covey "The seven habits of highly effective people" in the paradigm shift section. He tells a story when he gave one group of his students the image (A – young lady) in the above figure and the image (B – old lady) for the other group. He then gave the image (C) to both and asked them to tell what is in it. The first group said it was a young lady and the other said it was an old one. This because – like neural networks – each group has a different perspective (different position in the parameters manifold in case of neural networks). And the fact that the same image had different interpretations means it is of ambiguous and uncertain nature. The below figure is a bonus illustration.

Before we continue
Now that uncertainty is made clear, it is good to know two different types of uncertainties. (1) Aleatoric uncertainty captures the uncertainty in the data generating process, e.g., inherent randomness of a coin-flipping or measurement noise. This type of uncertainty cannot be reduced even if we collect more training data. (2) Epistemic uncertainty models the ignorance of the predictive model where it can be explained away given enough training data. The below figure explains the concept better. The orange braces correspond to aleatoric uncertainty while the golden ones correspond to epistemic uncertainty resulting from lack of training data.

So far the proposed methodologies target epistemic uncertainty as aleatoric one is already available as the class scores (this is different for regression though, check the reference [2] for further details).
Sampling Free Uncertainty Quantification
The uncertainty quantification presented earlier both depend on sampling a set of different model parameters (different networks). The authors in [3] proposed a sampling free one for regression using mixture neural networks (MDNs).
Why MDNs
An MDN is a variant of a neural network proposed in 1994 by Christopher Bishop [6]. The author proposed the variant after proofing that conventional neural networks assume a normal distribution of the target data. In other words, if the target data was multi-modal in some regions, the network will fail to approximate it and will learn the average. This insight is very important and should be considered by practitioners. Consider a self-driving cars dataset annotated by recording humans driving. For some similar scenarios, the humans might have behaved differently, some go left others go right. If we train the network to predict the steering angle in such a scenario, it will learn to go straight (as each training sample will push the gradients toward them and the network will minimize the error by settling on the average). The below figure presents two toy datasets, one with a normally distributed target and the other with a bimodal distributed one.

How is an MDN different
Rather than predicting the target directly, MDNs predicts a linear combination of k distributions. Rather than having one output neuron, it has 3*K ones. For each distribution, the network predicts a mean, a standard deviation, and the distribution weight. Increasing K gives the network more flexibility to model multimodal targets. The below figure illustrates an MDN.

Since this article is about uncertainty quantification and not about MDNs the reader is asked to follow up about MDNs through the references.
Back to the sampling free thing
Using MDNs the network is no more predicting one target but several distributions. The authors proposed to use the distributions mean of variances as an aleatoric uncertainty quantifier and the variance of the distribution of means as an epistemic uncertainty quantifier. How confusing.

The figure above illustrates both values. Consider an MDN with k=3 i.e. the network is predicting three distributions. The Variance of Means is how much the means of the three distributions are away from each other. If they are near to each other, this means we have a low epistemic uncertainty and we are good to go. Otherwise, the network function should be inspected or delegated in the scenario of means being far from each other as it is encountering a high uncertainty situation. The Mean of variance, on the other hand, is the average spread of the three distributions. If all the distributions are widely spread then aleatoric uncertainty is high. If all are narrowly spread then aleatoric uncertainty is low. Check this site for an interactive demo.
The authors proved that those two quantities respond to uncertainty resulting from different factors such as noise and missing data. Note that unlike the previous quantification, this has not been tested on large scale problems.
Conclusion
This was a simple introduction to uncertainty quantification in deep learning. Neural networks don’t have inherent uncertainty quantification thus this called for the augmentation of such concept. We discussed three proposals to resolve this: MC-dropout, ensembles, and MDNs. The reader is referred to the references section for further details. Please don’t hesitate to mention any feedback in the comments section below, as this will help me optimize my writings.
References
[1] Goodfellow, I.J., Shlens, J., and Szegedy, C., 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
[2] Gustafsson, F.K., Danelljan, M., and Schon, T.B., 2020. Evaluating scalable Bayesian deep learning methods for robust computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 318–319).
[3] Choi, S., Lee, K., Lim, S., and Oh, S., 2018, May. Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 6915–6922). IEEE.
[4] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), pp.1929–1958.
[5] Ma, L., Zhang, F., Sun, J., Xue, M., Li, B., Juefei-Xu, F., Xie, C., Li, L., Liu, Y., Zhao, J. and Wang, Y., 2018, October. Deep mutation: Mutation testing of deep learning systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE) (pp. 100–111). IEEE.
[6] Bishop, C.M., 1994. Mixture density networks.