Adding Uncertainty to Deep Learning

How to construct prediction intervals for deep learning models using Edward and TensorFlow

Published in

Towards Data Science

6 min readMar 10, 2017

The difference between statistical modeling and machine learning gets blurry by the day. They both learn from data and predict an outcome. The main distinction seems to come from the existence of uncertainty estimates. Uncertainty estimates allow hypothesis testing, though usually at the expense of scalability.

Machine Learning = Statistical Modeling - Uncertainty + Data

Ideally, we mesh the best of both worlds by adding uncertainty to machine learning. Recent developments in variational inference (VI) and deep learning (DL) make this possible (also called Bayesian deep learning). What’s nice about VI is that it scales well with data size and fits nicely with DL frameworks that allow model composition and stochastic optimization.

An added benefit to adding uncertainty to models is that it promotes model-based machine learning. In machine learning, the results of the predictions are what you base your model on. If the results are not up to par, the strategy is to “throw data at the problem”, or “throw models at the problem”, until satisfactory. In model-based (or Bayesian) machine learning, you are forced to specify the probability distributions for the data and parameters. The idea is to explicitly specify the model first, and then check on the results (a distribution which is richer than a point estimate).

Bayesian Linear Regression

Here is an example adding uncertainty to a simple linear regression model. A simple linear regression predicts labels Y given data X with weights w.

Y = w * X

The goal is to find a value for unknown parameter w by minimizing a loss function.

(Y - w * X)²

Let’s flip this into a probability. If you assume that Y is a Gaussian distribution, the above is equivalent to maximizing the following data likelihood with respect to w:

p(Y | X, w)

So far this is traditional machine learning. To add uncertainty to your weight estimates and turn it into a Bayesian problem, it’s as simple as attaching a prior distribution to the original model.

p(Y | X, w) * p(w)

Notice this is equivalent to inverting the probability of the original machine learning problem via Bayes Rule:

p(w | X, Y) = p(Y | X, w) * p(w) / CONSTANT

The probability of the weights (w) given the data is what we need for uncertainty intervals. This is the posterior distribution of weight w.

Although adding a prior is simple conceptually, the computation is often intractible; namely, the CONSTANT is a big, bad integral.

Monte Carlo Integration

An approximation of the integral of a probability distribution is usually done by sampling. Sampling the distribution and averaging will get an approximation of the expected value (also called Monte Carlo integration). So let’s reformulate the integral problem into an expectation problem.

The CONSTANT above integrates out the weights from the joint distribution between the data and weights.

CONSTANT = ∫ p(x, w) dw

To reformulate it into an expectation, introduce another distribution, q, and take the expectation according to q.

∫ p(x, w) q(w) / q(w) dw = E[ p(x, w) / q(w) ]

We choose a q distribution so that it’s easy to sample from. Sample a bunch of w from q and take the sample mean to get the expectation.

E[ p(x, w) / q(w) ] ≈ sample mean[ p(x, w) / q(w) ]

This idea we’ll use later for variational inference.

Variational Inference

The idea of variational inference is that you can introduce a variational distribution, q, with variational parameters, v, and turn it into an optimization problem. The distribution q will approximate the posterior.

q(w | v) ≈ p(w | X, Y)

These two distributions need to be close, a natural approach would minimize the difference between them. It’s common to use the Kullback-Leibler divergence (KL divergence) as a difference (or variational) function.

KL[q || p] = E[ log (q / p) ]

The KL divergence can be decomposed to the data distribution and the evidence lower bound (ELBO).

KL[q || p] = CONSTANT - ELBO

The CONSTANT can be ignored it since it’s not dependent on q. Intuitively, the denominators of q and p cancel out and you’re left with the ELBO. Now we only need to optimize over the ELBO.

The ELBO is just the original model with the variational distribution (worked out in detail here).

ELBO = E[ log p(Y | X, w)*p(w) - log q(w | v) ]

To obtain the expectation over q, Monte Carlo integration is used (sample and take the mean).

In deep learning, it’s common to use stochastic optimization to estimate the weights. For each minibatch, we take the average of the loss function to obtain the stochastic estimate of the gradient. Similarly, any DL framework that has automatic differentiation can estimate the ELBO as the loss function. The only difference is you sample from q and the average will be a good estimate of the expectation and then the gradient.

Code Example to build Prediction Intervals

Let’s run a Bayesian simple linear regression example on some generated data. The following should also be applicable to DL models since they are just linear regressions smashed together. We use a Bayesian deep learning library called Edward (based on TensorFlow) to build the model.

The entire code is available on Gist.

First step is to define a prior distribution for the weight.

weight = ed.models.Normal(mu=tf.zeros(1), sigma=tf.ones(1))

In deep learning, weight would be a point estimate. In Bayesian DL, the weight is a distribution and its posterior is approximated by the following variational distribution.

qw_mu = tf.get_variable(tf.random_normal([1]))
qw_sigma = tf.nn.softplus(tf.get_variable(tf.random_normal([1]))
qweight = ed.models.Normal(mu=qw_mu, sigma=qw_sigma)

Notice the variational parameters qw_mu and qw_sigma are estimated but the weight is not. Sampling from qweight will give us the posterior uncertainty intervals for weight (assuming the variance parameter fixed).

We define a linear regression as the prediction model. Any DL model can be substituted.

nn_mean = weight * x + bias

The data likelihood is constructed similarly to the prior distribution (assuming the variance parameter fixed).

nn = ed.models.Normal(mu=nn_mean, sigma=tf.ones(1))

The ELBO metric is dependent on both the data likelihood and the prior. The data likelihood is binded by the label data. Each prior distribution has to be binded by the variational distribution (this assumes independence between variational distributions).

latent_vars = {weight: qweight, bias: qbias}
data = {nn: y}

Running the inference optimization is easy as:

inference = ed.KLqp(latent_vars, data)
inference.run()

To extract the uncertainty intervals, we sample the variational distribution qweight directly. Here, the first two moments are extracted after sampling 100 times.

mean_, var_ = tf.nn.moments(qweight.sample(100))

We can also add uncertainty intervals to predictions (also called posterior prediction intervals). This can be done by copying the model and replacing the priors with all the posteriors (essentially updating the model after training on the data).

nn_post = ed.copy(nn, dict_swap={weight: qweight.mean(), bias: qbias.mean()})

Run through the graph again 100 times and we get an uncertainty interval for the prediction.

mean_, var_ = tf.nn.moments(nn_post.sample(100), feed_dict={x: data})

Now that we have the entire distribution, hypothesis testing is easy.

tf.reduce_mean(nn_post.sample(100) > labels)

But wait there’s more, Dropout Intervals!

If going fully Bayesian is too much of an investment, uncertainty intervals can be calculated from the neural network through dropout layers.

There’s a strong link between regularization of machine learning models and prior distributions in Bayesian models. For example, the frequently used L2 regularization is essentially a Gaussian prior (more details here).

Dropout is a technique that zeros out neurons randomly according to a Bernoulli distribution. This can be approximated by Gaussian processes under a KL divergence criterion (more details here since I’m not sure on the math).

In practice, this means that we run the model 100 times with the dropout layers turned on and see the resulting distribution.

# keep the dropout during test time
mc_post = [sess.run(nn, feed_dict={x: data}) for _ in range(100)]

The sample mean of mc_post is the estimate. For the uncertainty interval, we simply calculate the sample variance plus the following inverse precision term:

def _tau_inv(keep_prob, N, l2=0.005, lambda_=0.00001):
    tau = keep_prob * l2 / (2. * N * lambda_)
    return 1. / tau

This gives us principled uncertainty intervals without much investment.

np.var(mc_post) + _tau_inv(0.5, 100)

Summary

To get uncertainty intervals you either:

add a prior to the model, approximate the posterior via variational inference, and then sample from the posterior
run existing model a bunch of times with dropout layers turned on

A hybrid solution would be to add uncertainty on a specific parameter of interest. In natural language modeling, this could be treating only the word embeddings as latent variables.