Bayesian Neural Net
Super Deep Learning That Knows When It’s Tricked
This is the third chapter in the series on Bayesian Deep Learning. The previous article is available here.
We already know that neural networks are arrogant. But another failing of standard neural nets is a susceptibility to being tricked. Imagine a CNN tasked with a morally questionable task like face recognition. It gets its input from a security camera. But what happens when the security camera sends a picture of a dog, a mannequin or a child’s doll? The CNN will still output classifications having been tricked by something with a resemblance to human face. CNNs cry out for the Bayesian treatment, because we don’t want our work undermined by silly mistakes and because where the consequences of misclassification are high we want to know how sure the network is. How likely is a doll, or any real person, being a wanted criminal? We need to know.
Chapter Objectives
- Learn how to implement a Bayesian convolutional model
- Understand how we can identify bad input data without ever having seen it
- Understand how parameter problems of Bayesian neural networks influence training
As we’ve discovered in earlier articles, Bayesian analysis deals in distributions and not single values. We’ve seen it with normal distributions where we’re getting a continuous floating point back with the most likely return value of the mean. The distribution for a categorical becomes the discreet (piano key rather than violin string). For probabilities we’ll get a specific result such as a class, an index (or a note if we squeeze the musical analogy). The distribution is informed by the logits from our model.
Being able to perform very well with considerably less data, while throwing in an ability to generalise better makes Bayesian neural networks desirable. And that’s without considering the other advantages. There is however one disadvantage to a Bayesian implementations that become important in this chapter. Bayesian implementations need more parameters. Considering that a whole distribution replaces every weight value parameter it’s surprising that only twice as many parameters are required. It’s simple to see where the quite specific ‘twice as many’ figure comes from as the weight distributions are mostly normal distributions and those normal distributions each have two parameters of their own.
In the deviant manor that comes naturally to the author, a strange new dataset has been created to highlight the problem. We find ourselves helping parents solve an important problem. Parents are always interested in the measuring the height of their children. Well, no longer do they need fret about carrying a ruler around with them all day. We’ll create a model that estimates height from a picture. A dataset has been created consisting of 836 silhouettes of babies and toddlers in addition to their heights. Rather than classification problem we’re therefore solving a regression problem. We aim to return a single floating-point value corresponding to the height of the child silhouetted in the photograph. It’s a slightly harder problem than the classification exercise from the last chapter and made even more so by the occasional presence of a spider. While the training set consists only of valid human silhouettes, after training we’ll throw in spiders just for fun. And this is where it gets fun, because we want to avoid returning height measurements for insects but we aren’t allowed any pictures of insects in training.
Of course spiders are usually smaller things than children. So to prevent simple discrimination based on a ridiculousl difference in size, spiders were upscaled to occupy a space comparable to the children. Furthermore, in the spirit of making the task quite arbitrarily difficult, the children have been randomly rescaled so silhouettes are anywhere between 1/2 and 2x their original. Of course, rescaling doesn’t make much sense when we’re interested in the height of the children! But the rather elaborately contrived senseless problem perfectly demonstrates the power of Bayesian Deep Learning. You’ll see how well the models generalise to real world situations with few training examples and without any examples of the corrupted data they likely to receive!
Lets get stuck into the problem with TensorFlow probability. The full code as well as the data is available in a Jupyter Notebook online at: https://github.com/DoctorLoop/BayesianDeepLearning. First we’ll define the architecture.
In the first line we clear any session that might already be in memory, emptying any parameter stores and variables so there’s no interference with a fresh run. Next we define a lambda function that helps us update the loss via the Kullback-Leibler (KL) divergence that we discussed in the previous chapter. We then pass this lambda to each convolutional layer so the loss can be updated with reference to the divergence between an approximate distribution and our prior. Strictly speaking this isn’t absolutely necessary to specify as the default parameter for the layer is almost same. The difference however is that while the default parameter just gets the KL divergence, we go one step further and divide it by the total number of examples (836). The default implementation applies the epoch’s total KL to every example. But what we’d prefer is to apply only a proportion of the total epoch’s KL to each example rather than the total each time. While both will train we see better results through scaling the loss. Experiment and see for yourself.
The actual model is defined just as it is for any other keras sequential. Of course we’re using a Convolutional2DFlipout layer (we’ll discuss that later) rather than the usual Conv2D. You might be surprised we’re only using two convolutional layers in a time when its near enough a fashion to use hundreds. We’re using two simply because the results are impressive and for this problem we really don’t really need more. We’ve also thrown in two maxpool layers between neurone layers and both have quite large strides/pool sizes. If you’ve a problem that requires particularly sensitive pixel perfect measurement you might want to try removing these. Of course, the cost of doing so will be in terms of escalating hardware demands so it’s recommended to compare both.
The very last layer is a single dense (Bayesian) neurone because we’re interested in just one output. This output will be our measurement. It’s as simple as that.
Finally we compile the model with mean squared error loss (MSE). This is deceptive as although we only specify MSE we’re also adding the KL on each layer. However we defined the KL ourselves, because we’re independent Bayesianists who wanted to give Keras a well-deserved rest. We’ll see proof that KL is involved when we print the loss during training. It’s noticeably different (greater) than the MSE alone. The difference between the two is our KL.
Training
Lets start the training and take a look at that loss:
[Out]:
....
Epoch 250/250
151/151 [==============================] - 1s 4ms/step - loss: 12.5878 - mse: 5.1539 - val_loss: 16.3906 - val_mse: 8.9721
There’s are few things to note here. The loss is relatively high while the batch size is relatively low!
To address the loss first we’ll repeatedly find that with Bayesian model the loss value is an even worse indicator of model performance than it is for conventional models. The first reason is because we’re combining at least two losses. Of course we’re interested in the change in loss rather than the explicit value, but even then change isn’t always clear as we often change the relative influence of the two losses progressively over the training epochs. We’ll discuss these considerations in later chapters. Just remember for now that it isn’t unusual to see a classification model with a loss of several thousand(!) while having perfect validation metrics.
While some people may scoff at my puny batch size and assume resources are scarce – they couldn’t be more wrong. With Bayesian model the batch size has a much greater influence on training than we’d expect. It’s an example of a number of areas of neural network theory we often think we understand but that’ll demand a review of our beliefs. We usually think of batch size as of predominant importance to training speed. Some people also appreciate the reduced variance a larger batch brings. However with Bayesian models batch size directly influences training performance. Have a look and see by running the same model repeatedly with a batch size of 5 and with 50. You’ll notice that when the batch size is 50 epochs are of course much quicker but we never get loss or performance metrics as good as we do with a batch size of 5. It’s not a small difference – it’s enormous! This is important because it’ll quickly become clear that batch size is a very influential hyperparameter to Bayesian deep learning success.
It’ll quickly become clear that batch size is a very influential hyperparameter to Bayesian deep learning success
While at first it seems frustrating that we’ve another hyperparameter to optimize, we’ll find ourselves being able to rocket the performance with a very simple change of architectures that are simpler than we’ve relied upon in the past (in the appendix at the bottom of this article we’ll discuss the layers like Flipout that drive the changes).
Inference
Finally we get to inference. We’re interested in making multiple predictions from our Bayesian master model. Each output be slightly different because each prediction will be made with a fresh model that’s been filled with weights sampled from the weight distributions of the Bayesian master we trained.
In the above code we use a list comprehension style for loop to make each prediction. Wouldn’t it be quicker if we just provided a single input array (1000 x 126 x 126 x 1) and make all the predictions at once? Indeed it would be much quicker. But at the same time it would defeat the purpose because it’s the separate model.predict calls that sample fresh weights from the distributions of our Bayesian training. Each predict call therefore is responsible for creating a unique new model that’s constrained by the distributions we created in training. If we made just 1 predict call with an input of 1000 images all the predictions would be identical because we’d be working with a single sample of weights, and thereby emulating a standard model. We’re more interested in the ability to exploit the infinite bag of models our Bayesian training creates. We call the bag a model ensemble, and we take advantage of the ensemble of multiple different models to get many different perspectives on the same data. The agreement of the many perspectives is most important, it tells us the quality of the data we input.
In the above code and figure we produce a density plot of 1000 height predictions of a single baby image (green) and a single spider image (red). We can see that predictions for the baby’s height are very tightly packed together around 51 pixels (its mean and expected value). While around 30% of predictions are at exactly this measurement (the true value coincidently) and nearly all predictions are within a single pixel of the truth! On the bother hand, while predictions for the spider also centre on a value (90 pixels) fewer than 4% of predictions are at the expected value and the predictions are far more widely dispersed (spread out) over a range going from 51pixels all the way to 134pixels. Clearly the predictions on the spider don’t agree with each other. We can intuit therefore that our Bayesian model is uncertain about predictions on invalid objects while our Bayesian model is confident about predictions related to objects from training. This is exactly how we want it to be.
In the next article we’ll explore how we can make simple Bayesian models better than complex standard models. We’ll also find out how other types of uncertainty can be exploited to guide training and how to optimise and compare models to find the very best.
Appendix: TensorFlow-Probability Convolutional Layers
If you’ve read the documentation or any papers recently you may have found different ways to tackle Bayesian deep learning. TensorFlow Probability implements two approaches for convolutional layers (more are available for dense layers). The first is the reparameterization layer:
tfp.layers.Convolution2DReparameterization. Reparameterization lets us calculate gradients via a distribution’s most likely value. We therefore manipulate the parameters that describe the distributions instead of the weight values in the neural network. Dealing with distribution parameters means the actual distribution can be ignored and is effectively abstracted away. The parameters describe the distribution can be thought of as stand-ins for the distribution object in the same as paper money stands in for real assets like gold. In both cases a stand in is preferred because it’s more convenient. In training we conveniently avoid the embarrassment of attempting backpropagation through random variables (embarrassing because it doesn’t work¹).
Reparameterization is fast but sadly it suffers from a practical need to set all the weights of examples in a batch to the same value. If weights were individually recorded instead of shared the memory requirements would skyrocket. Sharing weights is efficient but increases variance to make training require more epochs.
The flipout layer: tfp.layers.Convolution2DFlipout takes a different approach. While it’s similar it benefits from a special estimator for loss gradients. This flipout estimator shakes up the weights in a mini-batch to make them more independent of each other. In turn the shake-up reduces variance and requires fewer training epochs than the reparameterization method. But there’s a catch. While flipout needs fewer epochs it actually requires twice as many calculations! Luckily these calculations can be parallelised but we’ll still tend to find a model taking 25–50% longer per epoch (dependant on hardware) even though training requires fewer epochs in total.
¹ Without a reparametrized distribution we break the assumption that taking a large sample gives us a better estimate. While many of us don’t think of training in these terms we’re depending on the assumption all the time. So with reparameterization we describe the change in the most likely value instead of the most likely change in a sample which we can’t predict as the variable isn’t random if we can.