The world’s leading publication for data science, AI, and ML professionals.

Pitfalls with Dropout and BatchNorm in regression problems

Usually, when I see BatchNorm and Dropout layers in a neural network, I don't pay them much attention. I tend to think of them as simple…

Making Sense of Big Data

Photo by Circe Denyer on PublicDomainPictures.net
Photo by Circe Denyer on PublicDomainPictures.net

Usually, when I see BatchNorm and Dropout layers in a neural network, I don’t pay them much attention. I tend to think of them as simple means to speed up training and improve generalization with no side effects when the network is in inference mode. In this post, I will show why this notion is not always correct, and may cause the neural network to fail completely.


Let’s make up a simple toy regression task:

Given a top-down depth image of a cylinder with variable radius and height, standing on a fixed surface, estimate the volume of the cylinder.

For our experiments, we’ll create a simple synthetic dataset. Below are some sample images.

Synthetic cylinders. Pixel color indicates height. (Image by Author)
Synthetic cylinders. Pixel color indicates height. (Image by Author)

The manual solution to the problem is of course trivial:

Measure the radius r, sample the height h somewhere on the cylinder and compute V = 𝜋·r²·h

Solving the task with a neural network

A simple task like this shouldn’t require an enormous neural network. Let’s try with a simple VGG11 backbone and a regression head. We include a Dropout layer just in case. What’s the harm, right?

Network architecture (Image by Author)
Network architecture (Image by Author)

We train the model with cylinders with uniformly distributed radii and heights in the ranges r ∈ [0.3; 0.5] and h ∈ [5; 35] by minimizing the mean squared error (MSE) using an Adam optimizer. Validation data is drawn from the same distribution.

Results (with dropout)

Loss curves (with dropout) (Image by Author)
Loss curves (with dropout) (Image by Author)

Woah, what happened here!? The training loss looks fine, but the validation loss if completely off. At a first glance, this looks like a bad case of overfitting to the training data. Remember, however, that our training- and validation data are randomly generated and come from the exact same distribution, so overfitting is impossible in this case.

Let’s look at some predictions and see if we can figure out what’s going on:

Predictions with Dropout (Image by Author)
Predictions with Dropout (Image by Author)

Clearly, the model is biased towards underestimating the volume in validation mode. What is the cause of this? Only one thing changes when switching from training- to validation mode; Dropout is switched off.

Why is dropout making the network fail?

When using dropout during training, the activations are scaled in order to preserve their mean value after the dropout layer. The variance, however, is not preserved. Going through a non-linear layer (Linear+ReLU) translates this shift in variance to a shift in the mean of the activations, going in to the final linear projection layer. The final projection (which is essentially just a weight sum and a scalar bias), will be trained to fit the training-time statistics and thus fail at validation time when Dropout is switched off.

This behavior should not be a problem for tasks where only the relative scale of the outputs matter (e.g. softmax classification). In our case, where the output represents an absolute quantity, it results in poor inference time performance.

Let’s verify by removing the Dropout layer:

Loss curves without Dropout (Image by Author)
Loss curves without Dropout (Image by Author)
Predictions without Dropout (Image by Author)
Predictions without Dropout (Image by Author)

Just as expected, our simple neural network is now able to solve the task.

What about Batch Normalization?

The point of BatchNorm is to normalize the activations throughout the network in order to stabilize the training. While training, the normalization is done using per-batch statistics (mean and standard deviation). In prediction mode, fixed running average statistics computed during training, are used. It is a well established fact, that BatchNorm often accelerates training significantly.

A feature of BatchNorm in training mode is that it changes the absolute scale of the features according to the batch statistics, which is a random variable, while the relative distances between features are preserved. This is completely fine for e.g. classification and segmentation tasks, where semantics of the image are invariant to arbitrary scaling and shifting of the channel values. Imagine, for example, that you run a batch of photos through BatchNorm. The absolute color information is lost, but the dogs will still look like dogs and cats will still look like cats. The information in the image data is preserved and the images can still be classified or segmented.

In other words, in a dog/cat classification problem, as long as the features from an image of a dog remain more "dog-like" than "cat-like" after Batch Normalization, we do not worry too much about absolute level of dog-likeness.


What about our toy example with the cylinders, then? Let’s add BatchNorm layers to the VGG11 backbone and see:

Loss curves with BatchNorm (Image by Author)
Loss curves with BatchNorm (Image by Author)

Right.. Not terrible, but not great either. The validation loss somewhat higher than it was without BatchNorm. This is due to the fact that we’re solving a problem where the absolute scale of features does matter. If the pixel values are scaled randomly, according to batch statistics, how can the network know the height of the cylinder?

One possible fix is to make the batches large enough, that the batch statistics are stable. Here is the result with a batch size of 256:

Loss curves with BatchNorm (large batch) (Image by Author)
Loss curves with BatchNorm (large batch) (Image by Author)

This certainly helped to stabilize things and lower the error. However, the validation error is still worse than the example without BatchNorm.

Take-home message

So what can we learn from this example? The primary take-home message here is that neural network architecture matters. We cannot simply view it as a black box which takes in data and magically spits out results. We should not always expect "standard" tricks like Dropout and BatchNorm to give better results. Their usefulness is problem dependent!

How do we apply this knowledge in practice, then? Often, when you read papers on deep learning, you will se ablation experiments, where performance is assessed with parts of the model or elements of the training procedure stripped away. When you are developing a model to solve a new problem, try to go the other way; start simple and slowly add on complexity, while keeping an eye on performance as you go. If performance unexpectedly goes down after adding something, try to figure out why. There may be a bug in your code, or maybe you are inadvertently throwing away important information in your data as we did in this case.

Happy coding!


I work at the Alexandra Institute, a Danish non-profit company specializing in state-of-the-art IT solutions. Here in the Visual Computing Lab, we focus on utilizing the latest in Computer Vision and computer graphics research.


Related Articles