Why default CNN are broken in Keras and how to fix them

A deep dive into CNN initialization…

Published in

Towards Data Science

6 min readJul 6, 2019

Last week I was making some experiments using a VGG16 model trained on the CIFAR10 dataset. I needed to train the model from scratch so didn’t use the pretrained version on ImageNet.

So I launched the training for 50 epochs, went to take a coffee and came back to these learning curves:

The model hasn’t learn anything !

I have already seen networks converging really slowly, oscillating, overfitting, diverging but that was the first time I saw such a behaviour of a network not doing anything at all. I thus dug a little to find out what was happening.

Experiments

This is how I created my model. It follows the original VGG16 architecture but most of the fully-connected layers are removed so that pretty much only convolution remains.

Let’s now understand a little more what could explain what was leading to such training curves that I showed you at the beginning of this post.

When something is wrong with the learning of a model, it is often a good idea to check how the gradients are behaving. We can get their mean and standard deviation at each layer with:

And by plotting them, we have:

Stats of gradients of VGG16 initialized with Glorot uniform

Wow…There is no gradient at all flowing in my model, maybe we should check how the activations are evolving along the layers. We can get their mean and standard deviation by:

And then, if we plot them:

Stats of activations of VGG16 initialized with Glorot uniform

There is what is happening !

To remind you, the gradient for each convolutional layer is computed by:

Where Δx and Δy are used to denote the gradients ∂L/∂x and ∂L/∂y. Gradients are computed using the backpropagation algorithm and the chain rule, meaning that we start from the last layers and backpropagate towards the early layers. But then what happens if our last layers have activations that goes towards 0 ? Exactly what we have here, the gradients are equal to 0 everywhere and thus cannot backpropagate, resulting in a network unable to learn anything.

As my network was pretty bare (no Batch Normalization, no Dropout, no Data Augmentation,..), I guessed the problem should come from a poor initialization so I read the Kaiming paper [1] and I will briefly describe what it says.

Initialization method

Initialization has always been a important field of research in deep learning, especially with architectures and non-linearities constantly evolving. A good initialization is actually the reason we can train deep neural networks.

Here are the main takeaways of the Kaiming paper, where they show the conditions that the initialization should have in order to have a properly initialized CNN with ReLU activation functions. A little bit of math is required but don’t worry, you should be able to grasp the outlines.

Let’s consider the output of a convolutional layer l being:

Then, if the biases are initialized to 0, and under the assumption that the weights w and the elements x both are mutually independent and
share the same distribution, we have:

With n, the number of weights in a layer (i.e. n=k²c). By the following variance of the product of independent property:

It becomes:

Then, if we let the weights w such that they have a mean of 0, it gives:

By the König-Huygens property:

It finally gives:

However, since we are using a ReLU activation function, we have:

Thus:

This was the variance of the output of a single convolutional layer but if we want to take all of the layers into account, we have to take the product of all of them, which gives:

As we have a product, it is now easy to see that if the variance of each layer is not close to 1, then the network can rapidly degenerate. Indeed, if it is smaller than 1, it will rapidly vanish towards 0 and if it is bigger than 1, then the value of the activations will grow indefinitely, and can even become a so high number that your computer cannot represent it (NaN). So, in order to have a well-behaved ReLU CNN, the following condition must be carefully respected:

Authors have compared what happens when you train a deep CNN initialized to what was the standard initialization at that time (Xavier/Glorot) [2] and when initialized with their solution.

Comparison of the training of a 22-layer ReLU CNN initialized with Glorot (blue) or Kaiming (red). The one initialized with Glorot doesn’t learn anything

Does this graph seems familiar ? Exactly what I witnessed and shown you at the beginning ! The network trained with Xavier/Glorot initialization doesn’t learn anything.

Now guess which one is the default initialization in Keras ?

That’s right ! By default in Keras, convolutional layers are initialized following a a Glorot Uniform distribution:

So what’s happening if now we change the initialization to the Kaiming Uniform one ?

Using Kaiming Initialization

Let’s recreate our VGG16 model but this time we change the initialization to he_uniform.

Let’s now check the activations and gradients before training our model.

So now, with Kaiming initialization, our activations have a mean around 0.5 and a standard deviation of around 0.8

We can see that now we have some gradients, which is a good thing if we want our network to learn something.

Now, if we train our new model, we get those curves:

We probably need to add some regularization now but hey, that’s still better than before, right ?

Conclusion

In this post, we showed that initialization can be a VERY important part of your model which can often be overlooked. Also, it showed that what you have by default in libraries, even for excellent ones like Keras, are not to take for granted.