The world’s leading publication for data science, AI, and ML professionals.

The Vanishing/Exploding Gradient Problem in Deep Neural Networks

Understanding the obstacles that faces us when building deep neural networks

A difficulty that we are faced with when training deep Neural Networks is that of vanishing or exploding gradients. For a long period of time, this obstacle was a major barrier for training large networks. However, by the end of this article, not only will you know what the problem of vanishing and exploding gradient is, you will also know how to detect when you are facing these problems and some possible solutions to debug you model.

Photo by Jingda Chen on Unsplash
Photo by Jingda Chen on Unsplash

What is the Problem?

When training a deep neural network with gradient based learning and backpropagation, we find the partial derivatives by traversing the network from the the final layer (y_hat) to the initial layer. Using the chain rule, layers that are deeper into the network go through continuous matrix multiplications in order to compute their derivatives.

In a network of n hidden layers, n derivatives will be multiplied together. If the derivatives are large then the gradient will increase exponentially as we propagate down the model until they eventually explode, and this is what we call the problem of exploding gradient. Alternatively, if the derivatives are small then the gradient will decrease exponentially as we propagate through the model until it eventually vanishes, and this is the vanishing gradient problem.

In the case of exploding gradients, the accumulation of large derivatives results in the model being very unstable and incapable of effective learning, The large changes in the models weights creates a very unstable network, which at extreme values the weights become so large that is causes overflow resulting in NaN weight values of which can no longer be updated. __ On the other hand, the accumulation of small gradients results in a model that is incapable of learning meaningful insights since the weights and biases of the initial layers, which tends to learn the core features from the input data (X), will not be updated effectively. In the worst case scenario the gradient will be 0 which in turn will stop the network will stop further training.


How to know?

Exploding Gradients

There are few subtle methods that you may use to determine whether your model is suffering from the problem of exploding gradients;

Photo by Yosh Ginsu on Unsplash
Photo by Yosh Ginsu on Unsplash
  • The model is not learning much on the training data therefore resulting in a poor loss.
  • The model will have large changes in loss on each update due to the models instability.
  • The models loss will be NaN during training.

When faced with these problems, to confirm whether the problem is due to exploding gradients, there are some much more transparent signs, for instance:

  • Model weights grow exponentially and become very large when training the model.
  • The model weights become NaN in the training phase.
  • The derivatives are constantly

Vanishing Gradient

There are also ways to detect whether your deep network is suffering from the vanishing gradient problem

Photo by David Besh on Unsplash
Photo by David Besh on Unsplash
  • The model will improve very slowly during the training phase and it is also possible that training stops very early, meaning that any further training does not improve the model.
  • The weights closer to the output layer of the model would witness more of a change whereas the layers that occur closer to the input layer would not change much (if at all).
  • Model weights shrink exponentially and become very small when training the model.
  • The model weights become 0 in the training phase.

Solutions

There are many approaches to addressing exploding and vanishing gradients; this section lists 3 approaches that you can use.

  1. Reducing the amount of Layers

This is the solution could be used in both, scenarios (exploding and vanishing gradient). However, by reducing the amount of layers in our network, we give up some of our models complexity, since having more layers makes the networks more capable of representing complex mappings.

2. Gradient Clipping (Exploding Gradients)

Checking for and limiting the size of the gradients whilst our model trains is another solution. Going into the details of this technique is beyond the scope of this article, but you can read more about gradient clipping in an article by Wanshun Wong titled What is Gradient Clipping.

3. Weight Initialization

A more careful initialization choice of the random initialization for your network tends to be a partial solution, since it does not solve the problem completely. Check out this article by James DellingerWeight Initialization in Neural Networks: A journey from the basics to Kaiming

Conclusion

In this post we learnt what exploding and vanishing gradients are, how to detect them and some solutions. I am aware in this article I did not go into much detail about the RNN structure which are prone to vanishing gradients, useful resources to learn more about that will be linked below.

Other Resources

Chi-Feng WangThe Vanishing Gradient problem

Eniola AleseThe curious case of the vanishing & exploding gradient


Related Articles