Often times the bottleneck in a neural network-based project isn’t the network implementation. Rather, after you’ve written all the code and tried a whole bunch of hyperparameter configurations, sometimes the network will just not work. I’ve been there before. After some time dealing with finicky networks, I’ve collected a few methods that have helped me debug them. These methods aren’t a guarantee of any sort – even if you do everything I suggest, there’s a chance your network will still be broken. I hope, however, that these tips will in the long run decrease the time you spend Debugging your neural networks.
Check for gradient issues:
Sometimes the gradient is the cause of the problem. There are several useful gradient-related debugging methods:
- Numerically compute the gradient for each weight. This is commonly called "gradient checking" and is useful to ensure the gradient is being computed correctly. One way to do this is to use finite differences. More details can be found here.
- For each weight, compare the magnitude of the gradient to the magnitude of the weight. We want to make sure the ratio of the magnitudes is reasonable. If the gradient magnitude is much smaller than the weight magnitude, the network will take forever to train. If the gradient magnitude is about the same or larger than the weight magnitude, the network will be very unstable and probably not train at all.
- Check for exploding or vanishing gradients. If you see a gradient go to 0 or nan/infinity, you can be sure the network will not train correctly. You need to first figure out why the exploding/vanishing gradient is happening, for example perhaps because the step size is too big. Once you figure out why the gradient is exploding/vanishing, there are various solutions to fix the issue, for example, adding residual connections to propagate the gradient better or simply using a smaller network.
- Activation functions can also cause exploding/vanishing gradients. For example, if the magnitude of the input to a sigmoid activation function is too large, the gradient will be very close to 0. Check the inputs to your activation functions over time, and make sure that those inputs won’t cause gradients to consistently be 0 or a large magnitude.
Check training progress often:
Checking the training progress of your network often will save you time. For example, assume you are training a network to play the Snake game. Instead of training the network for days at a time and then checking to see if the network has learned anything, every ten minutes run the game with the current learned weights. After several hours, if you notice that the agent is doing the same thing every time and getting zero reward, you know there might be something wrong and have saved yourself days of wasted training time.
Don’t rely on quantitative outputs:
If you only look at quantitative outputs, you could miss useful debugging information. For example, when training a network for speech translation, make sure you read the translated speech to make sure it actually makes sense. Don’t just check if the evaluation function is decreasing. As another example, when training a network for image recognition, make sure you check the labels that the network gives by hand. The reason you shouldn’t rely on quantitative outputs is twofold. First, there could be an error in your evaluation function. If you only look at the number outputted by the faulty evaluation function, it could be weeks before you realize that something is wrong. Second, you may find patterns of mistakes in your neural network output that wouldn’t show up quantitatively. For example, you might realize that one particular word is always being mistranslated, or that in the top left quadrant the image recognition network is always wrong. These observations in turn can help you find bugs in the data processing part of your code, which otherwise would go unnoticed.
Try a small dataset:
Another way to determine whether your code has a bug or the data is just hard to train is to fit a smaller dataset first. For example, instead of having 100,000 training examples in your dataset, trim your dataset so there are only 100 or even 1 training example. In these cases, you expect a neural network to be able to fit the data extremely well, especially in the case of one training example. If your network still has high test error, you should be almost certain that something is wrong with your network code.
Try a less complicated network:
If your full size network is having trouble training, try a smaller network with fewer layers. This has the added benefit of training faster. If the smaller network succeeds where the full size network fails, that suggests the network architecture for the full size model is too complicated. If both the simple and full size network fail, you might have a bug in your code.
If you aren’t using a framework, check against a framework:
If you’ve written the code for the neural network from scratch instead of using a Machine Learning framework, there’s a chance that something is wrong with your implementation. Fortunately, you can check if this is the case by coding up the same network architecture in a machine learning framework. Then put print statements in both your implementation and the framework version and compare the outputs until you find where the difference in the print statements starts occurring. That is where your error is. For example, let’s say you have a network with ten layers and the error is in the seventh layer. When you print the output of the first layer in your network and compare it to the first layer output in the framework implementation, they will be the same. So you move to comparing the outputs of the second layer. Still the same. Then you move to the third layer, and so on and so forth until you see differences start occurring at the seventh layer. Therefore you can infer that the seventh layer is the problem. Notice that this method will only work for the first iteration of the network, because the second iteration and beyond will have different starting points due to the differences in the first iteration output.
The example above assumes that the error occurs in the forward pass of learning. The same idea can be used if the error occurs during backpropagation. You can print the gradients for the weights layer by layer, starting from the last layer, until you see a difference in the framework gradients and your implementation gradients.