The Beginner’s Guide to Gradient Descent

Published in

Towards Data Science

6 min readJul 16, 2018

In every neural network, there are many weights and biases that connect neurons between different layers. With the correct weights and biases, the neural network can do its job well. When training the neural network, we are trying to determine the best weights and biases to improve the performance of the neural network. This process is called “learning”.

Initially, we will just give the network a set of random weights and biases. This, of course, means that the network will not be very accurate. For example, a network that distinguishes dogs from cats is not very accurate if it returns a 0.7 probability for dog and 0.3 probability for cat when given a cat image. Ideally, it ought to return 0 probability for dog and 1 for cat.

In order to tell the network how to change its weights and biases, we must first determine how inaccurate the network is. In the dogs VS cats example, the probability it gave for cat was 0.3, while the correct value is 1. We can take the difference between the two values and square it to produce a positive number representative of the network’s inability to classify cat images, like this: (0.3–1)2 = 0.49. We can do the same with the probability of the image being a dog: (0.7–0)2 = 0.49. Then, we add the two values up: 0.49+0.49 = 0.98. This is the loss of our neural network for the single photo that we passed in.

*Note: In this cast of a neural network with only two outputs (i.e. cat or dog), the probability difference is the same (in this case, both 0.49). But with networks with multiple outputs (such as a network that recognizes letters of the alphabet, which will have 26 outputs), each probability difference will be unique.

We can now repeat that process with multiple training images, each giving a different loss. Once we run through all our training data, we can take the average of all the losses. This average loss represents how bad the network is doing, and is determined by the values of the weights and biases.

Therefore, the loss is related to the values of the weights and biases. We can define a loss function, which takes in an input of all the values of the weights and biases and returns the average loss. In order to improve the network’s performance, we have to somehow change the weights and biases to minimize the loss.

Even though the loss function has multiple inputs, it is easier to visualize it as a function with one input (the weights) and one output. In this diagram, the vertical axis, L(w), represents the value of the loss, dependent on the weights. The horizontal axis, w, represents the value of the weights (multiple weights, but to better visualize it we can imagine only one weight).

For the initial random weight that we give, we are at a certain point in the loss function. To minimize the loss, we have to take a step towards the local minimum. We can do this by taking the derivative (the slope) of the function, taking a step towards the left if it is positive and taking a step towards the right if it is negative. In the example above, we see that the derivative at that point is positive, so we take a step to the left (reduce the value of the weight). In mathematical terms, if the original weight is xn, the new weight, xn+1 = xn + dy⁄dxn • r. Here, r is the learning rate, or the size of the step that we take.

After reducing the weight and running through another batch of images, we now have a lower loss. Once again, we take the derivative and change the weights accordingly, taking another step towards the local minimum.

Image 3: Taking another step towards the local minimum

Let us repeat the same steps one more time.

Oh no! This time, we’ve stepped too far. Therefore, we have to take another step back to get closer to the local minimum.

The smaller the learning rate (the steps we take) the more accurate the network will be in reaching the local minimum. However, small learning rates require a lot of steps to reach the minimum, which takes a lot of time. Larger learning rates, on the other hand, are faster but less accurate. How can we determine the best learning rate for a neural network?

A 2015 paper, Cyclical Learning Rates for Training Neural Networks by Leslie Smith, introduces a way to determine the optimal learning rate for each neural network. With this method, we first begin by taking a small step: shift the weights by a tiny amount. Then, the second time we run through a batch of images, we shift the weights by a slightly larger amount (take a larger step). With each time we run, we increase the learning rate, taking larger steps until we cross the local minimum and end up with a greater loss. This is illustrated in the diagram below:

Image 5: Determining learning rate through taking larger and larger steps

In the beginning, the steps are small. Then, we gradually increase the degree to which we change the weights, until we reach past the local minimum and jump out of the local trough. If we plot the learning rate against time, it would looks something like this:

since the learning rate is increasing with every new run-through with a batch of images.

If we plot the loss against the learning rate, it would look something like this:

Image 7: Loss as we increase learning rate every step

Initially, the loss decreases slowly, but as the learning rate increases, it starts to fall faster, until it becomes too big and spirals out of the local minimum, in which case the loss starts increasing rapidly again.

Smith suggested that we pick the learning rate in which the loss is decreasing the fastest. In other words, we pick the spot in the loss-learning rate graph where the loss is still decreasing but the slope is the steepest.

Note that we are not picking the learning rate with the lowest loss, since that may be the point where it just jumped our of the local minimum.

Of course, in the loss function graphs above, I’ve plotted only one weight to better visualize the function. In actuality, there are many weights and biases that all influence the loss of the neural network. The method of gradient descent and finding learning rates are still the same, however.

The Beginner’s Guide to Gradient Descent

Written by Chi-Feng Wang