The world’s leading publication for data science, AI, and ML professionals.

Full implementation of Gradient Descent – no holding back

This post derives the algorithm for implementing gradient descent. It is messy, ugly, and tedious! It is a fun exercise deriving it…

Deep dive into the mathematics of the gradient descent algorithm.

This post derives the algorithm for implementing Gradient Descent. It is messy, ugly, and tedious! It is a fun exercise deriving it yourself, but in the end, it’s just identifying patterns. Don’t let this page make your head explode and make you give up on AI! Just knowing how to use the algorithm is most important (even then it can be argued that you can just use Keras), but why not try it yourself and have some fun! I include the derivation here just because I know I would have loved to see it when I was learning AI.

We now know that we must find the gradient of the error function with respect to each weight to minimize our error (If you have no idea what I just said, learn/review how to train a neural network here). That’s all fine, but how do we actually do this in code. You can imagine how difficult it would be to find the gradient if we have a neural network with 5,6 or more layers each with hundreds or thousands of neurons.

We know what to find (the gradient), now we just need to figure out how to find it.

We could just use the definition of the derivative. Go through each weight and increase it by a tiny amount, then divide the resulting change in the error by this tiny amount. This will give us ∂E/∂W for this particular weight. Now if we did this for all of our weights, we would have the gradient.

But to put things in perspective, a neural network I trained to play atari games had 1,685,667 weights and I had to update them about 10 million times. Using the definition of the derivative would just take way too long for a network like this.

Since Neural Networks have very high repetition, it seems that we could find an algorithm that would find the gradient for us in an efficient way. To try to create this algorithm, we will first write out, by hand, the forward and update phase of a relatively small, but deep, neural network. We will then look for patterns in this process to try to simplify it as much as possible.

This is the neural network we will be using to derive our gradient descent algorithm:

This time we are implementing the bias. Note that the above implementation is the exact same as just adding a bias onto each neuron's weighted sum, as long as we keep our bias 'input' at 1. The bias is blue, and anything associated with the bias will be blue on this page. Image by author
This time we are implementing the bias. Note that the above implementation is the exact same as just adding a bias onto each neuron’s weighted sum, as long as we keep our bias ‘input’ at 1. The bias is blue, and anything associated with the bias will be blue on this page. Image by author

Forward phase

First off, it is very important to remember that we draw a neural network as this computational graph in order to help us visualize what is happening. But, if we are going to find our gradient descent algorithm we will need to write out the computations done. Well, we have a lot of parameters here! The huge amount of parameters we have really made for ugly indexing!

Let me write out all the parameters we have in this neural network. Then let you in on my convention for indexing.

Image by author
Image by author

Xbar above is a list of the ‘inputs’ for each layer. Notice that the size of each of the lists in Xbar corresponds to the number of neurons in each layer. Xbar[0] is the initial input data with a bias, Xbar[1] is the activated weighted sum of X[0] with a concatenated bias, Xbar[2] is the activated weighted sum of X[1] with a concatenated bias, and so on.

Wbar is a list of all the weights that connect each layer. Each of these lists should have size:

(# of neurons in the next layer excluding bias, # of neurons in the current layer including bias).

For example, the first list of weights has 3 rows since there are 3 neurons in the next layer (excluding the bias). This list has 3 columns because there are 2 inputs and 1 bias.

Now, the top left index tells what layer the particular element is in. The top right index tells what input neuron in the layer we are talking about, the bottom right index tells what input we are going to.

X is the input to the layer, w is the weights of the layer, z is the weighted sum before being passed through the activation function. Image by author
X is the input to the layer, w is the weights of the layer, z is the weighted sum before being passed through the activation function. Image by author

Now that we are on the same page as far as notation. Here is the forward phase of the neural network. Just going from left to right, top to bottom, writing out the computations:

a(z) is our activation function applied to z. Take a step back and just look at what is happening at a high level. Don't let your head explode trying to simultaneously be conscious of every detail! Image by author
a(z) is our activation function applied to z. Take a step back and just look at what is happening at a high level. Don’t let your head explode trying to simultaneously be conscious of every detail! Image by author

The top row of computations brings us from our first layer of inputs to our second layer of inputs. The second row brings us from our second layer of inputs to the third layer of inputs, and so on until we reach an output.

Backward phase

Ok, now let me write out what we are trying to find;

Image by author
Image by author

We are going to try to find enough of the third and second list of derivatives to identify a pattern. Using the chain rule, and a lot of staring at the problem, we can show that these derivatives are equal to the following;

I highlighted any pattern that I recognized, this will be useful in simplifying. The second gradient matrix is only partially complete. We only need enough to identify a pattern. Image by author
I highlighted any pattern that I recognized, this will be useful in simplifying. The second gradient matrix is only partially complete. We only need enough to identify a pattern. Image by author

If the above photo is slightly blurry, I suggest you download a copy of it and study it for a while. I first just compute the derivatives, then I highlight any similarities within the matrices. Pattern identification is the name of the game.

This can be tough deriving yourself. Take a look back at the photo of the forward phase frequently. If you start at the error, then work yourself back to the weight you are trying to find while using the chain rule, you’ll find it’s simple but very tedious.

Now, we are trying to find a simpler way to express the above equations. If you remember your linear algebra and kind of work backward, you can see that the above equations can be simplified to:

That circle with the X inside of it means that we are multiplying the rows. The dot between matrices means we are applying the dot product. Image by author
That circle with the X inside of it means that we are multiplying the rows. The dot between matrices means we are applying the dot product. Image by author

Look how beautiful! You can notice that everything that is red, is the same as everything highlighted in the row above it. This means we can start at the last layer, initialize a term, call it Ψ, to be what is in green. Then as we move back to previous layers, we can update Ψ with what is green in that layer (whatever is in red is the current value of Ψ). Now to find the gradient for that layer we just multiply Ψ by ∂Z/∂w row-wise. This is a lot prettier when you plug in values, below I assume the mean squared error function and the sigmoid activation function.

Image by author
Image by author

Now, we wrap up the update into a very simple for loop:

Image by author
Image by author

Summary

Ok, I know what you’re thinking. What in tarnation just happened!? All I did was identify patterns! Nothing more complicated than the chain rule was used, I just used a LOT of it. I would not blame you if you just believed my result, instead of deriving it yourself. In practice, we will be using Keras for implementation anyway, but now that you know how to do it yourself you don’t have to feel guilty doing so.

As always, we don’t truly understand this yet! We won’t understand it until we apply it. I go through a simple implementation in python here, in this link I also build a useful python class that we will later use to identify handwritten digits.

Thank you for reading! If this post helped you in some way or you have a comment or question then please leave a response below and let me know! Also, if you noticed I made a mistake somewhere, or I could’ve explained something more clearly then I would appreciate it if you’d let me know through a response.


This is a continuation of a series of articles that give an intuitive explanation of neural networks from the ground up. For the other articles see the links below:

Part 1: What is an artificial neural network

Part 2: How to train a neural network from scratch

Part 3: Full implementation of gradient descent

Part 4: Implementation of gradient descent (an example)

Part 5: How to classify handwritten digits in python


Related Articles