Deriving the Backpropagation Equations from Scratch (Part 2)

Gaining more insight into how neural networks are trained

Published in

Towards Data Science

6 min readNov 23, 2020

In this short series of two posts, we will derive from scratch the three famous backpropagation equations for fully-connected (dense) layers:

In the last post we have developed an intuition about backpropagation and have introduced the extended chain rule. In this post we will apply the chain rule to derive the equations above.

Backpropagating the Error

Backpropagation starts in the last layer 𝐿 and successively moves back one layer at a time. For each visited layer it computes the so called error:

Now assume we have arrived at layer 𝑙. In the last post we have illustrated, how the loss function 𝓛 depends on the weighted inputs 𝑧 of layer 𝑙:

We can consider the above expression as our “outer function”. We get our corresponding “inner functions” by using the fact that the weighted inputs 𝑧 depend on the outputs 𝑎 of the previous layer:

which is obvious from the forward propagation equation:

Inserting the “inner functions” into the “outer function” gives us the following nested function:

Please note, that the nested function now depends on the outputs 𝑎 of the previous layer 𝑙 -1. Next, we take the partial derivative using the chain rule discussed in the last post:

Resulting in:

The first term in the sum is the error of layer 𝑙, a quantity which was already computed in the last step of backpropagation. The second term is also easily evaluated:

We arrive at the following intermediate formula:

where we dropped all arguments of 𝓛 and 𝑧 for the sake of clarity. Expressing the formula in matrix form for all values of 𝑖 gives us:

which can compactly be expressed in matrix form:

with

Up to now, we have backpropagated the error of layer 𝑙 through the bias-vector and the weights-matrix and have arrived at the output of layer 𝑙 -1. To obtain the error of layer 𝑙 -1, next we have to backpropagate through the activation function of layer 𝑙 -1, as depicted in the figure below:

In the last step we have seen, how the loss function depends on the outputs 𝑎 of layer 𝑙 -1. Our new “outer function” hence is:

Our new “inner functions” are defined by the following relationship:

where 𝑔 is the activation function. Plugging the “inner functions” into the “outer function” yields:

Applying the chain rule we get:

The first term in the above sum is exactly the expression we’ve calculated in the previous step, see equation (𝑰). Since the activation function 𝑔 takes as input only a single 𝑧 , we get:

Thus we arrive at the final formula:

where again we dropped all arguments of 𝓛 for the sake of clarity. Expressing the formula in matrix form for all values of 𝑖 gives us:

In vector notation we can write:

where * denotes the elementwise multiplication and

And finally by plugging equation (𝑰) into (𝑰𝑰), we arrive at our first formula:

Gradient of the Weights

To define our “outer function”, we start again in layer 𝑙 and consider the loss function to be a function of the weighted inputs 𝑧:

To define our “inner functions”, we take again a look at the forward propagation equation:

and notice, that 𝑧 is a function of the elements of weight matrix 𝑾:

The resulting nested function depends on the elements of 𝑾:

We apply the chain rule:

As before the first term in the above expression is the error of layer 𝑙 and the second term can be evaluated to be:

as we will quickly show. To this end, we first notice that each weighted input 𝑧 depends only on a single row of the weight matrix 𝑾:

Hence, taking the derivative with respect to coefficients from other rows, must yield zero:

In contrast, when we take the derivative with respect to elements of the same row, we get:

Thus we arrive at the final formula:

Expressing the formula in matrix form for all values of 𝑖 and 𝑗 gives us:

and can compactly be expressed as the following familiar outer product:

with

Gradient of the Biases

All steps to derive the gradient of the biases are identical to these in the last section, except that 𝑧 is considered a function of the elements of the bias vector 𝒃: