The world’s leading publication for data science, AI, and ML professionals.

Reverse Engineering Backpropagation

Sometimes starting with examples might be a faster way to learn something rather than going theory first before getting into detailed…

Sometimes starting with examples might be a faster way to learn something rather than going theory first before getting into detailed examples. That’s what I will attempt to do here using an example from the official PyTorch tutorial that implements backpropogation and reverse engineer the math and subsequently the concept behind it.

Here’s a snapshot of the tutorial (https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#warm-up-numpy) where numpy was used to implement a network with one hidden layer.

# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

In this example, each input is a one dimensional array of size 1000 and the output, is also one dimensional and is set at size 10. In addition to the input and output layers which every NN will have, there is one hidden layer which is set to be of size 100 in the example.

This is a simple example and we have 1 batch of training data if size 64. In other words, we have 64 sets of randomly initialized inputs of size 1000 each and outputs of size 100. The goal will be to train the network using these 64 randomly initialized training data.

Let’s look at the forward pass. The weight matrix associated with connecting the input with the hidden layer, is of size 1000 x 100 (i.e. input dimension size x hidden dimension size). And a matrix multiplication between the 1000 sized input and the weight matrix will result in a 100 sized hidden layer. We don’t really do this one by one, but for the entire batch as seen here: "h = x.dot(w1)". A ReLU function is applied on the resulting 64 x 100 to get rid of the negative numbers, but this step doesn’t impact the dimension in any way. The resulting 64 x 100 matrix is multiplied with the second weight matrix of size 100 x 10 to get the output of size 10. The loss calculation also looks pretty straight forward, it’s the sum of squared errors.

Before getting into the backprop steps, let’s get a visual representation of this setup. Let’s start with the output layer.

Note: I am not using the variable names but different notations like "a" and "f" as this will help get a better general understanding of back propagation (and to make the equations look less congested).
Note: I am not using the variable names but different notations like "a" and "f" as this will help get a better general understanding of back propagation (and to make the equations look less congested).

I am not using the variable names but different notations like "a" and "f" as this will help get a better general understanding of back propagation (and will make the equations look less congested). I am also representing the last layer as "l+1", so that the previous layer can be represented using "l". This is because we need to go back one more layer to get to a generalized understanding and to get to a repeatedly callable backpropogation function and I want that layer to be indexed as "l".

The loss function is defined as:

This is calculated in the below lines of code.

# Compute and print loss
loss = np.square(y_pred - y).sum()

Now our goal is find out the rate of change of this loss with respect to the weights in the network, so that we could adjust the weights accordingly and minimize the loss. Let’s start with the last layer. Our goal is to determine:

But we only know the value of loss as of now. So let’s expand this to bring that into play.

What we did above is, keeping the dependent variable to be differentiated (in this case "C" the loss) fixed, we are trying to find the derivative wrt it’s sub expression recursively. In other words, since we know "C" is a function of output activation "a", we leveraged "chain rule" to rewrite to find the derivative of "C" wrt to "a" and then the derivative of "a" wrt "w" – this logic of recursively moving backwards through the sub expressions of a function to derive the derivative of loss wrt to the inputs of the function, given the derivative of loss wrt to the output of the function is called "backpropogation" or "Reverse Mode Auto Diff" (Reverse accumulation AD). In general, you repeat this logic layer after layer, going backwards, starting from the last output layer through all layers of the neural network.

Getting back, we actually know the first part of the equation, as this can be derived using the power rule.

This is what is derived in the example as:

grad_y_pred = 2.0 * (y_pred - y)

Let’s solve the second part:

Putting it all back together, the partial derivative of loss wrt to w for the last layer can be written as:

And this is what is calculated as:

grad_w2 = h_relu.T.dot(grad_y_pred)

While this is how the gradient for the weights are calculated in every layer, the output layer is simplified in that the activation function is just a dot product. In the prior layer we employed a rectified linear unit (ReLU) on the product. The code snippet in the example was: h = x.dot(w1); h_relu = np.maximum(h, 0). Sigmoid function is also commonly used and it is also very common to see both functions applied as well. Let’s represent this layer visually and get the gradients for w1.

Just like in the last layer, let’s expand out the partial derivative of loss wrt to the weights.

Let’s deal with the first part. In the output layer, the partial derivative of loss wrt the output activation was known to us as we defined loss as the difference between the activation aka the predicted output and the actual value. But for this layer, we need to derive it. Since we know the derivative of loss wrt output activation of the next layer, let’s use that and rewrite the first part as:

The partial derivative of output activation of next layer wrt output activation of this layer can be solved as below:

So the first part is,

This is what is calculated in the tutorial code as:

grad_h_relu = grad_y_pred.dot(w2.T)

Plugging it back:

This leaves us with the second part, which is the partial derivative of output activation of this layer wrt the weights of this layer.

The activation function here is a two step function. First we multiplied the weights with previous layer output. Next, we applied a ReLU function. Applying the principles of backpropogation, this can be re-written as below.

The first part of this is:

Let’s zoom in on ReLU function and it’s derivative. The derivative of ReLU happens to be a unit step function, with "0"s for all "x<0" "1"s for all other values of "x".

The second part of the equation can be solved as:

Bringing both together:

Plugging it back to our goal of finding the partial derivative of loss wrt weights in this layer,

There are four terms being multiplied above. As seen earlier, the first product is what is calculated in the tutorial code as "grad_h_relu = grad_y_pred.dot(w2.T)". The second product is with a unit step function which will have 0s when z is less than 0 and 1 otherwise. And the last part is a product with input activation which in our example is "x" the input layer. Here again are the three lines of code that does all this:

grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)

The first two lines of code represent the product between the first three terms, and the last represent the final product with the input values. The mapping below will help match the code with the derivations.

As noted earlier, it is common to use sigmoid function for activation as well, and we would use the derivative of sigmoid function instead of the unit step function in our calculations above. So if we were to write a stand alone backprop function, it would take the derivative of loss wrt to the output activation as input and will have to calculate two values from it. First, will be the derivative of loss wrt the weights. This will be used in the gradient descent calculation to update the weights. Second, the function should calculate the derivative of loss wrt the input activation. This will have to be returned so as to continue with the backpropogation, as the input activation for this layer is nothing but the output activation of the previous layer. The calling function can use this returned value to call the backprop function again for the previous layer, but this time passing the returned value as input to the function.

While I spent some time trying to learn about backprop, automatic differentiation and PyTorch through tutorials, it became pretty evident very quickly that doing so in parallel using one to explain the other was a more efficient way to get to know about these topics than trying to do so independently. Hopefully this was helpful.


Related Articles