Backpropagation made easy

Backpropagation is so basic in machine learning yet seems so daunting. But actually, it is easier than it seems.

Y. Lin
Towards Data Science

--

Photo by Jamie Street on Unsplash

It doesn't take a math genius to learn Machine Learning (ML). Basically, all you need is college first-year level calculus, linear algebra, and probability theory, and you are good to go. But behind the seemingly-benign first impression of ML, there are a lot of mathematical theories related to ML. For many people, the first real obstacle in learning ML is back-propagation (BP). It is the method we use to deduce the gradient of parameters in a neural network (NN). It is a necessary step in the Gradient Descent algorithm to train a model.

BP is a very basic step in any NN training. It involves chain rule and matrix multiplication. However, the way BP is introduced in many ML courses or tutorials is not satisfactory. When I was first learning BP in Coursera’s Machine Learning class, I was so confused about its calculation process I paused for several months. Meanwhile, I searched for more explanation of BP. I managed to pass the course. I finished the coding assignment. But BP still remains a very messy and confusing blur in my brain.

It doesn’t really hurt if you don’t understand BP at all and simply regard it as a black-box, because Tensorflow or Pytorch can automatically perform BP for you. But recently I was reviewing my notes on ML, and I start to properly understand BP. My method is to set up a simple NN and write down every parameter and variable matrix/vector explicitly and write down the gradient calculation through chain rule for each parameter matrix/vector step by step. At the end of the day, BP turns out to be so much easier than I originally thought.

Use a two-layer NN and single input sample as an example

My connotations are based on Coursera's Deep Learning Specification Course 1. I will use a two-layer NN as an example. Two layers are just “deep enough” because the techniques to calculate the gradient in layer 1 can be repeated for any additional layers. It is also helpful to let us see the dimension of parameters in regards to the number of neurons of each layer, e.g., notice that the input is a dimension 2 vector, there are three neurons in layer 1, so w[1] is a 3*2 matrix. There are two neurons in layer 2, or the output layer. If the NN is used for binary classification, the output layer is one neuron; if the NN is used for multiple-classification, the output layer has the same number of neurons as the classes. To generalize our analysis, we just consider a two-neuron output layer as an example.

A simple two-layer NN (Image by Author)

In the input layer, or layer 0, We have a 2*1 input vector x. Right now we only focus on a single sample input. If we want to consider the input as a batch of m samples, then the input dimension is actually 2*m. But let’s just consider a single input instance here.

In the first layer, we have three neurons, and the matrix w[1] is a 3*2 matrix. In this NN, there is also a bias vector b[1] and b[2] in each layer. b[1] is a 3*1 vector and b[2] is a 2*1 vector . In the forward pass, we have the following relationships (both written in the matrix form and in a vectorized form):

Matrix multiplication of layer 1 operation
Vectorized expression of layer 1 operation

Similarly, the vectorized operation of layer 2 can be written as:

Vectorized expression of layer 2 operations. Notice that a^[1] acts as the input to this layer.

Just ignore the details of activation function and loss function for now.

For each layer, we will use a nonlinear activation function g(z), e.g., sigmoid function or ReLU. For the sake of simplicity, we leave out the detail of g(z) and just keep using g(z).

Relationship between a[1] and z[1]

Notice that the derivative of g(z) over z is known to us once we settle down which g(z) we will use.

g’(z) is known to us.

At the output of the NN, we have a loss function J() that we will try to minimize. It is usually cross-entropy. For the sake of simplicity, we will leave the details of J() for now. Again, it is a function of a[2] and y, so we can calculate its derivative over a[2] directly.

Loss function

Going from the last layer to obtain partial direvative dz[2].

To perform BP, we need to go from the last layer backward to the first layer. At the last layer, we can directly obtain the partial derivative of da[2] from the loss function J because J is a function of a[2]. It is also straightforward to obtain partial derivative dz[2] from da[2] by the following functions, given that we know the derivative g’(z). Notice that the operator in the last equation (dot in a circle) indicates the element-wise multiplication.

Once we have obtained dz[2], we will try to find dw[2] and db[2].

Take w7 in w[2] as an example, this parameter is multiplied by a1 and then is added to z4. By the chain rule, dw7 equals to:

Therefore, we have the following relationship between w[2] and dz[2]:

You can see that dw[2] is nothing but the vector multiplication of dz[2] and the transpose of a[1]. db[2] is dz[2] itself.

We can then proceed to layer 1 and calculate da[1] and dz[1].

Take a1 in a[1] as an example. Notice that a1 is multiplied by w7 and w8 and are added to z4 and z5 respectively. By the chain rule, da1 will be made up of these two parts as well. The details of da[1] and dz[1] are given below:

Notice that da[1] is calculated with layer 2’s parameter w[2] and dz[2].

With dz[1] we can obtain dw[1] and db[1] just like in layer 2.

We can see that the calculation of dw[1] is the multiplication of dz[1] vector and the transpose of x. You can also imagine x as a[0], the output vector of layer “0”. Therefore, we get the consistent expression of dw[l]=dz[l]*a[l-1].T (.T denotes transpose, l is a random layer number).

With the above steps, we have obtained dw[2], db[2], dw[1], and db[1]. For NN with more than 2 layers, the steps can be repeated for each layer: for any layer, we just need to obtain dz[l] (either directly from J or from a previous layer), use that to calculate dw[l] and db[l]; obtain da[l-1] and dz[l-1], and proceed to layer l-1.

For NN with L layers, BP gets down to four major steps below:

  1. From the last layer, calculate da[L] from the loss function J; then get dz[L] from da[L]; set l= L,
  2. Calculate gradient of current layer parameters with dz[l]: calculate dw[l]=dz[l]a[l-1].T ; db[l]= dz[l];
  3. Calculate gradient of previous layer output with dz[l]: calculate da[l-1] by da[l-1] = w[l].T * dz[l], and get dz[l-1] from da[l-1].
  4. Set l=l-1, and repeat from step 2, until you reach the first layer.

In one word, at each layer, we want to find out dz[l], and everything goes from there.

Then we look at the details of J() and g(). Let’s consider g(z) is the sigmoid function and J() is the cross-entropy. Their format and gradient are given below.

In the case of our NN, J is a function of a[2] and y, and we can obtain dz[2] from the following functions. Once we have the explicit expression for dz[2], we can perform BP with the above steps.

Finally, let’s see the situation of multiple input instances. If the input contains m instances, then output and intermediate matrix A[1], A[2], Z[1], and Z[2] all have m columns.

X, Y, A[1], A[2], Z[1], Z[2] have m columns

If the loss function J is the expected cross-entropy, then the calculation of dw and db need to include a 1/m term, but everything else is basically still the same:

That’s it. We use a simple and straightforward two-layer NN as an example, and perform BP for a single input and for multiple inputs. Hope this simple tutorial will help you understand BP better.

Reference:

[1] Coursera Deep Learning Specification course 1: Neural Networks and Deep Learning

[2] https://www.youtube.com/watch?v=ibJpTrp5mcE&feature=youtu.be

[3] http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf

--

--

Ph.D studnet in Electrical Engineering; interested in Machine Learning, Artificial Intelligence, and demystifying technology buzzwords.