
I’ve been studying deep learning for a while now, and I became a huge fan of current deep learning frameworks such as PyTorch or TensorFlow. However, as I’m getting used to such simple but powerful tools, the fundamentals of core concepts in deep learning such as Backpropagation started to fade out. I believe it’s always good to go back to the basics and wanted to make a detailed hands-on tutorial to clear things out.
Introduction
The basic process of Deep Learning is to perform operations defined by a network with learned weights. For example, the famous Convolutional Neural Network (CNN) is just multiplying, adding, etc., pixel intensity values with such rules designed by the network. Then, if we want to classify whether the picture is a dog or a cat, we should somehow get the binary result after the operations to tell 1 as a dog and 0 as a cat.
When we are training the network, we are simply updating the weights so that the output result becomes closer to the answer. In other words, with a well-learned network, we can correctly classify an image to whatever class it really is. Here is where backpropagation comes in. We calculate the gradients and gradually update the weights to meet the objectives. An objective function (aka loss function) is how we are going to quantify the difference between the answer and the prediction we make. With a simple and differentiable objective function, we can easily find the global minimum. However, in most cases, it is not a trivial process.

Chain Rule
You just can’t talk about backpropagation without the chain rule. The chain rule enables you to calculate local gradients in a simple way.


Here is a simple example of backpropagation. As we’ve discussed earlier, input data is x, y, and z above. The circle nodes are operations and they form a function f. Since we need to know the effect that each input variables make on the output result, the partial derivatives of f given x, y, or z are the gradients we want to get. Then, by the chain rule, we can backpropagate the gradients and obtain each local gradient as in the figure above.

As we will be doing more vectorized calculations along with actual implementations, here is one example with function f being the L2 norm. The gradient of the L2 norm is just two times the input value which is 2q above. Then, the partial derivative of q given W will be the inner product of 2q and x transpose. Also, the same given x will be the inner product of W transpose and 2q, and why the W transpose is because each partial derivative of xᵢ gives the column vector of W.
Activation Function
In deep learning, a set of linear operations between layers would be just a big linear function after all if without activation functions. The non-linear activation function introduces further complexity into the model. I’m going to introduce one of the basic activation functions with their derivations to calculate gradients for our backpropagation.
Sigmoid


tanh


ReLU (Rectified Linear Unit)


Objective Function
An effective way of quantifying the closeness of the prediction from the answer is very important in training neural networks. The differentiable objective function (aka loss function) is needed in order to perform backpropagation and update all the regarding weights affecting the output predictions. I’m going to introduce two objective functions called Mean Squared Error (MSE) and Cross entropy loss functions.
Mean Squared Error (MSE)
MSE is the most generic loss term nowadays and it’s common for predicting numerical values. It calculated the average squared distance between the predictions and the ground truths. The final activation layer usually follows either linear or ReLU.
Cross Entropy (CE)
Cross entropy is common for predicting a single label from multiple classes. It usually follows softmax for the final activation function which makes the sum of the output probabilities be 1 and it provides great simplicity over derivation on the loss term as below.
Backpropagation

There are only three things to consider in backpropagation for a fully connected network such as above. The passing gradient from the right, the local gradient calculated from the derivation of the activation function, and the passing gradient regarding the weights and the inputs to the left.
The first gradient comes from the loss term, with the derivation of such terms explained as above, we can start passing on the gradients from the right to left. From every layer, we calculate the gradients regarding the activation layer first. Then, the inner product of that gradient to the input values (z’) will be the gradient with respect to our weights. Also, the inner product of the gradient to the weights (w) will be the next passing gradient to the left.
Repetition of this simple process is all we need for a successful backpropagation!
References
[1] Fei-Fei Li, CS231n: Convolutional Neural Networks for Visual Recognition, 2017 [2] Stacey Ronaghan, Deep Learning: Which Loss and Activation Functions should I use?, 2018