
In the previous article we talked about multi-layer perceptrons (MLPs) as the first neural network model that could solve non-linear and complex problems.
For a long time it was not clear how to train these networks on a given data set. While single-layer perceptrons had a simple learning rule that was guaranteed to converge to a solution, it could not be extended to networks with more than one layer. The AI community has struggled with this problem for more than 30 years (in a period known as the "AI winter"), when eventually in 1986 Rumelhart et al. introduced the Backpropagation algorithm in their groundbreaking paper [1].
In this article we will discuss the backpropagation algorithm in detail and derive its mathematical formulation step-by-step. Since this is the main algorithm used to train neural networks of all kinds (including the deep networks we have today), I believe it would be beneficial to anyone working with neural networks to know the details of this algorithm.
Although you can find descriptions of this algorithm in many textbooks and online sources, in writing this article I have tried to keep the following principles in mind:
- Use clear and consistent notations.
- Explain every step of the mathematical derivation.
- Derive the algorithm for the most general case, i.e., for networks with any number of layers and any activation or loss functions.
After deriving the backpropagation equations, a complete pseudocode for the algorithm is given and then illustrated on a numerical example.
Before reading the article, I recommend that you refresh your calculus knowledge, specifically in the area of derivatives (including partial derivatives and the chain rule of derivatives).
Now grab a cup of coffee and let’s dive in 🙂
The Three Phases of the Algorithm
The backpropagation algorithm consists of three phases:
- Forward pass. In this phase we feed the inputs through the network, make a prediction and measure its error with respect to the true label.
- Backward pass. We propagate the gradients of the error with respect to each one of the weights backward from the output layer to the input layer.
- Gradient descent step. We slightly tweak the connection weights in the network by taking a step in the opposite direction of the error gradients.
We will now go over each one of these phases in more detail.
Forward Pass
In the forward pass, we propagate the inputs in a forward direction, layer-by-layer, until the output is generated. The activation of neuron i in layer l is computed using the following equation:

where f is the activation function, zᵢˡ is the net input of neuron i in layer l, wᵢⱼˡ is the connection weight between neuron j in layer l – 1 and neuron i in layer l, and bᵢˡ is the bias of neuron i in layer l. For more details on the notations and the derivation of this equation see my previous article.
To simplify the derivation of the learning algorithm, we will treat the bias as if it were the weight _w_₀ of an input neuron _x_₀ that has a constant value of 1. This enables us to write the above equation as follows:

Backward Pass
In the backward pass we propagate the gradients of the error from the output layer back to the input layer.
Definition of the Error and Loss Functions
We first define the error of the network on the training set with respect to its weights. Let’s denote by w the vector that contains all the weights of the network.
Assume that we have n training samples {(xᵢ, yᵢ)}, i = 1,…,n, and the output of the network on sample i is oᵢ. Then the error of the network with respect to w is:

where J(y, o) is the loss function. The specific loss function that we use depends on the task the network is trying to accomplish:
- For regression problems, we use the squared loss function:

- For binary classification problems, we use log loss (also known as the binary cross-entropy loss):

- For multi-class classification problems, we use the cross-entropy loss function:

where k is the number of classes.
The reason why we use these specific loss functions is explained in detail in this article.
Our goal is to find the weights w that minimize E(w). Unfortunately, this function is non-convex because of the non-linear activations of the hidden neurons. This means that it may have multiple local minima:

There are various techniques that can be used to prevent gradient descent from getting stuck in a local minimum, such as momentum. These techniques will be covered in future articles.
Finding the Gradients of the Error
In order to use gradient descent, we need to compute the partial derivatives of E(w) with respect to each one of the weights in the network:

To simplify the mathematical derivation, we will assume that we have only one training example and find the partial derivatives of the error with respect to that example:

where y is the label of this example and o is the output of the network for that example. The extension to n training samples is straightforward, since the derivative of the sum of functions is just the sum of their derivatives.
The computation of the partial derivatives of the weights in the hidden layers is not trivial, since those weights don’t affect directly the output (and hence the error). To address this problem, we will use the chain rule of derivatives to establish a relationship between the gradients of the error in a given layer and the gradients in the subsequent layer.
The Delta Terms
We first note that E depends on the weight wᵢⱼˡ only via the net input zᵢˡ of neuron i in layer l. Therefore, we can apply the chain rule of derivatives to the gradient of E with respect to this weight:

The second derivative on the right side of the equation is:

Therefore, we can write:

The variable δᵢ is called the delta term of neuron i or delta for short.
The Delta Rule
The delta rule establishes the relationship between the delta terms in layer l and the delta terms in layer l + 1.
To derive the delta rule, we again use the chain rule of derivatives. The loss function depends on the net input of neuron i only via the net inputs of all the neurons it is connected to in layer l + 1. Therefore we can write:

where the index j in the sum goes over all the neurons in layer l + 1 that neuron i in layer l is connected to.
Once again we use the chain rule to decompose the second partial derivative inside the brackets:

The first partial derivative inside the brackets is just the delta of neuron j in layer l + 1, therefore we can write:

The second partial derivative is easy to compute:

Therefore we get:

But aᵢˡ = f(zᵢˡ), where f is the activation function. Hence, the partial derivative outside the sum is just the derivative of the activation function f‘(x) for x = zᵢˡ.
Therefore we can write:

This equation, known as the delta rule, **** shows the relationship between the deltas in layer __ l and the deltas in layer l + 1. More specifically, each delta in layer l is a linear combination of the deltas in layer l + 1, where the coefficients of the combination are the connection weights between these layers. The delta rule allows us to compute all the delta terms (and thus all the gradients of the error) recursively, starting from the deltas in the output layer and going back layer-by-layer until we reach the input layer.
The following diagram illustrates the flow of the error information:

For specific activation functions, we can derive more explicit equations for the delta rule. For example, if we use the sigmoid function then:

The derivative of the sigmoid function has a simple form:

Hence:

Then the delta rule for the sigmoid function gets the following form:

The Deltas in the Output Layer
The final piece of the puzzle are the delta terms in the output layer, which are the first ones that we need to compute.
The deltas in the output layer depend both on the loss function and the activation function used in the output neurons:

where f is the activation function used to compute the output.
Let’s now derive more specific delta terms for each type of learning task:
- In regression problems, the activation function we use in the output is the identity function f(x) = x, whose derivative is 1, and the loss function is the squared loss. Therefore the delta is:

- In binary classification problems, the activation function we use is sigmoid and the loss function is log loss, therefore we get:

In other words, the delta is simply the difference between the network’s output and the label.
- In multiclass classification problems, we have k output neurons (where k is the number of classes) and we use softmax activation and the cross-entropy log loss. Similar to the previous case, the delta term of the _i_th output neuron is surprisingly simple:

where oᵢ is the i-th component of the network’s prediction and yᵢ is the i-th component of the label. The proof is somewhat longer, and you can find it in this article.
Gradient Descent
Once we finish computing all the delta terms, we can use gradient descent to update the weights. In gradient descent, we take small steps in the opposite direction of the gradient (i.e., in the direction of the steepest descent) in order to get closer to the minimum error:

Remember that the partial derivative of the error function with respect to each weight is:

Therefore, we can write the gradient descent update rule as follows:

where α is a learning rate that controls the step size (0 < α < 1). In other words, we subtract from the weight between neuron j in layer l – 1 and neuron i in layer l the delta of neuron i multiplied by the activation of neuron j (scaled by the learning rate).
Gradient descent can be applied in one of the following modes:
- Batch gradient descent – the weights are updated after we compute the error on the entire training set.
- Stochastic gradient descent (SGD) – a gradient descent step is performed after every training example. Typically converges faster than batch gradient descent but is less stable.
- Mini-batch gradient descent – a middle way between batch gradient descent and SGD. We use small batches of random training samples (normally between 10 to 1,000 examples) for the gradient updates. This reduces the noise in SGD but is still more efficient than full-batch updates, and it is the most common form to train neural networks.
Backpropagation: The Complete Algorithm
We are now ready to present the entire algorithm in its full glory:

As an exercise, try to implement this algorithm in Python (or your favorite programming language).
Backpropagation Example
Imagine that we have a binary classification problem with two binary inputs and a single binary output. Our neural network has two hidden layers with the following weights:

The activation function in the hidden layers and in the output unit is the sigmoid function, and the learning rate is α = 0.5.
The network is presented with a training example with the inputs _x_₁ = 1 and _x_₂ = 0, and the target label is y = 1. Let’s perform one iteration of the backpropagation algorithm to update the weights.
We start with forward propagation of the inputs:

The output of the network is 0.6718 while the true label is 1, hence we need to update the weights in order to increase the network’s output and make it closer to the label.
We first compute the delta at the output node. Since this is a binary classification problem we use the log loss function, and the delta at the output is o – y.

We now propagate the deltas from the output neuron back to the input layer using the delta rule:

Note how the deltas become increasingly smaller as we go back in the layers, causing the early layers in the network to train very slowly. This phenomenon, known as the vanishing gradients, was one of the main reasons why backpropagation was not successful in training deep networks and a main motivation for the emergence of Deep Learning.
Finally, we perform one step of gradient descent:

Let’s do another forward pass to see if the network’s output became closer to the target:

Indeed, the output has increased from 0.6718 to 0.6981!
Final Notes
All images unless otherwise noted are by the author.
Thanks for reading!
References
[1] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." nature 323.6088 (1986): 533–536.