The world’s leading publication for data science, AI, and ML professionals.

Courage to Learn ML: Explain Backpropagation from Mathematical Theory to Coding Practice

Transforming Backpropagation's Complex Math into Manageable and Easy-to-Learn Bites

Image created by the author using ChatGPT.
Image created by the author using ChatGPT.

Welcome back to the latest chapter of ‘Courage to Learn ML. In this series, I aim to demystify complex ML topics and make them engaging through a Q&A format.

This time, our learner is exploring backpropagation and has chosen to approach it through coding. He discovered a Python tutorial on Machine Learning Mastery, which explains backpropagation from scratch using basic Python, without any deep learning frameworks. Finding the code a bit puzzling, he visited the mentor and asked for guidance to better understand both the code and the concept of backpropagation.

As always, here’s a list of the topics we’ll be exploring today:

  • Understanding backpropagation and its connection to gradient Descent
  • Exploring the preference for depth over width in DNNs and the rarity of shallow, wide networks.
  • What is the chain rule?
  • Breaking down backpropagation calculation into 3 components and examining each thoroughly. Why is it called backpropagation?
  • Understand backpropagation through straightforward Python code
  • Gradient vanishing and common preference in activation functions

Let’s start with the fundamental why –

What is backpropagation and how is it related to gradient descent?

Gradient descent is a key optimization method in Machine Learning. It’s not just limited to training DNNs but is also used to train models like logistic and linear regression. The fundamental idea behind it is that by minimizing the differences between predictions and true labels (prediction error), our model will get closer to the underlying true model. In gradient descent, the gradient, represented by ∇𝜃𝐽(𝜃) and formed by the loss function’s partial derivatives with respect to each parameter, guides the update of parameters: 𝜃 = 𝜃 – 𝜂⋅∇𝜃𝐽(𝜃). This process is akin to dissecting a complex movement into basic movements in video games.

However, in DNNs, which involve multiple layers and millions of parameters, calculating these partial derivatives for each parameter becomes computationally intensive. Particularly in a layered structure, it’s crucial to distinguish the error contribution of each layer to understand how the parameters at different layers should change with respect to the overall loss.

This is where Backpropagation comes in. It is a method to efficiently compute gradients for DNNs. Backpropagation assists DNNs in using gradient descent to guide their learning process, calculate the partial derivatives, and adjust each parameter more efficiently. The key to backpropagation lies in its name – it stands for ‘backward propagation of errors.’ This means the process involves sending the error (between the current prediction and the true label) backward and distributing the gradient from the output layer to the input layer and hidden layers in between. This distribution is in the reverse direction compared to the forward direction used to generate predictions.

So the hard part of training DNN is due to multiple layers. But, before talking more about backpropation, I’m curious why DNNs typically go deeper rather than wider. Why aren’t shallow but wide networks popular?

This question is about the preference for deep networks over shallow ones. Before jumping into it, let’s define shallow networks as having only 1 or 2 hidden layers. According to the universal approximation theorem, a single-layer network that’s wide enough can theoretically approximate any function. However, in practice, having many neurons in a wide network is not always practical due to high computational demands.

Deep networks, with multiple layers, are generally more efficient at modeling complex functions with fewer neurons compared to shallow networks. They are particularly good at learning different levels of data representation. For example, in facial recognition using CNNs, the initial layers might learn simple patterns like edges, and deeper layers can recognize more complex features like parts of a face.

Shallow networks, though, have their own advantages. They are easier to train and don’t face some common problems of deep networks, such as vanishing or exploding gradients. They are also more straightforward to understand. But, to capture complex functions, they might need many more neurons, making them less efficient for certain tasks.

To sum up, deep networks are typically favored because they can learn complex patterns in a hierarchical, sturctured manners and do so efficiency with fewer neurons. But the study of shallow versus deep networks is still an active field in machine learning.

Now, let’s explore backpropagation in detail and utilize the code snippets as a resource to deepen our understanding of the concept.

If you’re new to the gradients (gradient descent) or need a refresher, my previous article offers a detailed exploration of gradient descent and popular optimizers, accessible here. This section, inspired by Hung-yi Lee’s engaging lectures, blends my personal insights with his teachings. For those fluent in Chinese, I highly recommend his engaging machine learning lectures, available on YouTube.

The initial step involves understanding the role of gradients in our process. Gradients enable us to calculate the partial derivative of the loss with respect to each parameter from different layers.

Consider a scenario where we train our model using SGD (with Stochastic Gradient Descent) a batch size equal to 1. This means we use a single sample at a time to train our network. Through a certain magic, we can determine the partial derivative of the loss with respect to any weight, regardless of its depth in the network. For instance, within the network as below, we assume we already know the partial derivative of the loss with respect to the weights of the first layer. For example, the gradient of w1 at the first layer, denoted as ∂L(θ)/∂w1. Our goal is to understand what ∂L(θ)/∂w1 actually represents.

We will study backpropagation based on this network. Source: https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s
We will study backpropagation based on this network. Source: https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s

Author’s Note: A valuable technique I learned while practicing coding on Leetcode is to approach recursive functions by assuming a part of the task is already completed. This method helps in using the results of this assumed part to tackle larger problems. Adopting an attitude of assumed knowledge can provide comfort and ease your mind, fostering confidence as you delve into the topic. Essentially, it’s about ‘faking it till you make it’ in problem-solving and even life attitude

To dissect ∂L(θ)/∂w1 into comprehensible segments, we must explore the chain rule of partial derivatives. So, as is customary, let’s begin with the question:

What is the chain rule?

The chain rule is a fundamental technique in calculus, used for computing the derivative of a composite function. To illustrate what chain rules are, consider the following two examples:

  • If y = f(x) and x = g(t), this implies y = f(g(t)). Therefore, the derivative of y with respect to t (∂y/∂t) is calculated as the product of the derivative of y with respect to x (∂y/∂x) and the derivative of x with respect to t (∂x/∂t). So we have ∂y/∂t = ∂y/∂x * ∂x/∂t
Image created by the author.
Image created by the author.
  • If z = f(x, y), with y = g(t) and x = q(t), the derivative of z with respect to t (∂z/∂t) is the sum of the product of the derivative of z with respect to y (which is ∂z/∂y) and the derivative of y with respect to t (which is ∂y/∂t), and the product of the derivative of z with respect to x (∂z/∂x) and the derivative of x with respect to t (∂x/∂t). So we have ∂z/∂t = (∂z/∂x ∂x/∂t) + (∂z/∂y ∂y/∂t).

To visualize the chain rule, imagine derivatives as streams of water in a landscape. Calculating the derivative of an element in a process is like tracing a water stream’s journey. When functions are nested within each other, envision it as a small stream flowing into a river, then merging into the ocean. This represents how derivatives are multiplied together as they progress from small-scale to large-scale impacts. When two streams (derivatives) originate from different sources and converge, think of them as combining into a single stream. This illustrates the summing of two derivatives in the chain rule’s context.

Imagine derivatives as streams of water in a landscape. Image created by the author using ChatGPT.
Imagine derivatives as streams of water in a landscape. Image created by the author using ChatGPT.

An intuitive but technically ‘incorrect’ approach to grasping the chain rule is to liken it to fraction manipulation. Visualize it as a scenario where the denominator cancels out with the numerator when they multiply together. While this visualization simplifies the concept, it’s important to note that it’s incorrect. Nonetheless, it serves as a useful aid in comprehending derivatives.

This image shows a simple but incorrect way to understand the chain rule. Image created by the author.
This image shows a simple but incorrect way to understand the chain rule. Image created by the author.

How is the chain rule applied to our calculation of ∂L(θ)/∂w1?

Returning to our neural network, it’s helpful to visualize the connections within it as water streams. Here, the edges of the network symbolize the directional flow from input to output, with activation functions acting like complex gates altering the stream’s properties. To understand how the weights in the first layer (w1) affect the loss, we trace the ‘water stream’ from the loss back to w1. To better understand how the first layer’s weights (w1) affect the loss, envision tracing the ‘water stream’ from the loss back to w1. In the network diagram below, the red lines illustrate this water stream from the output back to w1, where z’ and z” are separate streams converging before an activation gate.

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.
Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

This stream flow analogy helps us reinterpret the partial derivative of the loss with respect to the first layer’s weights (∂L(θ)/∂w1) more intuitively.

We can then dissect ∂L(θ)/∂w1, representing the general partial derivative of loss with respect to weight, into three parts using the chain rule:

  • The partial derivative of the overall loss relative to the current layer, taking into account its role as the input for subsequent layers. This reflects the portion of the loss that can be influenced or contributed by this layer. It’s important to note that the current layer’s output is influenced by the preceding layers.
  • The partial derivative of the activation function’s output a with respect to its input z, where a = 𝜎(z). Since the activation function is like a gate altering the stream’s characteristics, understanding its impact on the stream becomes crucial when allocating loss to different layers.
  • The partial derivative of a neuron’s direct output, z, with respect to its weight, w1. In the formula z_=_w_1_x_1+_w_2_x_2+_b, z is the direct outcome of w1 and represents the stream prior to encountering the activation function.

After breaking down the partial derivatives into individual components, we can tackle the calculation by addressing each part separately. Let’s begin with the simplest component by posing this question:

How do we compute ∂z/∂w1, which is the partial derivative of z with respect to the weight that directly determines it in a linear fashion?

∂z/∂w1 represents the gradient of the output z before the activation function with respect to the weight direct associate with its own calculation. Since z, calculated as z_=_w_1_x_1+_w_2_x_2+_b, is a linear combination of inputs and weights, the derivative ∂z/∂w1 is simply the input associated with given weight w1, which is x1. This indicates that the gradient of z with respect to its direct weight does not depend on the other layers or the loss function (error). It’s a direct relationship: the partial derivative equals to the input that w1 multiplies in the forward pass. For a bias term b, the derivative is always 1 since a bias is a constant term.

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.
Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

To clarify with examples, let’s consider two scenarios. Starting with the first scenario where z is in the first hidden layer of the network. As highlighted in the graph with z and its inputs marked in red, we can calculate the derivatives by using the value of related input directly.

If the layer is the other hidden layer that are not direct associate with input data, the principle applies to those hidden layers as well. The partial derivative of a linear function with respect to its weight is the input to that weight. In the graph, for subsequent layers like z’ and z” in the network, the partial derivatives with respect to their associated weights, such as _w_3 and _w_4, are the outputs of the previous layer’s activation function, which in this case is a. I highlighted those relationships in blue.

In summary, the computation of the partial derivative of z (the output before activation) with respect to its weights is straightforward: it’s the input value to that weight. The derivative with respect to the bias is always 1. One can trace back along the edge of the weight in the network graph to find the corresponding input, and this input is the value of the partial derivative of the output z respect to the weight. This part of the backpropagation process does not require any "backward" information from the loss and is determined during the forward pass while processing the input data.

So, the calculation of ∂z/∂w1 isn’t actually a "backward" process. It’s simply equal to the input of the neuron. How about the partial derivative of the activation output with respect to its input, ∂a/∂z?

The computation of ∂a/∂z is more straightforward compared to ∂z/∂w1, considering that a is just a function result of z. Specifically, a = σ(z), where σ(z) represents the activation function. For those unfamiliar with activation functions, an activation function like σ_(_z) are used to introduce non-linearity into the neural network. Using our analogy of water streams, you can think it as a ‘water gate’ that altering the water stream of computation in a non-linear fashion, enabling the network to capture complex patterns that are non-linear. This non-linearity is vital for the network’s ability to capture the complex, non-linear underlying patterns.

As activation functions are crucial in neural networks and transforming computations to non-linearity, the derivative ∂a/∂z is a key component in backpropagation. It helps adjust gradients as they traversal the network reversely. Given a = σ(z), the derivative ∂a/∂z is simply σ′(z), the derivative of the activation function with respect to the original input z.

Calculating σ′(z) is quite simple. It involves plugging the input z into the derivative of the activation function. The derivative of the activation function, is simply another function which doesn’t require any information about the loss of current prediction. ** For instance, if the activation function is the sigmoid function, and its derivative σ′(z) can be defined in terms of the function itself as _ σ(z)_=σ(z)(1σ(z)). In practice, our program can define σ'(z) before trainin**g. We then input z into σ'(z) to get the partial derivative ∂a/∂z.

Thus, like ∂z/∂w1, ∂a/∂z doesn’t need loss information and can be computed during the forward pass. This derivative is then used in the chain rule during backpropagation to find the gradient of the loss with respect to the weights.

Considering our discussion thus far, when computing ∂L/∂w1, two of the three components do not require loss information or input from later layers. Then why do we refer the calculation process as ‘backpropagation’?

I’m glad you noticed that. Let’s recap our discussion so far:

We’ve established that the calculation of ∂z/∂_w_1 is straightforward because it is the input associated with weight w1. Similarly, for the activation function’s output a, represented as a_=σ(z), the derivative ∂a_/∂z is just σ_′(_z), which is the derivative of the activation function. Both components’ calculation are independent of the loss function and can be either pre-defined ahead of training process or pre-computed during the forward pass.

Now, the remaining part involves ∂L/∂a, which tells us how changes in the activation output a affect the overall loss L. Imagine the entire neural network as a cake factory’s production line, tasked with baking, packaging, and shipping cakes to stores. If a cake arrives damaged (akin to a poor prediction), it’s necessary to backtrack and identify which stage of the process was responsible for the damage and to what extent.

Imagine the entire neural network as a cake factory's production line. Image created by the author using ChatGPT.
Imagine the entire neural network as a cake factory’s production line. Image created by the author using ChatGPT.

In backpropagation, we use the chain rule to decompose ∂L/∂a. In our example network, the a is part of input related to two outputs, z′ and z′′. Then when trying to understand how much loss a contributes to the loss, we need to gather information from those two outputs to measure the impact.

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.
Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

This is analogous to the cake production line, where ∂z′/∂a and ∂z′′/∂a represent the contribution of a to each output, and we must trace the impact back from the loss through these channels.

Given that

We combine these with the activation function derivative ∂a/∂z to get

This is key to understanding why the process is named as backpropagation.
This is key to understanding why the process is named as backpropagation.

This formula shows that L/∂z is influenced by the gradients from the next layer, showcasing the ‘backward’ aspect of the computation. This is key to understanding why the process is named as backpropagation.

To update weights throughout the neural network, we calculate the partial derivative of the loss with respect to each weight, ∂L/∂w. Forward calculation (from input to output layer) would be inefficient, requiring repeated computation of ∂L/∂z for each layer starting from the input layer. Using the function above, a more efficient approach is to compute ∂L/∂z starting from the output layer and moving backwards to the input layer. This method involves storing the result and reusing the current layer’s ∂L/∂z for the preceding layers. Be aware that, to find ∂L/∂z for the (n-1)th layer, we only need the derivative of the loss with respect to the nth layer. So this process allows for calculating each layer’s derivative just once, in a single, elegant and backward pass.

So, we recursively compute ∂L/∂z for each layer backwards. Then, how do we calculate the partial derivatives of the loss with respect to the output layer’s z, which serves as the first value of ∂L/∂z to start the process?

Let’s consider ∂L/∂z calculation for the output layer. Here, z is the input to the output layer, which gets transformed into the prediction y-hat by the activation function σ_(_z). Then, the prediction (y-hat) is used to calculate the loss, expressed as L(y, y-hat), where L is the loss function. Then we’d apply chain rule to define the calculation of ∂L/∂z for the last layer into two parts.

  • The derivative of the loss function with respect to the network’s prediction, which depends on the loss function type. For instant, for regression task we often use MSE (Mean Squared Error). While for multi-class classification, we’d choose CE (cross entropy) as loss function. Then for single prediction, the partial derivative of the loss with respect to the prediction would be
  • The derivative of the prediction (y-hat) with respect to the last layer activation’s input, similar to ∂a/∂z, is the derivative of the last layer’s activation function, σ'(z).

Our discussions so far about backpropagation have focused on using SGD (Stochastic Gradient Descent) with a batch size = 1. What if we use a larger batch size for training, would that alter the calculations?

You are correct that the calculations with a larger batch size do differ slightly, but the fundamental calculations we discussed for backpropagation remain valid. It primarily involves an additional adding up step.

When a batch size > 1, the loss during the forward pass is the average loss across all samples in the batch. So, the partial derivative of the loss with respect to weight, ∂L/∂w1, is the average of the derivatives of individual losses (∂l/∂w) for all data points in the batch.

Mathematically, for batch size n, the gradient of loss L with respect to weight w1 is calculated as

Here i represents the loss for the ith sample within the batch. It’s important to note we use the average of the gradients here. By calculating the average, we ensure that the update to the weight is reflective of the average direction in which the loss decreases across all samples in the batch.

Alright, with our discussion, let’s apply this knowledge to understand the code you’ve got. It’s an excellent way to grasp complex concepts. As usual, let’s shape this into a question:

Based on our discussion, how to combine these elements to construct the backpropagation calculation? How to use our insights to decode this code snippet?

Let’s view the calculation steps by step. All the code shown here is from the Python tutorial on Machine Learning Mastery.

Define activation functions and their derivatives.

The code uses sigmoid function as activation function. To calculate the derivatives of the activation output with respect to its input, the code defines the sigmoid function as transfer and a transfer_derivative function for its derivative.

# Transfer neuron activation
def transfer(activation):
  return 1.0 / (1.0 + exp(-activation))

# Calculate the derivative of an neuron output
def transfer_derivative(output):
 return output * (1.0 - output)

Calculate ∂L/∂a.

The backward_propagate_error function involves looping backwards from output to input layer. In the function, the variableerror represents ∂L/∂a. It differentiates between the output and hidden layers for ∂L/∂a calculation.

  • For the output layer, the error is simply the difference between the output and expected value. The loss function used here is the prediction error as (y – y_hat). The derivative for this loss function is -1. Therefore, ∂L/∂a is calculated as neuron['output'] - expected[j]. This part of code can be confusing, because typically we’d use MSE (mean squared error) and CE (cross entropy) as loss function. Also the partial derivative of loss with repsect to prediction error is not obvious in the code.
def backward_propagate_error(network, expected):
 for i in reversed(range(len(network))):
  layer = network[i]
  # calculate the loss forr each layer
  if i != len(network)-1:
   ...
  else: 
      # ∂L/∂a of the output layer  
   for j in range(len(layer)):
    neuron = layer[j]
    errors.append(neuron['output'] - expected[j]) 
  • For hidden layers, it’s calculated based on the weights and delta (representing ∂L/∂z) of the following layer.
# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
 for i in reversed(range(len(network))):
  layer = network[i]
  errors = list()
  # calculate the loss forr each layer
  if i != len(network)-1:
   ...
        # ∂L/∂a of the hidden layer  
    for neuron in network[i + 1]: 
     error += (neuron['weights'][j] * neuron['delta']) 
    errors.append(error)
  else: 
     ... 

Calculate ∂L/∂z.

Here, we compute ∂L/∂z as ∂L/∂a (represented as errors in the code)* σ'(z), and store the result as neuron['delta']

# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
  for i in reversed(range(len(network))):
  layer = network[i]
  errors = list()
 # calculate the partial derivative of the loss with repsect to the input before activation
  for j in range(len(layer)):
   neuron = layer[j] 
   neuron['delta'] = errors[j] * transfer_derivative(neuron['output']) 

Calculate ∂L/∂w to adjust weights.

Finally, we’d update the weights at the corresponding layer using ∂L/∂w = ∂L/∂z* ∂z/∂w.The weights are adjusted based on the delta (representing ∂L/∂z) and inputs, while for bias term we will only need delta.

# Update network weights with error
def update_weights(network, row, l_rate):
  for i in range(len(network)):
    inputs = row[:-1]
    if i != 0:
      inputs = [neuron['output'] for neuron in network[i - 1]]
    for neuron in network[i]:
      for j in range(len(inputs)):
        neuron['weights'][j] -= l_rate * neuron['delta'] * inputs[j] 
      neuron['weights'][-1] -= l_rate * neuron['delta'] # bias term the input is 1

Note that the code is based on SGD with a batch size of 1. Therefore, the code doens’t include the calculations that involving averaging of loss or partial derivatives over multiple samples.

You know, backpropagation kind of like the reverse of the forward pass, doesn’t it? Is that a fair way to look at it?

Backpropagation calculation is a akin to the reversed version of forward pass, largely because:

  • The derivative of activation function mirrors the role of activation function in forward propagation.
  • The ∂L/∂z calculation is like calculating z in forward pass, adding up the multiplications of the previous layer’s outputs and weights. For backpropagation we use the outputs of the next layer.

However, they are still quite different:

  • Different starting points. Forward pass begins with input data X, while backpropagation starts from the loss, making the choice of loss function pivotal.
  • Different purpose. Forward propagation aims to generate predictions based on given data. While the goal of backpropagation is to train the model. We do so by djusting the model’s parameters based on the comparison between predictions and actual values.
  • Calculation dependency. Backpropgation needs the result of the forward pass, and use the chain rule integrate elements from the forward pass and partial derivatives of loss from following layers. So backpropagation is inseparabel from the forward propagation process.

Why is it important for us to understand the intricate details of backpropagation’s calculations?

Understanding backpropagation is crucial, even with deep learning frameworks like PyTorch and TensorFlow. A thorough understanding of its calculation provides an intuitive grasp of various challenges and training tricks in deep learning, without needing to memorize them. A key insight gained from understanding backpropagation is

How does backpropagation inform the choice of activation functions?

Recall our discussion on calculating ∂a/∂z. When implementing backpropagation in code, to save memory, we’d define both the activation function and its derivative, which explains why tutorials often list derivatives alongside activation functions. These derivatives are as essential as the activation functions themselves and often predefined before modeling.

Activation Functions and their Derivatives. Source: https://dwaithe.github.io/images/activationFunctions.png
Activation Functions and their Derivatives. Source: https://dwaithe.github.io/images/activationFunctions.png

Consider the computation of ∂L/∂z _during backpropagation. This calculation relies on the outputs from subsequent layers, and it has a pattern: the gradient for earlier layers is a product of the activation function derivatives at each successive layer. For instance, assuming our example network has one more layers after z’ and z”, with weights w_5 and w6, this calculation becomes an intricate multiplication of these derivatives.

If we choose sigmoid and tanh as our activation function, then they would may lead us to the gradient vanishing problem. Because the derivatives of functions like sigmoid and tanh have very small range. The sigmoid derivative ranges between 0 and 0.25, while tanh’s derivative lies between 0 and 1. As a result, when multiplied, their products tend to diminish progressively, leading to progressively smaller gradients. This reduction in gradient magnitude causes the earlier layers to receive very small gradient updates, keeping them away from effective learning and parameter adjustment.

Source: https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Vanishing-and-Exploding-Gradients-in-Neural-Network-Models-Debugging-Monitoring-and-Fixing-Practical-Guide_7.png?resize=636%2C497&ssl=1
Source: https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Vanishing-and-Exploding-Gradients-in-Neural-Network-Models-Debugging-Monitoring-and-Fixing-Practical-Guide_7.png?resize=636%2C497&ssl=1

One solution is to use an activation function whose derivative has a broader range. ReLU, for example, returns either 1 or 0, effectively addressing the vanishing gradient issue. However, ReLU has its drawback, because it can lead the multiplication of derivatives become 0. Consequently, neurons receive no updates, become inactive, and fail to contribute to model learning. This problem is known as "dying ReLU".

In summary, a thorough understanding of backpropagation can be very beneficial. It’s a cornerstone in developing effective neural network models and troubleshooting training problems.

(Unless otherwise noted, all images are by the author)


If you’re enjoying this series, remember that your interactions – claps, comments, and follows – do more than just support; they’re the driving force that keeps this series going and inspires my continued sharing.

Other posts in this series:


If you liked the article, you can find me on LinkedIn.

Reference:

https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/


Related Articles