The world’s leading publication for data science, AI, and ML professionals.

Back-propagation Demystified [Part 1]

Read to learn about the back-propagation algorithm in the context of Deep Learning

Photo by ThisisEngineering RAEng on Unsplash
Photo by ThisisEngineering RAEng on Unsplash

If you have been reading about Deep Learning, you must have definitely heard the term back-propagation at least once. In this article, I explain back-propagation and computational graphs.

I would recommend you read about gradient descent optimization here, before proceeding with this article. Now, let’s dive right into the explanation.

Introduction

We will first look at a few key steps involved in network training and terms like forward propagation.

Deep learning network training steps

The overall process for a deep learning network training may be summarized in the following steps.

  1. Data exploration and analysis [This is a huge step involving a lot of sub-steps]
  2. Choice of an architecture based on information obtained in step 1 and building it using a framework such as Chainer, Pytorch, TensorFlow etc.
  3. Choice of an appropriate cost function such as binary cross-entropy or mean squared error etc. depend on the task at hand.
  4. Choice of an optimizer such as Stochastic Gradient Descent, Adam, Adadelta etc.
  5. Training the model to minimize the selected loss function w.r.t. the network parameters using the training examples.
  6. Examining and analyzing network performance.

Forward Propagation

In terms of neural networks, the forward propagation refers to the flow of data from the input to the output of the network. So after forward propagation for an input x, you get an output ŷ. Using this predicted value, the scalar cost J(θ) is computed for the training examples.

Back-propagation Algorithm

The back-propagation algorithm comes in step 4 and allows the calculation of the gradient required for the optimization techniques. Computing an analytical expression for the gradient is straightforward. However, numerically evaluating such an expression is computationally expensive. Back-propagation algorithm allows the calculation using a relatively simple and inexpensive procedure.

Often people get confused and think back-propagation is the whole learning algorithm for the neural networks. But in reality, it is just used for the calculation of the gradient during the optimization stage.

The back-propagation algorithm leverages the chain rule of differential calculus, which computes the error gradients in terms of summations of local-gradient products over the various paths from a node to output. Although this summation has an exponential number of components (paths), one can compute it efficiently using dynamic programming. The back-propagation algorithm is a direct application of dynamic programming. It contains two main phases, referred to as the forward and backward phases, respectively. The forward phase is required to compute the output values and the local derivatives at various nodes, and the backward phase is required to accumulate the products of these local values over all paths from the node to the output [1].

Next, let us look at computational graphs that are a means for representing equations in the form of a graph data structure. All the deep learning frameworks (e.g. PyTorch, Tensorflow etc.) use them internally for implementing back-propagation.

Computational Graphs

Computational graphs are a method of representing mathematical expressions. In the case of deep learning models, this is like a descriptive language giving the functional description of the required computation [1].

These can be instantiated for two types of computations:

  • Forward computation
  • Backward computation

Few essential terms of a computational graph are explained below.

  • Node: A node in a graph is used to indicate a variable. The variable may be a scalar, vector, matrix, tensor, or even a variable of another type.
  • Edge: An edge represents a function argument and also data dependency. These are just like pointers to nodes.
  • Operation: An operation is a simple function of one or more variables. There is a fixed set of allowable operations. Functions more complicated than these operations in this set may be described by composing many operations together.

A node knows how to compute its value and the value of its derivative w.r.t. each argument (edge) times a derivative of an arbitrary input.

In other words, the nodes represent units of computation, and the edges represent the data consumed or produced by a computation.

One of the most important advantages of using computational graphs is that the edges represent the dependencies between operations. This makes the identification of operations that can be executed in parallel easier.

The popular deep learning frameworks such as PyTorch and TensorFlow also depend on the creation of these computational graphs to implement the back-propagation for the defined networks [2]. They perform the gradient calculation needed for the optimization algorithms automatically for you. An example of a computational graph is shown below.

Source: From the author
Source: From the author

This graph can be used to compute the value of the expression using a bottom-up data flow. An example is shown below.

Source: From the author
Source: From the author

Chain Rule of Calculus

In calculus, the chain rule is a formula for computing the derivative of the composition of two or more functions. The chain rule may be written in Leibniz’s notation in the following way. If a variable z depends on the variable y, which itself depends on the variable x, so that y and z are therefore dependent variables, then z, via the intermediate variable of y, depends on x as well. The chain rule is as follows.

Source: https://en.wikipedia.org/wiki/Chain_rule
Source: https://en.wikipedia.org/wiki/Chain_rule

The following figure illustrates the chain rule in computational graphs [1].

Source: From the author
Source: From the author
Source: From the author
Source: From the author

The partial derivative of any node a w.r.t. any node b in the graph is equal to the aggregated sum of the product of partial derivatives of edges of all the paths from node b to a. This is what is shown in the above figure where there are two paths from the input node to the output node which are included in the derivative calculation.

Back-propagation applied to our example graph is shown below. Starting from the output node, partial derivatives are calculated for all the lower nodes.

Source: From the author
Source: From the author

As you can see, once the backward graph is built, calculating derivatives is straightforward and is heavily optimized in the deep learning frameworks.

Conclusion

To summarize, in neural networks to calculate the gradients, the path-aggregation is over exponentially increasing number of paths, which seem to be intractable at first sight. However, back-propagation allows this calculation in an efficient manner (a type of dynamic programming technique).

This takes care of the back-propagation algorithm explanation. In the next post, I’ll be covering computational graphs w.r.t. PyTorch and TensorFlow frameworks.

Thank you for reading the article! I hope you enjoyed reading and found it interesting.

References

[1]Neural Networks and Deep Learning, Springer, September 2018, Charu C. Aggarwal. http://www.charuaggarwal.net/neural.htm

[2]https://www.tutorialspoint.com/python_deep_learning/python_deep_learning_computational_graphs.htm

[3]https://www.tensorflow.org/guide/intro_to_graphs

[4]https://kth.instructure.com/files/1864796/download?download_frd=1

[5]https://www.easy-tensorflow.com/tf-tutorials/basics/graph-and-session


Related Articles