
Content:
- Introduction
- Combination of functions
- A simple Neural Network
- Forward pass
- Setting up the simple neural network in PyTorch
- Backpropagation
- Comparison with PyTorch results
- Conclusion
- References
Introduction:
The neural network is one of the most widely used machine learning algorithms. The successful applications of neural networks in fields such as image classification, time series forecasting, and many others have paved the way for its adoption in business and research. It is fair to say that the neural network is one of the most important machine learning algorithms. A clear understanding of the algorithm will come in handy in diagnosing issues and also in understanding other advanced deep learning algorithms. The goal of this article is to explain the workings of a neural network. We will do a step-by-step examination of the algorithm and also explain how to set up a simple neural network in PyTorch. We will also compare the results of our calculations with the output from PyTorch.
1.0 Combination of functions:
Let’s start by considering the following two arbitrary linear functions:

The coefficients -1.75, -0.1, 0.172, and 0.15 have been arbitrarily chosen for illustrative purposes. Next, we define two new functions a₁ and a₂ that are functions of z₁ and z₂ respectively:

The function

used above is called the sigmoid function. It is an S-shaped curve. The function f(x) has a special role in a neural network. We will discuss it in more detail in a subsequent section. For now, we simply apply it to construct functions a₁ and a₂.
Finally, we define another function that is a linear combination of the functions a₁ and a₂:

Once again, the coefficients 0.25, 0.5, and 0.2 are arbitrarily chosen. Figure 1 shows a plot of the three functions a₁, a₂, and z₃.

We can see from Figure 1 that the linear combination of the functions a₁ and a₂ is a more complex-looking curve. In other words, by linearly combining curves, we can create functions that are capable of capturing more complex variations. We can extend the idea by applying the sigmoid function to z₃ and linearly combining it with another similar function to represent an even more complex function. In theory, by combining enough such functions we can represent extremely complex variations in values. The coefficients in the above equations were selected arbitrarily. What if we could change the shapes of the final resulting function by adjusting the coefficients? That would allow us to fit our final function to a very complex dataset. This is the basic idea behind a neural network. The neural network provides us a framework to combine simpler functions to construct a complex function that is capable of representing complicated variations in data. Let us now examine the framework of a neural network.
2.0 A simple neural network:
Figure 2 is a schematic representation of a simple neural network. We will use this simple network for all the subsequent discussions in this article. The network takes a single value (x) as input and produces a single value y as output. There are four additional nodes labeled 1 through 4 in the network.

The input node feeds node 1 and node 2. Node 1 and node 2 each feed node 3 and node 4. Finally, node 3 and node 4 feed the output node. w₁ through w₈ are the weights of the network, and b₁ through b₈ are the biases. The weights and biases are used to create linear combinations of values at the nodes which are then fed to the nodes in the next layer. For example, the input x combined with weight w₁ and bias b₁ is the input for node 1. Similarly, the input x combined with weight w₂ and bias b₂ is the input for node 2. AF at the nodes stands for the activation function. The sigmoid function presented in the previous section is one such activation function. We will discuss more Activation Functions soon. For now, let us follow the flow of the information through the network. The outputs produced by the activation functions at node 1 and node 2 are then linearly combined with weights w₃ and w₅ respectively and bias b₃. The linear combination is the input for node 3. Similarly, outputs at node 1 and node 2 are combined with weights w₆ and w₄ respectively and bias b₄ to feed to node 4. Finally, the output from the activation function at node 3 and node 4 are linearly combined with weights w₇ and w₈ respectively, and bias b₅ to produce the network output yhat.
This Flow of information from the input to the output is also called the forward pass. Before we work out the details of the forward pass for our simple network, let’s look at some of the choices for activation functions.

Table 1 shows three common activation functions. The plots of each activation function and its derivatives are also shown. While the sigmoid and the tanh are smooth functions, the RelU has a kink at x=0. The choice of the activation function depends on the problem we are trying to solve. There are applications of neural networks where it is desirable to have a continuous derivative of the activation function. For such applications, functions with continuous derivatives are a good choice. The tanh and the sigmoid activation functions have larger derivatives in the vicinity of the origin. Therefore, if we are operating in this region these functions will produce larger gradients leading to faster convergence. In contrast, away from the origin, the tanh and sigmoid functions have very small derivative values which will lead to very small changes in the solution. We will discuss the computation of gradients in a subsequent section. There are many other activation functions that we will not discuss in this article. Since the RelU function is a simple function, we will use it as the activation function for our simple neural network. We are now ready to perform a forward pass.
3.0 Forward pass:
Figure 3 shows the calculation for the forward pass for our simple neural network.

z₁ and z₂ are obtained by linearly combining the input x with w₁ and b₁ and w₂ and b₂ respectively. a₁ and a₂ are the outputs from applying the RelU activation function to z₁ and z₂ respectively. z₃ and z₄ are obtained by linearly combining a₁ and a₂ from the previous layer with w₃, w₅, b₃, and w₄, w₆, b₄ respectively. Finally, the output yhat is obtained by combining a₃ and a₄ from the previous layer with w₇, w₈, and b₅. In practice, the functions z₁, z₂, z₃, and z₄ are obtained through a matrix-vector multiplication as shown in figure 4.

Here we have combined the bias term in the matrix. In general, for a layer of r nodes feeding a layer of s nodes as shown in figure 5, the matrix-vector product will be (s X r+1) * (r+1 X 1).

The final step in the forward pass is to compute the loss. Since we have a single data point in our example, the loss L is the square of the difference between the output value yhat and the known value y. In general, for a regression problem, the loss is the average sum of the square of the difference between the network output value and the known value for each data point. It is called the mean squared error. This completes the first of the two important steps for a neural network. Before discussing the next step, we describe how to set up our simple network in PyTorch.
4.0 Setting up the simple neural network in PyTorch:
Our aim here is to show the basics of setting up a neural network in PyTorch using our simple network example. It is assumed here that the user has installed PyTorch on their machine. We will use the torch.nn module to set up our network. We start by importing the nn module as follows:

To set up our simple network we will use the sequential container in the nn module. The three layers in our network are specified in the same order as shown in Figure 3 above. Here is the complete specification of our simple network:

The nn.Linear class is used to apply a linear combination of weights and biases. There are two arguments to the Linear class. The first one specifies the number of nodes that feed the layer. The number of nodes in the layer is specified as the second argument. For example, the (1,2) specification in the input layer implies that it is fed by a single input node and the layer has two nodes. The hidden layer is fed by the two nodes of the input layer and has two nodes. It is important to note that the number of output nodes of the previous layer has to match the number of input nodes of the current layer. The (2,1) specification of the output layer tells PyTorch that we have a single output node. The activation function is specified in between the layers. As discussed earlier we use the RelU function. Using this simple recipe, we can construct as deep and as wide a network as is appropriate for the task at hand. The output from the network is obtained by supplying the input value as follows:

t_u1 is the single x value in our case. To compute the loss, we first define the loss function. The inputs to the loss function are the output from the neural network and the known value.

t_c1 is the y value in our case. This completes the setup for the forward pass in PyTorch. Next, we discuss the second important step for a neural network, the Backpropagation.
5.0 Backpropagation:
The weights and biases of a neural network are the unknowns in our model. We wish to determine the values of the weights and biases that achieve the best fit for our dataset. The best fit is achieved when the losses (i.e., errors) are minimized. Note the loss L (see figure 3) is a function of the unknown weights and biases. Imagine a multi-dimensional space where the axes are the weights and the biases. The loss function is a surface in this space. At the start of the minimization process, the neural network is seeded with random weights and biases, i.e., we start at a random point on the loss surface. To reach the lowest point on the surface we start taking steps along the direction of the steepest downward slope. This is what the gradient descent algorithm achieves during each training epoch or iteration. At any nth iteration the weights and biases are updated as follows:

m are the total number of weights and biases in the network. Note that here we are using wᵢ to represent both weights and biases. The learning rate η determines the size of each step. The partial derivatives of the loss with respect to each of the weights/biases are computed in the back propagation step.

The process starts at the output node and systematically progresses backward through the layers all the way to the input layer and hence the name backpropagation. The chain rule for computing derivatives is used at each step. We now compute these partial derivatives for our simple neural network.

We first start with the partial derivative of the loss L wrt to the output yhat (Refer to Figure 6).

We use this in the computation of the partial derivation of the loss wrt w₇.

Here we have used the equation for yhat from figure 6 to compute the partial derivative of yhat wrt to w₇. The partial derivatives wrt w₈ and b₅ are computed similarly.


Now we step back to the previous layer. Once again the chain rule is used to compute the derivatives. Refer to Figure 7 for the partial derivatives wrt w₃, w₅, and b₃:


Refer to Figure 8 for the partial derivatives wrt w₄, w₆, and b₄:


For the next set of partial derivatives wrt w₁ and b₁ refer to figure 9. We first rewrite the output as:


Similarly, refer to figure 10 for partial derivative wrt w₂ and b₂:

PyTorch performs all these computations via a computational graph. The gradient of the loss wrt weights and biases is computed as follows in PyTorch:

First, we broadcast zeros for all the gradient terms. optL is the optimizer. The .backward triggers the computation of the gradients in PyTorch.
Now that we have derived the formulas for the Forward Pass and backpropagation for our simple neural network let’s compare the output from our calculations with the output from PyTorch.
6.0 Comparison with PyTorch results:
One complete epoch consists of the forward pass, the backpropagation, and the weight/bias update. We will use Excel to perform the calculations for one complete epoch using our derived formulas. We will compare the results from the forward pass first, followed by a comparison of the results from backpropagation. Finally, we will use the gradient from the backpropagation to update the weights and bias and compare it with the Pytorch output. In practice, we rarely look at the weights or the gradients during training. Here we perform two iterations in PyTorch and output this information for comparison. But first, we need to extract the initial random weight and biases from PyTorch. We will need these weights and biases to perform our calculations. This is done layer by layer as follows:

Note that we are extracting the weights and biases for the even layers since the odd layers in our neural network are the activation functions. The extracted initial weights and biases are transferred to the appropriately labeled cells in Excel.
Figure 11 shows the comparison of our forward pass calculation with output from PyTorch for epoch 0. The output from PyTorch is shown on the top right of the figure while the calculations in Excel are shown at the bottom left of the figure. The output value and the loss value are encircled with appropriate colors respectively.

Next, we compute the gradient terms. Just like the weight, the gradients for any training epoch can also be extracted layer by layer in PyTorch as follows:

Figure 12 shows the comparison of our backpropagation calculations in Excel with the output from PyTorch. The different terms of the gradient of the loss wrt weights and biases are labeled appropriately. Note that we have used the derivative of RelU from table 1 in our Excel calculations (the derivative of RelU is zero when x < 0 else it is 1). The output from PyTorch is shown on the top right of the figure while the calculations in Excel are shown at the bottom left of the figure. All but three gradient terms are zero. The gradient of the loss wrt w₈, b₄, and b₅ are the three non-zero components. These three non-zero gradient terms are encircled with appropriate colors.

We are now ready to update the weights at the end of our first training epoch. In PyTorch, this is done by invoking optL.step(). For our calculations, we will use the equation for the weight update mentioned at the start of section 5.
Figure 13 shows the comparison of the updated weights at the start of epoch 1. The output from PyTorch is shown on the top right of the figure while the calculations in Excel are shown at the bottom left of the figure. Note that only one weight w₈ and two biases b₄, and b₅ values change since only these three gradient terms are non-zero. The learning rate used for our example is 0.01.

Interested readers can find the PyTorch notebook and the spreadsheet (Google Sheets) below.
Link to the google sheet:
7.0 Conclusion:
In this article, we examined how a neural network is set up and how the forward pass and backpropagation calculations are performed. We used a simple neural network to derive the values at each node during the forward pass. Using the chain rule we derived the terms for the gradient of the loss function wrt to the weights and biases. We used Excel to perform the forward pass, backpropagation, and weight update computations and compared the results from Excel with the PyTorch output. While the neural network we used for this article is very small the underlying concept extends to any general neural network.
References:
1.0 PyTorch documentation: https://pytorch.org/docs/stable/index.html.
2.0 Deep learning with PyTorch, Eli Stevens, Luca Antiga and Thomas Viehmann, July 2020, Manning publication, ISBN 9781617295263.