A friendly guide to the mathematical intuition behind vanilla Neural Networks.

Introduction
Understanding the mathematical operations behind Neural Networks (NNs) is important for a data scientist’s ability to design efficient Deep Learning models. In this article, the high-level calculus of a fully connected NN will be demonstrated, with focus on the backward propagation step. The article is oriented to people with basic knowledge of NNs, that seek to dive deeper into the NNs structure.
If you don’t have a paid Medium account, you can read for free here.
Background
The objective of the training process is to find the weights (W) and biases (b) that minimize the error. This minimization is achieved through the gradient descent algorithm. To begin with, the weights are randomly initialized, and an iterative process of a subtle weights change is performed until convergence.
Each iteration begins with a forward pass, that outputs the current prediction and evaluates the model’s error by a cost function J (equation 6). Next, a backward pass is conducted to compute the weights gradient.

To find the best parameters that minimize the error, we use the gradient descent algorithm. The goal is to find the minimum point of the cost function (J), where the gradient is close to zero (Figure 2). The algorithm iteratively moves, step by step, in the direction of steepest descent, toward the minimum point. The step size is named the ‘learning rate‘, and it is a scalar that determines how much the weights change in each iteration, and therefore how quickly the NN converges. The learning rate is a hyperparameter that has to be tuned. Smaller learning rates require more training time, whereas larger learning rates result in rapid training, but might suffer from compromised performances and instability.
To update the weights, the gradients are multiplied by the learning rate (alpha), and the new weights are calculated by the following formula:

As the model iterates, the gradient gradually converges toward zero, where the error is most likely the lowest (Alternatively, the model might converge to a local optimum and present sub-optimal performances).
Describing the Net
This article will follow the structure of a two layered Neural Network, where X (also named A[0]) is the input vector, A[1] is a hidden layer, and Y-hat is the output layer. The basic architecture of the net is illustrated in the picture below (The neurons amount in each layer are irrelevant to the equations, since we are using the vectorized equations):

Forward pass
To perceive how the backward propagation is calculated, we first need to overview the forward propagation. Our net starts with a vectorized linear equation, where the layer number is indicated in square brackets.

Next, a non linear activation function (A) is added. This activation enables the net to break linearity and adjust to complex patterns in the data. Several different activation functions can be used (e.g. sigmoid, ReLU, tanh), and here we will use the sigmoid activation for our NN.

So far we computed the first layer. The second layer, like the first, is composed of a linear equation (Z[2]), followed by a sigmoid activation (A[2]). Since this is the last layer in our net, the activation result (A[2]) is the model’s prediction (Y-hat).

Finally, to evaluate and minimize the error, we define a cost function (For further reading on activations and cost functions go to reference [1]). Here we are using the ‘Mean Squared Error’ (MSE) function. For simplicity, we will use Stochastic Gradient Descent (SGD) method [2], meaning that only one sample is processed in each iteration.

We can summarize the computational graph of the forward pass

Backward pass
In order to update the weights and biases after each iteration, we need to compute the gradients. In our two layered net there are 4 parameters to be updated: W[2], b[2], W[1] and b[1], and therefore 4 gradients to be computed:

But how are we going to find the derivatives of these composite functions? According to the ‘chain rule‘, we’ll construct the product of the derivatives along all paths connecting the variables. Let’s follow the weights of the second layer (W[2]) gradient:
![Figure5. W[2] gradient | Image by Author](https://towardsdatascience.com/wp-content/uploads/2021/05/1fHBovXUdLMNPkslyteGMag.png)
From the diagram above we can clearly see that the change in the cost J with respect to W[2] is:
![Equation 7. Gradient of the cost J with respect to the weights of layer two W[2]](https://towardsdatascience.com/wp-content/uploads/2021/05/19kum11XITdmLPg1VpsTF-g.png)
To resolve this, we’ll start by computing the partial derivative of the cost J with respect to A[2], which is also the prediction Y-hat. The original cost function is shown on the left, and the derivative on the right:
![Equation 8. MSE cost function (left) and its partial derivative with respect to A[2] (right).](https://towardsdatascience.com/wp-content/uploads/2021/05/1xsaDHR3djze2tPXjzVSLxA.png)
The partial derivative of the sigmoid activation in A[2], with respect to Z[2] is represented by the following (The mathematical development of the sigmoid derivative is described in ref [3]):
![Equation 9. Sigmoid activation on the second layer (left) and it's partial derivative with respect to Z[2] (right).](https://towardsdatascience.com/wp-content/uploads/2021/05/1MB0_hh-Wafpi7HF8wEEnzw.png)
The partial derivative of Z[2] with respect to the weights W[2]:
![Equation 10. Straight line equation on the second layer (left), and it's partial derivative with respect to W[2] (right).](https://towardsdatascience.com/wp-content/uploads/2021/05/199Yfs8rflHDp43xVgNZghw.png)
Lets chain everything together, to compute the W[2] gradient:
![Equation 11. Gradient of the cost J with respect to the weights of layer two (W[2])](https://towardsdatascience.com/wp-content/uploads/2021/05/1OxOr1HD4zTGlLcDZNgKI0A.png)
Great! Next, we will compute the b[1] gradient in a similar way. Lets follow the gradient:
![Figure 6. Gradient of the cost J with respect to the bias of layer two b[2] | Image by Author](https://towardsdatascience.com/wp-content/uploads/2021/05/1ty2OeNbDr8yfX_Gn-iUN9A.png)
![Equation 12. Gradient of the cost J with respect to the bias of layer two (b[2])](https://towardsdatascience.com/wp-content/uploads/2021/05/1a_rX0KfQG0wmeFoejJKCKQ.png)
The first two parts of the b[2] gradient were already calculated above (The partial derivative of the cost J with respect to Z[2]), and the last part is equal to 1:

So the overall b[2] gradient is:
![Equation 14. The gradient of the cost J with respect to the weights of layer two W[2]](https://towardsdatascience.com/wp-content/uploads/2021/05/1AkiueU2ADTib5NPAJZJZag.png)
At last, we finished computing the gradients for the second layer. The gradients for first layer are a bit longer, but we already computed parts of it. Lets follow the gradient for W[1] update:
![Figure 7. W[1] gradient | Image by Author](https://towardsdatascience.com/wp-content/uploads/2021/05/1sqsFvarf5uXjydd2ixyEDQ.png)
The first two parts of the gradient were previously calculated for layer 2. The partial derivative of Z[2] with respect to A[1], is W[2]:

The last two parts are calculated in the same manner as it was calculated in layer 2. Taking it together, we get:
![Equation 16. The gradient of the cost J with respect to the weights of the first layer W[1]](https://towardsdatascience.com/wp-content/uploads/2021/05/1xprorNZvirLsg-13_l0zcQ.png)
And if we follow the b[1] gradient:
![Figure 8. b[1] gradient | Image by Author](https://towardsdatascience.com/wp-content/uploads/2021/05/1jHA6VvCU7A7YKDggqNGfGQ.png)
We’ll get:
![Equation 17. The gradient of the cost J with respect to the bias of the first layer b[1]](https://towardsdatascience.com/wp-content/uploads/2021/05/1R8YhHvIoDf7v_wEurT9ZAQ.png)
By this part we finished computing all the gradients of the weights and biases for one iteration of our NN.
Weight update
Once computing the gradients, we can update the new parameters for the model, and iterate all over again until the model converges. Denote that alpha is the learning rate, a hyperparameter that will determine the convergence rate.

Final Thoughts
I hope this article helped to gain a deeper understanding of the mathematics behind Neural Networks. In this article, I’ve explained the working of a small network. However, these basic concepts can be generalized and applicable to deeper neural networks.
Thank you for reading!
Want to learn more?
- Explore additional articles I’ve written
- Subscribe to get notified when I publish articles
- Follow me on Linkedin
References
[1] Activation and cost functions. https://medium.com/@zeeshanmulla/cost-activation-loss-function-neural-network-deep-learning-what-are-these-91167825a4de
[2] Stochastic gradient descent. https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31
[3] sigmoid derivative. https://becominghuman.ai/what-is-derivative-of-sigmoid-function-56525895f0eb