The aim of this article is to provide a mathematical understanding of the learning process of neural networks by developing a framework to analyze how changes in weight and bias affect the cost function. Once this is understood, we will be able to determine how making very small changes in these variables, using optimization algorithms like Adam, can reduce the cost function.

A neural network, or an artificial neural network to be specific, is a series of nodes that mimics the animal neural system resulting in information processing capabilities similar to the brain. The connections between these nodes are specified by weights that are learned during the training process. The basic idea behind these networks goes back to Alan Turing’s 1948 paper titled Intelligent Machinery where he refers to them as B-type unorganised machines i.e. machines that start of as fairly unorganized in terms of their initial composition but are the able to learn how to perform specific tasks. Today these neural networks form an indispensable component of Deep Learning algorithms and this article is aimed at elucidating how exactly they learn.
Let’s begin by imagining a rather simplified neural network. Let us imagine a network made of L layers but with just a single neuron in each of those L layers as shown below.

Also for now, let’s limit our analysis to the connection between the last and the second last layer, both of which are shown below with the associated weight (ω) and bias (b).

The activation of neurons L and L-1 is a^(L) and a^(L−1) respectively. If the expected output of the final layer is y, the error or the cost function becomes _C_o= (a^(L)−y)²_ . This is also known as the mean squared error, possibly the simplest and most widely used of all available options. The subscript in the cost function denotes that this is the cost for the first training image/data point and the entire training process will entail training on all the N images in the training data set.The activation of layer L here can be expressed as:

where b^(L) is the bias of layer L and σ is the activation function discussed above. Learning involves use of a suitable optimization algorithm (usually some variant of gradient descent) for arriving at the optimum values of ω^(L) and b^(L) that minimize _C_o_. We do so by examining how sensitive the cost function is to small perturbations in bias and weight i.e. by computing the following partial derivatives using the chain rule:

The partial derivatives on the right hand side of the above equation can be computed as follows:

Substituting the partial derivatives derived above into equations 1 and 2

These two partial derivatives help quantify how changes in the weight and bias influence the cost.
Now let us add another layer to our neural network as shown below.

In addition to the two partial derivatives we just calculated, to fully investigate the effects of changes in weights and bias on the cost function, for this network we must also compute the following two partial derivatives for the second last layer

The partial derivatives needed for the above two equations can be calculated as follows:

Substituting these values we get the following expressions for rate of change of cost function for changes in weight and bias association with layer L-1

This should give an idea of how back propagation works. First we determined the partial derivatives of cost with respect to weight and bias in layer L as a function of the activation of layer L and L-1. We then derived the derivatives of this same cost function with respect to weight and bias of layer (L-1) as a function of the activation of layer L, L-1, and L-2. These activation were in turn expressed as functions of the respective weights and bias. We can follow the same process right up till the first neuron of the full network to determine how all the weights and biases effect the cost function.

In our analysis so far we are yet to address the case where there are more than one neuron in a single layer. It turns out that adding of more neurons to a layer just requires a few extra indices to keep track of. Shown below is a neural network of two layers but with many neurons in each of the layers. Here we introduce two new indices to keep track of neurons on each layer, k for layer L-1 and j for layer L.

The cost function in this case would be the sum of cost for each neuron in the output layer L

The activation of each neuron in layer L will be the weighted sum of the activation of all neurons in layer L-1

The partial derivatives of cost with respect to weight and bias of layer L can be written as before

The partial derivatives needed for the above equation can be calculated as before.

Substituting these in equations (1) and (2)

The equations above are for the neural network with two layers but many neurons per layer. We can similarly calculate the partial derivatives for the neural network with many layers and many neurons per layer. I am not including this equations as they will essentially involve the same logic as that discussed above.
