The animation in Fig1 above shows the training of a neural network of four neurons using the Backpropagation algorithm. The reference function is a plot shaped as an inverted V with a bend on each side. It is a plot over the range [0, 4] and has four slopes. Fig 2a (left) shows the reference function and the neural network with initial weights. Fig2a (right) shows the reference function and the neural network that converged to the reference function after 7200 backpropagation iterations.
The purpose of this article is to help the reader understand neural networks and the backpropagation algorithm used in training. Training neural networks is computationally very expensive. The backpropagation algorithm at times can give a solution that is locally, not globally optimal. Therefore a thorough understanding of the basics of this algorithm is useful for those who are deploying neural networks to solve their problems.
This article has three sections which cover the following topics : (a) A summary of the basic concepts in supervised learning and the mathematical equations involved in forward computation and backpropagation of neural networks is presented. (b) A translation of these equations into statements that describe the weight updates as visual motions is presented. These motions are an increase or a decrease in slopes, or movements of the origins of the slopes. (c ) A set of examples using a variety of neural networks is presented. These show the operation of the backpropagation algorithm. A python program is used to compute and generate the plots and animations that visually display the progress in computation, and lastly, a note explaining how the resultant plot ties in with the rules stated in (b) is given. I hope the visualization and the explanation will help the user with a better understanding of neural networks and backpropagation.
Supervised Learning: Supervised learning uses a special class of functions called neural networks to model the relationship between input and output values in a test data set. The test data set consists of a sequence of input and output values. Neural networks have weight parameters whose values are initially guessed. The training process uses the data set to compare the output value predicted by the model and modifies the weights such that the error between the predicted value and actual value is reduced. Once this training process is completed the neural network is used to predict output values for new inputs.
Neuron: A neuron is a function that maps an n input vector to a single value. It is a cascade of a linear and a non-linear operation as shown below .
N(w, x) = nl(x0*w0 + x1*w1+...+xn*wn + b)
where w = {w0, w1, ..., wn} is the weight parameter of size n,
x = {x0, x1, ..., xn} is the input vector of size n ,
b is a bias
and nl is a non-linear function,
The linear operation is a dot product of the input vector with a weight vector of the same size with a scalar bias value added. The non-linear function is a function defined on a single variable. A typical non-linear function used is relu which is defined as relu(x) = x for x > 0 else 0. In this article, we use an input vector of size 1 and we choose relu as the non-linear function. Thus the equation for a neuron is
N(x) = relu(w*x+b) where w is the weight and b the bias.
If w > 0 then N(x) > 0 when x > -b/w. We will refer -b/w as the origin of the neuron.
Neural Network: A neural network is a computational graph comprising of neurons. A neural network with n single input neurons in the input layer and one n input neuron in the output layer is specified by the equation
NN(x) = w0relu(iw0x+b0)+w1relu(iw1x+b1)+..+wnrelu(iwnx+bn).
In the single input case, we can set the input weight to 1 (to remove redundancy and improve backpropagation search efficiency). Thus the neural network equation is
NN(x) = w0relu(x+b0)+w1relu(x+b1)+..+wn*relu(x+bn)
A key useful observation is (a) neuron ni (wi *relu(x+bi)) contributes to the sum only when x > -bi, i.e. a neuron is active only when x > -bi. The point -bi is referred as the origin of the neuron. (b) the slope increases by wi if the neuron ni is active. An important point to note is a neural network of n neurons (with relu non-linears )represents a continuous graph with n slopes. Each neuron contributes to the slope of the neural network when x > bi where bi is the bias of the neuron.
Example 1 (A Neural network with four slopes): Fig 3a below shows a single input single output neural network. The input layer has 4 neurons each operating on a vector of size 1. The weights and biases of each neuron are {1, 1, 1, 1} and {0, -1, -2, -3} respectively. The function modeled has a single output so the output layer has 1 neuron of four inputs. The output layer does not have a non-linear and the weights are {2, -1, -2, -1} . Fig 3b shows the plot represented by this neural network. This plot represents a continuous function with four slopes {2, 1, -1, -2}, where the slopes change at location {0, 1, 2, 3}.
Backpropagation: Backpropagation is the algorithm used to reduce the error between the test data and the values predicted by the neural network. This algorithm modifies the initially guessed biases and weights iteratively till the error between the test data and the value predicted by the neural network is within acceptable limits. The difference delta(x) and error e(x) between the reference function ref_f(x) and the neural network is given by
delta(x) = (ref_f(x) - Σ wi*relu(x+bi) + b)
e(x) = delta(x)**2
where n > i ≥ 0, n
is the number of neurons in the input layer.
The gradient with respect to the weights and biases is expressed as partial derivatives with respect to the weights and biases.
grad = Σ ∂e/∂wi*Δwi + ∂e/∂bi*Δbi + ∂e/∂bi*Δb where
∂e/∂wi = -2 * delta(x) * relu(x+bi)
∂e/∂bi = -2 * delta(x) * (x+bi > 0) ? wi : 0) and
∂e/∂b = 2 * delta(x)
The equations for the weight updates are
wi = wi - lr*∂e/∂wi = wi + 2*lr*delta(x) * relu(x+bi) --- equation 1
bi = bi - lr*∂e/∂bi = bi + 2*lr* delta(x) *(x+bi > 0) ? wi: 0) ---- equation 2
b = b - lr*∂e/∂b = bi + 2*lr* delta(x) where lr is learning rate ----- equation 3
During each iteration in the forward path we compute the output NN(x) = Σ wi * relu(x+bi), the error function e(x) and in the backward path we update the weights and bias.
The weight update rules are:
If NN(x) < ref_f(x) i.e, delta(x) > 0 (Neural network output is less than the reference value) then:
For each neuron x > -bi (active neuron)
weight increases (slope increases)
if wi > 0 bias is increased, (neuron (slope) moves left ) -- rule 1
if wi < 0 bias is decreased, (neuron (slope) moves right) -- rule 2
Here the magnitude of weight increase is proportional to delta(x) and the distance x + bi from the origin -bi.
The magnitude of change in bi is proportional to delta(x) and the value of weight wi.
If NN(x) > ref_f(x), i.e., delta(x) < 0, ( Neural network output is greater than the reference value) then:
For each neuron x > -bi (active neuron)
weight decreases (slope decreases)
if wi > 0 bias is decreased, (neuron (slope) moves right) -- rule 3
if wi < 0 bias is increased, (neuron (slope) moves left) -- rule 4
Here the magnitude of weight decrease is proportional to delta(x) and the distance x + bi from the origin -bi.
The magnitude of change in bi is proportional to delta(x) and the value of weight wi.
In the next section, I present a few examples to illustrate the operation of the backpropagation algorithm. A random number generator generates 400 numbers in the range [0, 4]. A reference function generates the corresponding output. The reference function used in the examples are neural networks with fixed biases and weights. A training neural network with n neurons is chosen. The initial biases and weights of the training neural network are guessed. The backpropagation algorithm updates the weights and biases such that the error between the training neural network and the reference function is reduced. For training of the neural network, a data set of 400 numbers is used, one or more times, until the error is reduced within desired limits or we reach an upper limit on runtime. The time taken for the training will depend on the initial values of the weights and biases.
The examples in this section are accompanied by plots and animation. Each example has (i) a plot comparing the trained neural network with initial bias and weights, to the reference function, (ii) a plot comparing trained neural network with final bias and weights, to the reference function, (iii) a plot showing the bias of each neuron as backpropagation progresses, (iv) a plot showing the weights of each neuron as backpropagation progresses, (v) animation of plot showing the neural network and reference function as the backpropagation progresses, and (vi) optionally some example may have an animation of the plots of individual neurons as backpropagation updates the biases and weights.
Example 2: Let us use a single neuron reference function specified by the equation ref_fn1(x) =2 * relu(x-1.0) where x is defined over the interval [0, 4]. For training, we will use a neural network of one neuron with initial bias of 0 and initial weight (slope) of -0.6. A bias of 0 is chosen so that the neuron is active over the entire data set defined over the interval [0, 4]. The equations for the neural network and the reference function are:
neural(x) = wt*relu(x+b) where wt = -0.6 and b = 0.
ref_fn1(x) = 2* relu(x-1).
This example will illustrate the Convergence of a single neuron neural network to a reference function. Here the backpropagation algorithm moves the neuron to the right from 0 to 1 and increases its slope from -0.6 to 2. Let us see how it does it.
Fig4a below is an animation showing the convergence of the neural network to the reference function in 800 iterations. Fig 4b shows the change in bias and weight of the neuron as the backpropagation progress. Fig 4c shows the initial and final plot of neural network along with the reference function.
The convergence shown in the animation in Fig 4a above, has three phases. These phases are:
Phase 1: Initially the neural network output is smaller than reference function (as seen in fig 4c (left)). So the slope increases from -0.6 to 0 in 4 iterations and the neuron moves right (applying rule 2 for negative slope and positive delta). The slope continues to increase from 0 to 1 in the next 12 iterations, but the neuron now moves left as the slope is positive (applying rule 1 for positive slope and positive delta).
Phase 2: In this phase the slope increases and the neuron moves left (rule 1) when delta is positive and the slope decreases and the neuron moves right (rule 3) when delta is negative. However, the movement is such that overall the slope increases from 1.0 to 1.84 (fig 4b (right)) and the overall the neuron moves right from 0 to 0.9 (fig 4a (left)) in 300 iterations.
Phase 3: In this phase the slope continues to rise from 1.84 to 2.0 and the neuron moves right from 0.9 to 1.0 in the remaining 500 iterations. The movement of slope and neuron is slower in this phase as the error is smaller in in each iteration.
Conclusion: This example shows the ability of the backpropagation algorithm to update the bias and weight of the single neuron neural network, such that it converges to the reference function.
In the next example, we use a two neuron neural network as a reference function.
Example 3: The reference function is specified by the equation ref_fn2(x) = 2relu(x)-4relu(x-2) defined over the range [0, 4]. For training, we use a neural network with 2 neurons in the input layer. The initial bias is {0, -0.2} and initial weight is {0., 0.08}. The bias is chosen so that the origin of the neurons is as close to 0. This ensures that both the neurons are active over most of the data set defined on the input range [0, 4]. The initial bias values of each neuron is different. The update of the weights is dependent on the value of (x+bi) where bi is the bias of the ith neuron. By choosing different biases the trajectory of weight updates and bias updates for each neuron will be different. The impact of this difference is seen in this example. The neural net function with initial bias and weights is
neural2(x) = wt0*relu(x -b0)+wt1*relu(x-b1) where b0=0, b1=-0.2, wt0 = 0, wt1 = 0.08
Fig 5a (left) shows the trajectory of the biases and (right) weights of the neurons as they are updated by the backpropagation algorithm. The bias of neuron2 increases only till 0.14 and then starts moving to -2.16 , whereas the bias of neuron1 increases till 1.25 before it comes back to 0 as seen in fig 5a and animation Fig 5b below.
Let’s look at the reasons for the different trajectories.
The reason for the differences in the trajectories of the two neurons are:
Reason1: Difference in bias causes difference in weight updates: the initial bias of neuron1 is 0 and that of neuron2 is -0.2. Thus the origin of neuron1 is 0 and that of neuron2 is 0.2. The weight updates (from equation1) are proportional to delta(x)x for neuron1 and delta(x) (x-0.2) for neuron2. Therefore when x > 0.2 the weights of both the neurons increases, with the weight of neuron1 increasing by a larger amount.
Reason2: Difference in weights causes difference in bias updates: The update of bias (from equation 2) is proportional to delta(x) and the value of the weight of the neuron. Since the weight of neuron1 is larger it moves further than neuron2. Also when delta is negative and the weights decrease, since the weight of neuron2 is smaller it becomes negative (while the weight of neuron1 is still positive). This causes neuron2 to move to the right while neuron1 still moves to the left. These are the two reasons for the increase in the separation of the origins of neuron1 and neuron2.
Reason3: A difference in bias and weight occurs when one neuron is active and the other neuron is inactive: As neuron2 moves further right and its origin crosses 0 then it is not active (its weights are not updated) when x < -b1. This is the reason why the slope of neuron1 increases while slope of neuron2 decreases as seen in Fig5a (right) above. These are the three reasons for the difference in bias and weight trajectories of the two neurons.
The fig5c below shows the animation plot of the neural network as it converges to the reference function. Fig 5d (left) shows the plot of the neural network with initial bias and weights and Fig 5d(right) shows the plot after 3601 iterations of backpropagation with final bias and weights. The animation has three phases. In the first phase both the neurons move left with neuron1 moving further left with higher slope. In the second phase neuron2 starts moving right while neuron1 moves left. In the last phase both neurons move to their final position of 0 and 2 and the slopes converge to the final values of 2 and -4.
Conclusion: Thus we see the backpropagation algorithm updating the two neurons such that they follow different trajectories. This allows the neural network of two neurons to converge to the reference function in 3601 iterations.
Let us next consider an example with a reference function of four neurons.
Example 4: The reference function is specified by the equation ref_fn(x) = 2 relu(x)- relu(x-1) -2relu(x-2)-relu(x-3) defined over the range 0 ≤ x ≤4. For training we will use a four neuron neural network. The initial bias is {-0.002, -0.5, -1.0, -1.5} and weights are {0.001, 0.04, 0.07, 0.1}. Similar to previous example we have chosen different biases.
Fig 6a (left) shows the neural network plot with initial bias and weights and (right) with final bias and weights. However, after 20000 iterations this network does not converge to the reference function.
To understand the reason why it does not converge lets us look at the trajectories of the bias and weights of the four neurons shown in fig6b and the animation showing the movements of the four neurons in fig6c.
The global optimum required that the four neurons move to locations {0, 1, 2, 3} with slopes {2, -1, -2, -1}. Instead, as seen from fig6b, the four neurons {purple, green, blue, orange} moved to locations {0, 0, 1.22, 2.22} with slopes {1.42, 0.5, -1.33, -2.1}. Thus neuron1 and neuron2 both converged to location 0. This results in the coalescing of two neurons into one neuron at location 0, with a combined slope of 1.42+0.5 = 1.92.
From fig 6b we can see, that the slope of neuron1 and neuron2 is positive and rising. When delta(x) > 0 and x > -b2 both neuron1 and neuron2 move left. When delta(x) < 0 and x < -b2 (i.e., neuron2 is inactive) neuron1 will move to the right and neuron2 will not be updated as it is not active when x < -b2. Overall neuron1 moves right as the movement to the right is larger, while neuron2 moves left. This is the reason for coalescing of these two neurons. Thus effectively we have a neural network of three neurons with bias {0, -1.22, -2.22} and weights {1.92, -1.33, -2.1}. This is the reason for reaching a local optimum and not a global optimum.
There are two solutions to this problem.
Solution1: One is to change the initial values of bias so that the neurons are spread out more to the right. Let us choose the following initial bias of {0, -0.75, -1.5, -2.24} and weights {0, 0.02, 0.05, 0.08}. Fig 7a shows that this results in convergence in 19200 backpropagation iterations. The trajectory of the biases and the weights is shown in Fig 7b. We see that three neurons move right and have negative slopes. Since neuron2 moved right and has negative slope it avoids the problem of coalescing with neuron1. The final bias is {0, -1.0, -1.97, -2.93} and the final weights are {1.96, -0.99, -1.82, -1.15}. Fig 7c is an animation plot showing the convergence of the neural network to the reference function in 19200 iterations.
The second solution is to use a neural network with five neurons.
Solution2: We use a neural network of five neurons with initial bias of {0, -0.4, -0.8, -1.2, -1.6} and weights {-0.08, -0.06, -0.02, 0.02, 0.06} for training. Fig 8a (left) is a plot comparing the reference function to the neural network with initial bias and weights. Fig 8a (right) shows the plot of the neural network with final bias and weights that converged to the reference function after 20000 iterations.
Fig 8b shows the trajectory of the bias and weights of the five neurons. We see that neuron1 and neuron2 still coalesce together with bias of 0. The bias of other three neurons converges to {-1, -2, -2.89}. The weights converge to {1.24, 0.72, -1.05, -1.67, -1.25}. Thus the convergence is much better than the previous network with four neurons.
The next solution combines solution1 and solution2. Here we use five neurons and their initial biases are shifted to the right as in solution1.
Solution3: We use a neural network with five neurons with initial biases {0, -0.6, -1.2, -1.8, -2.4, } and initial weights {-0.06, -0.04, -0.02, 0.02, 0.06, }. Fig 9a (left) shows the plot comparing the neural network of five neurons with initial bias and weights to the reference function and fig 9a(right) shows the plot comparing the neural network with five neurons with final bias and weights after 12,400 iterations. Fig 9b shows the trajectories of the bias and weights of the neural network with five neurons and Fig 9c is an animation showing the convergence of the neural network to the reference function in 12,400 iterations. The final biases are {0.022,-0.04, -1.08, -1.99, -2.89} and the final weights are {1.43,0.54,-1.1, -1.69, -1.16}.
Conclusion: Supervised training of Neural Networks is time consuming. The backpropagation algorithm moves the neurons and adjusts the weights using the steepest descent algorithm. It is important to assign initial biases to values within the input data range. It is best to assign different values of bias to each neuron. It is good to spread out the biases so as to avoid coalescing of neurons.