How Neural Networks “Learn”
In my first story, I explained how the neural network processes your input. Before a neural network can predict as in the previous post, it must pass through a pre-processing phase. This phase governs the weight and bias values used by the neural network in processing your input.
There are 2 phases in the neural network life cycle and all machine learning algorithms, in general, are the training phase and the prediction phase. The process of finding the weight and bias values occurs in the training phase. Meanwhile, the phase where the neural network processes our input to produce predictions as in the previous post occurred in the prediction phase. This time, I will discuss how neural networks get the correct weight and bias a.k.a “learn” to make an accurate prediction (read: regression or classification) during the training phase.
So, how do neural networks get optimal weight and bias values? The answer is through an error gradient. What we want to know when fixing the current weight and bias (which is initially generated randomly) is whether the current weight and bias values are too large or too small (do we need to decrease or increase our current value?) with respect to their optimal value? And how much it deviates (how much we need to decrease or increase our current value?) from their optimal values. The gradient we are looking for is derivatives of error with respect to weights and biases.
Why is that? because we want to know how our current weights and biases affect the value of neural network error as a reference to answer 2 questions in upper paragraph (decrease or increase and how much). How we get the gradient value is through a well-known algorithm called backpropagation. How we utilize the gradient that has been obtained through backpropagation to improve the weight value and biases is through an optimization algorithm. One example of an optimization algorithm is gradient descent which is the simplest and most frequently used optimization algorithm. It just reduces recent weights and biases values with the gradient value obtained multiplied by the learning rate constant. What are the learning rate and more details will we discuss immediately in this post.
Suppose we have a neural network as below.
Suppose we have an input vector, bias vector, weight matrix, and truth value as below
To make it unambiguous, The order of weights value is
Let’s do the forward pass. The process is the same as in the previous post. The activation function that we use for all neurons in this demonstration is the sigmoid function.
Here we round the output value of the sigmoid function to 4 decimals. In actual calculations, such a round will greatly reduce the accuracy of neural networks. The number of the decimal is very crucial in neural network accuracy. We do this rounding to simplify calculations and so that the writing is not too long.
Before we proceed to the next layer, please note that the next layer is the last layer. It means that the next layer is the output layer. In this layer, we just do pure linear operation.
It’s time to calculate the error. In this case, we use Mean Squared Error (MSE) to calculate errors in the output neurons. The MSE equation is as follows
In our case, N = 1 because we just have 1 data, so the equation is reduced to
Let’s calculate the error of neuron in the output layer based on the truth value (T) that we have defined earlier.
So that’s our current error in the output layer. Now is the time to minimize the error by looking for an error gradient with respect to weight and bias in every interaction between layer via backpropagation a.k.a backward pass and apply the gradient descent afterward. Backpropagation is simply just a chain rule, how it works will be discussed immediately. For now, let’s find the derivative of all the equations we use in the forward pass.
- Derivative of E with respect to O
2. Derivative of sigmoid (h) function with respect to P (output of pure linear operation)
where h is
3. Derivative of pure linear with respect to weight (W) and bias (b) and input (h).
where purelin is
where l is a number from 1 to M.
That’s all we need, it’s time to apply backpropagation. We first look for the gradient to weight and bias between the hidden layer and output layer. to look for gradients, we use chain rules.
And with applying these, we get
So that’s our gradient for the layer between the hidden layer and the output layer. Now, onto the next layer. Here the real challenge (not so challenge)! But don’t worry, after this everything will be clear and easy :).
Chain rule in backpropagation is all about the path between neurons. Let’s collect the information!
- There are 2 neurons in the hidden layer and every neuron is connected with 3 weight and 1 bias in the left side (between the input layer and hidden layer).
- On the right side, every neuron in the hidden layer is connected with 2 neurons in the output layer.
These pieces of information are very important to find the gradient of W1. And from these, the gradients we want to find are
where
All possible paths from the weight we concern to the output layer is added. That is why there is a sum of 2 terms in the equation above. Now, let’s count the real gradient of W1.
Substitute partial derivative of E with respect to h, we get
Now for biases a.k.a b1
And that’s the end of the role of the backpropagation algorithm. Now, onto the optimization algorithm. The optimization algorithm is about how to utilize the gradient we have obtained to correct the existing weights and biases. optimization algorithm we choose is gradient descent. The way of gradient descent to correcting weights and biases by the equation below.
where W’ is new weight, W is weight, a is learning constant and the gradient is gradient we obtained from backpropagation. Learning constant is crucial constant because if this constant is too big, the result will not be convergent and if it’s too small, more iterations needed and that’s mean the training phase will be more time-consuming. Suppose we have a learning constant equal to 0.02.
And so on, this process will be repeated (with the same input that will be entered and) until the number of iteration that needed or target error has been reached.
So this is how neural networks “learn” in general. If I have more free time (and good mood of course), I will share the source code of multi-layer perceptron (another name of “ordinary neural network” which is our focus here) in python using numpy. See you.
Another neural network series by me: