How Neural Networks “Learn”

Muhammad Ryan
Towards Data Science
8 min readOct 12, 2018

--

Source: https://stats385.github.io/assets/img/grad_descent.png

In my first story, I explained how the neural network processes your input. Before a neural network can predict as in the previous post, it must pass through a pre-processing phase. This phase governs the weight and bias values used by the neural network in processing your input.

There are 2 phases in the neural network life cycle and all machine learning algorithms, in general, are the training phase and the prediction phase. The process of finding the weight and bias values occurs in the training phase. Meanwhile, the phase where the neural network processes our input to produce predictions as in the previous post occurred in the prediction phase. This time, I will discuss how neural networks get the correct weight and bias a.k.a “learn” to make an accurate prediction (read: regression or classification) during the training phase.

So, how do neural networks get optimal weight and bias values? The answer is through an error gradient. What we want to know when fixing the current weight and bias (which is initially generated randomly) is whether the current weight and bias values are too large or too small (do we need to decrease or increase our current value?) with respect to their optimal value? And how much it deviates (how much we need to decrease or increase our current value?) from their optimal values. The gradient we are looking for is derivatives of error with respect to weights and biases.

where E is an error, W is weight and b is bias

Why is that? because we want to know how our current weights and biases affect the value of neural network error as a reference to answer 2 questions in upper paragraph (decrease or increase and how much). How we get the gradient value is through a well-known algorithm called backpropagation. How we utilize the gradient that has been obtained through backpropagation to improve the weight value and biases is through an optimization algorithm. One example of an optimization algorithm is gradient descent which is the simplest and most frequently used optimization algorithm. It just reduces recent weights and biases values with the gradient value obtained multiplied by the learning rate constant. What are the learning rate and more details will we discuss immediately in this post.

Suppose we have a neural network as below.

Our neural network has a structure 3–2–2

Suppose we have an input vector, bias vector, weight matrix, and truth value as below

To make it unambiguous, The order of weights value is

Let’s do the forward pass. The process is the same as in the previous post. The activation function that we use for all neurons in this demonstration is the sigmoid function.

Here we round the output value of the sigmoid function to 4 decimals. In actual calculations, such a round will greatly reduce the accuracy of neural networks. The number of the decimal is very crucial in neural network accuracy. We do this rounding to simplify calculations and so that the writing is not too long.

Before we proceed to the next layer, please note that the next layer is the last layer. It means that the next layer is the output layer. In this layer, we just do pure linear operation.

It’s time to calculate the error. In this case, we use Mean Squared Error (MSE) to calculate errors in the output neurons. The MSE equation is as follows

In our case, N = 1 because we just have 1 data, so the equation is reduced to

Let’s calculate the error of neuron in the output layer based on the truth value (T) that we have defined earlier.

So that’s our current error in the output layer. Now is the time to minimize the error by looking for an error gradient with respect to weight and bias in every interaction between layer via backpropagation a.k.a backward pass and apply the gradient descent afterward. Backpropagation is simply just a chain rule, how it works will be discussed immediately. For now, let’s find the derivative of all the equations we use in the forward pass.

  1. Derivative of E with respect to O

2. Derivative of sigmoid (h) function with respect to P (output of pure linear operation)

where h is

3. Derivative of pure linear with respect to weight (W) and bias (b) and input (h).

where purelin is

where l is a number from 1 to M.

That’s all we need, it’s time to apply backpropagation. We first look for the gradient to weight and bias between the hidden layer and output layer. to look for gradients, we use chain rules.

And with applying these, we get

So that’s our gradient for the layer between the hidden layer and the output layer. Now, onto the next layer. Here the real challenge (not so challenge)! But don’t worry, after this everything will be clear and easy :).

Chain rule in backpropagation is all about the path between neurons. Let’s collect the information!

  1. There are 2 neurons in the hidden layer and every neuron is connected with 3 weight and 1 bias in the left side (between the input layer and hidden layer).
  2. On the right side, every neuron in the hidden layer is connected with 2 neurons in the output layer.

These pieces of information are very important to find the gradient of W1. And from these, the gradients we want to find are

where

All possible paths from the weight we concern to the output layer is added. That is why there is a sum of 2 terms in the equation above. Now, let’s count the real gradient of W1.

Substitute partial derivative of E with respect to h, we get

Now for biases a.k.a b1

And that’s the end of the role of the backpropagation algorithm. Now, onto the optimization algorithm. The optimization algorithm is about how to utilize the gradient we have obtained to correct the existing weights and biases. optimization algorithm we choose is gradient descent. The way of gradient descent to correcting weights and biases by the equation below.

where W’ is new weight, W is weight, a is learning constant and the gradient is gradient we obtained from backpropagation. Learning constant is crucial constant because if this constant is too big, the result will not be convergent and if it’s too small, more iterations needed and that’s mean the training phase will be more time-consuming. Suppose we have a learning constant equal to 0.02.

And so on, this process will be repeated (with the same input that will be entered and) until the number of iteration that needed or target error has been reached.

So this is how neural networks “learn” in general. If I have more free time (and good mood of course), I will share the source code of multi-layer perceptron (another name of “ordinary neural network” which is our focus here) in python using numpy. See you.

Another neural network series by me:

  1. How Neural Network Process Your Input (Trained Neural Network)
  2. How Neural Network “Learn”
  3. A Simple Way to Know How Important Your Input is in Neural Network

--

--