Learn How Neural Networks Learn

Published in

Towards Data Science

9 min readNov 28, 2020

Yesterday, I attempted to clean my perpetually growing inbox. A terrifying task indeed. Normally, answering my emails takes forever because I’d have to think of how to compose them juuuust right. Working from my Gmail browser tab though, I didn’t have to. The perfect answers were simply offered to me.

Is it just me, or does seeing these pre-formed answers ready to go throw you off? They’re perfectly appropriate responses, if impersonal. But how does Google know?

Well, it’s thanks to a little thing called AI. You may have heard of it.

Artificial intelligence is a computer’s ability to learn things that we’d typically only expect humans to be able to do. That includes creating new artistic movements, generating original recipes (tastiness undetermined), and formulating a theory of everything with an answer of 42.

How does AI do this? How does it know what’s right and not right? Let me tell you the not-so-secret-secret.

Machines learn the same as you and I.

In fact, machine learning algorithms are trained off of our brains, hence the name neural network.

Every time you speak, think, or even feel, external stimuli fire off our neurons, triggering a chain of signals along our nervous system.

Similarly, inputs fire off certain responses in the nodes of a neural network. Those outputs will be passed through the network to generate the final outputs or predictions.

Still not convinced? Let me explain with a story.

Imagine me in kindergarten. My parents want me to be smarter than all the other five-year-olds and start teaching me addition. They give me a problem to solve, 1+1. I have no idea what to do, so I write 1+1=11. My parents are already disappointed I’m not a math prodigy.

Addition is hard, okay? Four-year-old me is trying her best.

They tell me that’s wrong, but for some reason they tell me to keep guessing instead of teaching me. At least I get a hint: 11’s too high.

Hmmm. I know how to count by 1, but I think I can impress them by counting by 2s. So I begin guessing. 9? They shake their heads and exchange meaningful glances.

I keep going. 7? 5? Finally, they start to look excited. I’m getting close. I decide to start counting by 1s to make sure I don’t miss the right answer. 4? 3? 2? It’s 2! My parents shower me with hugs and kisses.

This is the basic idea behind training gradient descent in a neural network. Your first model will have no basis and will certainly be wrong. By comparing that wrong answer to the correct answer, though, the model can change to start outputting values that are closer to the correct answer. In this case, I started by changing my answers by 2, which we can call the learning rate. As I got closer to the answer though, my learning rate got smaller and became 1 so that I wouldn’t miss the answer.

Gradient descent diagram by Sebastian Raschka.

There’s more to the story, though.

I feel like a genius until my sister steps in. She’s jealous that my parents are giving me all the attention and decides to mess with me. Unfortunately, I still trust her wholeheartedly to have my best interests at heart as I have yet to understand the nature of a sibling rivalry.

Wow, smartypants, so you can do addition now! Can I show you another possible answer for 1+1? Flattered by the compliment, I graciously accept.

Watch this. If you put the equal sign at the top and bottom of the equation, then you’ll get a window! See?

I’m floored by this discovery. When I get to school, my teacher asks the class what’s 1+1. I confidently raise my hand and declare that it’s window.

What was the problem? Well, I trusted my sister more than my parents to give me the right answer. I assigned more weight to my sister’s explanation than my parents’ answer. Next time, I’ll know to listen more to my parents and less to my sister. This process of adjusting weights is called backpropagation.

When a neural network is first set up, the weights and biases in each layer are randomized (we can think of a bias as a weight). The prediction is wildly inaccurate as a result. Backpropagation updates the weights and biases of each of the nodes until the model can make consistently accurate predictions.

How Backpropagation Works

Here were the steps in the context of my story:

My parents and my sister both provided me with answers. Keeping them both in mind, I chose to go with an answer of window.
I compared my answer to my teacher’s correct answer.
Window was completely wrong. The answer was 2.
I looked back to my sources to figure out where I went wrong. My parents are clearly smarter than my sister.
I decided to start trusting my parents more than my sister from now on.
I use my new methodology to figure out the answer to 2+2.

Here’s the rundown in machine learning terms:

Perform a feedforward operation.
Compare the model’s output with the desired output.
Calculate the error with the error function.
Run the feedforward operation backwards.
Update the weights.
Rinse and repeat.

Perform a feedforward operation

The feedforward operation is the process of generating an output from the given inputs in a neural network.

I won’t go in-depth on this, but I’ve made a quick sequence to follow.

Receive inputs: If this is the first hidden layer, the inputs will come directly from the data. Otherwise, the inputs will be the outputs generated from the previous layer.

Calculate prediction: The prediction depends on the weight for each input and the bias. The bias can be considered its own weight for an input of 1. The formula for the prediction:

Apply sigmoid function: This just turns the prediction into a number between 0 and 1.

Generate output: This output will be sent to the nodes in the next layer. This sequence will repeat in every node of every layer until we reach our final prediction.

Note: the bias can also be represented as the weights of a unitary input. That means we’ll add an input of 1 in every hidden layer and the input layer. We’ll use this in future notation.

We can also write this in matrix form. The output for the first node in the hidden layer will be calculated as follows:

Note that the weights in that equation are only those from the input layer, so they should have a superscript of (1). If we imagine this in the grand scheme of the feedforward operation, then we need to consider the different collections of weights in different layers. So it might be more accurate to say that the output for that first node would be this:

The final prediction will be calculated as follows:

If it helps, read it from back to front. First, we calculate the output for each of the nodes in the hidden layer. Then, we calculate the final prediction for our model.

Compare the model’s output with the desired output

Now that we have our final prediction, we can compare it with the desired output — that is, the correct answer. It’ll be wrong, but it will give us insights as to how we’d want the model’s boundary to move.

If the point is misclassified, the boundary should move closer to it.
If the point is classified correctly, the boundary should move farther away.

On the left, the red point is classified incorrectly as a green point, so the boundary should move closer to minimize the error. On the right, the red point is classified correctly, so the boundary should move farther away.

Left: The red point is classified incorrectly as a green point, so the boundary should move closer to minimize the error | Right: The red point is classified correctly, so the boundary should move farther away.

Calculate the error with the error function

The next few steps go hand in hand, so there might be some overlap in the explanation.

We have the final prediction from our output layer. We know in which direction we want to push the boundary. We just need to know how much the boundary should move.

We’ll start by considering only one answer as “correct” and calculating its error with the error function. Remember the equation for the error function?

The negative of the gradient of this error function will determine the amount by which the line will have to move, either closer to or farther from the point. It will be formed by the partial derivatives of the cost function with respect to every weight.

From there, we can give our suggestions for updating the weights on the hidden layer’s nodes. Remember that each node’s input is the sigmoid function of all the outputs from the previous layer.

To update the weights, we subtract the product of the learning rate α with the negative of the gradient from the current weight value. It looks like this:

Run the feedforward operation backwards

Well, that’s how the process would work if we only had one output value to consider.

What actually happens, though, is that each output from the final layer will tell each node in the hidden layer to change its weights in order to minimize its error. Each node in the hidden layer will consider all of these suggestions from the outputs for changing the weights and average them. This is the negative gradient of the error function for each weight in each layer. The equation is written like this:

This process will repeat for each layer, updating each of the weights. It’s like a feedforward operation, but backwards. We’ll repeat this until we’ve adjusted all the weights from the hidden layers to the input layer. Here’s the equation:

Rinse and repeat

And now we do it again! We’ve updated the weights, but not by a lot. We’ll have to keep running the operation, checking our prediction, and updating our weights until we’ve minimized the error of our function. Then, we’re done. Woohoo!

Maybe you’re wondering what the “rinse” is. Well, there are two interpretations:

Cry tears of joy for having finally gotten through this.
Cry tears of disappointment for still being confused.

If you didn’t get it all the first time, don’t worry! Reread the steps and watch other videos and articles I’ve linked down below. You’ve got this :)