Welcome to Part 2 of the Deep Learning Illustrated series. In the previous article (definitely read that first!), we covered how a neural network works and how a trained neural network makes predictions. We also learned that the neural network arrives at optimal weight and bias values during the training process.
Deep Learning Illustrated, Part 1: How Does a Neural Network Work?
In this article, we’ll delve into the training process and explore exactly how a neural network learns.
📣 If you haven’t read my previous articles, I highly recommend you start with my series of articles covering the basics of machine learning, specifically the one on Gradient Descent because you’ll find that a lot of the material covered there is relevant here.
Let’s say we want to create a neural network that predicts the daily revenue of ice cream sales using the features temperature and day of the week.
This is the training dataset we’re using:

To build a neural network, as we learned in the previous article, we need to first decide on its architecture. This includes determining the number of hidden layers, the number of neurons in each layer, and the activation function of each neuron.
Let’s say we decided our architecture is: 1 hidden layer with 2 neurons, and 1 output neuron, all using the rectifier activation function.

Terminology segue: In the previous article, we learned about using subscripts to differentiate between different weights. We’re sticking with the same convention here, and in addition, we’ll use superscripts to indicate the layer to which the bias and weights belong. So, above, we can see that the weights going into the first layer of the neurons and the bias terms in that layer all have a superscript of 1.
Another thing you’ll notice is that our predicted output is denoted as r_hat. We learned that the hat symbol indicates it’s the predicted value, and since we’re predicting revenue here, we’re using r.
Once we’ve nailed down the architecture, it’s time to train the model by feeding it some data. During this training process, the neural network will learn the optimal values of the weight and bias terms. Let’s say that after training the model using the training data above, it produces the following optimal values:

This article will focus on exactly how we arrived at these optimal values.
Let’s start with a simple scenario. Suppose we have all the optimal values except the bias term for the outer layer neuron.

Since we don’t know the exact value of the bias, we begin by making an initial guess and setting the value to 0. Typically, bias values are initialized to 0 at the start.

Now, we need to input all the ice cream store features to make revenue predictions (aka forward propagation as we learnt from the previous article), assuming that the last bias term is 0. Let’s pass the 10 rows of our training data into the neural network…

…to get the following predictions:

Now that we have the predictions when the last bias term is equal to 0, we can compare them to the actual revenue. In the previous article, we learned that we can measure the accuracy of our predictions using a cost function, specifically the Mean Squared Error (MSE) for our use case.

Calculating the MSE of this model with a bias of 0:

We also know that the ultimate objective of any model is to reduce the MSE. Therefore, the goal now is to find an optimal bias value that minimizes this MSE.
One way to compare the MSE values at different bias values is by brute forcing it and trying different values for the last bias term. For example, let’s make a second guess for the bias term that is slightly higher than the last value of 0. Let’s try bias = 0.1 next.

We pass in the training data to the new model with bias = 0.1…

…which results in these predictions…

…which we then use to calculate MSE:

As we can see, the MSE of this model (0.03791) is slightly better than the previous MSE when the bias was set to 0 (0.08651).
To visualize this more clearly, let’s plot these values on a graph.

We can continue using this brute-force method by guessing values. Let’s say we also guessed 4 more values: bias = 0.2, 0.3, 0.4, and 0.5. We repeat the same process as above to generate an MSE chart that looks like this:

We notice that at bias = 0.3, the MSE is at its lowest. And at bias = 0.4 the MSE starts to increase again. This tells us that we minimized the MSE at bias = 0.3.
Fortunately, we were able to determine this after a few educated guesses and then confirm it through additional attempts. However, what if the optimal MSE value was 100? In that case, we would need to make 1000 (100 x 10) guesses to reach it. Therefore, this approach is not very efficient for finding the optimal bias values. Additionally, how can we be certain that the bias with the lowest MSE value is exactly 0.3? What if it’s 0.2998 or 0.301? It would be difficult to make precise guesses like that using this brute force technique.
Gradient Descent
Luckily, we have a waaaay more efficient way to determine the optimal bias value. We will utilize a concept called Gradient Descent. And yay for us – gradient descent was already covered (with beautiful illustrations if I can say so myself) in a previous article. So definitely read that before continuing.
To quickly summarize, by using gradient descent and leveraging derivatives, we can efficiently reach the lowest point of any convex curve (essentially a U-shaped curve). This is ideal in our current situation because the MSE graph above resembles a U-shaped curve, and we need to find the valley where the MSE is minimized. Gradient descent guides us by indicating the size and direction of each step needed to reach the bottom of the curve as quickly as possible.
Now let’s restart the process of finding the optimal bias using the steps laid out in gradient descent.
Step 1: Start with a random initial value for the bias
We can start with bias = 0 for instance:

Step 2: Calculate the step size of our step
Next, we need to determine the direction and how big of a step we should take. This can be achieved by calculating the step size, which is the result of multiplying a constant value known as the learning rate by the gradient of the MSE at the bias value. In this case, the bias value is 0 for this iteration.

Note: The learning rate is a constant used to control the step size. Typically, it falls between 0 and 1.
Let’s examine the derivative value more closely here. We know that the MSE is a function of _rhat, as shown in the formula:

And we also know that _rhat is determined by the relu function in the last neuron, as we can obtain r_hat only by utilizing the activation function:

And we know that the relu function in the last neuron includes the bias term.
Now, if we want to calculate the derivative of MSE with respect to the bias, we will use something called the chain rule, a super integral part of calculus, which utilizes the above 3 key pieces of information.

We need to use the chain rule because the terms are dependent on each other but not directly. It’s called the chain rule because they are all linked in a chain-like structure. We can almost think of the numerators and denominators canceling each other out.
This is how we calculate the derivative of MSE with respect to the bias. We calculate this derivative at the current bias value (0).
Step 3: Update the bias value by using the above step size

This will provide us with a new bias value that will hopefully bring us closer to our optimal bias value.

Step 4: Repeat Steps 2–3 until we reach our optimal value
We will continue to repeat this process of taking steps…



…making tiny leaps, with steps shrinking as we inch closer to the bottom…



…until finally…

…we reach the optimal value!
NOTE: We achieve the optimal value when the step size is close to 0 or when we reach a maximum number of steps that we set in the algorithm.
Perfect. This is how we can find the bias term, assuming that the optimal values for the other variables are already known.
Terminology segue: This process of determining the value of this bias term, is called backpropagation. In our previous article, we focused on forward propagation, which involves passing inputs to obtain an output. It’s called forward propagation because we literally are propagating the inputs forward. Meanwhile, this process is called backpropagation because we move backwards to update the bias values.
Now, let’s go one step further and consider a scenario where we know all the optimal values except for the bias term and the weight of the second input going into the last neuron.

Again we need to find optimal values for these two terms so that the MSE is minimized. For different values of the weight and bias, let’s create a plot of MSE. This plot will be similar to the one shown above, but in 3 dimensions.

Similar to the previous MSE curve, we need to find the point that minimizes the MSE. This point, known as the valley point, will provide us with the optimal values for the bias and weight terms. Once again, we can use gradient descent to reach this minimum point. The process is essentially the same here as well.
Step 1: Randomly initialize values of the weight and bias
Step 2: Calculate the step size using partial derivatives

This is where a slight deviation occurs. Instead of calculating the derivative of the MSE, we calculate something called the partial derivatives and update the step sizes simultaneously. By "simultaneously," we mean that the values of the partial gradients need to be calculated at the current weight and bias value, and we again use the chain rule:

Step 3: Simultaneously update the weight and bias terms

Step 4: Repeat Steps 2–3 until we converge at the optimal values
And we can get crazy with this. Now what if we want to optimize for all 9 values in the neural network?

Then, we’ll have 9 simultaneous equations that need to be updated in order to reach the minimum. (can’t even attempt to draw out this MSE function because, as you can or rather can’t imagine, that’ll be one insane-looking graph)
Even though the math becomes more complicated to do by hand as the number of simultaneous equations increases, the concept remains the same. We are trying to gradually move towards the bottom of the valley, relying on gradient descent to guide us.
By applying the equations and optimization procedures discussed above, hidden patterns of the data naturally emerge. This allows us to find these deep patterns without any human intervention.
Okay, to recap, we now understand that we always want to minimize the cost function (MSE in the above case) and how to obtain optimal values for our weight and bias terms that minimize MSE using gradient descent.
We learnt that by using gradient descent, we can easily traverse a convex-shaped curve to reach the bottom. And luckily for our case study, we had a beautiful looking convex curve. However, sometimes we may encounter a cost function that does not produce a perfect convex curve, but instead produces something that looks like this

If we use gradient descent, sometimes we may mistakenly identify one of the many local minimums (points that appear to be minimum points but are not actually) as the minimum point, instead of the global minimum (the actual minimum point).

Another issue with gradient descent is that as the number of data points in our dataset or the number of terms increases, the time it takes to perform gradient descent also increases. This is because the math involved becomes more complex.
In our small example, we have 10 data points (which is very unrealistic; usually we have hundreds of thousands of data points) and we are trying to optimize 9 parameters (this number can also be very high depending on how complex the architecture is). Currently, for each iteration of gradient descent, we use 10 data points to calculate the partial derivatives and update 9 parameter values.
This is essentially what Gradient Descent is doing:

In each iteration, we perform approximately 90 (910) small calculations to calculate the derivative for the MSE of each individual data point. Typically, we perform 1000 iterations like this, resulting in a total of 90,000 (901000) calculations.
However, what if we have 100,000 data points instead of just 10? In that case, we would need to calculate the MSE for all 100,000 data points and take the derivative of 900,000 (= 9100,000) terms. Normally, we would perform around 1000 steps of gradient descent to reach our optimal values, resulting in a staggering 900,000,000 (900,0001000) calculations. Additionally, our data can become much more complex with numbers in the millions and a larger number of parameters to optimize. This can quickly become very challenging.
To avoid this issue, we can utilize alternate optimization algorithms that are faster and more powerful.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is similar to gradient descent, with a teeny difference. In gradient descent, we update our values after calculating the MSE for the entire training dataset, which contains all 10 values. However, in SGD, we calculate the MSE using only one data point from the dataset.
The algorithm randomly selects a single data point and uses it to update the parameter values, instead of using the entire dataset.

This much is a lighter algorithm, making it faster than its all-encompassing counterpart.
Mini-batch Gradient Descent
This approach is a combination of vanilla and stochastic gradient descent. Instead of updating values based on just one data point or the entire dataset, we process a batch of data points per iteration. We can choose the batch size to be 5, 10, 100, 256, etc.
For example, if our batch size is 4, we calculate the MSE for partial derivatives based on 4 rows of data.

When dealing with gradient descent, apart from the big data problem and local minimum issues, we can also encounter another issue. Remember the learning rate? We didn’t delve too deeply into it, but we discussed that it’s a constant term set at the beginning of the model-building process. The choice of learning rate greatly affects the performance of gradient descent. If we set the learning rate too low, we may never converge to the optimal values, and if we set it too high, we may overshoot our step and diverge from the optimal value. In reality, there is a happy medium. Now, the question is, how do we find this learning rate?
Option #1: Try lots of different learning rates and see what works well
(much smarter) Option #2: Design an adaptive learning rate that "adapts" to our neural network and adapts to our MSE landscape.
This is precisely what other optimization algorithms aim to accomplish. However, discussing them in detail would require a whole other article. If you’re interested, you can refer to this article that dives into some popular ones.
That wraps up our foray into Neural Networks. These two articles should provide a solid foundation as we journey further into the world of Deep Learning. And to see how we put these concepts into action, read this article that builds the above ice cream revenue neural network in TensorFlow (and PyTorch as well!)
Part 3 on Convolutional Neural Networks is live now!
Deep Learning Illustrated, Part 3: Convolutional Neural Networks
Unless indicated, all images are by the author.
As always, feel free to connect with me on LinkedIn if you have any questions/comments!