
Background
In my last post, we discussed how you can improve the performance of neural networks through hyperparameter tuning:
This is a process whereby the best hyperparameters such as learning rate and number of hidden layers are "tuned" to find the most optimal ones for our network to boost its performance.
Unfortunately, this tuning process for large deep neural networks (_deep learning_) is painstakingly slow. One way to improve upon this is to use faster optimisers than the traditional "vanilla" gradient descent method. In this post, we will dive into the most popular optimisers and variants of gradient descent that can enhance the speed of training and also convergence and compare them in PyTorch!
Recap: Gradient Descent
Before diving in, let’s quickly brush up on our knowledge of gradient descent and the theory behind it.
The goal of gradient descent is to update the parameters of the model by subtracting the gradient (partial derivative) of the parameter with respect to the loss function. A learning rate, α, serves to regulate this process to ensure updating of the parameters occurs on a reasonable scale and doesn’t over or undershoot the optimal value.

- θ are the parameters of the model.
- J(θ) is the loss function.
- ∇J(θ) is the gradient of the loss function. ∇ is the gradient operator, also known as _nabla_.
- α is the learning rate.
I wrote a previous article on gradient descent and how it works if you want to familiarise yourself a bit more about it:
Momentum
Gradient descent is often visualised as a ball rolling down a hill. When it reaches the bottom of the valley, it has converged to the minimum, which is the optimal value. A ball rolling downhill consistently will garner some momentum, however, regular gradient descent works on an individual iteration basis and does not know the previous updates.
By including momentum in the gradient descent update, it provides the algorithm information about previous gradients that it has computed.
Mathematically, what we have is:

Where:
- _v_t_ is the current velocity.
- β is the momentum coefficient, a value between 0 and 1. This is also sometimes interpreted as the ‘friction.’ You would need to find the best value of β, but often 0.9 is a good baseline.
- t is the current time step or iteration number.
- v{t_−1} is the velocity from the previous step (the last calculated value).
The rest of the terms mean the same as declared earlier for vanilla gradient descent!
Notice that we are utilising information from the previous gradients to ‘accelerate’ the current gradient in the direction of the previous ones. This increases the speed of convergence and dampens any oscillations that may occur with vanilla gradient descent.
Momentum is easily implemented in PyTorch as well:
optimizer = torch.optim.SGD([theta], lr=learning_rate, momentum=momentum)
Nesterov Accelerated Gradient
_Nesterov accelerated gradient (NAG)_, or Nesterov momentum, is a slight modification to the momentum algorithm that often leads to better convergence.
NAG measures the gradient with respect to the loss function slightly ahead of θ. This improves convergence as the momentum value will generally be heading towards the optimal point. Therefore, allowing the algorithm to take a slight step ahead every time leads it to converge quicker.

Where **∇J(θ+βv{t−1}) is the gradient of the loss function evaluated at a point slightly ahead of the current θ**_.
All the terms in the above equation are the same ones as for the previous optimisers, so I won’t list them out all again!
The Nesterov accelerated gradient is also easily implemented in PyTorch:
optimizer = torch.optim.SGD([theta], lr=learning_rate, momentum=momentum, nesterov=True)
AdaGrad
Adaptive Gradient Algorithm (Adagrad) is a gradient descent algorithm that uses an adaptive learning rate that gets smaller if a feature/parameter is updated more frequently. In other words, it decays the learning rate a lot more for steeper gradients than shallow ones.

Here:
- G is a diagonal matrix that accumulates the squares of all the gradients up to the time step for each parameter.
- ϵ is a tiny smoothing term to avoid division by zero problems when G is very small.
- ⊙ denotes element-wise multiplication. This is the _Hadamard product._
The rest of the terms in the above equation are the same ones as for the previous optimisers, so I won’t list them out all again!
An example of element-wise multiplication for matrices, assuming A and B are both 2×2:

As we can see, the larger the value of G, the smaller the update will be to the parameter. It’s basically a moving average of the squared gradients. This ensures the learning slows down and doesn’t overshoot the optimum.
One problem with Adagrad is that it sometimes decays the learning rate so much that neural networks stop learning too early on and plateau. Therefore, it’s not generally recommended to use Adagrad when training neural nets.
optimizer = torch.optim.Adagrad([theta], lr=learning_rate)
RMSProp
RMSProp (Root Mean Squared Propagation) fixes the issue of Adagrad finishing training too early by only taking into account recent gradients. It does this by introducing another hyperparameter, β, that scales down the impact of values inside the diagonal matrix G:

All the terms in the above equation are the same ones as for the previous optimisers, so I won’t list them out all again!
Like the other optimisers, RMSProp is simple to implement in PyTorch:
optimizer = torch.optim.RMSprop([theta], lr=learning_rate, alpha=alpha, eps=eps)
Adam
The final optimiser we will look at is Adaptive Moment Estimation, better known as Adam. This algorithm is a combination of both momentum and RMSProp, so it’s kinda the best of both worlds. Although, it has a few more steps:

The first two and last steps are pretty much the momentum and RMSProp algorithms we showed earlier. Steps three and four are correcting the bias of _v_t and G_t_ as they are initialised to 0 at the start.
Adam is an adaptive learning rate algorithm like RMSProp, so you don’t need to necessarily tune the learning rate when using this optimiser.
The rest of the terms in the above equation are the same ones as for the previous optimisers, so I won’t list them out all again!
Here’s how you apply Adam in PyTorch:
optimizer = torch.optim.Adam([theta], lr=learning_rate)
Other Optimisers
There are many other gradient descent optimisers out there, and the ones we have considered here are only first-order derivatives, which are called _Jacobians. A second-order derivative exists, called Hessians, but their compute complexity is on the order of O², whereas first-orders is only O_.
In practise, deep neural networks have tens of thousands to millions of rows of data, so Hessian gradient descent methods are rarely used. The gold standard is really Adam or Nestorov for most cases.
There is also batch, mini-batch, and stochastic gradient descent which affect the compute speed of the network. I have written about these algorithms in my previous article linked here.
Some other used optimisers are:
A full comprehensive list can be found here.
Performance Comparison
The code below is a comparison of the different optimisers that we discussed above for the J(θ) = θ² loss function. The minimum is at θ = 0:
import torch
import plotly.graph_objects as go
# Function to perform optimisation and log theta
def optimize(optimizer_class, theta_init, lr, iterations, **kwargs):
theta_values = []
theta = torch.tensor([theta_init], requires_grad=True)
optimizer = optimizer_class([theta], lr=lr, **kwargs)
for _ in range(iterations):
optimizer.zero_grad()
loss = theta.pow(2)
loss.backward()
optimizer.step()
theta_values.append(theta.item())
return theta_values
# Initial values
theta_init = 3.0
learning_rate = 0.01
iterations = 1000
# Optimiser configurations
optim_configs = {
"Momentum": {"optimizer_class": torch.optim.SGD, "lr": learning_rate, "momentum": 0.9},
"Nesterov": {"optimizer_class": torch.optim.SGD, "lr": learning_rate, "momentum": 0.9, "nesterov": True},
"Adagrad": {"optimizer_class": torch.optim.Adagrad, "lr": learning_rate},
"Adam": {"optimizer_class": torch.optim.Adam, "lr": learning_rate},
"RMSprop": {"optimizer_class": torch.optim.RMSprop, "lr": learning_rate}
}
# Run Optimization for each optimizer and collect theta values
results = {}
for name, config in optim_configs.items():
results[name] = optimize(**config, theta_init=theta_init, iterations=iterations)
# Plot the results
fig = go.Figure()
for optimiser, theta_values in results.items():
fig.add_trace(go.Scatter(x=list(range(iterations)), y=theta_values, mode='lines', name=optimiser))
fig.update_layout(title="Optimiser Performance Comparison",
xaxis_title="Iteration Number",
yaxis_title="Value of Theta",
legend_title="Optimisers",
template="plotly_white",
width=900,
height=600,
font=dict(size=18),
xaxis=dict(tickfont=dict(size=16)),
yaxis=dict(tickfont=dict(size=16)),
title_font_size=24)
fig.show()

This plot is quite interesting, some key things to point out:
- Both Momentum and Nestorov overshoot the optimal value of θ.
- Adagrad is very slow. This is in line with what we discussed before regarding the training stopping too early as the learning rate decays rapidly and learning plateaus.
- Adam and RMSProp seem to be the best with RMSProp reaching the optimal value quicker.
Of course, this is only a simple example and in real-life problems, the best optimiser will likely be different. So, it is often well worth trying a variety of different ones and picking the best-performing one.
This code is available at my GitHub here:
Medium-Articles/Neural Networks/optimisers.py at main · egorhowell/Medium-Articles
Summary & Further Thoughts
In this post, we have seen several methods to speed up and improve the performance of the vanilla gradient descent. The two types of methods are momentum-based, using information from previous gradients, and adaptive-based, changing the learning rate regarding the computed gradients. In literature, the Adam optimiser is often the one recommended and used the most in research. However, it is always worth trying different optimisers, to determine which ones suit your model the most.
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.
Connect With Me!
References & Further Reading
- Andrej Karpathy Neural Network Course
- PyTorch site
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. Aurélien Géron. September 2019. Publisher(s): O’Reilly Media, Inc. ISBN: 9781492032649.
- Great blog on optimising neural networks
Some of my other blogs on neural networks that might be of interest: