A dive into Optimizers

Pros and Cons of each optimizer to know when to use each kind of optimizer

Namrata Kapoor
Towards Data Science

--

In machine/deep learning main motive of optimizers is to reduce the cost/loss by updating weights, learning rates and biases and to improve model performance.

Image Source

Many people are already training neural networks and using optimization without knowing that it is called Optimization.

One should know about optimization techniques to better train models, improve models performance. The advantages and disadvantages of each type of optimization should be known and one should be familiar with it to make the models having optimal performance.

Optimization algorithms have following goals :

  • To reach global minima . This is feasible if the function is convex, which means only one local or global minima.
  • When the objective function is not convex as in most deep learning problems, the objective is to find the lowest possible value of the objective function within its neighborhood.

There are various kinds of Optimizers and each has its own pros and cons. Lets start with the most basic optimizer called Batch Gradient Descent.

Batch Gradient Descent uses the entire training set data to calculate the gradient of the cost function.

Advantages

As whole data is taken at on single time it reaches global minima without any noise, but is suitable for small data sets only.

Disadvantages

Because this method calculates the gradient for the entire data set in one go, the calculation is very slow.

Also it needs high configuration of system to go for the whole dataset at a single time.

Batch gradient descent can converge to a global minimum for convex functions and to a local minimum for non-convex functions.

Stochastic Gradient Descent

If we compare Batch Gradient Descent gradients are calculated with all data at one time, Stochastic Gradient Descent updates the gradient of each data sample with each update, i.e. single dataset is fed into the system and weights are updated.

As there may be similar samples for large datasets, and BGD calculates the gradient. There will be redundancy, and SGD is updated only once, there is no redundancy, it is faster, and new samples can be added as and when more data comes.

Disadvantages: As SGD is updated more frequently, the cost function will have severe oscillations. BGD can converge to a local minimum because whole data is taken into consideration at one time ,while there may be oscillation for SGD as data is updated frequently and it may jump to a better local minima.

Non convex equation for SGD Image by Author

When we decrease the learning rate slightly, the convergence of SGD and BGD is the same.

For non-convex functions, it is important to avoid trapping at the local minimum or saddle point, because the error around the saddle point is the same, the gradients of all dimensions are close to 0, and SGD is easily trapped here.

One disadvantage of the SGD method is that its update direction depends entirely on the current data, so its update is very unstable. A simple way to solve this problem is to introduce momentum.

Mini Batch Gradient Descent

To overcome the problem of SGD and BGD , MBGD uses a small batch of samples, i.e, a number of samples to calculate each time. In this way, it reduces the variance when the parameters are updated, and which in turn makes convergence more stable. It can make full use of the matrix operations which are highly optimized in the deep learning library for more efficient gradient calculations.

The difference from SGD is that each cycle does not act on each sample, but a batch with a number of samples.

Disadvantages

Mini-batch gradient descent may or may not give good convergence. All depends on the mini batch data taken up.

  • If we decrease the learning rate, the convergence rate will be slow. If learning rate is too large, the loss function will oscillate and deviate at the minimum value.

Mini Batch Stochastic Gradient Descent with Momentum

Exponential Weighted Average for Momentum- Image by Author

If we draw Vdw equation on a graph we will see the smoother curves, hence removing noise from the Stochastic Mini batch gradient descent.

In formula above we see that we can decide to give weightage to which data, coming at a particular time depending on the value of beta.

Hence the learning will be wholesome from previous data points retaining their direction thus removing noise.

Beta is taken as 0.95 as optimal value according to researchers.

Exponential Weighted average at a time instance for momentum- Image by Author

Momentum simulates the inertia of an object when it is moving, i.e. the way of the previous update is retained to a some percentage during the update, while the current update gradient is used to optimize the final update direction. Stability is increased to a certain extent, so learning is faster, and also have the ability to get rid of local optimization.

Path to global Minima for different Optimizers- Image by Author

Adagrad

Adagrad is an algorithm for gradient-based optimization that adapts to the learning rate of the parameters. If we use low learning rates for parameters associated with most frequently occurring features, and high learning rates for parameters associated with infrequent features. We can get a good model.

It is most suited for sparse data, i.e. where most has a number of features not used very frequently.

The same update rate for all parameters may not solve the purpose as, some parameters may have reached the stage where only fine-tuning is needed for them to give results while some parameters need to be adjusted a lot due to the small number of samples associated with them.

The formula of Adagrad is as under:

Adagrad Formula for Updated Learning Rate- Image by Author
Value of Alpha in Adagrad- Image by Author

Adagrad , an algorithm that adaptively assigns different learning rates to various parameters among them. The problem is that for each parameter, as its total distance updated increases, its learning rate also slows.

Learning Rate and Gradient movement for BGD and Adagrad- Image by Author

Adagrad eliminates the need to manually tune the learning rate.

As the learning rate is decreasing drastically so in deep neural networks, at a particular time the weight updation will be so small that it will stop moving towards global minima.

There are three main problems associated with the Adagrad algorithm

  1. The learning rate keeps decreasing.
  2. The learning rate of the late training period is very small.
  3. It requires a lot of manual setting a global initial learning rate.

Adadelta

Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressively, reducing the learning rate:

It is done by restricting the past accumulated gradient to some fixed size of w . Running average at time t will depend on the previous average and the current gradient.

In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.

RMSProp

The RMSProp algorithm full form is called Root Mean Square Prop, which is an adaptive learning rate optimization algorithm proposed by Geoff Hinton.

RMSProp tries to resolve Adagrad’s radically diminishing learning rates problem by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient as shown in the equation below.

While Adagrad accumulates all previous gradient squares, RMSprop just calculates the corresponding average value, so it can eliminate the problem of quickly dropping of learning rate of the Adagrad .

RMSProp Formula fpr updating Learning rate- Image by Author

In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.

This algorithm divides the learning rate by the average of the exponential decay of squared gradients.

Adam

Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. It remembers and stores an exponentially decaying average of past squared gradients like Adadelta and RMSprop.

Adam also keeps an exponentially decaying average of past gradients, similar to momentum.

Adam is a combination of both Momentumand RMSprop,Momentum brings in smoothening effect and reduces noice while RMSProp brings in Learning rate updation in a proper way which is free from disadvantages of Adagrad.

Adams formula for using momentum and RMSProp- Image by Author

Steps:

Vdw, Vdb, Sdw, Sdb all are set to 0 initially.

On iteration t compute dL/dw and dL/db. Adam uses smoothening effect of momentum and advantages of perfectly updated learning rate of RMSProp to its advantage.

Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad.

Adam is computationally efficient and has very less memory requirement.

Adam is so far the best optimizer that is known.

Things to consider before selecting an optimizer

  • RMSprop, Adadelta, Adam have similar effects in many cases.
  • If the data is sparse, use methods, like Adagrad, Adadelta, RMSprop, Adam.
  • Adam added momentum and learning rate on the basis of RMSprop.
  • As the gradient becomes sparse, Adam will perform better than RMSprop.
  • If we need faster convergence or for deeper and complex neural networks an adaptive algorithm is needed.

SGD if used without momentum can reach a minimum value, but will takes longer than other algorithms and may be trapped in the saddle point.

I hope the Optimizers concept is by far clear, its the beauty of mathematics and playing around with equations which researchers spent a lot of time on. For all Optimizers now minibatch is always used. Minibatch Gradient descent solved the problem of performance and had lesser noise, while momentum reduced the noise to bring in smoothening effect. The problem of Adgrad of diminishing learning rate in deeper neural network was solved by RMSProp, and Adam by using both momentum and RMSProp became the best of optimizer right now.

Thanks for reading and happy modelling!

Originally published at https://www.numpyninja.com on November 19, 2020.

--

--

Data Science Professional | Technical Blogger | Artificial Intelligence | NLP | Chatbots and more