Understanding Optimization Algorithms in Machine Learning

Published in

Towards Data Science

5 min readJun 18, 2021

Mathematics behind two important optimization techniques in machine learning

Why do we optimize our machine learning models? We compare the results in every iteration by changing the hyperparameters in each step until we reach the optimum results. We create an accurate model with less error rate. There are different ways using which we can optimize a model. In this article, let’s discuss two important Optimization algorithms: Gradient Descent and Stochastic Gradient Descent Algorithms; how they are used in Machine Learning Models, and the mathematics behind them.

2. MAXIMA AND MINIMA

Maxima is the largest and Minima is the smallest value of a function within a given range. We represent them as below:

Global Maxima and Minima: It is the maximum value and minimum value respectively on the entire domain of the function

Local Maxima and Minima: It is the maximum value and minimum value respectively of the function within a given range.

There can be only one global minima and maxima but there can be more than one local minima and maxima.

3. GRADIENT DESCENT

Gradient Descent is an optimization algorithm and it finds out the local minima of a differentiable function. It is a minimization algorithm that minimizes a given function.

Let’s see the geometric intuition of Gradient Descent:

Let’s take an example graph of a parabola, Y=X²

Here, the minima is the origin(0, 0). The slope here is Tanθ. So the slope on the right side is positive as 0<θ<90 and its Tanθ is a positive value. The slope on the left side is negative as 90<θ<180 and its Tanθ is a negative value.

Slope of points as moved towards minima (Image by Author)

One important observation in the graph is that the slope changes its sign from positive to negative at minima. As we move closer to the minima, the slope reduces.

So, how does the Gradient Descent Algorithm work?

Objective: Calculate X*- local minimum of the function Y=X².

Pick an initial point X₀ at random
Calculate X₁ = X₀-r[df/dx] at X₀. r is Learning Rate (we’ll discuss r in Learning Rate Section). Let us take r=1. Here, df/dx is nothing but the gradient.
Calculate X₂ = X₁-r[df/dx] at X₁.
Calculate for all the points: X₁, X₂, X₃, ……., Xᵢ-₁, Xᵢ
General formula for calculating local minima: Xᵢ = (Xᵢ-₁)-r[df/dx] at Xᵢ-₁
When (Xᵢ — Xᵢ-₁) is small, i.e., when Xᵢ-₁, Xᵢ converge, we stop the iteration and declare X* = Xᵢ

4. LEARNING RATE

Learning Rate is a hyperparameter or tuning parameter that determines the step size at each iteration while moving towards minima in the function. For example, if r = 0.1 in the initial step, it can be taken as r=0.01 in the next step. Likewise it can be reduced exponentially as we iterate further. It is used more effectively in deep learning.

What happens if we keep r value as constant:

In the above example, we took r=1. As we calculate the points Xᵢ, Xᵢ+₁, Xᵢ+₂,….to find the local minima, X*, we can see that it is oscillating between X = -0.5 and X = 0.5.

When we keep r as constant, we end up with an oscillation problem. So, we have to reduce the “r” value with each iteration. Reduce the r value as the iteration step increases.

Important Note: Hyperparameters decide the bias-variance tradeoff. When r value is low, it could overfit the model and cause high variance. When r value is high, it could underfit the model and cause high bias. We can find the correct r value with Cross Validation technique. Plot a graph with different learning rates and check for the training loss with each value and choose the one with minimum loss.

5. GRADIENT DESCENT IN LOGISTIC REGRESSION

The formula for the optimal plane in logistic regression after applying sigmoid function is:

Optimal Plane — Logistic Regression (Image by Author)

Apply Gradient Descent Algorithm on Logistic Regression:

Gradient Descent in Logistic Regression (Image by Author)

We’ll calculate W₀, W₁, W₂, …., Wᵢ-₁, Wᵢ to find W*. When (Wᵢ-₁ — Wᵢ) is small i.e., when Wᵢ-₁, Wᵢ converge, we declare W* = Wᵢ

The disadvantage of Gradient Descent:

When n(number of data points) is large, the time it takes for k iterations to calculate the optimum vector becomes very large.

Time Complexity: O(kn²)

This problem is solved with Stochastic Gradient Descent and is discussed in the next section.

5. STOCHASTIC GRADIENT DESCENT(SGD)

In SGD, we do not use all the data points but a sample of it to calculate the local minimum of the function. Stochastic basically means Probabilistic. So we select points randomly from the population.

SGD in Logistic Regression

Stochastic Gradient Descent in Logistic Regression (Image by Author)

Here, m is the sample of data selected randomly from the population, n

Time Complexity: O(km²). m is significantly lesser than n. So, it takes lesser time to compute when compared to Gradient Descent.

6. CONCLUSION

In this article, we discussed Optimization algorithms like Gradient Descent and Stochastic Gradient Descent and their application in Logistic Regression. SGD is the most important optimization algorithm in Machine Learning. Mostly, it is used in Logistic Regression and Linear Regression. It is extended in Deep Learning as Adam, Adagrad.

7. REFERENCES

[1] Maxima and Minima: https://en.wikipedia.org/wiki/Maxima_and_minima

[2] Gradient Descent: https://en.wikipedia.org/wiki/Gradient_descent

Understanding Optimization Algorithms in Machine Learning

Table of Contents:

Written by Supriya Secherla