Hands-On Tutorials

Table of Contents (read till the end to see how you can get the complete python code of this story)
· What is Optimization?
· Gradient Descent (the Easy Way)
· Armijo Line Search
· Gradient Descent (the Hard Way)
· Conclusion
What is Optimization?
If you’ve been studying Machine Learning long enough, you’ve probably heard terms such as SGD or Adam. They are two of many optimization algorithms. Optimization algorithms are the heart of machine learning which are responsible for the intricate work of machine learning models to learn from data. It turns out that optimization has been around for a long time, even outside of the machine learning realm.
People optimize. Investors seek to create portfolios that avoid excessive risk while achieving a high rate of return. Manufacturers aim for maximum efficiency in the design and operation of their production processes. Engineers adjust parameters to optimize the performance of their designs.
This first paragraph of the Numerical Optimization book by Jorge Nocedal already explains a lot. It continues to state that even nature optimizes. Physical systems tend to a state of minimum energy. The molecules in an isolated chemical system react with each other until the total potential energy of their electrons is minimized. Rays of light follow paths that minimize their travel time.
To make sense of what optimization is, first of all, we must identify the objective, which could be the rate of return, energy, travel time, etc. The objective depends on certain characteristics of the system called variables. Our goal is to find values of the variables that optimize the objective. Often the variables are constrained, in some way.
Mathematically speaking, optimization is the process of maximizing or minimizing an objective function f(x) by searching for the appropriate variables x subject to some constraints cᵢ, which could be written compactly as follows.

where ℰ and ℐ are sets of indices for equality and inequality constraints, respectively.
This mathematical statement is definitely daunting at first sight, maybe because it is for the general description of optimization where not much could be inferred. But don’t worry, by the time we get to the code, everything will be clear.
Complete Step-by-step Conjugate Gradient Algorithm from Scratch
Gradient Descent (the Easy Way)
Now, how do we solve the min function? Thanks to calculus, we have a tool called gradient. Imagine a ball at the top of a hill. We know that the hill has different slopes/gradients at different points. Due to gravity, the ball will move descent following the curve of the hill. Which way does it go? To the steepest gradient. After some time, the ball will reach a local minimum where the ground is relatively flat to its surroundings.
This is the nature of gradient descent. We could quantize the smooth path of the ball into tiny steps. At k-th step, we will have two quantities: the step length αₖ and the direction pₖ. To see gradient descent in action, let’s first import some libraries.
For starters, we will define a simple objective function f(x) = x² − 2x − 3 where x is real numbers. Since gradient descent uses gradient, we will define the gradient of f as well, which is just the first derivative of f, that is, ∇f(x) = 2x − 2.
Next, we define Python functions for plotting the objective function and the learning path during the optimization process. What we mean by learning path is just points x after each descent step.
From the plot below, we could easily see that f has a minimum value at x = 1 (hence f(x) = -4). Let’s say we start at x = -4 (indicated by a red dot below), we will see whether gradient descent can locate the local minimum x = 1.

Define a simple gradient descent algorithm as follows. For every point xₖ at the beginning of step k, we maintain the step length αₖ constant and set the direction pₖ to be the negative of gradient value (steepest descent at xₖ). We take steps using the formula

while the gradient is still above a certain tolerance value (1 × 10⁻⁵ in our case) and the number of steps is still below a certain maximum value (1000 in our case).
Begin at x = -4, we run the gradient descent algorithm on f with different scenarios:
- αₖ = 0.1
- αₖ = 0.9
- αₖ = 1 × 10⁻⁴
- αₖ = 1.01
Scenario 1: αₖ = 0.1
Solution found:
y = -4.0000
x = 1.0000

Scenario 2: αₖ = 0.9
Solution found:
y = -4.0000
x = 1.0000

Scenario 3: αₖ = 1 × 10⁻⁴
Gradient descent does not converge.

Scenario 4: αₖ = 1.01
Gradient descent does not converge.

Here’s what we got:
- The first scenario converges like a charm. Even though the step length is constant, the direction is decreasing towards zero and hence results in a convergence.
- The second scenario also converges even though the learning path is oscillating around the solution due to the big step length.
- The third scenario moves towards the solution. However, the step length is so small so that the number of iterations is maxed out. Increasing
max_iter
will solve the issue even though it will take much longer to arrive at the solution. - The fourth scenario diverges due to the big step length. Here, we set
max_iter
= 8 to make the visualization more pleasing.
All and all, the solution x = 1 could be reached by gradient descent with the right step length.
You might be wondering, why don’t we use the exact analytical solution: take the derivative of f, then solve x such that the derivative is zero. For our previous example, we would find that the x that minimizes f would satisfy ∇f(x) = 2x − 2 = 0, that is, x = 1.
Yes, this is one way to go. But this is not a recommended technique when you face an optimization problem where the derivative of f is really hard to calculate or impossible to solve.
Armijo Line Search
Now, what could we improve from our gradient descent algorithm? We see previously that the step length α is maintained constant throughout the steps and the wrong choice of α might end up diverges the steps. Could we search for the optimum α for each step direction?
Enter line search.
In the line search strategy, the algorithm chooses a direction pₖ (could be as simple as the steepest descent -∇f(x)) and searches along this direction from the current iterate xₖ for a new iterate with a lower function value. The distance to move along pₖ can be found by approximately solving the following one-dimensional minimization problem to find a step length α:

As mentioned before, by solving this exactly, we would derive the maximum benefit from the direction pₖ, but an exact minimization may be expensive and is usually unnecessary. Instead, the line search algorithm generates a limited number of trial step lengths until it finds one that loosely approximates the minimum of f(xₖ + α pₖ). At the new point xₖ₊₁ = xₖ + α pₖ, a new search direction and step length are computed, and the process is repeated.
A popular inexact line search condition stipulates that αₖ should, first of all, give a sufficient decrease in the objective function f, as measured by the so-called Armijo Condition:

for some constant c₁ ∈ (0, 1). In other words, the reduction in f should be proportional to both the step length αₖ and the directional derivative ∇fₖpₖ. At iterate xₖ, we start with some initial αₖ, and while the Armijo Condition is not satisfied, we simply shrink αₖ with some shrinkage factor ρ. The shrinkage process will be terminated at some point since for a sufficiently small αₖ, the Armijo Condition is always satisfied.
Below is the code for the Armijo Line Search. Note that we use ρ = 0.5 and c₁ = 1 × 10⁻⁴.
Now we are ready to implement Armijo Line Search to Gradient Descent Algorithm.
Gradient Descent (the Hard Way)
First of all, let’s define a harder objective function to solve: the Griewank Function.

We will be using the two-dimensional version of the Griewank Function (n = 2). To give some sense, we could plot this function as follows.

We could see that the function has many local optima. We should expect the gradient descent algorithm to be trapped in one of these local minima. We could also notice that the global minimum is at x = [0, 0] where f(x) = 0.
Finally, let’s build the gradient descent algorithm one last time. As stated before, we use pₖ = –∇f(xₖ) as the direction and αₖ from Armijo Line Search as the step length. Also, steps will be taken until one of the stopping criteria below is satisfied:
- The norm of the gradient of the objective function is close enough to zero, that is,

-
The number of steps taken is 1000
We then make a python function for creating two plots:
- Learning path of x along with the contour plot of f(x)
-
The value of f(x) per step taken
For an initial value x₀, we run gradient descent algorithm on f with different scenarios:
- x₀ = [0, 3]
- x₀ = [1, 2]
- x₀ = [2, 1]
- x₀ = [1, 3]
- x₀ = [2, 2]
- x₀ = [3, 1]
Scenario 1: x₀ = [0, 3]
Initial condition: y = 1.5254, x = [0 3]
Iteration: 1 y = 1.1245, x = [0.0000 2.3959], gradient = 0.7029
Iteration: 2 y = 0.6356, x = [0.0000 1.6929], gradient = 0.6591
Iteration: 3 y = 0.2558, x = [0.0000 1.0338], gradient = 0.4726
Iteration: 4 y = 0.0778, x = [0.0000 0.5612], gradient = 0.2736
Iteration: 5 y = 0.0206, x = [0.0000 0.2876], gradient = 0.1430
Iteration: 6 y = 0.0052, x = [0.0000 0.1447], gradient = 0.0723
Iteration: 7 y = 0.0013, x = [0.0000 0.0724], gradient = 0.0362
Iteration: 8 y = 0.0003, x = [0.0000 0.0362], gradient = 0.0181
Iteration: 9 y = 0.0001, x = [0.0000 0.0181], gradient = 0.0090
Iteration: 10 y = 0.0000, x = [0.0000 0.0090], gradient = 0.0045
Iteration: 11 y = 0.0000, x = [0.0000 0.0045], gradient = 0.0023
Iteration: 12 y = 0.0000, x = [0.0000 0.0023], gradient = 0.0011
Iteration: 13 y = 0.0000, x = [0.0000 0.0011], gradient = 0.0006
Iteration: 14 y = 0.0000, x = [0.0000 0.0006], gradient = 0.0003
Iteration: 15 y = 0.0000, x = [0.0000 0.0003], gradient = 0.0001
Iteration: 16 y = 0.0000, x = [0.0000 0.0001], gradient = 0.0001
Iteration: 17 y = 0.0000, x = [0.0000 0.0001], gradient = 0.0000
Iteration: 18 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Iteration: 19 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Solution: y = 0.0000, x = [0.0000 0.0000]

Scenario 2: x₀ = [1, 2]
Initial condition: y = 0.9170, x = [1 2]
Iteration: 1 y = 0.7349, x = [0.8683 1.6216], gradient = 0.5225
Iteration: 2 y = 0.4401, x = [0.5538 1.2044], gradient = 0.5705
Iteration: 3 y = 0.1564, x = [0.2071 0.7513], gradient = 0.3932
Iteration: 4 y = 0.0403, x = [0.0297 0.4004], gradient = 0.1997
Iteration: 5 y = 0.0103, x = [0.0012 0.2027], gradient = 0.1011
Iteration: 6 y = 0.0026, x = [0.0000 0.1016], gradient = 0.0508
Iteration: 7 y = 0.0006, x = [0.0000 0.0508], gradient = 0.0254
Iteration: 8 y = 0.0002, x = [0.0000 0.0254], gradient = 0.0127
Iteration: 9 y = 0.0000, x = [0.0000 0.0127], gradient = 0.0063
Iteration: 10 y = 0.0000, x = [0.0000 0.0063], gradient = 0.0032
Iteration: 11 y = 0.0000, x = [0.0000 0.0032], gradient = 0.0016
Iteration: 12 y = 0.0000, x = [0.0000 0.0016], gradient = 0.0008
Iteration: 13 y = 0.0000, x = [0.0000 0.0008], gradient = 0.0004
Iteration: 14 y = 0.0000, x = [0.0000 0.0004], gradient = 0.0002
Iteration: 15 y = 0.0000, x = [0.0000 0.0002], gradient = 0.0001
Iteration: 16 y = 0.0000, x = [0.0000 0.0001], gradient = 0.0000
Iteration: 17 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Iteration: 18 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Iteration: 19 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Solution: y = 0.0000, x = [0.0000 0.0000]

Scenario 3: x₀ = [2, 1]
Initial condition: y = 1.3176, x = [2 1]
Iteration: 1 y = 0.8276, x = [1.3077 1.1907], gradient = 0.6583
Iteration: 2 y = 0.4212, x = [0.6639 1.0529], gradient = 0.5903
Iteration: 3 y = 0.1315, x = [0.2104 0.6750], gradient = 0.3682
Iteration: 4 y = 0.0320, x = [0.0248 0.3570], gradient = 0.1784
Iteration: 5 y = 0.0081, x = [0.0008 0.1803], gradient = 0.0900
Iteration: 6 y = 0.0020, x = [0.0000 0.0903], gradient = 0.0452
Iteration: 7 y = 0.0005, x = [0.0000 0.0451], gradient = 0.0226
Iteration: 8 y = 0.0001, x = [0.0000 0.0225], gradient = 0.0113
Iteration: 9 y = 0.0000, x = [0.0000 0.0113], gradient = 0.0056
Iteration: 10 y = 0.0000, x = [0.0000 0.0056], gradient = 0.0028
Iteration: 11 y = 0.0000, x = [0.0000 0.0028], gradient = 0.0014
Iteration: 12 y = 0.0000, x = [0.0000 0.0014], gradient = 0.0007
Iteration: 13 y = 0.0000, x = [0.0000 0.0007], gradient = 0.0004
Iteration: 14 y = 0.0000, x = [0.0000 0.0004], gradient = 0.0002
Iteration: 15 y = 0.0000, x = [0.0000 0.0002], gradient = 0.0001
Iteration: 16 y = 0.0000, x = [0.0000 0.0001], gradient = 0.0000
Iteration: 17 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Iteration: 18 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Iteration: 19 y = 0.0000, x = [0.0000 0.0000], gradient = 0.0000
Solution: y = 0.0000, x = [0.0000 0.0000]

Scenario 4: x₀ = [1, 3]
Initial condition: y = 1.2852, x = [1 3]
Iteration: 1 y = 1.0433, x = [1.4397 2.6729], gradient = 0.3230
Iteration: 2 y = 0.9572, x = [1.7501 2.5838], gradient = 0.2763
Iteration: 3 y = 0.8638, x = [1.9986 2.7045], gradient = 0.4098
Iteration: 4 y = 0.6623, x = [2.3024 2.9796], gradient = 0.5544
Iteration: 5 y = 0.3483, x = [2.6813 3.3842], gradient = 0.5380
Iteration: 6 y = 0.1116, x = [3.0054 3.8137], gradient = 0.3231
Iteration: 7 y = 0.0338, x = [3.1265 4.1133], gradient = 0.1618
Iteration: 8 y = 0.0141, x = [3.1396 4.2745], gradient = 0.0818
Iteration: 9 y = 0.0091, x = [3.1400 4.3564], gradient = 0.0411
Iteration: 10 y = 0.0078, x = [3.1400 4.3974], gradient = 0.0205
Iteration: 11 y = 0.0075, x = [3.1400 4.4179], gradient = 0.0103
Iteration: 12 y = 0.0074, x = [3.1400 4.4282], gradient = 0.0051
Iteration: 13 y = 0.0074, x = [3.1400 4.4333], gradient = 0.0026
Iteration: 14 y = 0.0074, x = [3.1400 4.4359], gradient = 0.0013
Iteration: 15 y = 0.0074, x = [3.1400 4.4372], gradient = 0.0006
Iteration: 16 y = 0.0074, x = [3.1400 4.4378], gradient = 0.0003
Iteration: 17 y = 0.0074, x = [3.1400 4.4381], gradient = 0.0002
Iteration: 18 y = 0.0074, x = [3.1400 4.4383], gradient = 0.0001
Iteration: 19 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 20 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 21 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Solution: y = 0.0074, x = [3.1400 4.4384]

Scenario 5: x₀ = [2, 2]
Initial condition: y = 1.0669, x = [2 2]
Iteration: 1 y = 0.9886, x = [1.8572 2.2897], gradient = 0.2035
Iteration: 2 y = 0.9414, x = [1.9025 2.488 ], gradient = 0.2858
Iteration: 3 y = 0.8372, x = [2.0788 2.713 ], gradient = 0.4378
Iteration: 4 y = 0.6117, x = [2.3753 3.035 ], gradient = 0.5682
Iteration: 5 y = 0.2941, x = [2.7514 3.461 ], gradient = 0.5082
Iteration: 6 y = 0.0894, x = [3.0423 3.8777], gradient = 0.2863
Iteration: 7 y = 0.0282, x = [3.1321 4.1495], gradient = 0.1438
Iteration: 8 y = 0.0127, x = [3.1398 4.2931], gradient = 0.0726
Iteration: 9 y = 0.0087, x = [3.1400 4.3657], gradient = 0.0364
Iteration: 10 y = 0.0077, x = [3.1400 4.4021], gradient = 0.0182
Iteration: 11 y = 0.0075, x = [3.1400 4.4203], gradient = 0.0091
Iteration: 12 y = 0.0074, x = [3.1400 4.4294], gradient = 0.0045
Iteration: 13 y = 0.0074, x = [3.1400 4.4339], gradient = 0.0023
Iteration: 14 y = 0.0074, x = [3.1400 4.4362], gradient = 0.0011
Iteration: 15 y = 0.0074, x = [3.1400 4.4373], gradient = 0.0006
Iteration: 16 y = 0.0074, x = [3.1400 4.4379], gradient = 0.0003
Iteration: 17 y = 0.0074, x = [3.1400 4.4382], gradient = 0.0001
Iteration: 18 y = 0.0074, x = [3.1400 4.4383], gradient = 0.0001
Iteration: 19 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 20 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 21 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Solution: y = 0.0074, x = [3.1400 4.4384]

Scenario 6: x₀ = [3, 1]
Initial condition: y = 1.7551, x = [3 1]
Iteration: 1 y = 1.5028, x = [2.8912 1.4543], gradient = 0.6001
Iteration: 2 y = 1.1216, x = [2.7619 2.0402], gradient = 0.6522
Iteration: 3 y = 0.7074, x = [2.7131 2.6906], gradient = 0.6214
Iteration: 4 y = 0.3449, x = [2.8471 3.2973], gradient = 0.5273
Iteration: 5 y = 0.1160, x = [3.0458 3.7858], gradient = 0.3246
Iteration: 6 y = 0.0361, x = [3.1298 4.0993], gradient = 0.1683
Iteration: 7 y = 0.0147, x = [3.1397 4.2673], gradient = 0.0854
Iteration: 8 y = 0.0092, x = [3.1400 4.3528], gradient = 0.0429
Iteration: 9 y = 0.0079, x = [3.1400 4.3956], gradient = 0.0214
Iteration: 10 y = 0.0075, x = [3.1400 4.4170], gradient = 0.0107
Iteration: 11 y = 0.0074, x = [3.1400 4.4278], gradient = 0.0053
Iteration: 12 y = 0.0074, x = [3.1400 4.4331], gradient = 0.0027
Iteration: 13 y = 0.0074, x = [3.1400 4.4358], gradient = 0.0013
Iteration: 14 y = 0.0074, x = [3.1400 4.4371], gradient = 0.0007
Iteration: 15 y = 0.0074, x = [3.1400 4.4378], gradient = 0.0003
Iteration: 16 y = 0.0074, x = [3.1400 4.4381], gradient = 0.0002
Iteration: 17 y = 0.0074, x = [3.1400 4.4383], gradient = 0.0001
Iteration: 18 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 19 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 20 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Iteration: 21 y = 0.0074, x = [3.1400 4.4384], gradient = 0.0000
Solution: y = 0.0074, x = [3.1400 4.4384]

In the first 3 scenarios, the algorithm converges to the global minimum where f(x) = 0, although the first scenario seems like wasting too many steps since the first coordinate of x₀ is already 0 for this case.
In the last 3 scenarios, the algorithm is trapped in a local minimum where f(x) = 0.0074 since x₀ is too far from coordinate [0, 0]. This comes with no surprise because the line search method looks for the minimum value of f by heading to the direction where the function value decreases and the norm of the gradient approaches zero, so that when this method has obtained a coordinate where the gradient is very close to zero, the point is considered as the function minimum value regardless of whether the minimum is local or global.
If we apply this version of gradient descent on the original objective function f(x) = x² − 2x − 3, we get the following result.
Initial condition: y = 21.0000, x = -4
Iteration: 1 y = -4.0000, x = 1.0, gradient = 0.0000
Solution: y = -4.0000, x = 1.0

The solution is found in one step!
What do you think about this? Is it because the new gradient descent version is too overkill for this function? Or is this just a coincidence, and changing parameters such as ρ will produce no better result than the original simple gradient descent? Let me know your thoughts in the response section below!
Conclusion
In this article, we understand the work of the Gradient Descent algorithm in optimization problems, ranging from a simple high school textbook problem to a real-world machine learning cost function minimization problem. The easy implementation assumes a constant learning rate, whereas a harder one searches for the learning rate using Armijo line search. The algorithm itself is the most basic and has many variations and implementations depending on the nature of the objective function. Like many other optimization algorithms, the Gradient Descent algorithm may be trapped in a local minimum.

🔥 Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.
🔖 Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:
