How your model is optimized | Know your Optimization

The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

Published in

Towards Data Science

8 min readJun 15, 2020

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily relies on numerical optimization methods. Studying the optimization and performance of different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example, you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully, continuous ones.
Here is a comprehensive summary of how an optimizer works under the hood.

Benchmark Functions — the curved, the bent and the asymmetric

Before we meet our optimizer we have to get acquainted with the functions that we intend to optimize. My three personal favorites are:

Mr. Nice-Guy is the ellipsoid function. It is overall nicely behaved, convex and has a global minimum.
it is defined as:

The Rosenbrock or also banana-function. It is not as nice as our ellipsoid, because of its valley of-death it can be a challenge to optimization algorithms. It is defined as:

The twisted-one is our attractive-sector function. It is the most unusual of the selection as it is non-convex and asymmetric around the minimum. It is defined as

Fig. 1 a) Ellipsoid function in R². Lower function values are depicted in blue with global minimum at (0, 0).

Fig. 1 b) Rosenbrock function in R². Lower function values are depicted in blue with global minimum at (1, 1) which lies in a valley.

Fig. 1 c) Attractive sector function in R². Lower function values are depicted in blue with global minimum around (0,0).

This is a comprehensible selection of functions and there exists a multitude of available benchmark-functions to run your optimizer across. From the depths of the Styblinski-Tang function to the beautiful peeks of Rastrigin. Some of which are ready made and implemented like in [1].

As you might have noticed the functions work on R^n and we provide an input vector x ∈ R^n. We will only be looking at two dimensions R^2 and three dimensions R^3 to make the display easier.

Now that we have set-up a playground we can just load up our tool of choice like scipy with its optimization library. Our goal is to capture how the optimizer finds the minimum on the benchmark function given different starting points. We just pick and choose any optimizer like simple gradient descent “CG” or the more intricate BFGS. It goes like this:

from scipy.optimize import minimize# import benchmark function with its derivatives
from scipy.optimize import rosen
from scipy.optimize import rosen_der
from scipy.optimize import rosen_hessimport numpy as npdef banana(x, y):
“”” 
a nice 2D way to look at Rosenbrock 
“””
 return (1 — x)**2 + (100 * (y — (x**2))**2)# we select some interesting optimizers:
OPTIMIZERS = [‘CG’, ‘BFGS’, ‘Newton-CG’, ‘dogleg’]
FUNCTIONS = {‘rosen’: [rosen, rosen_der, rosen_hess, banana]
# feel free to add or implement ellipsoid or other benchmark functions here with their derivatives
}start_values = np.random.uniform(low=x_min, high=x_max, size=(10, 2))start_iter = 1
max_iter = 50
step = 1
x_min, x_max = -5, 5
y_min, y_max = -5, 5def optimize_funct(fname, funct, x0, derivative, hess, optimizer, iteration):
“””
Just a wrapper around the scipy minimize function
“””
 fitted_min = minimize(funct, x0=x0, method=optimizer, 
 jac=derivative, hess=hess,
 options={‘maxiter’:iteration})
 return fitted_min# run actual optimization
fitted_optimizers = {}
for opt in OPTIMIZERS:
 fitted_optimizers[opt] = {}
 for name, functs in FUNCTIONS.items():
 fitted_optimizers[opt][name] = []
 f = functs[0]
 fd = functs[1]
 fdd = functs[2]
 f_2d = functs[3]
 for vals in start_values:
 computed_values = []
 x, y = vals
 z = f_2d(x,y)
 # include start values before optimization
 computed_values.append(np.array([x,y,z]))
 for i in range(start_iter, max_iter, step):
 out = optimize_funct(fname=name, funct=f, x0=vals, derivative=fd, hess=fdd,
 optimizer=opt, iteration=i)
 # only save the output values (stored under x)
 x, y = out.x
 z = f_2d(x, y)
 computed_values.append(np.array([x, y, z]))
 fitted_optimizers[opt][name].append(np.array(computed_values))

Not every optimizer is created equal. Some require the first and some the second derivative for our optimization functions aka the Jakobian and the Hessian . For an extra brief recap of what those are and how they work you can look here.
One thing to keep in mind is, that we start at different starting positions and let the optimizer roam freely — we are not doing constraint optimization just yet.

*Fig. 2 — Contour plot of the Ellipsoid function from different starting points using standard gradient descend. Each step is a “+”. The optimizer finds the global minimum after a few steps.*

Fig. 3 — Contour plot of the Rosenbrock- (Banana-) function. Gradient descent first walks into the right direction (see black “+”), but it looks for the global minimum to no avail. The valley it traverses is already low in function-value and information on where to look next is not as informative.

Fig 4. Attr. Sector contour plot of the function. BFGS optimization on the function space. Given three starting points (three colors) each step is depicted by a black cross. The minimum is quickly discovered and we see how a linear method (CG) behaves differently compared to BFGS which is from the family of quasi-newton methods.

Personally, it helped me a lot to visualize how the optimization procedures move through the different function-spaces. In the above figures, you can see how CG and BFGS perform. Now CG is the name for scipy’s implementation of gradient descent . It is the plain and simple optimization procedure and a classical line-search method. BFGS, on the other hand, uses approximated higher order information. BFGS, therefore, classifies as a quasi-Newton method.

A quasi-Newton method approximates higher order information like the Hessian Matrix (second order derivatives) and makes improvements per each step taken. [2]

Line-Searching | Backtrack Everything.

One method of optimization is simply following a line given the underlying function (see Fig. 2 and Fig. 3). We keep some information through backtracking, but that is about it. In order to optimize we may utilize first derivative information of the function. An intuitive formulation of line search optimization with backtracking is:

Compute gradient at your point
Compute the step based on your gradient and step-size
Take a step in the optimizing direction
Adjust the step-size by a previously defined factor e.g. α
repeat until minimum is found or the difference in value between your steps is very (very) small — e.g. 0.00000001

Gradients give us information on how the slope of a function behaves. We take the gradient as the first-order-derivative or partial derivative of a function. It is depicted as a vector and defined as:

Trust-Region Optimizer | Thinking in Circles

Instead of just looking straight ahead and straight back we can also look around us in a 360 degree fashion. We look at our function-information like the Jacobian and Hessian and make an optimal step in the right direction. In order for this to work we consider the quadratic model function:

, where g is the gradient of f and B the Hessian.
When we do the optimization we optimize for p and adjust the radius that is around us appropriately. This looks like:

*Fig. 5 Trust-Region Optimization on the contour of the attr.-sector function. Each step is depicted as “+” and around each step is the region from which the next step is chosen as a circle.*

An intuitive formulation of the procedure can look like this
(see the Appendix for a formal representation):

Compute Hessian and Gradient of your function
Choose a point in a radius around your position that is optimal
Take a step given the computed information
Adjust the size of the radius given on how much or how little you have improved — smaller if closer to minimum otherwise increase
Repeat until convergence or tolerance is reached

Of course these short steps don’t quite capture the complexity, you need more parameters and a bit more math to find the right point in your radius, adjust the step and the radius. If that interests you see Alg. 1 and 2 in the Appendix. We can observe an algorithm in practice and look at how the radius changes over time:

*Fig. 6 — magnitude changes of function values (y axis) over steps taken (x axis). The radius Δ changes with the improvement of the function-values.*

We see that if the function values improve over time then our radius gets smaller (see Fig. 5 and 6) whereas if the values are bad initially we increase our radius. See the first steps of the optimization on the Attractive Sector function in Fig. 6.

Answers to our starting questions

To answer the question: “Did my model converge?”. I suggest the following

test out different optimization algorithms, compare performances and identified minima — are they global?
if your problem-surface is flat aka there are only small changes in the optimization procedure: run over more iterations, increase tolerance (very small values for convergence-difference), maybe try to jump to different areas in the function
start with multiple different starting points: choose mean or median to benchmark performances.

Conclusion | The Rabbit Hole

We have seen how two families of optimization methods work: the straight-lined linear optimizer and the trust-region optimizer. This is only the entrance of the rabbit hole. With each function and problem that we look at, we can discover something new. We can analyze how our Jacobian or Hessian matrix have to behave to make our optimizer do its work. We have to look at the shape and properties of our functions.
Obviously there exists a vast number of available proofs and theorems that are the mathematical backbone of what we have just looked at.
If you want to dive deep and get your hands on the raw mathematics I recommend going through the book Numerical Optimization (2nd ed.) by J. Nocedal and S. J. Wright [3].

Acknowledgements

Much of the work that I have shown here I had done in attending the Numerical Optimization course at KU — DIKU. Without the guidance of the course-officials, this would not have been possible.

References

[1] MathWorks Test functions for global optimization algorithms
[2] C. Geiger and C. Kanzow. Quasi-Newton-Verfahren Springer 1999
[3] J. Nocedal and S. J. Wright Numerical Optimization (2nd ed.) Springer 2006