NON-CLASSICAL OPTIMIZATION COURSE
Hello and welcome back to this full course on Non-Classical Optimization! In this post we will start with the first algorithm: Hill Climbing! If you are new to this course, you can check out the overview below!
Non-Classical Optimization Algorithms (FULL COURSE) Overview
To start off, what is Hill Climbing? Hill Climbing belongs to the field of local searches, where the goal is to find the minimum or maximum of an objective function. The algorithm is considered a local search as it works by stepping in small steps relative to its current position, hoping to find a better position.
Table of Contents
- Overview and Basic Hill Climber Algorithm
- Common Problems
- Hill Climber Variants
- Building Our Implementation
- Testing Objective Functions
- Code
- Conclusion
Overview and Basic Hill Climber Algorithm
Hill Climber receives its name by being analogous to a hiker climbing to the top of a mountain by stepping towards the next highest part of the mountain. The hiker doesn’t go down the slope of the mountain to get to the top, but takes a step towards to the slope of the mountain such that the next step is higher than the previous in terms of elevation.
We can take this idea to the field of optimization by stepping in different directions from the current position, and moving towards the direction that maximizes the objective function (or minimizes in terms of minimization)! Because Hill Climber always moves towards the position that is better or equal to the current position, Hill Climber is considered as a greedy algorithm. For example, if we are given the two-dimensional problem below:

Our goal is to find the maximum point, which occurs at (0,0). How Hill Climber works to find the maximum point? First we choose a random initial point, take a step in every possible direction, and move into the position that has the highest function value. For this problem, we have eight possible moves for every position. Two for moving in the x direction only (both positively and negatively); two for moving in the y direction only (both positively and negatively); and four for moving in the x direction (both positively and negatively) and in the y direction (both positively and negatively). After we test these next eight positions, we choose the position such that it is the best. Depicted below on the left hand side is this exact scenario. We have our initial point in red, where the eight colored points around are possible directions to step towards. However, note that the direction stepped towards is the yellow upper left point as that yielded the best value. We simply repeat this process until convergence. On the right hand side we have an initial
For example, take the picture on the lower left hand side below. Our initial point is the red point. We can see the steps taken by Hill Climber in the blue vectors before the final global maximum was reached at the blue point.


The basic Hill-Climber Algorithm can be depicted below. First, we randomly choose an initial state, then we select the different variables to step towards, the step sizes, and then test all the generated new positions. After testing, we select the best position to step into and restart the process. Termination occurs when our algorithm has either reached a user-defined maximum number of iterations or when the algorithm stagnates/convergences: when the current position does not change after a specified number of iterations.

Common Problems
When implementing the basic version of the algorithm, three common issues arise:
- For multi-dimensional problems, how many variables do we choose to test before stepping into?
- What should the step size be in each direction?
- Problems with Getting Stuck
If we have an n-dimensional objective function, there exist 2^n possible state moves to consider. For the example given above in the 3-dimensional gaussian distribution, there were 8 possible moves, 2³= 8 . The reason why this grows exponentially is that we have to consider ‘diagonal’ moves where we move in directions that consider the interaction between multiple variables instead of one. Now the obvious problem is that evaluating 2^n points at each timestep is an extremely expensive task, when n grows larger than around 15. We can reduce this by not considering the "diagonal" states; instead, we only consider single variable moves. In this framework, there are 2n possible single state moves (both forwards and backwards for each variable), assuming no diagonal moves. In this way, for a greedy approach, we need to test out 2n possible states before choosing the best state to move into. For large state spaces or when we have a limited number of function evaluations allowed, this can eat up time for larger n as well. A common way to handle this problem is to reduce the number of variables to step towards by simply taking a random sample from the given set of variables and only test out steps in those variables.
Step size on the other hand is a tricky subject. Large step sizes can cover ground but can often hop over the minimum or maximum. Small step sizes can make sure that the algorithm will have good convergence by not skipping over the maximum or minimum; however, if the current position is far away then it will take many iterations before the algorithm will reach the position. We will cover this later on.
Because Hill Climber is a local search, it has problems finding optimal/global minimum or maximum points in highly modal functions where there exist many local optimum. The algorithm has no ability to escape these local traps using the standard algorithm. In addition, the algorithm will converge if a plateau is reached where all the nearby points have the same function value, as the algorithm will not know in which direction to step into if they all have the same outcome.
Hill Climber Variants
As detailed above, there are three common problems that plague Hill Climber. In response, many variants have been developed to handle these issues. The algorithm described thus far for Hill Climber is known as Steepest Ascent Hill Climber, where the traditional Simple Hill Climber tests each position one by one and the first to yield a better value is chosen instead of testing all neighboring positions and moving into the best. Stochastic Hill Climber is another popular variant that selects and moves into a position probabilistically. These are very three well known variants – the rest of the variants to be described below do not have traditional names.
A true greedy approach will have to evaluate either 2^n or 2*n possible positions per iteration, depending upon allowing diagonals or not. As stated before, this can waste computational time when n is large. To reduce this we can either check either a single random variable or a handful of random variables. This can be called as Stochastic-Variable Selection Hill Climber.
On the other hand, assuming we always choose to evaluate all 2*n possible positions, then our algorithm will always follow the same exact path deterministically from the initial point to convergence as it will always choose the local best neighbor. This can be problematic when the algorithm becomes stuck in a local optima. In order to introduce variance, the algorithm can move into positions probabilistically determined by their function values. Take for example below where we have five possible neighbors. We calculate their function values and divide each value by the sum of function values to obtain a probability:

Now instead of always choosing neighbor 2 to move into when selecting the largest F(x) value, we sample from the probabilities given to add some variance to the path. This can be called Stochastic-Neighbor Selection Hill Climber. The obvious drawback is that it will take many more iterations to converge to a point; however, the included variance might be what it takes to escape out of local optima.
Selecting the step size has dramatic influence on the convergence and quality of the algorithm. Keeping the step size static simplifies things but does not allow for the algorithm to adapt to the environment. One solution is to use a reduction schedule on the step size, to allow for larger steps early on for exploration and smaller ones later on to achieve convergence. There exist many different reduction schedules, from linear, exponential, and cosine to uniform. These variants can be called Scheduled-Step Size Hill Climbers.
Building Our Implementation
For our implementations, we will create two different versions of Hill Climber. First, we will create the classic Steepest Ascent Hill Climber, which will create 2*n possible neighbor states to evaluate before choosing the best to move into. Second, we will use a mixture of Scheduled-Step Size and Stochastic Variable Selection. Instead of testing every possible step at a current position, our algorithm will select a random set of variables at each iteration and test the steps for those selected variables. For schedule step size, our algorithm will take a random step, either forward or backwards in the selected variable through a uniform step size. Uniform step size works by taking a uniformly random step between two values. For example, we take a random value from U(-0.1, 0.1), use that as a step size.
For our Steepest Ascent Hill Climber, we need to define a step size. Because the domain of each variable could be different, we can’t assume a global static step size; instead, we need to define a step size for each variable. The classic way to do this is to take a percentage step of the total domain size for each variable. For example, suppose we have the following domain bounds for x=[-5,5] and y=[0,15] in a 2 input dimensional problem. ** Thus, our lower bound for our variables would be [-5, 0], while the upper bound would become [5, 15]. We can create the total bounds, which measures how large the domain space is for each variable, by subtracting the upper bound by the lower bound for each variable: [5-(-5), 15-0] = [10,15]. Then we take percentage step size, lets say 0.01. Then 0.01*[10,15]=[0.1, 0.15]. These are now our step sizes for the respective variables, 0.1 for ** x and 0.15 for y. Here is the implementation for Steepest Ascent:
For our second implementation of a mixture between Scheduled-Step Size and Stochastic Variable Selection, there are slight modifications we need to enable. First, instead of checking all possible 2*n possible neighbors, we will only select only a random sample from the possible variables. In addition, instead of adding and subtracting by a static step size, we will implement uniform step size, which will add or subtract a uniform random value between the step size. Here is the implementation:
Testing Objective Functions
Now it is time to test our two implementations on some real objective functions! If you remember from the course overview page, we will test our algorithms on three functions: Sphere, Shubert, and Eggholder – where the goal in each is to find the global minimum. The global minimum for the Sphere Function is F(X)=0, F(X)=-12870.88 (changes for different n) for Shubert, and F(X)=-959.64 for Eggholder. Sphere is a convex function with no local minima while both Shubert and Eggholder are highly modal with many local minima and maxima.



For both Sphere and Shubert we will test with n=1000, and allow 100,000 function evaluations. Because each run of the algorithm is dependent upon the initial random position, they will be re-ran 30 times on each function and the mean result will be kept for comparison. In addition, the step size percentage will be tested at the following values: [1, 0.5, 0.1, 0.01]. Now, one should notice that a step size percentage of 1 is extremely large to be considered as a local search for Hill Climber, as the step size has the maximum value large enough to span the entire domain; however, whenever we test our uniform step size this will help us create exploration as the step is sampled uniformly from the entire domain instead of being static.
Testing Steepest Descent
First, lets test our Steepest Descent algorithm. Because this algorithm tests 2*n possible neighbors, and with n=1000, 2000 function evaluations will be tested each iteration. Therefore, the maximum iterations will be 50. Here are the results:


As we can see, with only 100,000 function evaluations, Steepest Descent Climber does not find minimum points even close to the actual global minimum points. This is probably due to the fact that we are only running it for 50 iterations, lets run for 1,000 iterations instead: equating to 2 million function evaluations! Here are the results:


Even though we ran for 20x more iterations, the final min points found are still far from the actual global min points. Notice that for the Sphere function, step percentages of 0.5 and 0.1 performed the best, while for Shubert the step percentage of 1 performed the best. This showcases that step percentage is another hyper-parameter that needs to be tuned for the particular problem.
Testing Stochastic Variable with Uniform Step
Now lets test our custom Stochastic Variable with Uniform Step! To keep things simple, we will create neighbors in only 10 variables, which is in itself another hyperparameter to tune as well. Therefore, at each iteration we will test 10 neighbors, one for each randomly sampled variable, equating to 10 function evaluations per iteration. Therefore, the maximum number of iteration will be 10,000 to reach the maximum 100,000 function evaluations. Here are the results:


As we can see, this custom implementation of Hill Climber did way better than the classical Stochastic Descent Hill Climber. This showcases the power of random step sizes. Even though we used uniform here, one could have instead sampled from a gaussian or any other type of distribution. The best step percentage for the Sphere function was 0.1 while it was 1 and 0.5 for the Shubert Function. To keep things similar, lets see how well the algorithm will perform with 2 million function evaluations:


As we can see above, with 2 million evaluations, the global minimum was almost found with step percentage 0.01 for the Sphere Function and step percentage 1 for the Shubert Function.
Testing Eggholder Function
One might object that Steepest Descent is performing bad due to the high dimensionality of the domain space, n=1000 in the previous two objective functions. Instead, lets test both algorithms on the 2 input dimensional Eggholder Function with a limit of 1000 function evaluations. Here are the results:


As we can see from above, steepest descent did well at finding a good minimum at best, but the mean values are relatively poor along with large standard deviations. On the other hand, the mean values for the custom hill climber was twice as better than steepest descent; in addition, it also had lower standard deviations and absolute min values found.
Code
If you are interested in toying around with the algorithms above, all code is available below on my GitHub repository for this course!
NonClassicalOptimization/Unit 1: Hill Climber at main · OUStudent/NonClassicalOptimization
Conclusion
In conclusion, Hill Climber is a local search method found in Machine Learning for optimization- finding the global minimum or maximum of an objective function. Hill Climber has many variants due to its struggles in its classic form. In this post we compared the classic Steepest Descent Hill Climber with a custom made Stochastic Variable Selector with Uniform Step Size Hill Climber. The latter outperformed the prior in three highly advanced objective functions: The Sphere, Shubert, and Eggholder.
Stay tuned for the next Unit in the series where we will cover Simulated Annealing!