The world’s leading publication for data science, AI, and ML professionals.

Evolutionary Strategy: A Theoretical Implementation Guide

Photo by Sebastian Unrau on Unsplash
Photo by Sebastian Unrau on Unsplash

Here, I’ll be talking about the intuition and some statistics behind the Evolutionary Strategy described by OpenAI in this paper.

If you want some code, here it is!

A Brief Introduction to Optimization

In a traditional Optimization setting, the goal is to find the global minimum of some function. For example, the minimum of the function, f(x) = x² is at x = 0:

Image by Author: The graph of f(x) = x². Clearly, the minimum point of this function is at x = 0 when f(x) = 0
Image by Author: The graph of f(x) = x². Clearly, the minimum point of this function is at x = 0 when f(x) = 0

The solution of this example is fairly obvious, but these functions can become much more complex and exist in higher dimensions where the solution isn’t so trivial. The idea behind traditional optimization methods is to use the gradient of the function to try and ‘slide down the slope’ until you reach the bottom. As you might imagine, these algorithms all require the gradient of the function to be well-defined; and in fact, some like the Newton-Raphson method even require second order derivative information! Often, these are very difficult or time intensive to compute. Is there a way to optimize a function without needing this extra information?

Some Intuition behind ES

ES is another optimization method. What separates this method from more traditional optimization algorithms is the lack of information about the function itself required by the strategy, no direct gradient information is required at all for ES! That means we can apply it to far more abstract optimization problems where the function being optimized is somewhat a black box.

So how does this actually work? Suppose there is a function ** we want to optimize with respect to some parameters. This is an iterative algorithm where we continuously update our position, given by the parameters, to hopefully converge to an optimal point (in this case, the algorithm optimizes for maxima rather than minim**a). From a starting position, we sample points randomly and calculate the function value of those points to gather information about the function itself. Using this information, we can update ourselves to move in an ascending direction! Basically, we move some distance in the direction where there are more large values calculated from the sampled points.

The Equation

Let’s go through the update equation briefly to see what’s going on, we want to optimize some function, F with respect to some parameters, θ:

Image by Author: ES Update Equation
Image by Author: ES Update Equation

At each time step, t, we update the parameters as defined above. The most important feature here is the sum term which gets the ascent direction. Here, we calculate the weighted sum of a population of n points sampled randomly around the current position, θ_t, where the weights are the function values of those points. The sampling is done via a Gaussian distribution:

Image by Author: Sample with a Gaussian distribution
Image by Author: Sample with a Gaussian distribution

Such that σ in the update equation represents the standard deviation of this noise. Finally, α represents the learning rate, controlling how far we travel in the direction determined by the sum term.

Let’s Talk Stats!

The goal here is to get a deeper understanding of the equation and predict some behaviour of the algorithm. Here is the equation again, so you don’t need to scroll up to refer 😊

Image by Author: ES Update Equation
Image by Author: ES Update Equation

First, what does the distribution of θ_{t+1} look like? Well, unfortunately this is dependent on the function, F, being optimized. However, if we assume the term in the sum has finite variance whatever the case, then the equation obeys the Central Limit Theorem, and we can assume a Gaussian distribution as n increases to infinity!

In this case, we can simply get terms for the expectation and variance and feel fairly confident about the distribution (note: we also assume independent sampling).

Image by Author: Expectation and variance of transition
Image by Author: Expectation and variance of transition

You might be wondering what the point of all that was. These equations don’t seem useful and there’s a lot of uncertainty over the function itself, F. However, this also gives us some intuition into hyperparameter tuning and algorithm behaviour:

  • A larger n will have no effect on the expected value of the next state, but will reduce the variance around that expected value leading to more stability. This can be interpreted as increasing the information density of the search, although the cost of this is increased computation time.
  • Increasing α pushes us further along an ascent direction, which can allow for faster convergence. However, it also increases the variance of the update, leading to a larger jitter around the optimal solution. If it’s too big, it can even lead to instability as the algorithm diverges.
  • Altering σ is less certain, since it is input to F. However, intuitively it increases the search space, which allows the algorithm to escape local maxima traps. This is at the cost of reduced information density, so increasing σ likely benefits from an increase in n.

Experiments

For simplicity, let’s use the function, f(x) = -x² in our tests. For reference, here’s what the learning curve looks like in general:

Image by Author: ES learning curve for f(x) = -x²
Image by Author: ES learning curve for f(x) = -x²

There are two phases to the algorithm, the transition state where the algorithm moves to the solution and the steady state where it has arrived at the solution. Ideally, we want the transition state to be as short as possible whilst having minimum variance in the steady state. Let’s see how these metrics change as we alter the hyperparameters.

Image by Author: Convergence time and steady state standard deviation for changing α
Image by Author: Convergence time and steady state standard deviation for changing α

As expected, increasing the learning rate, α, decreases the convergence time at the cost of increasing the steady state standard deviation. Since the convergence time appears to plateau, increasing α beyond a point will lead to only marginal improvement to the convergence time at the cost of a decrease in steady state performance.

Image by Author: Convergence time and steady state standard deviation for changing σ
Image by Author: Convergence time and steady state standard deviation for changing σ

Increasing the sampling noise standard deviation, σ, increases the time to convergence. This is expected since each update has less information density, which is also reflected in the general increased steady state standard deviation; although the decreasing deviation at very small σ is unexpected.

Image by Author: Convergence time and steady state standard deviation for changing n
Image by Author: Convergence time and steady state standard deviation for changing n

Increasing population size, n, also conforms to expectations. The decreasing convergence time and steady state standard deviation align with the idea of increasing information density, although the effect plateaus very fast compared to changes in the other hyperparameters.


ES is very simple to implement and can yield great results. Next time you need to optimize a function (hint: machine learning 👀 ), keep this tool in mind!

I hope you found this article insightful and do let me know your thoughts 😁


Related Articles