Evolution Strategies for Reinforcement Learning

Solving the famous inverted pendulum problem using nothing more than brute force.

Published in

Towards Data Science

13 min readSep 13, 2020

Photo by Daniel Pelaez Duque on Unsplash

In the last article, the goal that we set to ourselves was to optimize the Deep Q-Learning with prioritized experience replay, in other words, provide the algorithm with a bit of help judging what is important and should be remembered and what is not. Most globally, under current technological achievements, algorithms tend to perform better when helped by human intervention. Take the example of image recognition, and let’s say you want to classify apples and bananas. Your algorithm would definitely be more accurate with the prior knowledge that bananas are yellow than if it has to learn by itself. This could also be translated by over engineering a set of hyper-parameters which would only optimize a very specific task. As for reinforcement learning, a way to prove algorithms can be generalized is to test it with multiple environments to solve. This is exactly why OpenAI environments have been made, allowing researchers to test out their algorithms by providing a simple interface that enables us to transition between environments very easily.

Now about prioritized experience replay, the publication showed that it could generalize well through most of the environments, yet this little human intervention help did not benefit in our case. After all, maybe this environment does need more randomness to be able to be solved. In the case of Deep Q-Learning, the randomness is created by using an epsilon greedy policy, which is “How about we don’t take the most optimized action from time to time and see what happens”. This is also called exploration. But this is actually a very pale random behavior since based on a stochastic (probabilistic) policy. Uh. In that case, how about making it totally random?

Reinforcement learning randomness cooking recipe:

Step 1: Take a neural network with a set of weights, which we use to transform an input state into a corresponding action. By taking successive actions guided by this neural network, we collect and add up each successive rewards until the experience is complete.
Step 2: Now add the randomness: from this set of weights, generate another set of weights by adding random noise into the original weights parameters, which is, modify them a bit using a sampling distribution, for example a Gaussian distribution. Sample a new experience and collect the total reward.
Step 3: Repeat the random sampling of weight parameters until you achieve the desired score.

This is the most random thing you could do. Pull out a random neural network with random weights and see if that works, if not try again. Well the truth behind that is, this is very unlikely to work. (Or at least work in a reasonable amount of time). Yet, some very promising papers which have been published in recent years and provide very competitive results in tasks such as making humanoids robots learn how to walk are not so far from applying this very basic cooking recipe.

Let’s get to it. Now imagine that instead of sampling around the same initial set of weights, at each sampling iteration you compare your reward with the previous set of weight’s rewards. If the reward is better, it means your neural network has a better idea of what is the optimal policy. So you can now start from there to sample another set of weights. This process is called hill climbing.

The analogy is pretty straightforward, you are trying to optimize your total reward (which lies at the summit of the hill), and you are taking successive steps. If your step brings your closer to the top, then you very confidently start from there for the next step. Otherwise, you come back to your previous step and try another direction. It actually looks very much like gradient ascent, optimizing a function by “climbing” a function you are trying to optimize. The difference lies in the neural network update. In hill climbing, you do not back propagate to update the weights, but only use random sampling.

Hill climbing actually belongs to a group of algorithms called black box optimization algorithms.

We don’t know what exactly is the function we are trying to optimize, we can only feed it input and observe a result. Based on the output we are able to modify our input to try reaching the optimum result value. Actually eh, that sounds very familiar! Indeed, reinforcement learning algorithms also rely on a black box since they are based on an environment which provide the agent rewards (output) for each step taken (input). For example when you are trying to teach a humanoid robot how to walk, you have absolutely no prior knowledge of the kinematics, dynamics of the model you are trying to action, not even an idea of what gravity means! This is quite a black box, isn’t it? As seen below, what we are trying to optimize is our own approximation function with respect to the environment.

In the case of the deep Q-network, we have an idea of the function we are trying to optimize as we are performing back-propagation, so in some sense only the environment is considered as a black box. Where as for hill climbing, we are blindly modifying a set of weights without any knowledge about how these weights are used, which can be considered as a black box.

Now, let’s climb some hill and see how would that be implemented. We start with another environment called Cart Pole, which is basically an inverted pendulum. This is usually a very good environment to test out algorithms and ideas as it is rather easy to solve and have no particular local minima. Cart pole is also popular due to the fact that it was also used for classic control theory (which provides with an analytical solution).

Let’s first watch what an inverted pendulum looks like in real life, here balanced by an impressive robot from Naver Labs:

Now let’s try to balance the cart-pole from our gym environment. In this implementation we assume that an agent is provided (code on github) which can evaluate the reward for a full episode.

Quite simple isn’t it? We actually went a bit fancy here by adding a small improvement, instead of using all the rewards to compute the new weights, we only use a fraction of them called the “elite” weights, the ones who provided the best rewards. We could also be more greedy and only take the weights with the highest reward, but that would be less robust and not handle well local minima. Computing the new weights with all the results on the other hand is more robust but slower as well. Using a fraction of the results represents some kind of balance between robustness and fast convergence.

The algorithm can solve the cart pole environment in 104 iterations with about 470s of processing time. Let’s enjoy watching the trained agent perform his task. Definitely not as good looking as a humanoid robotic arm balancing a weight, but that’s already something!

Now hill climbing is one possible black-box algorithm implementation. In this paper, the OpenAI team (the company which made OpenAI gym as well) came out with another version which they said could solve complex RL problems such as the Mujoco motion tasks or the Atari games collection. They also claim they can solve these tasks much faster than the most efficient RL algorithms, by at least a factor 10. Let’s see what is the algorithm behind such a success:

Put apart the mathematical notations, what is here is very similar to the hill-climbing algorithm. First generate sets of weights by sampling randomly with a Gaussian distribution, evaluate the rewards for each set of weights and update the new weights using all the results. The main difference lies in the update, while hill climbing just averages the best weights, Evolution Strategies will use an update rate to move slowly to the best direction. As all RL algorithms this is indeed meant to avoid falling straight away into local minima and adds robustness. Apart from that, this is very much like hill climbing. And that is able to perform better than the most complex RL algorithms? Uh. The keyword is actually in the title of the snapshot: parallelization.

Parallelization is separating the processing tasks and executing them at the same time. Let’s say here, when you want to evaluate the rewards for each set of weights you can just evaluate all of them at the same time and then gather the results when they are all done. If using only one computer of course this would require it to be incredibly powerful to be able to do that, but the trick here is that the computation is not limited to one machine only. Parallelization can be done by sharing the computing tasks with other computers. For instance, the OpenAI team stated that to teach a humanoid robot to walk they used 1,440 CPUs across 80 machines.

Making tasks parallelized for solving RL problems is nowadays a must if you want to achieve decent results. For example, the relatively famous A3C algorithm’s target is orientated towards taking advantage of parallel computation being part of what the authors call Asynchronous methods for Deep RL. The difference with Evolution Strategies (ES) is that ES is fully made to take advantage of parallelization, and even though it’s a much simpler piece of code, stacking more and more computers will lead to better results.

Even so I (and probably you) don’t have 80 computers at hand, it could still be interesting to try to run this algorithm with asynchronous computation to get a better hold on how this could actually work and see if we can outperform the hill climbing performance.

To try parallelizing the evolution strategy algorithm, we need to dig into parallel computation in Python and the different ways of performing multi-threading. Fortunately, we can rely on some very good article which details exactly what we are looking for, presently a guidance on which library is most suited for us with some snippets of code teaching us how to use it. (Actually it was supposed to be part of this article but shhhh).

> > > > > Insert analysis of multi-threading in Python here << < < <

Using the newly acquired knowledge, we can try to parallelize the evolution strategies algorithm.
The base layer of our implementation is the same as previously presented , it relies upon an agent who can evaluate the total reward for a provided set of weights. The agent simply performs a forward pass on the neural networks built with the specified weights to find out the action to take at every step, and sums up the rewards provided by the gym environment.

We now want to leverage our threading knowledge by choosing the correct library for our case. We want to optimize the convergence time of the evolution strategy by parallelizing heavy computation. On top of that, we wish in the future to be able to use the cores of multiple separate computers. The choice is then obvious, we want to go with the multi-processing library.

It was seen that using the multi-processing, some data types could not be successfully serialized, which is the case for the agent, probably due to the presence of pytorch data types. In order to work around this issue, the agents are defined as global variables so that they can still be accessible from the threads, which is not a good design, and which cannot be used in the case of computation shared between multiple computers. But as for testing, that will do for now as a primal way to measure the parallel computation efficiency. The base of the algorithm is as follow: at each iteration launch a thread for each parallel agent that you want to use. Then in each thread sample a set of weights and call the evaluate function. Collect all the rewards generated by each thread and update the weights. Note that the update requires the rewards as well as the sampled weights but the thread is only returning the reward. This is using the seed trick: np.random is actually not that random, if you call np.random.randn multiple times and then reset the seed and reiterate, it will provide the exact same results. Taking advantage of this, instead of returning the set of weights (Which could be huge for complex neural networks), we can only return the seed and then recreate the weights when needed. The gain of time not serializing/de serializing the weights can be consequent for some applications. Here is how this implementation would look like:

Running this code has a problem though. The multi-processing apply_async gets blocked at the start of the second iteration. Seemingly some internal lock proper to multiprocessing is blocking the computation. Since the first iteration was successful, we can guess some remnants of that first iteration is blocking the beginning of the next one. It could be that when going out of scope the objects that the mp.Pool() handles could not be destroyed properly.
One possible explanation would be that, multi-processing creates threads with the “fork” method internally which copies the thread environment instead of “spawn” re-evaluating” all the variables. It is possible that not all the data types (and underlying ones) have a well defined copy constructor which could lead to such a deadlock.
How about using the “spawn” method then? Since multi-processing re-evaluates all the variables, the gym environment as well as the agent itself do suffer the same treatment, which is very time consuming (at least much more than calling the sample_reward function itself). Even though this approach would still work, it would definitely not allow us to reach the target that is fixed, improve the computation efficiency of the evolution strategy algorithm.

So now what? Multi-processing does not seem like a viable alternative, the only other solution is to use the Threadpool library, which is limited by the Global Interpreter Lock (GIL). We can still verify the performance level using that library to see if we have any gain compared to a basic sequential approach. In both cases we use 50 agents that are able to generate the total reward for a randomly sampled set of weights:

The algorithm runs until the average reward hits 195. In both cases, the behavior looks similar and the environment is solved with an equivalent amount of episodes. Until here everything looks fine. The algorithm is the same, the only difference being the parallelization of the computation. However the computation time is different, actually longer in the case when using the threadpool executor library. This is echoing the results found when bench-marking the different threading approaches. We only confirmed what was expected.

We now found ourselves in a deadlock, our preferred threading library is not working as expected, other approaches are actually slower than the sequential approach. The way we do threading needs to be different. If only we could parallelize the computation directly inside the gym environment, things would be much easier…

Guess what? This interface does.. kind of exists. There is an interface called VecEnv, which is literally an array of openai gym environments. Actually it sounds quite similar to the approach we just took earlier. Except that this VecEnv acts as a wrapper of the traditional gym environment and allows you to perform a parallel computation of the “step” operation. In the following example of the implementation of the evaluate function, the “step” function now accepts and array of actions, and returns an array of rewards and new states:

While before we just parallelize the entire process of accumulating the rewards for an entire episode, we here call multiple thread at each “step”. The benefit is that we do call a thread after having already performed the neural network forward pass. Which means that the problems encountered in the previous section related to pytorch data types and multi-processing should be avoided. Let’s check the performance results:

While we were expecting a massive improvement in computation time, the opposite actually happened. Even though the algorithm converged using about 500 iterations compared with 400 for the previous results, it took about three more times to compute.
Now let’s try take apart the “step” function and measure the performance separately from the rest of the algorithm.

We create two agents, one implementing the usual “step” function, taking one action as an argument and providing back the new state and obtained reward, while the other one use the VecEnv (vector of environments) and implements a “step” function taking an array of actions as argument and outputting an array of states and rewards. We repeat the process of calling the “step” a fixed number of times and compare the results:

We know that VecEnv threads by implementing multi-processing, which is the correct approach and it is supposed to provide a significant computation improvement. However, we observe that even for the most advantageous case where the number of threads equals the number of cores of the PC (4 in this case), the performance is still behind the simple sequential case.

One reason for this lack of performance could be that the environment is too simple to solve: the “step” function takes only an infinitesimal amount of time to return, while multiprocessing does need to serialize and de-serialize (which we don’t know the amount as it is hidden in the VecEnv implementation). It is likely that in more complex environments, as the humanoid robot learning to walk, the “step” function would require much more computation power and thus take full advantage of the threading implementation.

Conclusion:

While we were initially planning to improve our lunar lander, we opted to solve the inverted pendulum problem first as an easy way to benchmark the implementation efficiency.
We could observe that a naive threading implementation separating the full evaluation of an experience reward into different thread does not work due to Pytorch.
We used a ThreadPool executor instead and proved that this approach was still limited by the GIL (Global Interpreter Lock).
Finally, we tried the VecEnv wrapper of the gym environment providing an interface to execute the “step” using multiple threads. However, this approach did not appear to be successful, potentially due to our environment being too simple to take advantage of the multi-threading approach.

To be done:

Try the algorithm with a more complex environment (e.g the Mujoco environments).

Github repo: https://github.com/Guillaume-Cr/evolution_strategies

Evolution Strategies for Reinforcement Learning

Solving the famous inverted pendulum problem using nothing more than brute force.

Written by Guillaume Crabé