Reinforcement learning without gradients: evolving agents using Genetic Algorithms

Published in

Towards Data Science

10 min readJan 7, 2019

During holidays I wanted to ramp up my reinforcement learning skills. Knowing absolutely nothing about the field, I did a course where I was exposed to Q-learning and its “deep” equivalent (Deep-Q Learning). That’s where I got exposed to OpenAI’s Gym where they have several environments for the agent to play in and learn from.

The course was limited to Deep-Q learning, so as I read more on my own. I realized there are now better algorithms such as policy gradients and its variations (such as Actor-Critic method). If this is your first time with Reinforcement Learning, I recommend following resources that I found helpful to build a good intuition:

Andrej Karpathy’s Deep Reinforcement Learning: Pong from Pixels. It’s a classic tutorial that sparked widespread interest in reinforcement learning. Must read.
Diving deeper into Reinforcement Learning with Q-Learning. The post gives a nice, illustrated overview of the most fundamental RL algorithm: Q-learning.
Introduction to Various Reinforcement Learning Algorithms. A walkthrough through the world of RL algorithms. (Interestingly, the algorithm that we’re going to discuss in this post — Genetic Algorithms — is missing from the list.
Intuitive RL: Intro to Advantage-Actor-Critic (A2C). A really, really well-done comic (yes, you heard it right) of the currently most advanced algorithm in RL.

I feel blessed that as a community, we share so much that within a couple of days of knowing nothing about reinforcement learning, I was able to replicate what was state of the art just 3 years ago: playing Atari games using pixel data. Here’s a quick video of my agent (in green) playing Pong against AI using only pixel data.

This sort of feels like a personal achievement!

The Problem with Reinforcement Learning

My agent trained just fine using policy gradients algorithm, but that took a full 2 days and nights of training on my laptop. And even after that, it wasn’t playing perfectly.

I know that in the grand scheme of things 2 days isn’t a lot but I was curious about why training in reinforcement learning is so slow. Reading and thinking more about this, I realized that the reason reinforcement learning is slow is because gradients are (almost) non-existent and therefore not very useful.

Gradients help in supervised learning tasks such as image classification by giving useful information on how parameters (weights, biases) of the network should be changed for better accuracy.

Imagine this surface representing error for different combinations of weights and biases (the lower the better). Starting from a randomly initialized point (of weights and biases), we want to find the values that minimize the error (the lowest point). Gradients at each point represent the direction of downhill, so finding the lowest point is equivalent to following the gradient. (Wait, did I just describe the stochastic gradient descent algo?)

In image classification, after every mini-batch of training, backpropagation provides a clear gradient (direction) for each of the parameters in the network. However, in reinforcement learning, the gradient information only comes occasionally whenever environment gives a reward (or penalty). Most of the times our agent is taking actions not knowing whether they were useful or not. This lack of gradient information makes our error landscape look like this:

Image via the excellent blog post Expressivity, Trainability, and Generalization in Machine Learning

The surface of cheese represents parameters of our agent’s network where no matter what agent does, environment gives no reward and hence there’s no gradient (i.e. since there’s zero error/reward signal, we don’t know in which direction to change parameters for a better performance next time). The few holes on the surface represent low error/high reward corresponding to parameters corresponding to agents that perform well.

Do you see the issue with policy gradients now? A randomly initialized agent is likely to lie on the flat surface (rather than a hole). And if a randomly initialized agent is on a flat surface, it’s incredibly hard to move towards better performance because there’s no gradient information. And because the (error) surface is flat, a randomly initialized agent does a more-or-less random walk and remains stuck with bad policy for a long time. (This is why it took me days to train an agent. Hint: maybe policy gradient approach is no better than random search?)

As the article titled why RL is flawed clearly argued:

What if the board game you were trying to learn was Go — how would you start learning it? You would read the rules, learn some high-level strategies, recall how you played similar games in the past, get some advice… right? Indeed, it’s at least partially because of the learning from scratch limitation of AlphaGo Zero and OpenAI’s Dota bots that it is not truly impressive compared to human learning: they rely on seeing many orders of magnitude more games and using far more raw computational power than any human ever can.

I think the article hits the nail on its head. RL is inefficient because it doesn’t tell the agent what it is supposed to do. The agent not knowing what to do starts doing random things and only occasionally environment gives out a reward and now an agent must figure out which of the thousands of actions that it took led to environment giving the reward. That’s not how humans learn! We’re told what needs to be done, we develop skills and rewards play a relatively minor role in our learning.

If we trained children through policy gradients, they’ll always be confused about what they did wrong and never learn anything. (Photo via Pixabay)

Gradient-free approaches to Reinforcement Learning

As I was exploring alternatives to gradient-based approaches to RL, I hit upon a paper titled: Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning. This paper combined what I was learning (reinforcement learning) and what I’ve always been excited about (evolutionary computing), so I went about implementing the algorithm in the paper and evolve an agent.

Note that strictly speaking, we didn’t even have to implement a genetic algorithm. As hinted above, the same paper found that even completely random searches were able to discover good agents. This means that even if you keep randomly generating agents, eventually, you’ll find (and sometimes faster than policy gradients) an agent that performs well. I know, it’s insane but this just illustrates our original point that RL is fundamentally flawed as it has very little information for us to use for training an algorithm.

What is a Genetic Algorithm?

The name sounds fancy but under the hood, it’s perhaps the simplest algorithm you can devise for exploring a landscape. Consider an agent in an environment (like Pong) that’s implemented via a neural network. It takes pixels in the input layer and outputs probabilities of actions available to it (move the paddle up, down or do nothing).

Image via Deep Reinforcement Learning: Pong from Pixels

Our task in reinforcement learning is to find the parameters (weights and biases) of the neural network (weights and biases) that make the agent win more often and hence get more rewards. So far so good?

Pseudocode for Genetic Algorithms

Simply imagine the agent to be an organism,
Parameters will be its genes that specify its behavior (the policy)
Rewards will indicate the organism’s fitness (i.e. higher the rewards, higher the likelihood of survival)
In the first iteration, you start with X number of agents with randomly initialized parameters
Some of them by pure chance will perform better than others
And just like the natural evolution in the real world, you implement survival of the fittest: simply take the fittest 10% of agents and replicate them for the next iteration until you have X agents again for the next iteration. Kill the weakest 90% (if this sounds morbid, you can rename the kill function as give-moksha!)
During replication of top 10% fittest agents, add a tiny random gaussian noise to its parameters so that in the next iteration, you get to explore the neighborhood around parameters of best agents
Keep the top performing agent as is (without adding noise) so that you always preserve the best of the best as insurance against Gaussian noise causing a probable reduction in performance

That’s it. You’ve understood the core of Genetic Algorithms (GA). There are many (fancy) variations of GA where there’s all sorts of (dirty) sex(ual recombination) between two agents but the paper on deep neuroevolution implemented the vanilla GA with pseudo-code above and that’s what I also implemented in my code. (You can access the Jupyter notebook with code on my Github repository).

Yellower regions are regions with lower error (higher rewards/performance). Blue dots are all agents. Green ones are the top 10% and the red dot is the best of the best. Notice how gradually the entire population moves towards the region with the lowest error. (Image via Visual Guide to Evolutionary Strategies)

For more visualizations on how evolutionary algorithms work, I highly recommend going through this insanely well-done post: A Visual Guide to Evolution Strategies.

Code for implementing Deep Neuroevolution for Reinforcement Learning

I t̸r̸a̸i̸n̸e̸d̸ evolved an agent balancing pole on a moving cart (aka CartPole-v0). Here’s the entire code: https://github.com/paraschopra/deepneuroevolution

Using PyTorch, we parametrize the agent through a 2 hidden-layer neural network (wanted to retain the “deep” part :), for CartPole a one layer network might also have done just fine).

Here’s the main loop of evolution:

The code is pretty much self-explanatory and follows the pseudo-code that I wrote earlier in this post. Mapping specifics to the pseudo-code :

Our population size is 500 (num_agents) and we generate agents randomly in the first iteration (return_random_agents)
Out of 500, we only select top 20 as parents (top_limit)
The maximum number of iterations that we want to run the loop is 1000 (generations). Although usually for CartPole, a decently performing agent is found within a couple of iterations.
In each generation, we first run all randomly generated agents and get their average performance over 3 runs (once it could be lucky, so we want to average). This is done through the run_agents_n_times function.
We sort agents in descending order of their rewards (performance). The sorted indexes are stored in sorted_parent_indexes.
Then we take the top 20 agents and choose randomly among them to make children for the next iteration. This happens in the return_children function (see below). The function adds a small Gaussian noise to all parameters while copying an agent but retains one best-of-the-best elite agent (without adding any noise).
With the children agents as parents now, we iterate and run through the entire loop again, until all 1000 generations are done or we find an agent with good performance (in the Jupyter notebook, I print average performance of top 5 agents. When it’s good enough, I manually interrupt the loop)

The return_children function looks like this:

You see that first, it selects a random agent out of the top 20 agents (with indexes contained in sorted_parents_indexes) and then calls the mutate function to add random Gaussian noise before appending it to the children_agents list. In the second part, it calls the add_elite function to add the best-of-the-best agent into the children_agents list.

Here’s the mutate function:

You can see that we iterate through all parameters and simply add Gaussian noise (via np.random.randn()) that’s multiplied with a constant (mutation_power). The multiplicative factor is a hyper-parameter and loosely resembles the learning rate in gradient descent. (By the way, this method of iterating through all parameters is slow and inefficient. If you know a faster way in PyTorch, let me know in comments).

Finally, the function add_elite is as follows:

This code simply takes the top 10 agents and runs them 5 times to be doubly sure of which one is the elite based on average performance (via return_average_score). Then it copies the elite via copy.deepcopy and returns it as is (without mutation). As mentioned previously, the elite ensures that we always have a copy of our previous best and it prevents us from randomly drifting (via mutations) into a zone where there is no good agent.

That’s it! Let’s see our evolved CartPole agent in action.

Genetic Algorithms are not perfect. For example, there’s no guidance on how to choose multiplicative factor while adding Gaussian noise. You simply have to try a bunch of numbers and see which one works. Moreover, in many cases, it may fail completely. I tried multiple times evolving an agent for Pong but it was very slow and I gave up.

I contacted the author of the paper on deep neuroevolution, Felipe Such. He mentioned that for some games (including Pong) neuroevolution fails but for others (such as Venture) it is much faster than policy gradients.

What will YOU evolve?

The code for deep neuroevolution in my repository is generic enough to work with any neural network implemented in PyTorch. I encourage you to try your hands at various different tasks and architectures. Please let me know if you succeed, fail or get stuck!

Good luck evolving your own world.

PS: Check out my previous hands-on tutorials on a) Generating Philosophy from a small corpus and b) Bayesian Neural Networks

Thanks Nirant Kasliwal and Felipe Such for reviewing the draft of this post.

Follow me on Twitter

I regularly tweet on AI, deep learning, startups, science and philosophy. Follow me on https://twitter.com/paraschopra

Paras Chopra (@paraschopra) | Twitter

The latest Tweets from Paras Chopra (@paraschopra). Founder and Chairman of @Wingify | Writes on…

twitter.com