Evolving a Neural Network in a Sparse Reward Environment

Using genetic algorithms to solve the Lunar Lander Continuous environment with a sparse reward

Published in

Towards Data Science

10 min readMar 24, 2022

Genetic algorithms are a powerful method, inspired by the biological evolution, to solve optimization problems. They consist of a sequential process creating solutions through random mutations and crossover and selecting the best solutions (according to a quantity to be maximized) for reproduction. A key element in the selection of the best solutions is the fitness function, which associates to any solution a score called the fitness of the solution. The fitness should correspond to the objective to be maximized in the optimization problem. Each solution is parametrized by a set of parameters called genes that constitute the chromosome of the solution. The ensemble of all current solutions at each step is called the population. A genetic algorithm is composed of a number of subsequent steps, called generations, during which:

The fitness of each solution is evaluated;
The parents of the next generation are selected from the population as specified by a selection function, favoring solutions with the highest fitness;
Each couple of parents produce new solutions by mixing the genes in their chromosomes as prescribed by a crossover function;
The genes of the new solutions are mutated according to a mutation function;
Low fitness solutions are removed from the population.

As the number of generations increases, the population will be filled with higher fitness individuals as these have a higher probability of propagating their genes to the next generations. Furthermore, the mutations that give a solution an advantage in obtaining a higher fitness will combine and accumulate into the offspring thanks to crossover.

Genetic algorithms are very versatile and can be applied to a variety of optimization problems as long as it is possible to write a fitness function whose global maximum solves the task.

A very interesting application is the use of genetic algorithms to optimize the weights of a neural network as opposed to the usual gradient descent approach. Although genetic algorithms will probably converge more slowly for many tasks (as we are giving up a valuable piece of information, the gradient of the loss function), they can overcome some of the limitations that come with gradient-based approaches. For instance, genetic algorithms won’t suffer from the usual exploding/vanishing gradient problems. Another situation in which genetic algorithms may be an interesting alternative to gradient-based methods is when the gradients are almost always null or uninformative.

A use case in which genetic algorithms might be useful is for solving sequential decision-making problems with sparse rewards. Sequential decision-making problems correspond to tasks that need a concatenation of actions in order to be solved. As an example, let’s consider the problem of escaping from a maze: we need to choose which direction we want to follow at each crossing and the direction to follow depends on the current position in the maze. Reward sparsity happens when many decision steps do not come with an extrinsic reward, for instance when the reward is provided only at the end of the task. In our example, this could happen if the reward is only given after successfully escaping the maze, with no reward given at intermediate steps. This kind of problem poses a challenge to simple reinforcement learning algorithms as many steps won’t provide informative gradients.

As I was curious to see how genetic algorithms might perform in a sparse reward environment, I used them to evolve a neural network on the Lunar Lander Continuous environment from OpenAI gym. In the vanilla version, this environment has a reward shaped in such a way that it is not sparse. To simulate a sparse reward environment, I used as the fitness function the sum of the reward obtained at each time step over an entire episode. In this way, for the genetic algorithm, there is no difference between the original environment providing a reward at each time step and one in which the reward is given at once at the last step.

I made the full code available in this Colab Notebook that uses the genetic algorithm package I implemented in this GitHub repository. Let’s have a look together at the various parts.

The Environment

As anticipated, I have used the Lunar Lander continuous environment from OpenAI gym. The environment expects our agent to provide an action consisting of a two-dimensional vector with values bounded between -1 and 1. The first entry controls the main engine: it varies the power from 0 (-1) to 100% (+1). However, the main engine cannot work below 50% power. The second entry controls the side engines: in the interval [-1, -0,5] fires the left engine, in the interval [-0.5, 0.5] turn the side engines off, [0.5, 1] fires the right engine.

The environment provides the agent an observed state consisting of an 8-dimensional vector containing the following information:

the x coordinate of the lander;
the y coordinate of the lander;
the horizontal velocity of the lander;
the vertical velocity of the lander;
the angle of the lander;
the angular velocity of the lander around its center;
a boolean indicating if the left leg is in contact with the ground;
a boolean indicating contact of the right leg with the ground.

The target landing location, consisting of a smooth horizontal segment, is always located at the origin of the coordinates. At the beginning of each episode, the lander starts with a random speed in a random direction. This stochasticity makes the problem significantly more difficult for the genetic algorithm since sometimes good solutions may receive a poor total reward due to the randomness of the initial conditions. As the total reward is the metric that we use to propagate solutions to the next generations, some good solutions may not be able to pass on their genes due to mere bad luck. I expect that this will slow the convergence of the algorithm, but it will make it more robust to different initial conditions.

The Genetic Algorithm

I will now explain in detail the various parts of the genetic algorithm I ran my little experiment with.

Mutation function: I used a gaussian mutation with a mutation probability of 0.1. This means that each gene of the newly created children solutions has a 0.1 probability to be mutated. Mutation consists of an addition to a random value sampled from a gaussian distribution with mean 0 and standard deviation 0.1.

Crossover function: I used single-point crossover with 0.5 probability. This means that a new solution obtained by mating 2 parent solutions has a probability of 0.5 of being obtained by single-point crossover from the two parents and a 0.5 probability of being equal to parent 1. Single-point crossover selects a random index, i, from 0 to the length of the chromosomes, and creates a new chromosome made up by the genes of the chromosome of parent 1 up to position i and by the genes of the chromosome of parent 2 from position i to the end.

Fitness function: the fitness assigned to each solution is the total reward obtained in an episode. Notice again that this does not make use of the detailed step-by-step reward provided by the environment.

Selection function: at the end of each generation, parents are chosen by sampling solutions from a Boltzmann distribution depending on their fitnesses. This means that a solution has a probability of being selected as a parent equal to

where fᵢ is the fitness of the solution considered and the sum in the denominator runs over all the solutions in the population. The temperature, T, of the Boltzmann selection regulates the selection pressure on the solutions: at high temperature the solutions have a similar probability to be selected, while at low temperature the solutions with higher fitness have a significantly higher probability of being selected. I started with a temperature T=1 and slowly decreased it with the number of generations n as

Image by Author.

up to a minimum of 0.1 after which I held it fixed.

Elitism: in order to avoid the risk of losing some good solutions due to the random effects of mutations and crossover, the top 10% of the solutions are passed on to the next generation without any change.

I chose a population of size 50, starting from randomly initialized weights of neural networks.

The Neural Network

Let’s now take a look at the neural network whose weights will constitute the genes of the solutions in the population. I chose a simple architecture consisting of a feedforward neural network with 2 hidden layers with 256 features each. Each layer is passed through a hyperbolic tangent activation function. This will constrain the output between [-1,1]. The output of the last layer is directly used as the action. It is interesting to note that, in the case of usual optimization with gradient descent, the use of tanh activation for intermediate layers would lead to saturating gradients and would perhaps not be a good choice. In our case, instead, we don’t expect such problems as we have no gradients and the tanh activation indeed seems to work quite well. Actually, the saturating parts of the activation could be beneficial to our setting as they will encourage diversity of the solutions in our population as different weights will have similar activations.

Results

Let’s finally visualize the result of my little experiment!

As a disclaimer, I didn’t have much time to extensively optimize the parameters of the algorithm or the neural network architecture so I would not be surprised if the results could be easily improved. Also, the evolution is terminated at 180 epochs. This value is completely arbitrary and I stopped it there due to time constraints. The data seem to indicate that the performance was still improving, so letting the algorithm run for more generations could yield better results.

Below are the graph of the best fitness of each generation

and the average fitness of all the solutions in the population at each generation (the shaded area represents the region [mean-standard deviation, mean+standard deviation])

As we can see, not only does the best solution for each generation dramatically improve with a sharp rise after 100 generations, but the mean fitness of all the solutions in the population steadily increases over time. Thus, the population as a whole is evolving into solving the task better and better!

Let’s visualize the evolution process by looking at episodes generated by the solution with the highest fitness at different generations.

Generation 50

Generation 100

Generation 150

Generation 180

We see an impressive improvement in the skills of the latest generations!

While after 50 generations the neural network has learned nothing yet but freely falling, we start seeing stability in the orientation of the lander by generation 100 with some attempt to have a soft landing. By generation 150 the best solution is able to avoid crashes and at the final generation (180) we observe a perfect landing on target!

A non-trivial problem is to find, amongst the explored solutions, the one that performs the task better. In fact, the fitness of each solution is evaluated at the end of a single episode and, due to the stochasticity of the environment, the total reward of the same solution for different episodes has a large variability. Because of this, if we were to pick the solution corresponding to the highest recorded fitness, we risk choosing a solution that doesn’t perform as well on average but that just had a lucky run (either because of a particularly simple environment configuration or because it is adapted to solve well only a handful of situations). A solution to this problem could have been to use as fitness not just the reward on a single episode but the average reward over N of such episodes, which could also help in making the evolution process more stable. This. however, will be computationally expensive (since fitness evaluation is the slowest part of the algorithm) as either N or the number of solutions in the population gets very large. Alternatively, since the entire population on average is improving with the increasing number of generations, it makes sense to restrict the attention to the solutions of the last generation. They should be, on average, better than solutions in the previous generations. Still, computing the average reward over 100 episodes for all the solutions in the population was very slow. So I performed the evaluation over the solution with the highest fitness in the last generation. I remark that, as mentioned, there is no guarantee that this is the best solution (by average reward) ever explored and not even of the last generation. However, this solution already scores an impressive 238.84 average reward over 100 episodes!

Conclusion

I am very satisfied with the results of my little experiment. After only 180 generations and with minimal to no optimization on the parameters nor on the neural network architecture the genetic algorithm is able to find a solution that reliably solves the Lunar Lander Continuous task. The algorithm succeeded despite the fact that it was passed no detailed information on the fine-grained reward of the environment, but only the cumulative reward at the end of each episode to simulate a sparse reward environment. I believe that in general genetic algorithms can constitute a competitive alternative to usual reinforcement learning approaches for tasks in which only a sparse reward is available.

I hope you enjoyed this post and I would be happy to hear your thoughts!