Best practices for Reinforcement Learning

Lifting the curses of time and cardinality.

Published in

Towards Data Science

12 min readSep 27, 2020

Machine learning is research intensive. It contains significantly higher degrees of uncertainty compared to classic programming. This has a significant impact on product management and product development.

*Image via Shutterstock under license to Nicolas Maquaire*.

Developing an intelligent product with good performance is very difficult. In addition, the production environment can cost a lot. This combination of challenges can make the business model of many startups risky.

In my last article, I described challenges newcomers face when using artificial intelligence in fog computing. More specifically, I detailed what it takes to make an inference on the edge.

In this article, I’ll describe what I believe are some best practices to start a Reinforcement Learning (RL) project. I’ll do this by illustrating some lessons I learned when I replicated Deepmind’s performance on video games. This was a fun side-project I worked on.

Google achieved super human performance on 42 Atari games with the same network (see Human-level control through deep reinforcement learning). So then, let’s see if we can achieve the same results and find out what best practices are needed to be successful!

You can find the source code in the following Github repository; Additionally, for readers who want to learn how my algorithm works, I published Breakout explained and e-greedy and softmax explained. These are two Google Colab notebooks where I explain expected sarsa and the implementation of the two policies, e-greedy and softmax.

Time and cardinality curses

The main concerns each RL practitioner deals with are uncertainty coupled with unlimited technical options, and very long training times.

I call these the time and cardinality curses of RL. I believe the best practices for every person or every team starting a reinforcement learning project are:

Build a working prototype even if it has poor performance or it’s a simpler problem
Try to reduce the training time and memory requirements as much as possible
Improve accuracy by testing different network configurations or technical options
Check, check again, and then check again every line of your code

To these best practices, I would add:

Monitor reliability. Sometimes, luck is not repeatable
Parallelism is your friend. Test different ideas in parallel

Let’s start by tackling a very simple textbook case: Open AI GYM Acrobot. Then, we’ll move to some more challenging games: Breakout and Space Invaders.

If you’re interested in building knowledge before continuing, I recommend reading Reinforcement Learning by Richard S. Sutton and Andrew G. Barto.

If you’re actively participating in a project, I recommend reading Andrew NG’s “Machine Learning Yearning.”

Learning with Open AI Acrobot

Before tackling complex projects in RL, my recommendation is to start with a simple one because you will find more literature on the internet. It will be easier to find solutions and, more importantly, because it’s faster to test new ideas (fail fast, fail good).

Open AI offers a wealth of options. Because of my specialization in control systems, I decided to use the Acrobot. I worked on a very similar project while working on my engineering degree: The double pendulum.

For this side project, I decided to start with Acrobot.

As illustrated, the acrobot system has two joints and two links, where only the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link above the horizontal line.

Anatomy of the agent

What is Reinforcement learning?

Well, it’s a direct implementation of the idea that we learn by interacting with our environment. In that way, it mimics the brain.

In RL, at each step of the game, we decide the best action and then we retrieve the reward and move into the next state. For the Acrobot, the state consists of the sin() and cos() of the two rotational joint angles and the joint angular velocities.

[cos(theta1) sin(theta1) cos(theta2) sin(theta2) thetaDot1 thetaDot2]

In the book Reinforcement Learning, Sutton and Barto describe different Temporal Difference (TD) techniques. TD learning refers to a class of model-free reinforcement learning where a deep network is used to approximate the value function. The value function estimates how good each action is.

I started with the most common RL algorithms, Expected Sarsa, and its special case, Q-Learning. Both algorithms use a policy. Roughly speaking, a policy is the agent behavior function. The policy uses the value function estimate to decide the best action. We will use soft policies (ɛ-greedy and softmax), meaning that each action has a chance of being executed. The policy is ɛ-greedy when the best action is selected with a probability of 1-ɛ and randomly chosen with a probability of ɛ. And my favorite, the softmax policy assigns a preference to each action according to its action-value estimate.

In Human-level control through deep reinforcement learning, Deepmind uses Q-Learning with an e-greedy policy.

Each of the two algorithms we are going to use comes with a few hyperparameters and many options. To name a few, the most important ones are the learning (𝜶) and discounting rates (𝞬), the batch sizes, epsilon (ɛ) or tau (𝝉) for finding the right balance between exploration and exploitation, the size of the experience replay memory, the number of exploration steps, the number of annealing steps, the model update frequency, the weights initialization, and the optimizer.

Of course, the list of hyperparameters goes on and on.

This is what I mean by the curse of cardinality.

*The curse of cardinality. Image via Shutterstock under license to Nicolas Maquaire*.

This is why my first best practice is to build a “working” prototype. And then focus on performance.

So then. How can we lift the curse of cardinality?

First, I recommend searching the web for similar implementations so as to understand the hyperparameters used by other practitioners. Beyond helping you reach success, this will help develop your intuition, which is extremely important. As a matter of fact, ‘intuition’ is one of the words Andrew Ng uses the most in his fantastic course called Deep Learning Specialization.

On my first attempt to rock the Acrobot, a few of my runs converged. Below, we can see the success rate over ten games. Of course, there is definitely room for improvement but my algorithm is learning and that’s a good start.

*Horizontal axis unit is hours of training. Vertical axis unit is the success rate over ten games. Graph by author.*

On the graph you can also see that I trained the networks for more than a week. The networks started to converge around the second day. This teaches us a good lesson: As AI practitioners, we wait and wait and wait. This is the curse of time!

Looking back now, if I tally the number of hours I’ve spent in front of my machine scrutinizing the loss and accuracy of my many attempts, it’s certainly equivalent to receiving a degree in RL psychology! I am very proud that I became an expert in the psychology of RL’s algorithms.

Some of the AI I trained perform like champs (The Good). A few are suicidal and perform less than pure randomness (the Bad). Others have regular burnout (and the Ugly).

*The Good (blue), the Bad (green) and the Ugly* *(red). Graph by author.*

Developing your intuition helps you find solutions to your problems. Here, the bad had a problem with the best action selection and the ugly had an issue with its target network.

Like the bias and variance methodology in deep learning, you can develop your own intuition to diagnose your problems quicker.

Now that we have a champ, let’s see how we can improve its performance.

Training the agent

GPU vs CPU

One of the main differences between Deep Learning and Reinforcement Learning is that there is no pre-existing dataset. The dataset (or experience replay, or memory in RL) is created when the agent interacts with the environment. This induces a performance bottleneck as the pipeline depends on CPU operations. This is why the tensorflow operations of most tutorials happen on the CPU. I am pretty sure that your first attempts will show better performances when placing all tensorflow operations on the CPU.

Training on the GPU is absolutely not automagical. But when done correctly, it improves speed as you can see from the graph below.

*Horizontal axis in hours of training. Red with most of the operations on GPU ; Blue with operations on CPU only. Graph by author.*

The performance bottleneck is created by moving zillions of small objects back and forth between the CPU’s memory and the GPU’s memory.Therefore, it’s important to understand where you create your tensorflow variables and how you leverage the eager execution of tensorflow 2.x.

On my machine, the limiting factor is often the CPU’s memory, which stores all the images of the experience buffers. I usually don’t see high GPU memory consumption because the data is created and managed on the CPU’s memory. I use the tensorflow data API to create a dataset. The dataset loads the images and feeds them to the GPU. I also don't see a high utilization rate of the GPU because each process is waiting on both the CPU to play the next step and on the dataset to yield a mini-batch.

Additionally, the data type used plays an important performance role. Using int16 instead of float32 will increase the speed and facilitate the management of multiple 1.000.000 experience replays on your machine.

Being careful with your data type helps you to train multiple networks in parallel and go faster, which further lifts the curse of time.

These steps clearly support our second best practice: Trying to reduce the training time and memory requirements as much as possible.

Hyperparameters and network architecture

Now that we’ve made the algorithm as efficient as possible with regard to speed and memory consumption, we can focus on lifting the curse of cardinality. Of course, we still have a lot of pieces to fit together, but it’s more manageable. We can now launch multiple runs in parallel and get results faster.

I recommend setting all your hyperparameters to the community’s commonly accepted values. The best way to find these values is to find papers that dig into similar use-cases and see what parameters they are using. Then, I recommend doing a manual search on 2 or 3 of the most important hyperparameters. Don’t forget to use the powers of 10 to effectively swipe the entire scope of your hyperparameters. Some are really sensitive (particularly the softmax temperature).

For network architecture, I recommend replicating the network architecture from any of the interesting papers you find. Then, try a different number of hidden layers or a different number of nodes. Also, keep in mind that the initialization kernel of your layers is extremely important.

For example, I noticed that Expected Sarsa with a softmax policy is not converging when the last dense layers use variance scaling instead of the default Glorot Initialization.

To illustrate the improvements you can get from a different number of nodes, we added a new run to the previous comparison between the CPU/GPU. The only difference between the pink and the red plots is that I switched the two last layers (from 256 and 128 perceptrons to 128 and 256).

*Horizontal axis in hours of training. Red with the new layer arrangement. Graph by author.*

If you use a classic network architecture and a stack of 4 states as input, hundreds of runs make me think that the most important hyperparameters are:

The learning rate
The exploration and annealing parameters (or temperature for softmax)
The initialization of your layers
The frequency of the main network parameters update
The frequency of the target network updates

The improvements we obtained by following our best practices are pretty good. As you see below, we significantly improved the accuracy and the compute time.

*Improvements obtained with GPU and a little network tuning. Graph by author.*

Needless to say, this can have a very important impact on the go-to-market and related costs of a new product. This clearly supports our third best practice: Improve accuracy by testing different network configurations or technical options.

As a side note to readers who are interested in further improving the accuracy and convergence time of a similar use-case: my next step would be to use tile coding. I’m pretty sure this will further improve both speed and performance.

Thanks to our work on the Acrobot, we now have a good platform to try something more challenging: Reinforcement learning for computer vision.

RL and computer vision with Atari

Breakout and Space Invaders

Breakout was an arcade game developed and published by Atari on May 13, 1976. To play: a layer of bricks lines the top third of the screen and the goal is to destroy them all!

Space Invaders was a 1978 arcade game created by Tomohiro Nishikado. Space Invaders was the first fixed shooter game and it set the template for the shoot ’em up genre. The goal is to defeat wave after wave of descending aliens with a horizontally moving laser to earn as many points as possible.

Let’s see how our algorithm performs on those games! As a sneak-peek, the two animated gifs were captured during our eval sessions.

There is not much to do to support these new games. You need to declare the new environment, but most importantly, you need to adapt your network and the data pipeline. For the new network, we will use a classic ConvNet architecture, the same network as the one Google Deepmind used in the Nature paper: Three convolution layers followed by a few dense layers. Then, we need to update the data pipeline. This is a bit more tricky as we have to store a million images.

Most of the tutorials you can find on the internet store a history of states. A history of states stacks five consecutive images generated by the environment. Four images for estimating the current action-value function and the next four for estimating the next action-value. Considering the impact that storing histories has on memory consumption and our second best practice, we’ll only store the states (the images). The data pipeline will re-stack the states on the fly.

You can find a detailed explanation of what I did on this Google Collab.

As a side note, if like me, you use Open AI Gym to test different algorithms or technical options and want to transfer from Open AI Gym environments to real world problems, be aware that it is crucial to ensure your training is done in a stochastic environment. Many tutorials you can find on the internet use determinist environments. Stochasticity is a crucial requirement for robustness. It is very unlikely your algorithm will work on real world problems if you train and validate in an Open AI Gym deterministic environment. I cover this topic in “Are the space invaders deterministic or stochastic?”

Unfortunately, during this update, I made many mistakes and spent an awful amount of time removing these bugs. Why? Because even with a few mistakes, the network was learning and converging. The accuracy was obviously far off, but it was working.

This is a crucial difference with classic programming.

When the network is learning and plateauing at a modest score, it’s easy to jump to the conclusion that the hyperparameters need to be adjusted. I fell into this trap quite a few times. Often, you remove a bug from the data pipeline and launch a series of runs only to notice a couple of days later that you have another bug. And this is not limited to the data pipeline. I admit I found bugs in each and every part of my code.

This supports our fourth best practice: Check, check again, and then re-check every line of your code.

In my experience, the best way to handle this is to create a separate notebook to prove that each line of code is working. It’s easy to inadvertently use matrix multiplication instead of element wise multiplication and it’s easy to make mistakes with Tensorflow or Numpy casting. I encourage you to be very cautious with your Tensorflow code.

Conclusion

While we are just scratching the surface of the technical challenges you will have building an intelligent product, I hope this article gave you a good understanding of some best practices to follow to successfully jumpstart your RL project.

As for any research intensive projects, the time and cardinality curses should always be factored into your product management and team organization.

I am a strong believer that machine learning can truly provide the thrust we need to solve many of our rising concerns.

I am happy to help anybody with a solid vision! And, I am very open to feedback.

Thank you for reading this article!