The world’s leading publication for data science, AI, and ML professionals.

Tips for Running High-Fidelity Deep Reinforcement Learning Experiments

In this article, we discuss some techniques for running scientificially-rigorous, high-fidelity deep reinforcement learning experiments.

Making Sense of Big Data

Photo by Jason Leung on Unsplash
Photo by Jason Leung on Unsplash

Despite recent incredible algorithmic advances in the field, deep reinforcement learning (DRL) remains notorious for being computationally expensive, prone to "silent bugs", and difficult to tune hyperparameters. These phenomena make running high-fidelity, scientifically-rigorous reinforcement learning experiments paramount.

In this article, I will discuss a few tips and lessons I’ve learned to mitigate the effects of these difficulties in DRL – tips I never would have learned from a reinforcement learning class. Thankfully, I have had the chance to work with some amazing research mentors that have shown me both how, and more importantly, why, the following are really important techniques for running RL experiments:

  1. Set (all) Your Seeds
  2. Run (some of) Your Seeds
  3. Ablations and Baselines
  4. Visualize Everything
  5. Start Analytic, then Start Simple
  6. When in Doubt, Look to the (GitHub) Stars

I’m sure there are many more tips and tricks from seasoned reinforcement learning practitioners out there, so the above list is by no means exhaustive. In fact, if you have tips and tricks of your own that you’d like to share, please comment them below!

Let’s get started!

1. Set (All) Your Seeds

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Being able to reproduce your experiments is crucial for publishing your work, validating a prototype, deploying your framework, and keeping your sanity. Many reinforcement learning algorithms have some degree of randomness/stochasticity built-in, for instance:

  1. How your neural networks are initialized. This can affect the initial value estimates of your value neural networks and the actions selected by your policy neural networks.
  2. The initial state of your agent. This can affect the transitions and rollouts the agent experiences.
  3. If your policy is stochastic, then the actions your agent chooses. This can affect the transitions you sample, and even entire rollouts!
  4. If the environment your agent is in is also stochastic, this can also affect the transitions and rollouts that your agent samples.

As you might have guessed from the points above, one way to ensure reproducibility is to control the randomness of your experiments. This doesn’t mean making your environment deterministic and completely free of randomness, but rather, setting seeds for your random number generators (RNGs). This should be done for all packages you use that use probabilistic functions – for instance, if we use stochastic functions from the Python packages torch, gpytorch, numpy, and random, we can set the random seeds for all these packages using the following function:

def set_seeds(seed):
    torch.manual_seed(seed)  # Sets seed for PyTorch RNG
    torch.cuda.manual_seed_all(seed)  # Sets seeds of GPU RNG
    np.random.seed(seed=seed)  # Set seed for NumPy RNG
    random.seed(seed)  # Set seed for random RNG

Try it out yourself – if you set all seeds that add a random component to your RL experiments, you should see that the results from the same seed are identical! This is a good first step for setting up your RL experiments.

2. Run (Some) Of Your Seeds

When validating your RL framework, it is critical to test your agents and algorithms on multiple seeds. Some seeds will produce better results than others, and by running on just a single seed, you could have simply gotten lucky/unlucky. Particularly in RL literature, it is commonplace to run anywhere from 4–10 random seeds in an experiment.

How do we interpret these multi-seed results? You can do so by computing means and confidence intervals for your metrics, e.g. for rewards or network loss, as shown in the plot below. This gives you an idea of both:

i. The average performance of your agent (via the mean) across seeds.

ii. The variation of performance of your agent (via the confidence interval) across seeds.

Example plot showing mean performance (solid lines) and confidence intervals (color bars) for different sets of random seeds (in this case, corresponding to different experiments). Image source: author.
Example plot showing mean performance (solid lines) and confidence intervals (color bars) for different sets of random seeds (in this case, corresponding to different experiments). Image source: author.

3. Ablations and Baselines

Photo by Anne Nygård on Unsplash
Photo by Anne Nygård on Unsplash

An ablation refers to the removal of a system component. How best to test the effect of a component in your reinforcement learning system? Well, one way is to try running the reinforcement learning algorithm without this component, using an ablation study. Here, to compare the results, it is critical that these different configurations are run with the same seeds. Running with the same seeds is what allows us to make "apples-to-apples" comparisons between frameworks.

A similar, but not necessarily equivalent way to think about your RL experimental procedure is to use baselines – verifiably-correct algorithms or routines that your algorithm(s) build on. Running a baseline test answers the question: "How much does my algorithm improve upon what is already done?"

Example ablation study with different experiments. Image source: author.
Example ablation study with different experiments. Image source: author.

4. Visualize Everything

Reinforcement learning can be difficult to debug, because sometimes bugs don’t simply manifest as errors – your algorithm may run, but the agent’s performance may be sub-optimal because some quantity isn’t being computed correctly, a network’s weights aren’t being updated, etc. To debug effectively, one strategy is to do what humans do well: visualize! Some useful visualization tools and quantities to consider visualizing include:

a. Tensorboard: This module can be configured with TensorFlow and PyTorch, and can be used to visualize a multitude of quantities, such as rewards, TD error, loss, etc.

Example of some plots from Tensorboard. Image source: Author.
Example of some plots from Tensorboard. Image source: Author.
You aren't restricted to only generating plots with Tensorboard! Check out the Appendix for code you can use to generate GIFs from image data! Image Source [3].
You aren’t restricted to only generating plots with Tensorboard! Check out the Appendix for code you can use to generate GIFs from image data! Image Source [3].

b. Reward surfaces: If your state and action spaces are low-dimensional, or if you want to visualize a subset of the dimensions in your state and action spaces, and you have a closed-form function to compute reward, you can visualize the reward surface parameterized by actions and states.

Example of a reward surface parameterized over states and actions for the OpenAI Gym Pendulum environment. See the appendix for code on how to generate this plot.
Example of a reward surface parameterized over states and actions for the OpenAI Gym Pendulum environment. See the appendix for code on how to generate this plot.

c. Distributions/histograms of parameters: If your parameters change over time, or if you rerun your parameter sets over multiple experiments, it may also be helpful to visualize the distributions of your parameters to get a sense of your RL model’s performance. Below is an example of visualizing hyperparameters for Gaussian Process Regression.

Example of a parameter/hyperparameter distribution from Gaussian Process Regression. Image source: author.
Example of a parameter/hyperparameter distribution from Gaussian Process Regression. Image source: author.

5. Start Analytic, then Start Simple

5a. Start Analytic

Before you evaluate your algorithm in a dynamic environment, ask yourself: Is there an analytic function I can evaluate this on? This is especially valuable for tasks in which you are not provided with a ground truth value. There are a multitude of test functions that can be used, ranging in complexity from a simple elementwise sine function to test functions used for optimization [4].

Below is an example of a Rastrigin test function that can be used for optimization:

An example of a Rastrigin test function. Image source: author.
An example of a Rastrigin test function. Image source: author.

Once you’re confident that your model can fit complex analytic forms such as the Rastrigin test function above, you are ready to start testing it on real reinforcement learning environments. Running these analytic tests allows ensures that your model is capable of approximating complex functions.

5b. Start Simple

You’re now ready to transition your RL model to an environment! But before you evaluate your model on complex environments, for instance, with a state space of 17 dimensions [1], perhaps it would be better to start your evaluation procedure in an environment with just 4 [1]?

This is the second recommendation of this tip: start with simple environments for evaluating your model, for the following reasons:

(i) They will (generally) run faster and require less computing resources.

(ii) They will (generally) be less susceptible to the "curse of dimensionality" [2].

The OpenAI Gym CartPole-v1 environment is an example of a good starting environment, since its state space has only 4 dimensions, and its action space has only 1 dimension. Image source: author.
The OpenAI Gym CartPole-v1 environment is an example of a good starting environment, since its state space has only 4 dimensions, and its action space has only 1 dimension. Image source: author.

6. When in Doubt, Look to the (GitHub) Stars

Photo by Astrid Lingnau on Unsplash
Photo by Astrid Lingnau on Unsplash

We’ve all been there. Our RL code isn’t working, and we have no clear idea why, despite countless hours of debugging and evaluating. One possible reason for this is a poor setting of hyperparameters, which can have profound effects on agent performance – sometimes in very subtle ways.

When in doubt, look around at what’s worked before, and see how your RL configurations, particularly hyperparameter configurations, compare to tried and tested configurations that your fellow RL colleagues have discovered. Here are just a few benchmark resources that may be helpful for this task:

  1. Spinning Up (OpenAI)
  2. OpenAI Baselines
  3. Ray/RLlib RL-Experiments
  4. Stable Baselines
  5. TensorFlow-Agents Benchmarks

Additionally, if you’re using a reinforcement learning package, such as RLlib or TensorFlow-Agents, many of the default parameters that come with your RL classes were selected for a reason! Unless you have a strong reason to change the default parameters, those default parameters were likely chosen to help you build a successful model with little modification needed 🙂

Summary

Photo by Adam Lukomski on Unsplash
Photo by Adam Lukomski on Unsplash

Congrats, you made it, and thanks so much for reading! In this article, we talked about the importance of running high-fidelity, scientifically-rigorous experiments in deep reinforcement learning, and some methods through which we can achieve this. Here’s to running higher-fidelity, more reproducible, and more explainable RL experiments!

Again, if you have tips and tricks of your own that you’d like to share, please comment them below!

Acknowledgments

A special thanks to my mentors at MIT Distributed Robotics Laboratory for teaching me these tips. Learning about these techniques has been truly invaluable and has made me a far better researcher.

Thanks for reading 🙂 Please follow me for more articles on reinforcement learning, computer vision, programming, and optimization!

References

[1] Brockman, Greg, et al. "Openai gym." arXiv preprint arXiv:1606.01540 (2016).

[2] Keogh E., Mueen A. (2017) Curse of Dimensionality. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_192.

[3] Wilko Schwarting, Tim Seyde, Igor Gilitschenski, Lucas Liebenwein, Ryan Sander, Sertac Karaman, and Daniela Rus. Deep latent competition: Learning to race using visual control policies in latent space. 11 2020.

[4] Test Functions for Optimization, https://en.wikipedia.org/wiki/Test_functions_for_optimization.

Appendix

Generating Analytic Reward Functions


Related Articles