Notes from Industry

Why You (Probably) Shouldn’t Use Reinforcement Learning

Josiah Coad
Towards Data Science
6 min readAug 30, 2021

--

There is a lot of hype around this technology. And for good reason… it’s quite possibly one of the most important machine learning advancements towards enabling general AI. But outside of general interest, you may eventually come to the question of, “is it right for your application”?

I am currently working on a team for vision enabled robotics and as a past researcher in RL, I was asked to answer this question for my team. Below, I’ve outlined some of the reasons I think you may not want to use reinforcement learning in your application, or at least think twice before walking down that path. Let’s dive in!

Castle Rock, CA. Image by Author

Extremely Noisy

Below are two learning plots from a game which has a max score of 500. So which learning algorithm was better? Trick question. They were the exact same, the second run is just a rerun of the first. The only difference between the one training session that totally rocked it and learned a perfect policy, and the other, that miserably failed, was the random seed.

Training curves for DQN on CartPole. Image by Author
  • Small changes in random initialization can greatly affect training performance so reproducibility of experimental results is challenging.
  • Being noisy makes it very hard to compare algorithms, hyperparameter settings, etc because you don’t know if improved performance is because of the change you made or just a random artifact.
  • You need to run 20+ training sessions under the exact same conditions to get consistent/robust results. This makes iterating on your algorithm very challenging (see note below about how long these experiments can take!)

Large amount of hyperparameters

One of the most successful algorithms on the market right now is Soft Actor-Critic (SAC), which has nearly 20 hyperparmeters to tune. Check for yourself! But that’s not the end of it…

  • In deep RL, you have all the normal deep learning parameters related to network architecture: number of layers, nodes per layer, activation function, max pool, dropout, batch normalization, learning rate, etc.
  • Additionally, you have 10+ hyperparameters specific to RL: buffer size, entropy coefficient, gamma, action noise, etc
  • Additionally, you have “Hyperparameters” in the form of reward shaping (RewardArt) to get the agent to act as you want it to.
  • Tuning even one of these can be very difficult! See notes about extremely noisy, long training time… imagine tuning 30+.
  • As with most hyperparemter tuning, there’s not always an intuitive setting for each of these or a foolproof way to most efficiently find the best hyperparameters. You’re really just shooting in the dark until something seems to work.

Still in research and development

As RL is still actually in its budding phases, the research community is still working out the kinks in how advancements are validated and shared. This causes headaches for those of us that want to use the findings and reproduce the results.

  • Papers are ambiguous in implementation details. You can’t always find the code and it’s not always clear how to turn some of the complex loss functions into code. And papers also seem to leave out little handwaivy tweaks they used to get that superior performance.
  • Once some code does get out there on the interwebs, because of the reason listed above, these differ slightly in implementation. This makes it hard to compare results you’re getting to someone else’s online. Is my comparatively bad performance because I introduced a bug or because they used a trick I don’t know about?

Hard to debug

  • Recent methods use the kitchen sink of techniques to get cutting edge results. This makes it really hard to have clean code, which subsequently makes it hard to track others code or even your own!
  • On a related note, because there’s so many moving parts, it’s really easy to introduce bugs and really hard to find them. RL often has multiple networks learning. And it’s a lot of randomness in the learning process so things may work one run and may not the next. Was it because of a bug you introduced or because of a fluke in the random seed? Hard to say without running many more experiments. Which takes…. TIME.

Extremely sample inefficient

Model free learning means we don’t try to build/learn a model of the environment. So the only way we learn a policy is by interacting directly with the environment. On-policy means that we can only learn/improve our policy with samples taken from acting with our current policy, ie we have to throw away all these samples and collect new ones as soon as we run a single backward gradient update. PPO is, for example, a model-free on-policy state-of-the-art algorithm. All this means that we have to interact with the environment a lot (like millions of steps) before learning a policy.

This may be passible for if we have a high-level features in a relatively low-fidelity simulator. For example,

Image of Humanoid Environment by https://gym.openai.com/
  • Humanoid takes 5 hours to learn how to walk (2 mil steps)

But as soon as we move to low-level features, like image space, our state space grows a lot which means our network must grow a lot, eg we must use CNN’s.

Atari Pheonix. Image by mybrainongames.com
  • Atari games such as Phoenix takes 12(?) hours (40–200 mil steps)

And things get even worse when we start introducing 3D high-fidelity simulators like CARLA.

CARLA Driving Simulator. Image by Unreal Engine
  • Training a car to drive in CARLA takes ~3–5 days (2 mil steps) with a GPU

Andd even worse if the policy is notably complex.

  • In 2018, OpenAI trained an agent that beat the world champions at DOTA 2. How long did the agent take to train you ask? 10 months 🙊

What if we wanted to train in the real world instead of a simulator? Here, we are bound by real-time time steps (whereas before we could simulate steps in faster than real time.). This could take weeks or even worse, just be entirely intractable. For more on this, look up “the deadly triad of RL”.

Sim 2 real gap

What if we wanted to train in a simulator and then deploy in the real world? This is the case with most robotics applications. However, even if an agent learns to play well in a simulator, it doesn’t necessarily mean that it will transfer to real world applications. Depends how good the simulator is. Ideally, we’d make the simulator as close to real life as possible. But see the last section to see the problem with high-fidelity simulators.

Unpredictability & Inexplainability

  • Even a well trained RL agent can be unpredictable in the wild. We may try to punish disastrous behaviors severely but we still don’t have a gaurantee that the agent won’t still chose that action since, in the end, we are only optimizing the expectation of total reward.
  • Explainability: this is more a problem with DL in general, but in reinforcement learning, this issue takes on a new importance since the networks are often choosing how to move physical machinery that could physically damage people or property (as in the case of self driving or robotics). The RL agent may make a disastrous control decision and we have no idea exactly why, which in turn means we don’t know how to prevent it in the future.

Conclusion

Well, I don’t know if that was depressing or a buzz kill for you to read. I kind of meant it to be a reality check to cut through the hype so I did go pretty hard. But I should also disclaim all these points with the fact that these issues are the very reason why it is such a hot research area and people are actively working on many, if not all, of these pain points. This makes me optimistic for the future of RL but realizing that these are still problems is what makes me a realistic optimist.

Believe it or not, I wouldn’t totally discount RL for industrial applications… it is really awesome when it works. I’d just make sure you know what you’re getting yourself into so you don’t overpromise and underestimate the timeline. :)

--

--