Reinforcement Learning: The quirks

I have been working on Reinforcement Learning for the past few months and all I can say about it: It is different. A writeup of the common quirks and frustrations of Reinforcement Learning I have experienced.


I have been applying variations of the A3C and the GA3C algorithm on various OpenAI Gym environments as part of my internship. I did not have any extensive Reinforcement Learning experience before that, apart from some introduction courses, so this was very new to me.

I quickly learned that Reinforcement Learning is very different from the default classification and regression tasks I had done previously — but it is very exciting! Let me tell you about problems I encountered during all of this.

The time

In Deep Reinforcement Learning, training data usually gets collected “on the job”, meaning that the agent starts with no prior knowledge of his environment and then starts collecting experiences while interacting with the simulation.

This means that it takes some time to collect the data in a memory buffer and then some more time to pass it into the algorithm and update the network weights. This can be optimized by using a asynchronous approach — such as the A3C algorithm mentioned above — which makes things a lot more complicated. This means that even some simple agents take hours to train — and more complex systems need multiple days or even weeks.

Computational Power

Working at NVIDIA AI brings some obvious benefits with it — sheer unlimited computational power.


The worse it was to realize, that most of Reinforcement Learning work is done on the CPU.

Using the GA3C algorithm, some power gets moved to the GPU, allowing bigger and more complex networks, but in my experience the limitations usually come from the CPU, which is very unusual in the rest of the Deep Learning space.

Exploration vs. Exploitation

This one hurts the most. The “Explorations vs. Exploitation” seems to be a very common problem. Basically, there is a trade-off between exploring your environment and taking the best possible action. Let’s look at this in the context of a maze.

If you explore too much, your bot will always try to find new ways to solve the maze better. Sounds good at first, but it also means that if your bot finds the perfect way to solve the maze —it won’t believe that it is the best and try another route the next time, possibly never finding the “best way” again.

If you explore too little / you exploit too much, your bot might find a way to solve the maze and will just keep on going the same way. That might mean that it goes through every single corridor before getting to the core of the maze — but that is good enough for it and it will keep doing that forever.

In my experience, the second one happens way more. I read this brilliant article by Arthur Juliani about some exploration techniques and I can recommend it to anyone experimenting with RL.