How to match DeepMind’s Deep Q-Learning score in Breakout

Fabio M. Graetz
Towards Data Science
9 min readAug 26, 2018

--

If you are as fascinated by Deep Q-Learning as I am but never had the time to understand or implement it, this is for you: In one Jupyter notebook I will 1) briefly explain how Reinforcement Learning differs from Supervised Learning, 2) discuss the theory behind Deep Q-Networks (DQN) by telling you where you find the respective explanations in the papers and what they mean and 3) how to implement the components needed to make it work in python and tensorflow.

In 2013 a London based startup called DeepMind published a groundbreaking paper called Playing Atari with Deep Reinforcement Learning on arXiv: The authors presented a variant of Reinforcement Learning called Deep Q-Learning that is able to successfully learn control policies for different Atari 2600 games receiving only screen pixels as input and a reward when the game score changes. The agent even surpasses human expert players in some of those games! This is an astonishing result because previously “AIs” used to be limited to one single game, for instance, chess, whereas in this case the types and contents of the games in the Arcade Learning Environment vary significantly and yet no adjustment of the architecture, learning algorithm or hyperparameters is needed.

No wonder DeepMind was bought by Google for 500 Million Dollars. The company has since been one of the leading institutions advancing Deep Learning research and a later article discussing DQN has been published in Nature.

Now, five years later, there are even more advanced reinforcement learning algorithms but in my opinion, Deep Q-Learning is still an extremely impressive method that is well worth studying (and more importantly getting to work).

I was in the first year of my Ph.D. in Theoretical Astrophysics when I first watched DeepMind’s video showing a DQN agent learning to play the game Breakout and discovering that it could “dig a tunnel” around the side that allows the ball to hit blocks by bouncing behind the wall. I was immediately fascinated and promised myself that I would try to recreate this result once I could afford to dedicate some time to it.

Now, two years later, I implemented DQN, learned a lot about neural networks and Reinforcement Learning and therefore decided to write a hopefully comprehensive tutorial on how to make it work for other people who, like me, are fascinated by Deep Learning.

Deep Q-Learning needs several components to work such as an environment which the agent can explore and learn from, preprocessing of the frames of the Atari games, two convolutional neural networks, an answer to the exploration-exploitation dilemma (e-greedy), rules to update the neural networks’ parameters, error clipping and a buffer called Replay Memory where past game transitions are stored in and drawn from when learning.

I described the individual components in detail, immediately followed by their implementation. If you want to learn more, continue reading in the notebook:

Here I will now describe the experiments I performed using the notebook and more importantly the small details I discovered while performing them that proved to be crucial to making DQN work well: Several threads discuss problems with matching the scores DeepMind reported (see here, here and here) and I initially struggled with similar problems.

Let’s get started: Pong

I started my experiments with the environment Pong as it is relatively easy and quick to learn for a DQN agent because of its simplicity: The agent controls a paddle that can be moved up and down and the goal is to hit the ball in a way that the opponent is not able to reach it. The game is over once one player achieves a score of 21 points.

After 30 minutes of training

The implementation in the notebook creates a gif after every epoch which allows you to observe the agent learn. I personally never get tired of looking at the networks improvements feeling amazed by the thought that humans figured out how to make a machine learn to play these games simply by looking at them.

At the beginning of training, the DQN agent performs only random actions and thus gets a reward of around -20 (which means that it looses hopelessly). After 30 to 45 minutes of training, the agent already learned to hit the ball and is able to score its first points.

The solid line shows the training reward (training episode scores averaged over the last 100 episodes) and the dashed line the evaluation score (average episode score after 10,000 frames of greedy gameplay). One epoch is equivalent to 50,000 parameter updates or 200,000 frames.

Training and evaluation reward for the environment Pong
DQN winning like a boss

The evaluation score quickly reaches the maximum value of +21. The training score remains lower because of the e-greedy policy with annealing epsilon used during training. In the gif on the left, all points are scored in almost the exact same way meaning that the agent discovered an almost ideal strategy. I’d love to see two DQN agents learn to play against each other but currently, OpenAi’s gym does not support multiplayer Pong.

Next: Breakout

Let us take a look at Breakout, the environment shown in the video that initially made me want to implement DQN myself after seeing the agent dig a tunnel at the side of the screen that allows the ball to hit blocks by bouncing behind the wall.

In the Nature paper Mnih et al. 2015, DeepMind reports an evaluation score of 317 for Breakout. I ran the notebook on Breakout overnight hoping for a similar result. After the initial success with Pong, I excitedly woke up the next morning to check on the agent’s improvements only to discover that the reward hit a plateau at around 35 without any further improvements. The same problem regarding DQN and Breakout (without a final answer to what the problem is) was reported here: DQN solution results peak at ~35 reward.

The problem with online courses (which I love a lot, don’t get me wrong) is that everything has a tendency to work right away and things start to get really exciting (and of course also sometimes frustrating) only once the things don’t go as planned and you need to figure out why.

Breakout, first try

By reducing the learning rate I got the reward up to around 50 which still was only around 15% of what DeepMind reported.

During the following weeks, I continued to sometimes add improvements to the code, ran experiments overnight or longer and gradually worked my way up to a training score of close to 300 and a maximum evaluation score of around 400.

Breakout, first vs last try

Wang et al. 2016 report a score of 418.5 (Double DQN) and 345.3 (Dueling DQN) for Breakout (Table 2). The best evaluation episode I saw during my experiments had a score of 421 (shown at the beginning of this article). If you want to try this yourself, clone the repository and run the notebook. The network’s respective parameters are included and since I trained Breakout using a deterministic environment (BreakoutDeterministic-v4) you should get the same score.

Which adjustments were needed to achieve this improvement?

  1. Use the right initializer! DQN uses the Relu activation function and the right initializer is He et al. 2015 equation 10 (click here for a detailed explanation). In tensorflow use tf.variance_scaling_initializer with scale = 2.
  2. Make sure you are updating the target network in the right frequency: The paper says that the target network update frequency is measured in the number of parameter updates (which occur every four frames, see Extended Data Table 1 in Mnih et al. 2015) whereas in the DeepMind code it is measured in the number of action choices/frames the agent sees. Let’s take a look at the DeepMind code: Here you can see that by default update_freq=4 and target_q=10000 (line 14 and 31). Here you can see that by default parameter updates occur every 4 steps (self.numSteps%self.update_freq==0) and here that the target network is updated every 10.000 steps (self.numSteps%self.target_q==1 ) — the target network update frequency is, thus, measured in numSteps and not in parameter updates.
  3. Make sure that your agent is actually trying to learn the same task as in the DeepMind paper!
    I found that passing the terminal state to the replay memory when a life is lost (as DeepMind did) makes a huge difference. This makes sense since there is no negative reward for losing a life otherwise and the agent does “not notice that losing a life is bad”.
    DeepMind used a minimal set of four actions in Breakout (xitari), several versions of OpenAi gym’s Breakout have six actions. Additional actions can alter the difficulty of the task the agent is trying to learn drastically! The Breakout-v4 and BreakoutDeterministic-v4 environments have four actions (check with env.unwrapped.get_action_meanings()).
  4. Use the Huber loss function (Mnih et al. 2013 call this error clipping) to avoid exploding gradients. Gradients are clipped to a certain threshold value, if they exceed it. Observe that in comparison to the quadratic loss function the derivate of the green curve in the plot shown below does not increase (or decrease) for x>1 (or x<−1).
Error clipping to avoid exploding gradients

Feel free to play with the notebook. If you invest GPU-time to optimize hyperparameters for better performance or if you try it with other games, please write me in the comments, I’d be thrilled to know.

If you want to train the network yourself, set TRAIN to True in the first cell of the notebook.

Consider making your computer accessible remotely which I described in this blog post: Accessing your Deep Learning station remotely and setting up wake on lan. You can convert the notebook to a python script using jupyter-nbconvert --to script DQN.ipynb and then run it in a tmux session which I described here: jupyter and tensorboard in tmux. This has the advantage that you can detach from the tmux session and reattach to it remotely in order to check on the agents progress or make changes to the code wherever you are.

Summaries in tensorboard

If you want to use tensorboard to monitor the networks improvements type tensorboard --logdir=summaries in a terminal in which the respective virtual environment is activated. Open a browser and go to http://localhost:6006. This works remotely as well.

I found DQN quite fiddly to get working. There are many little details to get right or it doesn’t work well. In addition to that, the experiments often have to run overnight (at least) which means that debugging and improving is slow if you have just one GPU.

Regardless of that, it was time well spent in my opinion as I learned a LOT about neural networks and Reinforcement Learning, especially by the debugging and improving. Look at the gif below: at one point the agent starts to aim directly at the sides to dig a tunnel and hit the blocks from above — exactly what caught my attention two years ago in DeepMind’s video and what I wanted to replicate.

Observe how the agent starts to dig a tunnel at the side :)

I therefore encourage you to try to implement DQN from scratch for yourself and when necessary look at my code for comparison to find out how I implemented the different components. Feel free to write me a message when you need clarification or hints, I’d love to help.

Have fun :)

Fabio

--

--

Senior MLOps engineer at Recogni | Machine Learning | Kubernetes | Theoretical Astrophysicist | Bespoke Shoemaking | Berlin