Advanced DQNs: Playing Pac-man with Deep Reinforcement Learning

Published in

Towards Data Science

22 min readJun 29, 2018

In 2013, DeepMind published the first version of its Deep Q-Network (DQN), a computer program capable of human-level performance on a number of classic Atari 2600 games. Just like a human, the algorithm played based on its vision of the screen. Starting from scratch, it discovered gameplay strategies that let it meet (and in many cases, exceed) human benchmarks. In the years since, researchers have made a number of improvements that super-charge performance and solve games faster than ever before. We’ve been working to implement these advancements in Keras — the open source, highly accessible machine learning framework — and in this post, we’ll walk through the details of how they work and how they can be used to master Ms. Pac-man.

A Brief Introduction to Reinforcement Learning

DQN, and similar algorithms like AlphaGo and TRPO, fall under the category of reinforcement learning (RL), a subset of machine learning. In reinforcement learning, an agent exists within an environment and looks to maximize some kind of reward. It takes an action, which changes the environment and feeds it the reward associated with that change. Then it takes a look at its new state and settles on its next action, repeating the process endlessly or until the environment terminates. This decision-making loop is more formally known as a Markov decision process (MDP).

It’s easy to see how Atari games like Ms. Pac-man fit into the MDP framework. The player looks at the game screen and chooses from the different buttons and joystick positions in an effort to increase their score. But that simplicity can hide the daunting task that anyone who’s ever played a video game has probably taken for granted, which is the process of making instant decisions based on one of the nearly infinite pixel combinations that can show up on screen at any given time, and one which you are very unlikely to have ever encountered before. In case that wasn’t hard enough, video games are technically partially observable MDPs, in the sense that you are forced to make choices based on an indirect representation of the game (the screen) rather than the code/memory itself, hiding some of the information you need to make a fully-informed decision. Lucky for us, video games like Pac-man do provide two useful simplifications: frame rates and controllers. The finite number of buttons and joystick positions let us map that observation space to a discrete and manageable action space, a mental function that gets simplified further when you realize it doesn’t need to be done continuously, because you can only press one button for every frame (most RL agents go further than that and only make new decisions every 3 or 4 frames). The challenge is then difficult but definable: how do we use our experience playing a game to make decisions that increase the score, and generalize that decision-making process to new positions we’ve never been in before?

The Basics of Deep Q-Networks

DQNs leverage the incredible success of neural networks to solve that challenge. The architecture can be split into two parts. First, a series of convolutional layers learns to detect increasingly abstract features of the game’s input screen (we’ll get into the details of how this works and what it learns to detect later). Then, a dense classifier maps the set of those features present in the current observation to an output layer with one node for every button combination on the controller.

Neural network schematic from DeepMind’s 2015 paper

These output values are the agent’s best prediction of the reward it expects if it were to take this action. This ‘state to predicted reward’ approach is known as value-based learning, and is centered around the action-value function Q(s,a), representing the total value of taking action a in state s, which is the sum of the future rewards r, adjusted by a discount factor gamma, indicating how far into the future the agent is expected to plan. The optimal strategy is then written:

Which is just to say that at each step we are taking the action that leads to the maximum reward score. Pretty straightforward once we can calculate those Q values, but of course our Pac-man agent can’t really see the future, and can’t assign a unique Q value to each of the possible states. That’s where the deep learning comes in. By mapping pixel images to Q values, our neural network acts as a Q function approximator, so that while it can’t see the future, with enough training it can learn to predict it.

That training is accomplished using gradient descent on the above loss function, which is a variation of temporal difference. This can be thought of as the difference between the ‘true’ or target Q values and our current estimation of them, where the target value is the immediate reward plus the Q value of the action we will take in the next state. Of course, that value is also calculated by our network, but the overall expression is inherently more accurate thanks to it having access to at least the first reward term. Even so, this is definitely the math equivalent of trying to hit a moving target, as the true values are calculated by the same network we are training. This is the big discrepancy between supervised learning — another subset of machine learning that includes tasks like image classification and sentiment analysis — and reinforcement learning. Supervised learning uses datasets that are labeled, meaning that the target values are manually set by humans and assumed to be accurate and unchanging. Reinforcement learning creates its own, ever-shifting dataset, both because the network generates its own target values and because the action choices of the network directly impact which states it will reach in its environment and therefore what it will have to learn about. To help manage that, we actually take two extra stabilization measures. First, we duplicate the neural network, using the second copy, or target network to generate the target values, and the original, or online network to generate the estimations. Only the online network is trained, but we copy its weights over to the target network every so often to refresh it with more optimized parameters. Second, we use something called an experience buffer. The buffer is a dataset of our agent’s past experiences, where an experience is defined as (s, a, r, t, s’) where s, a and r maintain their previous definitions, t is a boolean that lets the agent know if this was the terminal state of the episode, and s’ represents the state that followed s when the agent took action a. You’ll notice that an experience entry contains all of the variables needed to compute the loss function. So, instead of learning as the agent plays Pac-man, it’s actually just moving Pac-man around the screen based on what it has already learned, but adding all of those experiences to the buffer. Then, we can take experiences from storage and replay them to the agent so that it can learn from them and take better actions in the future. This is conceptually similar to how humans replay memories in order to learn from them, and the process is fittingly named experience replay. Most importantly, it ensures that the agent is learning from its entire history (or at least as much of it as we can practically store before we start overwriting the oldest data), rather than it’s most recent trajectory.

It’s worth noting that these architecture decisions are what classify DQN as an off-policy, model-free algorithm. Model-free because it learns to predict the value associated with a position, but doesn’t attempt to build a model of the inner workings of its environment in order to make predictions about the next state it will see, and off-policy because its training examples are generated by a past version of itself (and therefore a past version of its decision-making policy).

There’s one more important detail to get out of the way, and that’s the exploration issue. There’s a never-ending debate within Markov agents over whether its best to take a risk and explore new ideas or to exploit what is already known. Imagine what would happen if the algorithm always pressed the button that it thought would lead to the highest reward. It would likely do that same action every single time, never trying anything else and therefore never learning and improving any further. For example, it might learn on its very first attempt that going right from the start zone can get it a couple points, and then just continue to go right no matter how many ghosts or walls stand in its way; it never experimented with going left and therefore has no way to accurately estimate the reward it would generate by doing so. The original DQN solves this problem in an arbitrary but surprisingly effective way. We initialize a variable, epsilon, to 1.0. For every step, we generate a random number between 0 and 1. If this number is less than epsilon, we take an action completely at random, regardless of what the agent has to say about that action’s Q value. Because epsilon is 1, we do this 100% of the time. But as training progresses, we anneal epsilon down to around 0.1, meaning we take the best action 90% of the time and explore a new, random direction the other 10%. Interestingly, we never let epsilon hit 0 in practice (even during testing/evaluation, where we typically use .05); this makes sure that Ms. Pac-man can never get stuck in a corner or stop moving indefinitely. This method is known as epsilon greedy.

Now that we’re getting into the experiments/code, it’s probably a good time to mention that all of these results (including weight files for all of the trained models), and the scripts that generated them can be found on this project’s GitHub repo. The inner workings of the DQN algorithm are implemented in the open source library keras-rl. However, it may be some time before the more advanced techniques we’ve been working on and will be describing below have been approved for a merge into the master version. In the meantime, you can access them on our fork.

Something to know about DQNs (and RL algorithms in general) is that they are sample inefficient and expensive to train. The standard approach in the DQN literature is to run 200 million frame training sessions. That’s about 930 hours of human-speed gameplay or roughly 20 days of training on a single P4000 GPU (at least for the final DQN variant we’ll be getting to). We don’t have the resources to pull something like that off for every version of the algorithm. Instead, we chose to run these comparison experiments for 10 million frames (enough to highlight the discrepancies), and to train our final, best-performing agent for a more extended 40 million step run.

With that out of the way, it’s time to see how the algorithm we’ve described performs on the task at hand.

Vanilla DQN Takes on Ms. Pac-man

An effective way to monitor and analyze an agent’s success during training is to chart its cumulative reward at the end of each episode.

The reward function is proportional to (but not equal to, and actually much less than) the in-game score, so it doesn’t mean all that much on its own. However, the agent is clearly able to learn something about the game in order to consistently improve over time. Also notice the diminishing return on investment, just like any learning curve — human or machine.

Let’s check out some gameplay and see how it does.

Amazingly, the agent has learned to navigate around the maze. It even does a pretty good job of clearing large areas when there are still dense pockets to be collected. However, it has a very hard time finding its way back to the lonely few dots that slip away, and it seems to have total meltdowns whenever it finds itself stuck in the middle of two pockets and has to decide which one to go after first (like at 0:29). It also does not seem to actively hunt down the ghosts for points when they become vulnerable, and only triggers that behavior by accident, on its way to gathering up all the dots in each corner of the maze.

Double Q-Learning

In 2015, van Hasselt et al. applied double q-learning to DQN, a modification that improved performance on some games. Notice that - in the loss function from above - we use the target network both to calculate the q values for each action in the next state and to select which of those actions we will want to take (the highest one). It turns out this can lead to some overestimation problems, particularly when the discrepancy between the target network and online network cause them to recommend totally different actions given the same state (as opposed to the same action with slightly different q values). To solve this, we take away the target network’s responsibility to determine the best action; it simply generates the Q values, and we let the online model decide which to use. More formally, Double DQN generates target values according to:

Dueling Architecture

An important detail of RL theory is that the Q function can actually be split up into the sum of two independent terms: the value function V(s), which represents the value of being in the current state, and an advantage function A(s,a), which represents the relative importance of each action, and helps ensure that the agent takes the best action possible, even if that choice may not have any immediate impact on the game score.

In 2016, Wang et al. published the dueling network architecture, which splits the neural network into two streams that share a convolutional base, one for estimating each function.

Vanilla DQN architecture (top) vs. Dueling DQN (bottom). Dueling estimates the action and value functions separately, before combining them to arrive at the overall Q function.

The green arrows in the architecture schematic above stand in for the method we use to combine them. Unfortunately, the naive solution (adding A(s,a) to V(s)), isn’t effective because the network isn’t incentivized to optimize V and A independently, and therefore we can’t guarantee that those are what it’s learning at all. Instead, we use an alternative approach:

Where alpha and beta represent the parameters of the advantage and value streams, respectively. Now the network can use the knowledge of its chosen action to push the advantage function to 0, so that Q(s,a) is approximately equal to V(s).

Prioritized Experience Replay

When humans look back to learn from our mistakes, we optimize the process to spend time where it’s most needed. When you fail a midterm and decide you need to do better on the final, you go in and look specifically at the questions you got wrong; you don’t skim through the pages or just check the even multiples of 3. Up until now, our agent has been sampling from its replay buffer uniformly at random. In the context of Pac-man, that means our agent is being presented with frames where it was coasting down a straightaway with no enemies in sight much more often than the one time it entered a four-way intersection with ghosts on three sides, chose wrong, and paid the price. But what if it could learn from the experiences where it is the most wrong?

Schaul et al.’s 2016 paper proposes a solution, known as prioritized experience replay (PER). In PER, we use an additional data structure that keeps a record of the priority of each transition. Experiences are then sampled proportional to their priorities:

Where alpha is a hyperparameter that indicates how aggressive we want this sampling bias to be. The priorities themselves are just the agent’s temporal difference error the last time it trained on that experience. In this way, our agent is more likely to learn from its most inaccurate predictions, smoothing out its trouble spots and generally being much more sample efficient.

There is also the detail of importance sampling weights, which are meant to compensate for the fact that we are now intentionally feeding the network data that will cause abnormally large gradient updates.

These importance weights are controlled by a hyperparameter beta, which controls how aggressively we want to compensate for them. This beta value is typically annealed through the duration of the run, rising from an initialization of ~ .6 and ending at 1.0 (full compensation). Even so, PER-equipped agents tend to use lower learning rates (typically around .25x) to safeguard against catastrophic collapse.

Prioritized Double Dueling vs Pac-man

While each of these three improvements can make significant improvements on their own, the great thing about them is they can exist in the same algorithm, somewhat longwindedly referred to as a Prioritized Double Dueling DQN (PDD). Here is the learning curve, plotted against our previous version:

While their performance is equally poor during the heavy exploration phase of the first few thousand episodes, PDD begins to break away at the 2k episode mark. And while that gap may seem tough to distinguish, the gameplay video tells a better story:

The new additions have allowed our agent to seek out isolated dashes with clear intent — no more random wandering around the maze. It also seems to have made the association between picking up the energy pills in the corners and earning a higher reward, most likely from accidentally running into ghosts once that behavior makes them vulnerable. Interestingly, it hasn’t fully flushed out that strategy, as it typically makes a beeline to each corner of the maze without even waiting for its invincibility to expire and without targeting the fleeing ghosts, whereas an experienced human player might save those powers for dire situations or at the very least space them out to maximize their effect.

Noisy Networks for Exploration

Earlier, we talked about the debate between exploration and exploitation, and the somewhat arbitrary solution called epsilon Q greedy. The problem with this approach is its reliance on a scheduled hyperparameter. How do we know how much exploration we’ll need to solve a game like Ms. Pac-man? The only real way to find out is to test a whole bunch of different values, a time consuming and expensive process. It would be much better if the agent could learn to control its exploration the same way it learns to control the rest of its decision-making — through gradient descent. In 2017, Fortunato et al. released the solution that made that possible: Noisy Networks.

Because we are calculating the Q-values as normal, but randomizing (adding noise) to our action selection, the epsilon Q greedy approach falls under the umbrella of action space noise exploration methods. In a Noisy Network, we instead inject noise into the weights of the neural network itself (the parameter space). As a result, the agent will consistently recommend actions that it didn’t ‘intend’ to, ensuring exploration, and we can make action decisions based solely on the highest Q value output.

The difference between action and parameter space noise, from a very similar technique released by Open AI around the same time.

Specifically, Noisy Networks replace the Dense classifiers in the model with Noisy Dense layers, defined by the operation:

Where all of the parameters are learnable except for the epsilon terms, which are generated using factorized Gaussian noise before each training and inference step. The agent can then learn to manipulate the other variables, particularly the two sigmas, to tune out the noise — but it can do so at the pace its environment demands, without any arbitrary interference from us. To illustrate this, we can chart the average sigma weight over time. Here is the result from the 10 million step training run we’ll get more into in a little bit:

Because the noise injection is element-wise multiplied by this term, the agent can learn to ignore the noise by setting sigma to 0. It’s interesting to note that this method tends to naturally bottom-out above 0 just like we designed our epsilon schedule to — ensuring at least some level of exploration for the duration of training.

N-step TD

One last improvement we’ll make is the introduction of a new term in our loss function, which helps the agent more closely approximate the theoretical cumulative discounted reward. Basically, this:

becomes this:

in an effort to emulate this:

Where n is a hyperparameter typically set to either 3 or 5. The idea here is to improve target accuracy by bootstrapping from sequences of transitions — a method that lies somewhere in the middle of the Temporal Difference -> Monte Carlo spectrum. You can read a lot more about this in Chapter 7 of this textbook.

The Final Result

Once again, the best part of these changes is that we can stack them into the same, super-charged algorithm, presumably called a NoisyNet N-step Prioritized Double Dueling Deep Q-Network, which really rolls off the tongue. When DeepMind added one more thing to this late last year, they started calling it “Rainbow”, either because they were throwing in the towel on the whole name-stacking thing or because a bunch of PhD’s sat around a conference room looking at the colors on this chart and had an unfortunate lapse in creativity:

from Rainbow: Combining Improvements in Deep Reinforcement Learning

For the record, that additional feature is distributional RL, in which the agent learns to predict reward distributions for each action. We hope to return to this in the future.

Here is the learning curve for our final DQN variant, plotted against the previous iterations:

These last few extensions go a long way towards improving both overall performance and sample efficiency — with a wire-to-wire victory over both previous versions. One final reminder: 10 million steps is not enough to say anything definitive about overall performance when it comes to DQNs. However, all of the papers we’ve been referencing/implementing show similar results extrapolated over much larger time frames, and our shorter experiments show that the hierarchy holds true even on more manageable training runs.

We left NoisyNstepPDD running for a total of 40 million steps. Now let’s see how it plays.

We’re now consistently clearing the first level, and pulling off tight maneuvers between ghosts while navigating towards left-over dashes with a purpose and making good use of the shortcut exits on each side of the screen. It’s most common cause of death is running into invisible ghosts, a ‘glitch’ that is caused by the Atari’s strict sprite-render limits.

What Does DQN ‘See’?

Atari-playing DQNs make sense of the screen with the help of a convolutional neural network (CNN) — a series of convolutional layers that learn to detect relevant game information from the mess of pixel values. Unfortunately, we don’t give it the chance to learn directly from the game screen — compute resources demand some shortcuts. Instead, we pre-process the raw input by downsizing from the original size to a more manageable 84 x 84 and convert to grayscale.

What we would see (left) vs. what our agent sees (right)

Of course, this does mean we are throwing away color information that could be helpful. For example, each color ghost in Pac-man has it’s own AI, and expert players can use those differences to forecast where they will go, but it’s unlikely our agent can tell them apart at all. This is done to chop off two of the channels an RGB image would have, significantly reducing the load on our CNN. Also note that the resize operation even manages to delete the location of the dots to the left and right of the ghosts’ start box; luckily, these are typically picked up by chance as Ms. Pac-man moves to either edge of the maze. These preprocessing steps are a pretty standard practice across the Atari RL domain, so we decided to continue them here, even if they aren’t optimized for this specific game.

We fill the now-empty third dimension with a stack of the 4 most recent frames; this lets the agent get an understanding of the speed and direction of each sprite.

The stack of frames that makes up one complete input.

In the example above, the agent would know that it is rounding a corner counter-clockwise and that a fruit has recently appeared and is moving away from it. A stack of frames also adds some protection against the sprite-flicker effect, asssuming each ghost is rendered at least once every 4 frames (which doesn’t always seem to be the case).

All of our implementations use DeepMind’s CNN architecture from 2015, consisting of three layers. The first learns to recognize 32 low-level features. Conceptually, we might expect one feature channel to track the remaining dots, one to track the ghosts, another to track the player, etc. By accessing the layer’s output and laying its activations over the most recent frame in the input stack, we can get a sense of what they might actually be learning. Probably the most surprising thing we found was our agent’s ghost and player-tracking system; rather than using one channel to keep a constant watch on the ghosts, and another to get Ms. Pac-man’s position in the maze, our agent has learned a radar-by-committee approach, in which a channel might track one ghost at a time, and even switch ghosts depending on the frame.

Ghost-busting by committee: the agent uses an ensemble of feature channels to determine the position of the game’s sprites. Bright white spots indicate areas of high activation. The maze background is added for context.

In this example, channels 13 and 16 seem to track individual ghosts, while channel 22 tracks the player and channel 1 sees multiple sprites at once. The system is far from perfect, as there doesn’t seem to be a feature map that picks up the ghost near the center start zone. That ghost will likely be identified in a future frame, and a good portion of the agent’s success on any given episode seems to be connected to how long these ‘blind spots’ last.

There are other channels that pick up very specific features, even if it’s hard for humans to conceptualize what those features might be:

The agent has learned a number of well-defined features that are difficult to conceptualize and vary by frame. In this case, 29 and 0 might both be looking at the player; 28 and 11 are anyone’s guess.

Even less helpful are the channels of total noise, of which there are still a disappointing number after 40m steps. This would likely clear up as training continued, with noisy channels narrowing to look more like 11 and 28 from above, while 11 and 28 might develop into more useful representations themselves.

The great thing about CNNs is that their features get more and more abstract as you get deeper into the network. We can use gradient-ascent on the outputs of the third and final convolutional layer to generate class activation maps, indicating which parts of the image most contributed to the agent’s decision. This technique has some similarities to tracking the eye movement/attention of a human playing a video game or watching a movie. The results reveal a decision-making process that a human player might recognize.

The class activation map for the agent’s decision.

Here, the agent, located at (20, 30), is coming up on an intersection and needs to decide where to go next. The heatmap indicates that the agent’s attention is focused on the left side of the screen, and is clustered pretty tightly around the path of dots that remain. It’s also noticing the ghosts at (25,40) and in the upper-center as well as the smaller dot cluster on the far right side. It ends up deciding to go right and up, which would be the equivalent of pushing the Atari joystick diagonally between those two directions. Interestingly, our agent seems to prefer these in-between inputs to the four standard directions, probably because the walls of the maze typically make one of those directions invalid, so it ends up being more efficient to try both at once. It could also be that this allows Ms. Pac-man to round tight corners without frame-perfect inputs (the game forces her right until the first frame there isn’t a wall overhead, then usually turns her up), a useful tactic for an agent that only gets to make a decision once every 4 frames.

In this example, from the end of the first level, we get some empirical proof of what we suspected when we first added PER — that the agent is capable of seeking out lone dots. It knows it’s safe from the three ghosts to the left and is only focused on the last remaining dot.

While their decision-making strategies may be similar on the surface, it’s important to remember that DQN doesn’t process information like a human. It learns to recognize certain groups of pixels as being ‘good’ or ‘not as good’. It can’t avoid ghosts, because it doesn’t even know what a ghost is; it learns to avoid clusters of numbers in a matrix that are similar to clusters of numbers that gave it a negative reward in the past. Where a human can lean on past experience and outside concepts (‘ghosts are bad, I should run’), DQN has nothing. Recent research has shown just how much of an advantage this can be, and how it’s probably a big contributor to the slow learning progress of RL algorithms. Borrowing a diagram from that paper, imagine how difficult it would be to play a game if it looked like this:

What’s going on? What am I controlling? What’s the goal? The only way to find out is to try something and see what happens. Welcome to the world your computer sees.

Where Does it Go From Here?

As great as DQN is, it has room to improve and many directions to branch out in. For one, it learns extremely slowly. There has been a lot of work on jumpstarting RL algorithms, using techniques from imitation and few-shot learning. Intuition says there’s also room for optimization in the memory mechanism; we’re already sampling based on importance, but we could be using similar metrics to decide which experiences to save and which ones they should overwrite. The convolutional architecture that is widely used and that we held constant in this post has been mostly stagnant since it’s release in 2015; would recent advancements in the field of computer vision boost our agent’s ability to comprehend its environment? In addition, DQN has to learn each new game from scratch; what if it could take its knowledge from different-but-similar tasks, like Wizard of Wor, another maze-based Atari game, and transfer some of it over to Pac-man? That would take us from replicating human performance to replicating the human learning process — using a lifetime of continuous learning to tackle brand-new challenges.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Written by Jake Grigsby with Zabih Yousuf

Cavalier Machine Learning, University of Virginia

www.cavml.com

June 2018