The world’s leading publication for data science, AI, and ML professionals.

Getting an AI to play atari Pong, with deep reinforcement learning

A detailed instructional

First we dominate pong, and next, the world...
First we dominate pong, and next, the world…

Getting a computer program to learn how to play Atari games may seem like a very difficult task. Well, I’ll tell you what … it is. But believe me, if I can do it then so can you. This page assumed you have read through and somewhat slightly understand the theory of deep RL (link posted below). But beware, the theory is not enough to get an agent to learn Atari! With reinforcement learning, everything is in implementation and the devil is in the details! So, the rest of the post will be focused on implementing the code line by line to get our agent working.

What is Deep Reinforcement Learning?

Implementation:

For implementation, we will be using the Open AI Gym environment. For the agent’s neural network, I will be building a CNN using Keras. We will first tackle Pong, then in a separate article, we will get the agent to play breakout (it takes a lot longer to train). Really take your time and read through my code to understand what is going on.

Also, I highly recommend you do this project on google colab if you don’t have a powerful computer. If you don’t know what google colab is, google it and check it out! They provide GPU and CPU’s for you to run code on for free.

The strategy here is this; we receive the current game frame from openai gym. A tensor of the pixel values from the 4 most recent frames is our current state (more on this later). Based on the epsilon greedy strategy, we either take a random action or we input our current state into our CNN to get our action. Now we take our action, receive a reward, and are brought to a new state. We store our (S,A,R,S’) values in memory for training. After each action, we randomly sample data from our agent’s memory and train our agent using the loss function we derived in the article on deep RL theory (linked above).

A Quick Open AI Gym Tutorial

Open AI Gym is a library full of atari games (amongst other games). This library easily lets us test our understanding without having to build the environments ourselves. After you import gym, there are only 4 functions we will be using from it. These functions are; gym.make(env), env.reset(), env.step(a), and env.render().

  • gym.make(env): This simply gets our environment from open ai gym. We will be calling env = gym.make(‘PongDeterministic-v4’), which is saying that our env is Pong.
  • env.reset(): This resets the environment back to its first state
  • env.step(a): This takes a step in the environment by performing action a. This returns the next frame, reward, a done flag, and info. If the done flag == True, then the game is over.
  • env.render(): env.render() shows the agent playing the game. We are only going to use env.render() when checking the performance of our agent. I don’t think this works in google colab or any other notebook, though I may be wrong. Because of this, you may have to save the weights of the agent then load them on your local machine and render the game there.

A code example of how to implement these functions is written below:

Preprocessing Frames

Currently, the frames received from openai are much larger than we need, with a much higher resolution than we need. See below:

We don't need any of the white space at the bottom, or any pixel above the white stripe at the top. Also, we don't need this to be in color.
We don’t need any of the white space at the bottom, or any pixel above the white stripe at the top. Also, we don’t need this to be in color.

Firstly, we crop the image so that only the important area is displayed. Next, we convert the image to grayscale. Finally, we resize the frame using cv2 with nearest-neighbor interpolation, then convert the image datatype to np.uint8. See the code below for implementation…

The resulting image looks like this:

Supplying Directional Information

Put yourself in the agent’s shoes for a second and look at the frame below:

Now, which way is the ball is moving … well there’s no way to know. We can only know the position since we only have one timestep. There is simply not enough information provided in one frame to know the direction. If given 2 timesteps we can know the velocity of the ball, and if given three then we can know the acceleration, and so on.

Recall that our frames are a matrix of pixel values. To add the directional information into our input, we can simply stack our frames together to form a tensor. Taking our frame’s shape to be (84,84), the shape of our state will be (84,84, 4) if we stack the most recent 4 frames together. So at time t in-game, our current state is a stack of the frames at t, t-1, t-2, and t-3.

Agent Memory

Before I get into implementing agent memory, I want to explain what inspired researchers to use it in the first place.

When I first approached this project, I tried to train the agent using the data generated in real-time. So at timestep t, I would use the (S,A,R,S’) data from time (t-1) to train my agent.

The best way I can explain the problem with this is by going back to my early days in college. In my sophomore year, I had statics theory shoved down my throat for a year straight (I pursued mechanical engineering). Then in my junior year, I did nothing with statics and instead had machine design shoved down my throat. Come senior year, I’d forgotten everything there was to forget about statics.

In our environment, there might be 100 frames of the ball just moving to the left side of the screen. If we repeatedly train our agent on the same exact situation then eventually it will overfit to this situation and it won’t generalize the entire game, just like how I forgot about statics while studying machine design.

So, we store all our experiences in memory, then randomly sample from the whole list of memories. This way the agent learns from all his experiences while training, and not just his current situation (to beat this dead horse a little more, imagine if you couldn’t learn from the past, and instead you could only learn from the exact present).

Alright now for implementation. For the memory, I make a separate class with 4 separate deques (a first in first out list of fixed size). The lists contain the frames, actions, rewards, and done flags (which tells us if this was a terminal state). I also add a function that allows us to add to these lists.

The environment

Next, we get into implementing our environment. I do this by making 4 functions;

  • Make_env()
  • Initialize_new_game()
  • take_step() … this is where all the magic happens
  • play_episode()

I’m going to show the code before I get into the specifics so you can follow along.

The make_env() function is self-explanatory. It just calls the gym.make() function.

The initialize_new_game() function resets the environment, then gets the starting frame and declares a dummy action, reward, and done. Now, this data is added to our memory 3 times. Remember we need 4 frames for a complete state, 3 frames are added here and the last frame is added at the start of the take_step() function.

The take_step() function is a bit complicated. I’ll list the pseudocode below:

  1. Update agents total timesteps
  2. Save weights every 50000 steps
  3. Now, we call env.step using the last action in our agent’s actions list. This returns the next frame, reward, done flag, and info
  4. Resize the next frame and get the next state from the last 4 frames.
  5. Get the next action using the next state. This is why we are able to use the last action in the agent’s action list in step 3
  6. If the game is over, then return the score and a terminal flag
  7. Add the experience to memory
  8. If debugging, render the game. We won’t be using this, but I found it useful when trying to get this to work.
  9. If the agent’s memory contains enough data, then have the agent learn from memory. More on this later.
  10. Return the agents score and a false terminal flag

As you can see, this function works with the agent class a lot. This class will be introduced shortly.

The play_episode function is pretty self-explanatory. This function just initializes a new game, calls the take_step function until a true terminal flag is returned. Then the episode score is returned.

The Agent class

This class has 5 functions, including the constructor. These functions are: init(), _build_model(), get_action(), _index_valid(), and learn().

The build_model() function just constructs a CNN for the agent. Note that there are no max_pooling layers in the CNN, since pooling eliminates some needed spatial information. I use the Huber loss and Adam initializer.

The get_action() function is just an implementation of epsilon greedy. We generate a random number, and if it is less than our epsilon value, we take a random action. If not then we pass the current state into our CNN and return the max output.

The _index_valid() function is a helper to the learn function. It simply checks to see if there was a terminal frame within the past 4 frames of our memory, at the given index. We don’t want to create a state that is in between 2 games.

Before I explain the learn function, you may have noticed the model_target attribute to the class. This is a method that is used to decrease training noise. Basically, we copy our CNN so we now have our model, and something called our model target. Now, recall that when training we input the next state into our CNN to help generate the target for the error function. Notice that after the weights are updated, the target within the error function will change since the CNN is being used to generate this target, and the CNN weights were just changed.

Below is a reminder of our loss function:

V(t+2) is an output to our NN. So if we update our weights once, then V(t+2) will change a bit. This means our target is moving, which in practice has proven to be inefficient.
V(t+2) is an output to our NN. So if we update our weights once, then V(t+2) will change a bit. This means our target is moving, which in practice has proven to be inefficient.

Someone at google had the idea that we could use the target model to generate the targets for our loss function, then every so often we just set the weights of the target model to the weights of our main model. So they gave that a go, and it increased performance and decreased training time (which is why I implemented it here).

So now onto the learn() function. We will be training our CNN with a minibatch size of 32.

  1. Our first step is to gather 32 states, next_states, actions, rewards, and done flags.
  2. We pass our states through our model and our next states through our target model. The output we get from our model will be modified in the next step so it can be used as our Vtrue.
  3. Now for each state in our minibatch, we first figure out which action was taken. Then we calculate the true value of taking this action and replace the outputted value with this true value. The _not next_doneflags[i] code just ensures that if we are at a terminal state, the next reward is not taken into account. Also, gamma is called a discount factor. It just makes current rewards hold more weight than future rewards. Gamma can be taken out without fear of breaking the agent.
  4. Now we fit our model using our states and the labels we created.
  5. Now update epsilon and how many times the agent has learned
  6. Copy the model weights to the target model every 10,000 steps

And that’s it for the agent. The learn function is where things went wrong for me when I first tried this project. So if you are having trouble check here first.

main.py

Now for our final file. I’m not going to go into much detail here, the file is pretty self-explanatory. Just initialize your agent and environment, then call the play_episode() function in a loop. Every so often print out some useful metrics, every 100 episodes I print out a graph that shows a moving average of the scores. After training my agent, I realized it would be more useful to plot the scores against the steps taken instead of games played. If you are recreating this project I suggest plotting the scores every 10,000 or so steps.

Summary:

You can find the complete, finished project here. Training took about 24 hours. The final result is the agent on the right in the gif at the top of this post.

My first time doing this took weeks. It really is a difficult project. If you are having trouble, really take the time to understand the theory and read through my code.

This particular RL algorithm is called Deep Q Learning. Deep Q Learning is usually treated as an introduction to deep RL, but it is not the most effective algorithm out there. My favorite algorithm to use is A2C, which I will go over in a future post.

But before I do, let’s prove just how cool this algorithm is. I once showed this to a friend and he said, "Well couldn’t you have just hardcoded the paddle to follow the ball". Although he is right, the cool part about this project is not just that we got it to play pong. It is that the agent learned to play pong. This algorithm is a step towards general Artificial Intelligence. A way to prove that we have obtained some degree of general artificial intelligence is to see if this agent can learn multiple games. I will leave a link here to an instructional on how to get this agent to learn the game Breakout when finished.

All images, gifs, and code snippets by the author!

Thank you for reading! If this post helped you in some way or you have a comment or question then please leave a response below and let me know! Also, if you noticed I made a mistake somewhere, or I could’ve explained something more clearly then I would appreciate it if you’d let me know through a response.


Related Articles