DEEP REINFORCEMENT LEARNING EXPLAINED – 15

In the previous post, we have presented solution methods that represent the action-values in a small table. We referred to this table as a Q-table. In the next three posts of the "Deep Reinforcement Learning Explained" series, we will introduce the reader to the idea of using neural networks to expand the size of the problems that we can solve with reinforcement learning presenting the Deep Q-Network (DQN), that represents the optimal action-value function as a neural network, instead of a table. In this post, we will do an overview of DQN as well as introduce the OpenAI Gym framework of Pong. In the next two posts, we will present the algorithm and its implementation.
Atari 2600 games
The Q-learning method that we have just covered in previous posts solves the issue by iterating over the full set of states. However often we realize that we have too many states to track. An example is Atari games, that can have a large variety of different screens, and in this case, the problem cannot be solved with a Q-table.
The Atari 2600 game console was very popular in the 1980s, and many arcade-style games were available for it. The Atari console is archaic by today’s gaming standards, but its games still are challenging for computers and is a very popular benchmark within RL research (using an emulator)

In 2015 DeepMind leveraged the so-called Deep Q-Network (DQN) or Deep Q-Learning algorithm that learned to play many Atari video games better than humans. The research paper that introduces it, applied to 49 different games, was published in Nature (Human-Level Control Through Deep Reinforcement Learning, doi:10.1038/nature14236, Mnih, and others) and can be found here.
The Atari 2600 game environment can be reproduced through the Arcade Learning Environment in the OpenAI Gym framework. The framework has multiple versions of each game but for the purpose of this post, the Pong-v0 Environment will be used.
We will study this algorithm because it really allows us to learn tips and tricks that will be very useful in future posts in this series. DeepMind’s Nature paper contained a table with all the details about hyperparameters used to train its model on all 49 Atari games used for evaluation. However, our goal here is much more modest: we want to solve just the Pong game.
As we have done in some previous posts, the code presented in this post has been inspired by the code of Maxim Lapan who has written an excellent practical book on the subject.
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.
Our previous examples for FrozenLake, or CartPole, were not demanding from a computation requirements perspective, as observations were small. However, from now on, that’s not the case. The version of code shared in this post converges to a mean score of 19.0 in 2 hours (using a NVIDIA K80). So don’t get nervous during the execution of the training loop. 😉
Pong
Pong is a table tennis-themed arcade video game featuring simple two-dimensional graphics, manufactured by Atari and originally released in 1972. In Pong, one player scores if the ball passes by the other player. An episode is over when one of the players reaches 21 points. In the OpenAI Gym framework version of Pong, the Agent is displayed on the right and the enemy on the left:

There are three actions an Agent (player) can take within the Pong Environment: remaining stationary, vertical translation up, and vertical translation down. However, if we use the method action_space.n
we can realize that the Environment has 6 actions:
import gym
import gym.spaces
DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
test_env = gym.make(DEFAULT_ENV_NAME)
print(test_env.action_space.n)
6
Even though OpenAI Gym Pong Environment has six actions:
print(test_env.unwrapped.get_action_meanings())
['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']
three of the six being redundant (FIRE is equal to NOOP, LEFT is equal to LEFTFIRE and RIGHT is equal to RIGHTFIRE).
DQN Overview
At the heart of the Agent of this new approach, we found a deep neural network instead of a Q-table as we saw in the previous post. It should be noted that the Agent was only given raw pixel data, what a human player would see on screen, without access to the underlying game state, position of the ball, paddles, etc.
As a reinforcement signal, it is fed back the change in game score at each time step. At the beginning, when the neural network is initialized with random values, it’s really bad, but overtime it begins to associate situations and sequences in the game with appropriate actions and learns to actually play the game well (that, without a doubt, the reader will be able to verify for himself with the code that will be presented in this series).
Input space
Atari games are displayed at a resolution of 210 by 60 pixels, with 128 possible colors for each pixel:
print(test_env.observation_space.shape)
(210, 160, 3)
This is still technically a discrete state space but very large to process as it is and we can optimize it. To reduce this complexity, it is performed some minimal processing: convert the frames to grayscale, and scale them down to a square 84 by 84 pixel block. Now let’s think carefully if with this fixed image we can determine the dynamics of the game. There is certainly ambiguity in the observation, right? For example, we cannot know in which direction the ball is going). This obviously violates the Markov property.
The solution is maintaining several observations from the past and using them as a state. In the case of Atari games, the authors of the paper suggested to stack 4 subsequent frames together and use them as the observation at every state. For this reason, the preprocessing stacks four frames together resulting in a final state space size of 84 by 84 by 4:

Output
Unlike until now we presented a traditional reinforcement learning setup where only one Q-value is produced at a time, the Deep Q-network is designed to produce in a single forward pass a Q-value for every possible action available in the Environment:

This approach of having all Q-values calculated with one pass through the network avoids having to run the network individually for every action and helps to increase speed significantly. Now, we can simply use this vector to take an action by choosing the one with the maximum value.
Neural Network Architecture
The original DQN Agent used the same neural network architecture, for the all 49 games, that takes as an input an 84x84x4 image.
The screen images are first processed by three convolutional layers. This allows the system to exploit spatial relationships, and can sploit spatial rule space. Also, since four frames are stacked and provided as input, these convolutional layers also extract some temporal properties across those frames. Using PyTorch, we can code the convolutional part of the model as:
nn.Conv2d(input_shape, 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU()
where input_shape
is the observation_space.shape
of the Environment.
The convolutional layers are followed by one fully-connected hidden layer with ReLU activation and one fully-connected linear output layer that produced the vector of action values:
nn.Linear(conv_out_size, 512),
nn.ReLU(),
nn.Linear(512, n_actions)
where conv_out_size
is the number of values in the output from the convolution layer produced with the input of the given shape. This value is needed to pass to the first fully connected layer constructor and can be hard-coded due it is a function of the input shape (for 84×84 input, the output from the convolution layer will have 3136). However, in order to code a generic model (for all the games) that can accept different input shape, we will use a simple function, _get_conv_out
that accepts the input shape and applies the convolution layer to a fake tensor of such a shape:
def get_conv_out(self, shape):
o = self.conv(torch.zeros(1, *shape))
return int(np.prod(o.size()))
conv_out_size = get_conv_out(input_shape)
Another issue to solve is the requirement of feeding convolution output to the fully connected layer. But PyTorch doesn’t have a "flatter" layer and we need to reshape the batch of 3D tensors into a batch of 1D vectors. In our code, we suggest solving this problem in the forward()
function, where we can reshape our batch of 3D tensors into a batch of 1D vectors using the view()
function of the tensors.
The view()
function "reshape" a tensor with the same data and number of elements as input, but with the specified shape. The interesting thing of this function is that lets one single dimension be a -1
in which case it’s inferred from the remaining dimensions and the number of elements in the input (the method will do the math in order to fill that dimension). For example, if we have a tensor of shape (2, 3, 4, 6), which is a 4D tensor of 144 elements, we can reshape it into a 2D tensor with 2 rows and 72 columns using view(2,72)
. The same result could be obtained by view(2,-1)
, due [144/ (346) = 2].
In our code, actually, the tensor has a batch size in the first dimension and we flatten a 4D tensor (the first dimension is batch size and the second is the color channel, which is our stack of subsequent frames; the third and fourth are image dimensions.)from the convolutional part to 2D tensor as an input to our fully connected layers to obtain Q-values for every batch input.
The complete code for class DQN that we just described is written below:
import torch
import torch.nn as nn
import numpy as np
class DQN(nn.Module):
def __init__(self, input_shape, n_actions):
super(DQN, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU()
)
conv_out_size = self._get_conv_out(input_shape)
self.fc = nn.Sequential(
nn.Linear(conv_out_size, 512),
nn.ReLU(),
nn.Linear(512, n_actions)
)
def _get_conv_out(self, shape):
o = self.conv(torch.zeros(1, *shape))
return int(np.prod(o.size()))
def forward(self, x):
conv_out = self.conv(x).view(x.size()[0], -1)
return self.fc(conv_out)
We can use the print
function to see a summary of the network architecture:
DQN(
(conv): Sequential(
(0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
(1): ReLU()
(2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
(3): ReLU()
(4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
(5): ReLU()
)
(fc): Sequential(
(0): Linear(in_features=3136, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=6, bias=True)
)
)
OpenAI Gym Wrappers
In DeepMind’s paper, several transformations (as the already introduced the conversion of the frames to grayscale, and scale them down to a square 84 by 84 pixel block) is applied to the Atari platform interaction in order to improve the speed and convergence of the method. In our example, that uses OpenAI Gym simulator, transformations are implemented as OpenAI Gym wrappers.
The full list is quite lengthy and there are several implementations of the same wrappers in various sources. I used the version of Lapan’s Book that is based in the OpenAI Baselines repository. Let’s introduce the code for each one of them.
For instance, some games as Pong require a user to press the FIRE button to start the game. The following code corresponds to the wrapper FireResetEnv
that presses the FIRE button in environments that require that for the game to start:
class FireResetEnv(gym.Wrapper):
def __init__(self, env=None):
super(FireResetEnv, self).__init__(env)
assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
assert len(env.unwrapped.get_action_meanings()) >= 3
def step(self, action):
return self.env.step(action)
def reset(self):
self.env.reset()
obs, _, done, _ = self.env.step(1)
if done:
self.env.reset()
obs, _, done, _ = self.env.step(2)
if done:
self.env.reset()
return obs
In addition to pressing FIRE, this wrapper checks for several corner cases that are present in some games.
The next wrapper that we will require is MaxAndSkipEnv
that codes a couple of important transformations for Pong:
class MaxAndSkipEnv(gym.Wrapper):
def __init__(self, env=None, skip=4):
super(MaxAndSkipEnv, self).__init__(env)
self._obs_buffer = collections.deque(maxlen=2)
self._skip = skip
def step(self, action):
total_reward = 0.0
done = None
for _ in range(self._skip):
obs, reward, done, info = self.env.step(action)
self._obs_buffer.append(obs)
total_reward += reward
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
return max_frame, total_reward, done, info
def reset(self):
self._obs_buffer.clear()
obs = self.env.reset()
self._obs_buffer.append(obs)
return obs
On one hand, it allows us to speed up significantly the training by applying max to N observations (four by default) and returns this as an observation for the step. This is because on intermediate frames, the chosen action is simply repeated and we can make an action decision every N steps as processing every frame with a Neural Network is quite a demanding operation, but the difference between consequent frames is usually minor.
On the other hand, it takes the maximum of every pixel in the last two frames and using it as an observation. Some Atari games have a flickering effect (when the game draws different portions of the screen on even and odd frames, a normal practice among Atari 2600 developers to increase the complexity of the game’s sprites), which is due to the platform’s limitation. For the human eye, such quick changes are not visible, but they can confuse a Neural Network.
Remember that we already mentioned that before feeding the frames to the neural network every frame is scaled down from 210×160, with three color frames (RGB color channels), to a single-color 84 x84 image using a colorimetric grayscale conversion. Different approaches are possible. One of them is cropping non-relevant parts of the image and then scaling down as is done in the following code:
class ProcessFrame84(gym.ObservationWrapper):
def __init__(self, env=None):
super(ProcessFrame84, self).__init__(env)
self.observation_space = gym.spaces.Box(low=0, high=255,
shape=(84, 84, 1), dtype=np.uint8)
def observation(self, obs):
return ProcessFrame84.process(obs)
@staticmethod
def process(frame)
if frame.size == 210 * 160 * 3:
img = np.reshape(frame, [210, 160, 3])
.astype(np.float32)
elif frame.size == 250 * 160 * 3:
img = np.reshape(frame, [250, 160, 3])
.astype(np.float32)
else:
assert False, "Unknown resolution."
img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 +
img[:, :, 2] * 0.114
resized_screen = cv2.resize(img, (84, 110),
interpolation=cv2.INTER_AREA)
x_t = resized_screen[18:102, :]
x_t = np.reshape(x_t, [84, 84, 1])
return x_t.astype(np.uint8)
As we already discussed as a quick solution to the lack of game dynamics in a single game frame, the class BufferWrapper
stacks several (usually four) subsequent frames together:
class BufferWrapper(gym.ObservationWrapper):
def __init__(self, env, n_steps, dtype=np.float32):
super(BufferWrapper, self).__init__(env)
self.dtype = dtype
old_space = env.observation_space
self.observation_space =
gym.spaces.Box(old_space.low.repeat(n_steps,
axis=0),old_space.high.repeat(n_steps, axis=0),
dtype=dtype)
def reset(self):
self.buffer = np.zeros_like(self.observation_space.low,
dtype=self.dtype)
return self.observation(self.env.reset())
def observation(self, observation):
self.buffer[:-1] = self.buffer[1:]
self.buffer[-1] = observation
return self.buffer
The input shape of the tensor has a color channel as the last dimension, but PyTorch’s convolution layers assume the color channel to be the first dimension. This simple wrapper changes the shape of the observation from HWC (height, width, channel) to the CHW (channel, height, width) format required by PyTorch:
class ImageToPyTorch(gym.ObservationWrapper):
def __init__(self, env):
super(ImageToPyTorch, self).__init__(env)
old_shape = self.observation_space.shape
self.observation_space = gym.spaces.Box(low=0.0, high=1.0,
shape=(old_shape[-1],
old_shape[0], old_shape[1]),
dtype=np.float32)
def observation(self, observation):
return np.moveaxis(observation, 2, 0)
The screen obtained from the emulator is encoded as a tensor of bytes with values from 0 to 255, which is not the best representation for an NN. So, we need to convert the image into floats and rescale the values to the range [0.0…1.0]. This is done by the ScaledFloatFrame
wrapper:
class ScaledFloatFrame(gym.ObservationWrapper):
def observation(self, obs):
return np.array(obs).astype(np.float32) / 255.0
Finally, it will be helpful for the following simple function make_env
that creates an environment by its name and applies all the required wrappers to it:
def make_env(env_name):
env = gym.make(env_name)
env = MaxAndSkipEnv(env)
env = FireResetEnv(env)
env = ProcessFrame84(env)
env = ImageToPyTorch(env)
env = BufferWrapper(env, 4)
return ScaledFloatFrame(env)
What is next?
This is the first of three posts devoted to Deep Q-Network (DQN), in which we provide an overview of DQN as well as an introduction of the OpenAI Gym framework of Pong. In the next two posts (Post 16, Post 17), we will present the algorithm and its implementation, where we will cover several tricks for DQNs to improve their training stability and convergence.
Deep Reinforcement Learning Explained Series
by UPC Barcelona Tech and Barcelona Supercomputing Center
A relaxed introductory series that gradually and with a practical approach introduces the reader to this exciting technology that is the real enabler of the latest disruptive advances in the field of Artificial Intelligence.
About this series
I started to write this series in May, during the period of lockdown in Barcelona. Honestly, writing these posts in my spare time helped me to #StayAtHome because of the lockdown. Thank you for reading this publication in those days; it justifies the effort I made.
Disclaimers – These posts were written during this period of lockdown in Barcelona as a personal distraction and dissemination of scientific knowledge, in case it could be of help to someone, but without the purpose of being an academic reference document in the DRL area. If the reader needs a more rigorous document, the last post in the series offers an extensive list of academic resources and books that the reader can consult. The author is aware that this series of posts may contain some errors and suffers from a revision of the English text to improve it if the purpose were an academic document. But although the author would like to improve the content in quantity and quality, his professional commitments do not leave him free time to do so. However, the author agrees to refine all those errors that readers can report as soon as he can.