Are the space invaders deterministic or stochastic?

A discussion on the techniques used to inject stochasticity in the ALE and Open AI Gym.

Nicolas Maquaire
Towards Data Science

--

Image by author

Abstract

Google Deepmind achieved human-level performance on 49 Atari games using the Arcade Learning Environment (ALE). This article describes the methods I used to reproduce this performance and discusses the efficiency of the mechanisms used by Deepmind and Open AI for injecting stochasticity in the ALE.

Introduction

As a side project, I spent some time trying to achieve the same Deepmind’s human-level performance on Breakout and Space Invaders. Although I understand there are many better performing architectures, my goal was to use the same network as the one presented in Deepmind’s 2015 nature paper (Human-level control through deep reinforcement learning). I did this to better understand the challenges and performances Deepmind experienced while playing some iconic games from my childhood. One of the challenges Deepmind experienced is particularly interesting to me: Are the environments stochastic or deterministic? Did Deepmind and Open AI fight deterministic waves of Space Invaders? In this paper we discuss the efficiency of the mechanisms used by Deepmind and Open AI for injecting stochasticity in the ALE.

The source code can be accessed at https://github.com/NicMaq/Reinforcement-Learning

This repository contains the code I used to support my conclusions as well as the data of my tensorboard runs to encourage discussions and facilitate comparisons.

Additionally, for readers who want to learn how my algorithm works, I published Breakout explained and e-greedy and softmax explained. These are two Google Colab notebooks where I explain expected sarsa and the implementation of the two policies, e-greedy and softmax.

Last, Reinforcement learning is an exciting and promising field of artificial intelligence but be aware that it is cursed. Read my post on the best practices for RL to accelerate your path to success.

Games scores

In Deepmind’s 2015 nature paper we find two tables presenting the results (Extended Data Table 2 and Extended Data Table 3). In table 2, Deepmind listed the highest average episode score reported over the first 50 million frames while evaluating every 250,000 frames for 135,000 frames. In table 3, they listed the highest average episode score reported over the first 10 million frames.

I don’t have the computing power of Deepmind so I compared my results with table 3. On the first ten million frames (used for training the agent) deepmind reported a highest average score of 316.8 for Breakout and 1088.9 for Space Invaders.

The company, Open AI, open-sourced baselines, which was their internal effort to reproduce reinforcement learning algorithms with performance on par with published results.

To the best of my knowledge, the baseline team did not publish results on the same network as Deepmind. The closest we can find is a network trained with double q-learning, which as its name indicates, is an improvement on q-learning. Nevertheless, it was interesting to read through the reports and check if my results were comparable. So, provided that Open AI used the same methodology as Deepmind and particularly a skip frame of 4, the score they obtained was between 360 and 410 over the first 10 million frames (see baselines DQN results).

graph from https://openai.com/blog/openai-baselines-dqn/

Finally, the score for Space Invaders reported in the 2017 ALE paper for a DQN was 673.

The methodology I used is discussed in detail in a later chapter. I tried to rigorously follow Deepmind’s methodology. Below are the results I got for Breakout and Space Invaders using almost the same evaluation procedure. I committed my Tensorflow runs in the GitHub:

Breakout (Run 20200902182427)- My best score is 427

Image by author

Space invaders (Run 20200910151832)- My best score is 1085

Image by author

Congratulations, I successfully reproduced Deepmind’s performance on Breakout and Space Invaders.

But did I, really?

One of the major differences between Deepmind’s code and mine is that Deepmind uses the ALE directly while I am using OpenAI Gym. And, certainly the most significant difference is how we inject stochasticity into the games.

The ALE is deterministic and therefore, OpenAI Gym implements different techniques for injecting stochasticity in the games. Let’s discuss which techniques are the closest to Deepmind’s methodology and their efficiency.

Determinism and stochasticity of ALE and OpenAI Gym.

While there is no reference to determinism in the first 2013 ALE paper, Machado, Bellemare & al. write in 2017 that one of the main concerns of the ALE is that “in almost all games, the dynamics within Stella itself are deterministic given the agent’s actions.”

The original Atari 2600 console had no feature for generating random numbers. As a consequence, the ALE is also fully deterministic. As such, it is possible to achieve high scores by simply memorizing a good sequence of actions rather than learning to make good decisions. Such an approach is not likely to be successful beyond the ALE. The stochasticity of the environment is a critical factor to encourage the robustness of RL’s algorithm and their ability to transfer to other tasks.

Various approaches have been developed to add forms of stochasticity to the ALE dynamics (often, at a later date than Deepmind’s paper publication date). In the 2017 ALE paper we can find:

  1. Sticky actions
  2. Random frame skips
  3. Initial no-ops
  4. Random action noise

Google deepmind uses a fixed frame skipping of 4, a maximum of 30 initial no-ops, and a random action noise.

The ALE is deterministic. We can read in the 2017 ALE paper that “Given a state s and a joystick input a there is a unique next state s0, that is, p(s0 | s; a) = 1.” So, if at each step of the game there is only one possible outcome, we should always achieve the same score with the same network and the same initial state.

In the following paragraphs, we will study the variance of the scores we achieved to evaluate the level of stochasticity added by the different approaches. We will also support our conclusions by looking at the distribution of the results.

Stochasticity with sticky actions

In the 2017 ALE paper, Machado, Bellemare & al., recommend sticky actions to enforce stochasticity. Sticky actions add stickiness to the agent’s actions. At every step, the environment executes either the previous action or the new agent’s action.

“Our proposed solution, sticky actions, leverages some of the main benefits of other approaches without most of their drawbacks. It is free from researcher bias, it does not interfere with agent action selection, and it discourages agents from relying on memorization. The new environment is stochastic for the whole episode, generated results are reproducible.”

We conducted two experiments to learn how to differentiate a stochastic and a deterministic environment. Our deterministic environment is BreakoutNoFrameskip-v4. We can read about this environment in Open AI’s source code: “No frameskip. (Atari has no entropy source, so these are deterministic environments)”. Our stochastic environment is BreakoutNoFrameskip-v4 with sticky actions (BreakoutNoFrameskip-v0).

Below and on the left, we can see the distribution of results we got for the deterministic environment, and on the right, for the stochastic environment.

Graph by author

As expected, we have a very narrow distribution of results for the determinist environment and a broader distribution for the stochastic environment.

It’s also very interesting to look at the histogram of the results for both environments:

Graph by author

We can see on the left that, for each of the 100 games of our evaluation phase, we achieved the same score. On the right, the stochasticity added by the sticky actions caused the agent to achieve a variety of scores.

To measure the spread, we can compute the variance of the results. Below is a comparison of the variance of the deterministic and stochastic environments.

Graph by author

The variance of the results for the deterministic environment is always zero and greater than 0 for the stochastic environment.

Let’s compare these first results with the distribution and variance of the other techniques.

Frame skipping

Frame skipping consists of repeating the last action decided by the agent for a random number of n consecutive frames. The agent only sees 1 in n+1 frames.

In Open AI Gym, frame skipping can skip a random (2,3, or 4) number of frames. The environments which skip a fixed number of frames are the {}Deterministic-v4 and {}Deterministic-v0 environments. The environments which skip a random number of frames are the {}-v4 and {}-v0 environments.

Deepmind used a fixed frame skipping. We will use BreakoutDeterministic-v4.

Below is a comparison between the result distribution of BreakoutNoFrameskip-v4 on the left and BreakoutDeterministic-v4 on the right. BreakoutDeterministic-v4 is the same environment as BreakoutNoFrameSkipV4 (deterministic), but with the addition of a fixed frame skipping.

Graph by author

Both distributions are very similar. For BreakoutDeterministic-v4, we obtained the distribution of a deterministic environment.

If we compute the variance for the two experiments, all values are equal to zero.

Graph by author

In the 2017 ALE paper, Bellmare and al’s conclusion about the random frame skipping technique was that “beside injecting stochasticity, frame skipping results in a simpler reinforcement learning problem and speeds up execution.”

As per our experiments, a fixed frame technique does not inject stochasticity but clearly simplifies the learning problem and speeds up convergence (see the scores below).

Graph by author

Below is the distribution of one of my experiments with breakout-v4 which uses a random frame skipping technique. We clearly see that this distribution is similar to those of our stochastic environments.

Graph by author

A random frame skipping technique adds stochasticity but also adds complexity. We can read in the 2017 ALE paper : “Discounting must also be treated more carefully, as this makes the effective discount factor random.”

My code does not take the time distortion into consideration. Distorting time in a Time Difference reinforcement learning method would certainly prove challenging.

As it simplifies the learning problem and speeds up convergence, we will use BreakoutDeterministic-v4 as our deterministic environment for the remaining part of this article.

Initial no-ops

Another technique used by Deepmind was to change the initial state by executing k no-op actions (k=30). This technique is not implemented by Open AI Gym.

My implementation of no-ops for Breakout encourages a diversity of initial states by randomly moving the spaceship before sending the “FIRE” action (for a random number k, I chose either the “RIGHT” or “LEFT” action).

On the left, we can see the distribution of results for our deterministic environment (BreakoutDeterministic-v4) and on the right, the distribution of results when the agent executes k no-ops actions (k < 0).

Graph by author

The comparison between the two charts is interesting. While we don’t have the same pattern as a deterministic environment, the spread is definitely narrower than the spread of a deterministic environment.

Graph by author

The number of distinct results is increased but remains lower than in a stochastic environment.

Graph by author

When studying the variance, we can notice that the environment presents signs of a deterministic environment. The variance is close to zero for a few evaluation phases and significantly lower than the variance of our deterministic environment. The technique no-ops injects little stochasticity. It is less efficient than other techniques.

This confirms what Machado et al state in the 2017 ALE paper: “The environment remains deterministic beyond the choice of starting state.”

Random action noise

Finally, we will observe the efficiency of a random action noise. A random action noise is keeping a small probability to replace the agent’s selected action by a random action. This technique is used by Deepmind. During the evaluation phase, they keep a probability of 5% to execute a random action.

On the left is BreakoutDeterministic-v4 with epsilon = 0 and on the right is BreakoutDeterministic-v4 with epsilon = 0.05.

Graph by author

We can observe that keeping epsilon non-nul clearly injects stochasticity. The results obtained are really close to the ones we got with BreakoutNoFrameskip-v0. And, on the graph below, we can also see that the variance is increasing over the first 20 million frames.

Graph by author

Injecting stochasticity with a random action noise is clearly effective. Although it comes with an important drawback as it biases the policy and decreases performance.

This supports what Bellmare and al. wrote in the 2017 ALE paper: “Random action noise may significantly interfere with the agent’s policy”.

Space Invaders

Let’s validate our previous conclusions on Space Invaders.

Deterministic versus stochastic

First, let’s validate that we have the same differences between SpaceInvadersDeterministic-v4 (deterministic) and SpaceInvadersDeterministic-v0 (stochastic).

Graph by author
Graph by author

We observe on these two distributions the same patterns as for Breakout. Also, the variance of the distribution is zero for the deterministic environment and non zero for the stochastic environment.

Initial no-ops

For Breakout, the initial no-ops were not injecting stochasticity.

Let’s compare SpaceInvadersDeterministic-v4 with SpaceInvadersDeterministic-v4 with no-ops (k=30).

Graph by author
Graph by author
Graph by author

We can say that SpaceInvadersDeterministic-v4 with no-ops behaves the same as Breakout. It shows signs of a deterministic behaviour. We obtained a variance significantly lower than our stochastic environment.

Random action noise

Let’s compare SpaceInvadersDeterministic-v4 (left) with SpaceInvadersDeterministic-v4 with a random action noise (right).

The two distributions below clearly validate that keeping epsilon to non-zero value injects stochasticity.

Graph by author

And, the variance is almost the same as the variance of our stochastic environment.

Graph by author

Conclusion

In this article, we proved that there are different, valid methods to inject a certain level of stochasticity in the ALE. However, some of the methods come with undesirable side effects: the random frame skips distorts time; the initial no-ops shows signs of determinism and sometimes has no effect on the game; and the random action noise penalizes the behaviour policy.

This corroborates the conclusion of Machado, Bellemare & al. who write: Our recommendation is to use sticky actions, ultimately proposing sticky actions as a standard training and evaluation protocol.”

Therefore, my preference when training on Open AI GYM would go toward using {}deterministic-v0 where we have a good stochasticity and no time distortion.

It’s important to note, if you’re training an RL algorithm, it’s always of interest to check the distribution of your scores for signs of determinism.

Now, to answer the question about Deepmind winning over deterministic space invaders, I would say that they used two methods to inject stochasticity (initial no-ops and a random action noise).

Using both methods seems excessive as a random action noise is sufficient to ensure stochasticity. Perhaps, Deepmind started their experiments with no-ops and noticed they were overfitting in some games. It’s unclear and only Deepmind can say.

Although it is difficult to understand the Baselines methodology, it’s clear from the results that they added no-ops. It’s unclear if they used a random action noise. Was using no-ops sufficient to add enough stochasticity in the games to prevent overfitting?

I was in the same performance range as Deepmind when using a stochastic Breakout environment. When transferring to Space Invaders with the same hyperparameters, my results were significantly lower. Maybe it’s related to the differences in methodology or there is still some fine tuning to do with the hyperparameters to get good performances on other games.

I hope you enjoyed seeing the differences between a deterministic and a stochastic environment! And, I hope this article accelerated the transfer of your networks on other tasks, aiding in your path to success.

References

Deepmind’s 2015 nature paper

Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

2013 ALE paper

M. G. Bellemare, Y. Naddaf, J. Veness and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents, Journal of Artificial Intelligence Research, Volume 47, pages 253–279, 2013.

2017 ALE paper

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. J. Hausknecht, M. Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents, Journal of Artificial Intelligence Research, Volume 61, pages 523–562, 2018.

Methods and Hyperparameters

Methods and Hyperparameters

--

--