I Placed 4th in my First AI Competition. Takeaways from the Unity Obstacle Tower Competition

Joe Booth

Follow

Published in

Towards Data Science

9 min readJul 30, 2019

--

Over the last few years, most of my spare time has been spent tinkering, learning, and researching machine learning, specifically reinforcement learning and digital actors.

Recently I decided to participate in the Obstacle Tower Challenge. To my surprise, my early efforts briefly topped the table, and I placed 2nd in the first round as my agent mastered the first 10 levels (floors 0 through 9) of the tower.

Whilst my agent spent most of the competition in 2nd place, it never fully mastered floor 10 and eventually placed 4th.

Round One:

My first step was to reproduce the findings of the paper, ‘Obstacle Tower: a Generalization Challenge in Vision, Control, and Planning,’ which I did using the Rainbow example code.

Obstacle Tower Challenge authors found that PPO did not generalize well.

I decided to switch to the PPO algorithm. Although the PPO algorithm did not perform well in the Obstacle Tower paper, it has a high ceiling, and I wanted to explore why it had not performed well with this environment.

I hypothesized that by implementing the best practices and optimizations used in state-of-art PPO implementations, there should be no theoretical upper limit to its performance (at least compared with other model-free reinforcement learning algorithms).

I used the code from the paper, Large-Scale Study of Curiosity-Driven Learning, which enabled my agent to score 9.4 and obtain first place for a short time. When training, I found the curiosity bonus (parameter ‘ — int_coeff’) did not seem to impact performance. I found I needed to reduce its impact so the algorithm could learn the environment ( — int_coeff=0.025, — ext_coeff=0.975)

I found the TensorFlow framework unintuitive to work with and so looked for a PyTorch alternative codebase. I chose ‘pytorch-a2c-ppo-acktr-gail’ and built my code around their implementation of the PPO algorithm. I was able to reproduce my earlier score.

Through further refinement and training, I was able to get a stable score of 10, whereby the agent was consistently able to finish the first 10 floors of the challenge on public and test seeds.

Here are some examples of the agent completing puzzles:

Easy Key Puzzles: lower levels have relatively simple key / door placement where the key is on or close to the path through the level. The agent is able to learn without any prior semantic knowledge of the relationship between keys and door.

Hard Key Puzzles: later levels have a more complex key/door placement where the agent must search for the key.

Double Jump Puzzle: this was one of the harder puzzles in the first 10 levels. The agent learns to do a double jump. Some seeds have a blocking mechanic, whereby the agent must complete this puzzle to get the key and progress.

Round One Paper

I felt that I had gone a long way to demonstrate my hypothesis, and used the time between round one and round two to perform some experimental analysis and write a technical paper, PPO Dash: Improving Generalization in Deep Reinforcement Learning, with the accompanying source code github.com/sohojoe/ppo-dash

My agent was outperforming the then-current published state-of-art from the Obstacle Tower paper with some simple but deliberate modifications over the basic PPO algorithm. I hoped to show the individual value of each of the best practices and optimizations, much like the DeepMind paper, ‘Rainbow: Combining Improvements in Deep Reinforcement Learning.’

PPO Dash: Improving Generalization in Deep Reinforcement Learning

For the ablation study, I trained each modification in isolation and then combined them incrementally. I ran each of the 17 different experiments 3 times over 10m training steps, each individual run taking about 5 hours.

The paper was a bit of a bust, with only one of the modifications (Action Space Reduction) being statistically significant.

Either the other modifications are not of value, or we need more than 10m training steps to isolate their value. I would assume the latter, but given that the ablation study’s total training time exceeded 250 hours (or 2 weeks), this is a little depressing.

I recommend reading the technical paper for a detailed breakdown of the best practices and optimizations, however, here is a summary:

Action Space Reduction: I chose to be aggressive in this task because I wanted the network to focus on the greater generalization goal and not to be burdened with learning the semantic relationship between actions, which a human takes for granted. I chose a set of 8 actions (a drop of 85% from the original 54 actions). In the paper, I tested action spaces of 6, 8, 20, 27, and 54 over 10m steps.

Comparison of action sets and their impact on the maximum floor reached after 10m steps.

Frame Stack Reduction: the goal of Frame Stack Reduction is to increase learning performance by reducing the number of required network inputs. I found I was able to reduce the number of historical frames to a single black and white frame.
Large Scale Hyperparameters: typically a PPO algorithm’s hyperparameters are tuned for the Atari benchmark. My hypothesis was that the Obstacle Tower Challenge environment was closer to the Unity Maze environment in Burda et al. (2019) as both environments are 3D in nature and both environments have sparse rewards. We implemented the following changes to the hyperparameters: 8 epochs, Learning Rate of 1e-4 at a constant learning rate (no linear decay), Entropy-coef = 0.001, Number of mini-batches = 8, Number of concurrent agents = 32, Number of steps per epoch = 512
Vector Observations: the goal of adding Vector Observations was to maximize the use of visible pixels and to remove the burden of the policy needing to learn visually encoded state information. I also added the previous action and previous reward to the vector.
Normalized Observations: the goal of normalizing observations is to help increase the variance per pixel.
Reward Hacking: I made the following changes to enhance the reward signal. Completing the floor: add the remaining health to encourage the policy to finish each floor as quickly as possible (the range of reward is between 1 and 4). Puzzle completion: give a reward of 1 (by default, this is 0.1). Health pickup: reward of 0.1 for picking up a blue dot/health pick up. Game over: give a reward of -1.
Recurrent Memory: the motivation for adding recurrent memory was that it would help the policy learn temporal aspects of solving puzzles.

My instinct is that 10m is just too few steps to demonstrate the impact of these modifications. For example, I was only able to score 9.2 without a memory. For round two, I dropped Normalized Observations. It’s not clear how impactful Reward Hacking was and I am not sure I would use that again.

Exploration Puzzles vs Semantic Puzzles

Many researchers see the challenge of sparse reward environments, such as Montezuma’s Revenge, as an ‘exploration problem’, and have proposed solutions such as curiosity (maximizing explorations of uncertain states) and empowerment (maximizing future optionality). While these can be useful in solving some puzzles, such as the double jump puzzle, I do not think they capture puzzles such as the block puzzle introduced on level 10.

When a human player comes across the block puzzle, they have a rich semantic knowledge from years of playing video games and watching Indiana Jones movies. They know the semantic relationships: ‘player can push crates,’ ‘triggers open doors,’ and that ‘puzzles can be reset.’

The benefits of semantic relationships are that one can apply graph theory to chain semantics together to solve problems. We often attribute genius to the act of connecting two seemingly unrelated ideas to provide a novel solution to a hard problem. Perhaps if we can model environments semantically, then our AI will exhibit some of this genius.

Illustration of different roles and types of natural language information in reinforcement learning. From the paper ‘A Survey of Reinforcement Learning Informed by Natural Language’

I am not the first person to have this insight. The paper, ‘A Survey of Reinforcement Learning Informed by Natural Language’ has a good overview of the active research in this area.

Project Sugar Cube

My idea is to create a kind of Turing Test for visual OpenAI.Gym compatible environments. Imagine two terminals, Terminal A and Terminal B, either of which can be controlled by a human, replaying data, or an AI policy.

The idea is to train two agents — a Terminal A agent, which learns to issue ‘instructions’ or ‘thoughts’ plus rewards, and a Terminal B agent, which learns to master the OpenAI.Gym environment.

Project Sugar Cube: Terminal A gives instructions to Terminal B

Terminal A views the visual environment feed from Terminal B and issues text-based ‘instructions’ or ‘thoughts’ to Terminal A along with a +1 / -1 reward signal.

Project Sugar Cube: Terminal B tries to complete the goal

Terminal B is the OpenAI.Gym environment with a couple of modifications — it receives text input from Terminal A and an additional reward signal from Terminal A.

Project Sugar Cube: Terminal A gives a positive or negative reward based on Terminal B’s performance.

Terminal A then gives positive or negative rewards to Terminal B.

Because each terminal is abstracted, one can take a staged approach. For example, one step may focus on training Terminal B on prerecorded Terminal A input and rewards.

I chose to spend time developing a prototype of Project Sugar Cube, knowing that it was a long shot in terms of a payoff in time for the competition, but understanding I had a plan B to address the block puzzle.

I am close to releasing the prototype, which supports two human users, and the Obstacle Tower Environment for data collection. I plan to use the Google Cloud credits I won for Round One towards collecting data over the next year.

Round Two Finale

Round two saw a significant update in the Obstacle Tower Environment. Initially, I had struggled to reproduce round one results and found I needed to randomize across all visual styles. I left this training while I focused on the Project Sugar Cube prototype.

I switched back to focusing on the competition about two weeks before the end date. My ‘plan B’ was to inject prerecorded demonstrations of a human completing the block puzzle.

I had hoped to use inverse reinforcement learning by driving the agent with prerecorded inputs until just before completing the puzzle, then slowly increment the number of steps it would have to complete. The problem I hit was that somehow the environment is not fully reset when restarting a level and so the replay often fell out of sync.

I switched strategies and injected the raw observations, reward, and actions in with the training. Initially, I made it so that when the agent got to level 10, it would do a pass of a prerecording of that level, then a pass using the policy. It was slow to learn, so I implemented the ability to set the number of policy steps to take between a random replays.

Final agent solving the block puzzle. Note how it struggles on the first few attempts.

Again this was slow, but it did learn to complete the block puzzle some of the time. It also completed higher levels, which was satisfying, as I only provided demonstrations of level 10.

However, it did not fully generalize. It would master some seeds, only to forget them as it mastered others.

It was too late at this point to try any significant new strategies. I tried recording more demonstrations. I tried removing the demonstrations, and it would slowly forget how to complete the block puzzle. Also, I tried starting from level 10, but this seemed to make it worse as well.

Through my different submissions, it did pass level 10 on every test seed at some point, but not all on the same run. I tried an ensemble of different trained network instances, which may have helped a tiny bit.

I recommend reading Unixpickle’s write up of his winning agent; Unixpickle also used PPO with demonstrations along with some interesting tricks. It sounds like my agent hit a similar plateau to his agent before his implementation of Prierarchy / replacing the entropy bonus with a KL term. I want to try implementing the KL term into my code, as it would be interesting to see if the block puzzle is solvable without the other tricks he used.

Final Thoughts

I really enjoyed my time working on the competition, interacting with other contestants on Twitter, and the inspiration it gave me to work towards semantics & RL with Project Sugar Cube.

My final score was 10.8. I had dropped out of the top three, which means no prizes (cash, Google Cloud credits, travel grants). However, the goal of winning was never as important to me as developing my intuition, improving my discipline, and working towards my long-term goal of creating digital actors.