Curiosity in Deep Reinforcement Learning

Understanding Random Network Distillation

Published in

Towards Data Science

6 min readDec 3, 2018

Learning to play Montezuma’s Revenge, a previously difficult deep RL task, has seen big breakthroughs with Exploration by Random Network Distillation (source: Parker Brothers Blog)

Learning to play Atari games is a popular benchmark task for deep reinforcement learning (RL) algorithms. These games offer a nice balance between simplicity and complexity: some games (e.g., Pong) are simple enough to solve with basic algorithms like vanilla policy gradient, while other games are complex enough to stump even the most advanced algorithms.

In between the simplest and most complex games is a useful spectrum of tasks that has become central to many deep RL research papers.

A previously “unsolved” Atari game, Montezuma’s Revenge, was recently solved (sort of) by an algorithm that is able to surpass human performance in terms of points scored. Researchers were able to encourage the agent to explore different rooms in the first level, which is a great way to earn points in this particular game.

Exploration Through Curiosity

When a human plays adventure games (like Montezuma’s Revenge), there is an intrinsic desire to explore. Game designers build such games to encourage this behavior, often requiring gamers to explore in order to make progress. This is arguably what makes adventure games fun (ask anyone who’s ever enjoyed Skyrim.)

Adventure games like Montezuma’s Revenge or Skyrim take advantage of the player’s natural desire to explore, making exploration a key component to completing in-game tasks.

General deep RL algorithms perform “exploration,” typically, through a stochastic policy: actions are randomly sampled from an action likelihood distribution provided by a neural network. The result of this, especially early on (when the policy hasn’t had time to converge), is an apparent random selection of actions.

This works in some cases. Pong, for example, can be solved by moving the paddle around randomly and observing outcomes. A few lucky moments where the ball is deflected can get optimization started.

In a game like Montezuma’s Revenge, this behavior gets us nowhere; imagine the avatar randomly moving left, right, and jumping starting from the beginning of the game. He’ll just end up falling into lava or walking right into an enemy without ever earning points. Without points, or a reward, the algorithm has no signal to optimize on.

So, you’re just going to flail around randomly? Good luck with that (source).

Curiosity

There has been a lot of emphasis on finding better ways to explore. Curiosity-based exploration can be seen as an attempt to emulate the curiosity-driven behavior of a human gamer.

But how do we create a curious agent?

There are a number of approaches that can accomplish this, but one approach, which uses next-state prediction, is particularly interesting because of its simplicity and scalability.

The basic idea is to train an independent predictive model alongside a policy model. This predictive model takes an observation from the current state and the selected action as an input and produces a prediction for the next observation. For well-explored trajectories, we assume this loss will be small (as we are continually training the predictive model via supervised learning). In less-explored trajectories, we assume this loss will be large.

What we can do, then, is create a new reward function (called “intrinsic reward”) that provides rewards proportional to the loss of the predictive model. Thus, the agent receives a strong reward signal when exploring new trajectories.

“Learning to explore” through curiosity in level 1 leads to faster progress in level 2 using next-state prediction in this Mario emulator task (source).

This technique lead to some encouraging results in a Super Mario emulator.

Procrastinating Agents: The TV Problem

This technique is not perfect. There’s a known problem: agents are attracted to stochastic or noisy elements in their environment. This is sometimes called the “white noise” problem or the “TV problem”; it’s also been referred to as “procrastination.”

To demonstrate this effect, imagine an agent who learns to navigate through a maze by observing the pixels that he sees.

A next-state prediction curious agent learns to successfully navigate a maze (source).

The agent ends up doing well; he’s driven to look for unexplored parts of the maze because of his ability to make good predictions in well-explored areas (or, rather, his inability to make good predictions in unexplored areas.)

Now put a “TV” on the wall of the maze displaying randomly-selected images in rapid succession. Due to the random source of images, the agent cannot accurately predict what image will come next. The predictive model will produce high loss, which gives high “intrinsic” rewards to the agent. The end result is an agent who prefers to stop and watch TV rather than continue exploring the maze.

The next-state prediction curious agent ends up “procrastinating” when faced with a TV, or source of random noise, in the environment (source).

Avoiding Procrastination with Random Network Distillation

A solution to the noisy TV problem is proposed in Exploration by Random Network Distillation (RND), a very recent paper published by some of the good folks at OpenAI.

The novel idea here is to apply a similar technique to the next-state prediction method described above, but to remove the dependence on the previous state.

Next-State prediction vs. RND overview (source).

Rather than predicting the next state, RND takes observations from the next state and tries to make a prediction on the next state. That’s a pretty trivial prediction, right?

The purpose of the random network part of RND is to take this trivial prediction task and translate it into a hard prediction task.

Use a Random Network

This is a clever, albeit counter-intuitive, solution.

The idea is that we use a randomly-initialized neural network to map the observations to a latent observation vector. The output of this function itself is actually unimportant; what’s important is that we have some unknown, deterministic function (a randomly-initialized neural network) that transforms observations in some way.

The task of our predictive model, then, is not to predict the next state, but to predict the output of the unknown random model given an observed state. We can train this model using the outputs of the random network as labels.

When the agent is in a familiar state, the predictive model should make good predictions of the expected output from the random network. When the agent is in an unfamiliar state, the predictive model will make poor predictions about the random network output.

In this way, we can define an intrinsic reward function that is again proportional to the loss of the predictive model.

Conceptual overview of the intrinsic reward computation. Only the next-state observation is used.

This can be interpreted as a “novelty detection” scheme, where the computed loss is higher when the observation is new or unfamiliar to the predictive model.

The authors use MNIST as a proof-of-concept for this idea. In this experiment, they feed MNIST characters from one class through a randomly-initialized neural network. They then train a parallel network to predict the output of the random network given its input. As expected, they see the loss from the parallel network on the target class drop as the number of training examples from the target class increases.

MNIST proof-of-concept, from the research paper.

In this way, when the agent sees a source of random noise, it doesn’t get stuck. It no longer tries to predict the unpredictable next-frame on the screen, but instead just needs to learn how these frames get transformed by the random network.

Exploring Montezuma’s Revenge

Previous next-state prediction curiosity mechanisms failed to solve Montezuma’s revenge because of bad solutions, but RND seems to have overcome these issues.

Agents driven by curiosity explore rooms, and learn to collect keys which allow them to unlock new rooms.

Despite this success, the agent only “occasionally” passes the first level. This is because careful management of key usage is required to get through the final door to complete the level. Internal-state models, such as LSTM, will be required to overcome this barrier.

So, while RND has gotten agents to surpass average human performance in terms of points scored, there is still a ways to go before mastering the game.

This is a part of a series of posts about experimental deep reinforcement learning algorithms. Check out some previous posts in the series:

Understanding Evolved Policy Gradients