Generalization in Reward Learning
Assessing Generalization in Reward Learning with Procedurally Generated Games (1/2)
Authors: Anton Makiievskyi, Liang Zhou, Max Chiswick
Note: This is the first of two blog posts (part two). In these posts, we describe a project we undertook to assess the ability of reward learning agents to generalize. The implementation for this project is available on GitHub.
This first post will provide a background on reinforcement learning, reward learning, and generalization, as well as summarize the main aims and inspirations for our project. If you have the requisite technical background, feel free to skip the first couple sections.
About Us
We are a team that participated in the 2020 AI Safety Camp (AISC), a program in which early career researchers collaborate on research proposals related to AI safety. In short, AI safety is a field that aims to ensure that as AI continues to develop, it does not harm humanity.
Given our team’s mutual interests in technical AI safety and reinforcement learning, we were excited to work together on this project. The idea was originally suggested by Sam Clarke, another AISC participant with whom we had fruitful conversations over the course of the camp.
Reinforcement Learning
In reinforcement learning (RL), an agent interacts with an environment with the goal of earning rewards. Ultimately, the agent wants to learn a strategy in order to maximize the rewards it obtains over time. First things first, though: what exactly is an agent, and what are rewards? An agent is a character that interacts with some world, also known as an environment, by taking actions. For instance, an agent could be a character playing a video game, the car in a self-driving car simulation, or a player in a poker game. The rewards are simply numbers that represent the goal of the agent, whether what happens to the agent is preferable or not. For example picking up a coin may give a positive reward, while getting hit by an enemy a negative reward.
In RL, the state represents everything about the current situation in the environment. What the agent can actually see, however, is an observation. For example, in a poker game, the observation may be the agent’s own cards and the previous actions of the opponent, while the state also includes the cards of the opponent and the sequence of cards in the deck (i.e., things the agent can’t see). In some environments like chess where there is no hidden information, the state and the observation are the same.
Given observations, the agent takes actions. After each action, the agent will get feedback from the environment in the form of:
- Rewards: Scalar values, which can be positive, zero, or negative
- A new observation: The result of taking the action from the previous state, which moves the agent to a new state and results in this new observation. (Also, whether or not the new state is "terminal", meaning whether the current interaction is finished or still in progress. For example, completing a level or getting eaten by an opponent will terminate many games.)
In RL, our goal is to train the agent to be really good at a task by using rewards as feedback. Through one of many possible training algorithms, the agent gradually learns a strategy (also known as a policy) that defines what action the agent should take in any state to maximize reward. The goal is to maximize reward over an entire episode, which is a sequence of states that an agent goes through from the beginning of an interaction to the terminal state.
Hugely successful agents have been trained to superhuman performance in domains such as Atari and the game of Go.

Let’s look at how a sample algorithm might work, using the video game Mario as an example. Let’s say that Mario has an enemy to his right, a mushroom to his left, and nothing above (see figure below). With those three action options, he might get a reward of +2 if he goes left, -10 if he goes right, and 0 if he goes up. After Mario takes an action, he will be in a new state with a new observation, and will have earned a reward based on his action. Then it will be time for another action, and the process goes on. Recall that the intention is to maximize rewards for an entire episode, which in this context is the series of states from the beginning of the game for the duration of Mario’s life.

The first time the algorithm sees this situation, it might select an option randomly since it doesn’t yet understand the consequences of the available actions. As it sees this situation more and more, it will learn from experience that in situations like these, going right is bad, going up is ok, and going left is best. We wouldn’t directly teach a dog how to fetch a ball, but by giving treats (rewards) for doing so, the dog would learn by reinforcement. Similarly, Mario’s actions are reinforced by feedback from experience that mushrooms are good and enemies are not.
How does the algorithm work to maximize the rewards? Different RL algorithms work in different ways, but one might keep track of the results of taking each action from this position, and the next time Mario is in this same position, he would select the action expected to be the most rewarding according to the prior results. Many algorithms select the best action most of the time, but also sometimes select randomly to make sure that they are exploring all of the options. (Note that at the beginning, the agent usually acts randomly because it hasn’t yet learned anything about the environment.)
It’s important to keep exploring all of the options to make sure that the agent doesn’t find something decent and then stick with it forever, possibly ignoring much better alternatives. In the Mario game, if Mario first tried going right and saw it was -10 and then tried up and saw it was 0, it wouldn’t be great to always go up from that point on. It would be missing out on the +2 reward for going left that hadn’t been explored yet.
Imagine that you tried cooking at home and didn’t like the food, and then went to McDonald’s and had a fantastic meal. You found a good "action" of going to McDonald’s, but it would be a shame (and not great health-wise) if you kept eating at McDonald’s forever and didn’t try other restaurants that may end up giving better "rewards".
Generalization
RL is often used in game settings like Atari. One problem with using RL in Atari games (which are similar to Mario-style games) is the sequential nature of these games. After winning one level, you advance to the next level, and keep going through levels in the same order. The algorithms may simply memorize exactly what happens in each level and then fail miserably when facing the slightest change in the game. This means that the algorithms may not actually be understanding the game, but instead learning to memorize a sequence of button presses that leads to high rewards for particular levels. A better algorithm, instead of learning to memorize a sequence of button presses, is able to "understand" the structure of the game and thus is able to adapt to unseen situations, or generalize.
Successful generalization means performing well in situations that haven’t been seen before. If you learned that 22 = 4, 23 = 6, and 24 = 8, and then could figure out that 26 = 12, that means that you were able to "understand" the multiplication and not just memorize the equations.

Let’s look at a generalization example in the context of an email spam filter. These usually work by collecting data from users who mark emails in their inbox as spam. If a bunch of people marked the message "EARN $800/DAY WITH THIS ONE TRICK" as spam, then the algorithm would learn to block all of those messages for all email users in the future. But what if the spammer noticed his emails were being blocked and decided to outsmart the filter? The next day he might send a new message, "EARN $900/DAY WITH THIS ONE OTHER TRICK". An algorithm that is only memorizing would fail to catch this because it was just memorizing exact messages to block, rather than learning about spam in general. A generalizing algorithm would learn patterns and effectively understand what constitutes a piece of spam mail.
Reward Learning
Games generally have very well-defined rewards built into them. In a card game like Blackjack, rewards correspond to how much the agent wins or loses each hand. In Atari, rewards are game dependent, but are well-specified, such as earning points for defeating enemies or finishing levels and losing points for getting hit or dying.
The image below is from a classic reinforcement learning environment called CartPole, where the goal is to keep a pole upright on a track, and where a reward of +1 is provided for every second that the pole stays upright. The agent moves the cart left or right to try to keep the pole balanced and the longer it can keep it balanced, the more +1 rewards it receives.

However, many tasks in the real world do not have such clearly defined rewards, which leads to limitations in the possible applications of reinforcement learning. This problem is compounded by the fact that even attempting to specify clearly defined rewards is often difficult, if not impossible. A human could provide direct feedback to an RL agent during training, but this would require too much human time.
One approach called inverse reinforcement learning involves "reverse engineering" a reward function from demonstrations. For complex tasks, figuring out the reward function from demonstrations is very difficult to do well.
Reward learning involves learning a reward function, which describes how many rewards are earned in each situation in the environment, i.e. a mapping of the current state and action to the rewards received. The goal is to learn a reward function that encourages the desired behavior. To train the algorithm to learn a reward function, we need another source of data such as demonstrations of successfully performing the task. The reward function outputs reward predictions for each state, after which standard RL algorithms can be used to learn a strategy by simply substituting these approximate rewards in place of the usually known rewards.

Prior work (described below as Christiano et al. 2017) provides an example that illuminates how difficult learning the reward function can be. Imagine teaching a robot to do a backflip. If you aren’t a serious gymnast, it would be challenging to give a demonstration of successfully performing the task yourself. One could attempt to design a reward function that an agent could learn from, but this approach often falls victim to non-ideal reward design and reward hacking. Reward hacking means that the agent can find a "loophole" in the reward specification. For example, if we assigned too much reward for getting in the proper initial position for the backflip, then maybe the agent would learn to repeatedly move into that bent over position forever. It would be maximizing rewards based on the reward function that we gave it, but wouldn’t actually be doing what we intended!
A human could supervise every step of an agent’s learning by manually giving input on the reward function at each step, but this would be excessively time consuming and tedious.
The difficulty in specifying the rewards points towards the larger issue of human-AI alignment whereby humans want to align AI systems to their intentions and values, but specifying what we actually want can be surprisingly difficult (recall how every single genie story ends!).
Relevant Papers
We’d like to look at several recent reward learning algorithms to evaluate their ability to learn rewards. We are specifically interested in how successful the algorithms are when faced with previously unseen environments or game levels, which tests their ability to generalize.
To do this, we leverage a body of prior work:
- Deep reinforcement learning from human preferences – 2017 by Christiano et al.
- Reward learning from human preferences and demonstrations in Atari – 2018 by Ibarz et al.
- Leveraging Procedural Generation to Benchmark Reinforcement Learning – 2019 by Cobbe et al.
- Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations – 2019 by Brown, Goo et al.
The first two papers were impactful in utilizing reward learning alongside deep reinforcement learning, and the third introduces the OpenAI Procgen Benchmark, a useful set of games for testing algorithm generalization. The fourth paper proposed an efficient alternative to the methods of the first two works.
Deep reinforcement learning from human preferences (Christiano et al. 2017)
The key idea of this paper is that it’s a lot easier to recognize a good backflip than to perform one. The paper shows that it is possible to learn a predicted reward function for tasks in which we can only recognize a desired behavior, even if we can’t demonstrate it.
The proposed algorithm is shown below. It alternates between learning the reward function through human preferences and learning the strategy, which are both initially random.
Repeat until the agent is awesome:
1. Show two short video clips of the agent acting in the environment with its current strategy
2. Ask a human in which video clip the agent was better
3. Update the reward function given the human’s feedback
4. Update the strategy based on the new reward function
The simulated robot (shown in the below figure) was trained to perform a backflip from 900 queries in under an hour, a task that would be very difficult to demonstrate or to manually create rewards for.

Experiments were performed in the physics simulator called MuJoCo and also in Atari games. Why run these experiments in Atari when we already know the true rewards in Atari for the games? This gives the opportunity to assign preferences automatically instead of having a human manually give feedback about two video clip demonstrations. We can get automatic (synthetic) feedback by simply ranking the clip with higher true reward as the better one. This enables us to run experiments very quickly because no human effort is needed. Furthermore in this case we can assess the performance of the algorithm by comparing the learned reward function to the true rewards given in the game.
Reward learning from human preferences and demonstrations in Atari (Ibarz et al. 2018)
This paper built on the prior paper by performing additional experiments in the Atari domain with a different setup and a different RL algorithm. Their main innovation is to utilize human demonstrations at the beginning in order to start with a decent strategy, whereas the original algorithm would have to start with an agent acting completely randomly since no rewards are known at the beginning. The addition of these human demonstrations improved learning significantly in three of nine tested Atari games relative to the no-demos method used by Christiano.
Leveraging Procedural Generation to Benchmark Reinforcement Learning (Cobbe et al. 2019)
OpenAI, an AI research lab, developed reinforcement learning testbed game environments called the Procgen Benchmark, which includes 16 unique games. Within each game, all levels are similar and share the same goal, but the actual components like the locations of enemies and hazards are randomly generated and therefore can be different in each level.
This means that we can train our agent on many random levels and then test it on completely new levels, allowing us to understand whether the agent is able to generalize its learning. Note the contrast to Atari games in which training is done on sequential game levels where the enemies and rewards and game objects are always in the same places. Furthermore, when testing the agent’s abilities in sequential and non-randomly generated games, they are tested on those same levels in the same order. An important Machine Learning principle is to train with one set of data and test with another set to truly evaluate the agent’s ability to learn/generalize.
We looked primarily at four environments from Procgen in our work:
- CoinRun: Collect the coin at the end of the level while dodging enemies
- FruitBot: Eat fruit and avoid non-fruit foods like eggs and ice cream
- StarPilot: Side scrolling shooter game
- BigFish: Begin as a small fish and eat other strictly smaller fish to get bigger
Below are screenshots from each of the games. The agent view uses a lower resolution to optimize for the algorithm to require less computation. The human view is how the game would look if a human were playing.


The main experiments in the paper involved training agents in all 16 unique games over a range of 100 to 100,000 training levels each, while keeping the training time fixed. These agents were then tested on levels that they had never played before (this is possible because each level is uniquely auto-generated). They found that agents need to be trained on as many as 10,000 levels of the game (training levels) before they are able to demonstrate good performance on the test levels.
The StarPilot game plot below shows the training performance in blue and the testing performance in red. The y-axis is the rewards and the x-axis is the number of levels used for training. Note that the x-axis is on a logarithmic scale.

We see that the agent does very well immediately during training and that training performance then goes down and then back up slightly. Why would the agent get worse as it trains more? Since the training time is fixed, by training on only 100 levels, the agent would be repeating the same levels over and over and could easily memorize everything (but do very poorly at test time in the unseen levels). With 1,000 levels, the agent would have to spread its time over more levels and therefore wouldn’t be able to learn the levels as well. As we get to 10,000 and more levels, the agent is able to see such a diversity of levels, that it can perform well as it has begun to generalize its understanding. We also see that the test performance quickly improves to nearly the level of the training performance, suggesting that the agent is able to generalize well to unseen levels.
Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations (Brown, Goo et al. 2019)
The algorithm proposed in this paper, called T-REX, is different from the previously mentioned reward learning methods in that it doesn’t require ongoing human feedback during the learning procedure. While the other algorithms require relatively little human time compared to supervising every agent action, they still require a person to answer thousands of preference queries. A key idea of T-REX is that the human time commitment can be significantly reduced by completing all preference feedback at the beginning, rather than continuously throughout the learning procedure.
The first step is to generate demonstrations of the game or task that is being learned. The demonstrations can either come from a standard Reinforcement Learning agent or from a human.
The main idea is that we can get a lot of preference data by extracting short video clips from these demonstrations and assigning preferences to them based only on a ranking of the demonstrations that they came from. For example, with 20 demonstrations, each demo would get a rank of 1 through 20. A large number of short video clips would be taken from each of these demonstrations and each clip would be assigned the ranking of the demo that it came from, so when they face each other, the preference would go to the clip that came from the better demo. The reward function would then be based on these preferences.
This is in contrast to the approach of the prior works that require human preference input over each and every pair of 1–2 second clips. Here, we only require human preference input to rank the initial demonstrations. T-REX’s disadvantage, however, is that it is using an approximation. Not all clips from a higher ranked demonstration should be preferred to clips from a lower ranked demonstration, but the idea is that on average, the preferences would work well and the procedure would suffice to learn a reward model.
Providing a ranking over demonstrations is the same as giving preferences between every pair of them. For example, if we had three demos and ranked them 3>1>2, this means that we would generate the rankings of 3>1, 3>2, and 1>2. Then randomly generated clips from the demos would be given the same preference ranking based on which demos the clips came from.
The T-REX paper showed that having just 12 demonstrations was sufficient to learn a useful reward model. There are 12 * 11 / 2 = 66 distinct pairs for any 12 objects, so ranking 12 demonstrations from 1 to 12 is equivalent to answering up to 66 queries about which demo is better, which is ~100 times fewer queries than required by the algorithm by Christiano et al. Again, although the T-REX ranked demonstrations method is more efficient, it is sacrificing precision due to the simplifying assumption that all clips from a better demo are better than all clips from a worse demo.
Brown and Goo et al.’s Atari-based experiments showed that T-REX was competitive against the Ibarz et al. method that was previously described. It was able to learn better-than-demonstrator quality agents using only 12 demonstrations and their corresponding preference (rank) labels.
The figure below shows a comparison between the scores from human demonstrations and the scores from the T-REX algorithm in five Atari games. T-REX was able to soundly outperform 3 of the 5 human scores, though was not able to earn any points in the Montezuma’s Revenge game.

T-REX also exceeded the performance a state-of-the-art behavioral cloning algorithm (BCO) and imitation learning algorithm (GAIL) in 7 out of 8 games, as shown in the chart below, while also beating the best available demonstrations in 7 out of 8 games. (Behavioral cloning algorithms try to act as close to demonstrations as possible and inverse reinforcement learning algorithms attempt to recover a reward function from an expert demonstration.)

Next: Implementations and Experiments
Based on T-REX’s strong results and simple idea, we decided to base our initial experiments on combining this algorithm with the Procgen game environments, which would give us a highly efficient reward learning algorithm and a variety of benchmark games to test generalization. We will explain the details of our implementation and the experimental results and issues that we faced in the second blog post of this series.