Generalization in Reward Learning
Assessing Generalization in Reward Learning with Procedurally Generated Games (2/2)
Authors: Anton Makiievskyi, Liang Zhou, Max Chiswick
Note: This is the second of two blog posts (part one). In these posts, we describe a project we undertook to assess the ability of reward learning agents to generalize. The implementation for this project is available on GitHub.
In the first post, we reviewed some fundamental background material and described the inspiration for as well as the aims of our project. In doing so, we discussed a number of papers that served as the launching point for our experiments.
We based our implementations on T-REX [Brown and Goo et al. 2019]. We chose it for its straightforward setup, as well as the open-source implementation provided by the authors, which would save us substantial time. According to the paper, the algorithm showed strong results on Atari games, so we wanted to evaluate how well those results would carry over to Procgen environments, where levels would be randomly generated and thus could not be memorized.
We ran the T-REX algorithm on four Procgen game environments: CoinRun, FruitBot, StarPilot, and BigFish. As a brief recap: the algorithm is supposed to learn to play the games well without having access to the game score. The game objective must be inferred only from the provided set of ranked demonstrations and the resulting predictions of the reward model.
These are the algorithm steps that we implemented:

Creating our dataset of demonstrations
In order to train anything at all, we needed a dataset. For T-REX, the dataset consists of ranked demonstrations from the environment. How many of them? In the T-REX paper, the authors were able to obtain good results on Atari with just 12 ranked demonstrations. However, we suspected that we would need quite a bit more, as we believed that the Procgen environments could be a good deal more difficult to learn because of the inherently random levels.
Either way, we needed to produce demonstrations of varying quality from the four Procgen environments we wanted to test on. To do so, we could either play the games ourselves or we could train Reinforcement Learning agents on those environments, using the true rewards provided by the actual environments themselves.
We chose the latter; having trained agents allows us to generate demonstrations hundreds of times faster compared to the manual approach. Why is it "legal" to use the true rewards when the purpose of reward learning is to learn without being given that information? During training, the algorithm has access to only the demonstrations, and not the rewards that the demonstrations earned.
Fortunately, the Procgen paper came with code that we could use in order to train a number of agents to various degrees of performance. Even having this speed up, generating "fresh" demos every time would still take a few minutes of waiting at the beginning of every experiment. This would have substantially reduced our efficiency at iteration and improvement of the code, so we decided to generate all of the demonstrations before running any experiments – thousands of them for each environment and ranked them based on total reward earned
This way, we created a robust source of demonstrations for any environment, of any quality, of any length, whenever we wanted them.
Our first reward models
Now that we had our demonstration dataset, it was time to get to training our very first reward models using the T-REX procedure.
In order to evaluate the quality of the reward models, we could train agents using them with a reinforcement learning algorithm and evaluate how the agents performed, but this is computationally expensive. Thus, we looked for an interim metric by which we could examine how close our trained reward model would be to the true reward model. We decided to use a simple measure, the correlation coefficient. This interim metric is probably a less reliable measure for the reward model. However, introducing this metric reduced the time each experiment took by about 5x, allowing us to get many more data points for assessing the algorithm. We would then select the best reward models and run an RL algorithm on them, instead of having to run it on all of the models.
Our goal is to have a high correlation over environment episodes between the trained reward model and the true environment reward model, which would imply a trained reward model that is similar to the true (target) model. For each of our four chosen environments, we ran tests that varied the number of demonstrations provided to T-REX. We measured the correlation between the total reward for the game as predicted by the reward model and the actual reward collected in this game.
Have a look at our first results:

On the x-axis is the number of demonstrations used in the experiment, and on the y-axis is the correlation coefficient. We show two different types of correlations with orange and blue lines. Each large point in the graph represents the average result of 5 different runs within each number of demos, which are represented by the smaller points.
We made three main observations from this figure:
- The correlations look quite strong in all of the games except for CoinRun.
- The correlation increases with the number of demos used. As we had guessed, 12 demos wouldn’t be sufficient in the more complex Procgen environments. The correlations are fairly steady and high from 100 to 200 demos.
- The variance of the runs decreases significantly as we get to 100 and 200 demos. We see that the small points representing the runs group much closer together as the number of demos increases.
Our next step was to train reinforcement learning agents on Procgen using our reward models that showed the highest correlation in place of true rewards from the environment. Here’s what we got:


Something’s wrong
The clips generally look pretty good, but something’s off. If you look at FruitBot, you’ll see that although our trained agent has learned not to hit walls, it hasn’t learned a single thing about avoiding non-fruit items, which collectively constitute half of the game objects. Recall that the mission in the game is to survive while eating as much fruit as possible and avoiding all other foods. Here’s what an ideal agent would look like:

We conjectured that our trained agents were mainly learning to survive, that is, to keep the episode going for as long as possible (optimizing what’s called the "live-long" objective). An example of an environment where surviving directly leads to reward is CartPole (shown in the previous blog post), where the agent receives a reward of +1 for every second that it keeps a pole from falling over. In FruitBot, a large reward is received for making it to the end of the level, so maybe the agent had learned the importance of this, but not about the importance of avoiding non-fruits.
Before jumping to conclusions, though, we wanted to do some further testing. We realized that this "live-long" objective makes a great deal of sense to use as a baseline with which to compare our reward models in general – in other words, a good reward model trained on a Procgen environment should be able to do better than a simple reward model that returns a +1 reward for every timestep the agent is alive, like in CartPole. By using such a baseline, we would be testing for this. So we ran some comparisons.
Below we show the same correlation plots as above, but now with live-long baselines (represented by red dashed lines), showing that our reward models are generally performing slightly worse or close to the baseline, even with 100 and 200 demonstrations! CoinRun is an exception because the baseline itself is so poorly correlated.

Although it would seem that a correlation of over 75% between the learned reward model and the true reward model would be quite strong, the high live-long baseline correlations were concerning because they show that it is possible to have a highly correlated reward model that actually has no relation to the environment!
What to do?
Deep reinforcement learning is tricky. Results that look encouraging might not be so great upon closer inspection, as we saw in the last section. We focused our efforts on FruitBot, where the issue was most glaring, and looked at how true and predicted reward varied across individual episodes:

Each graph shows the true (green) and predicted (orange) rewards of an agent playing in different levels. On the x-axis is the timestep of the episode, and on the y-axis is the cumulative reward earned up until that timestep. In FruitBot, if an episode reaches 420 timesteps, the episode automatically ends and the agent is given a large reward (indicated by the green spikes near the end). Ideally, the green and orange lines would look very similar since the learned reward should be similar to the environment’s true reward.
Unfortunately, they’re not. In fact, other than the spike near the end, the predicted and true rewards seem quite unrelated to each other. What is going on?
We looked for bugs in our code and tried several ways of improving our reward model. Without getting too much into the details, we tried the standard tried-and-true methods of "massaging" our training procedure, including elements such as early stopping and regularization, as well as changing the neural network architecture of the reward model.
None of these changes, however, seemed to make any significant difference, as shown in these updated plots:

Next, we decided that we would simplify the actual task. If the agent could learn in a simplified version, then we would gradually increase the difficulty of the task to figure out which specific elements were causing the problems. We knew that the algorithm worked in the Brown et al. paper with only 12 demonstrations on Atari games and tried to make the task at least as easy in Procgen.
First, we would modify Procgen so that game levels go in a fixed sequential order instead of a random order. This is how Atari games work and this makes it much easier for an agent to learn.
Moreover, we would go far beyond using 12 demonstrations and use 150 demonstrations in the T-REX algorithm to provide much more data for learning. This should lead to a more precise reward model. Yet even after these modifications, the FruitBot reward model wasn’t satisfactory.

At this point, we were convinced that we had done something incorrectly, because it seemed that our modified tasks were at least as easy as those in the Brown paper, where the T-REX algorithm trained successful agents.
Back to the source
After not being able to produce good reward models even in the simplified setting, we decided to have a closer look at the original implementation and results reported by the authors to pinpoint exactly where we diverged. Because the authors had originally written the algorithm to work on Atari, we couldn’t make direct comparisons. However, we could look at how the reward models they provided would fare when compared to the live-long baseline we had established earlier. So, we took a T-REX-trained reward model from the Atari game Space Invaders and checked how its predicted reward would compare with the true reward given by the environment:

The red line should look like the green line, but they are quite different, just as we had seen in our earlier plots for the FruitBot game. It seems like the agent was learning approximately a constant positive reward, similar to that of the live-long baseline.
Upon further investigation, we noted that the output of the original T-REX algorithm was passed through a sigmoid function in the implementation supplied by the authors of the paper, which would restrict predicted reward values to be between 0 and 1.

We suspected that this would bias the reward model towards the live-long approach, due to the ease with which a small, constant positive reward could be given.
We contacted Daniel Brown, the lead author of the T-REX paper, to discuss these concerns, and he was immensely responsive and helpful in answering our questions and debugging possible issues. As it turns out, he had considered this issue himself and even ran the relevant experiments in one of his later papers – he trained agents with both the T-REX reward model and the live-long objective and reported how their performance compared. According to the reported results, the T-REX algorithm in fact produced a meaningful model that outperformed the baseline.
We decided to run the corresponding experiments ourselves. As we have mentioned earlier we couldn’t afford to run as many "full" tests as we wanted due to computational constraints, so we only ran experiments for the FruitBot environment in the easiest (sequential) mode.

As you can see, in our experiments, we couldn’t get the agents that would outperform those trained on the live-long objective baseline. To be fair, the FruitBot is a tough game for the T-REX algorithm and in other games we did sometimes get agents that would outperform the baseline, but rarely so.
We are not experienced deep reinforcement learning researchers and it is possible that we are mistaken in our conclusions, but to us it seems that the T-REX algorithm beats the baseline by a quite small margin when it does. We were planning to look at how this margin would change if we introduced the randomness of Procgen levels, and given how close to the baseline we’d start with – the difference would be very hard to spot.
Wrapping up
After some fairly lengthy discussions, we decided it would be best to move on from T-REX and try another reward learning algorithm. After all, the original inspiration of our project isn’t unique to T-REX – we would like to investigate reward learning algorithms in general. We initially picked T-REX because it seemed easiest to implement as well as potentially manipulate; however, it might have been better in retrospect to attempt to use a more established reward learning method to start with, especially since we were applying the algorithm to a more difficult setting. Most notably, this suggests implementing code from Christiano et al. 2017 (discussed in post one) that has generally been cited as an important result that helped kick off the use of reward learning in deep reinforcement learning.
Although implementations of the algorithm from this paper are available on GitHub, none were in a condition that we could integrate very easily into our project. Therefore, the easiest way to move forward would be to re-implement this from scratch.
Reflection
Originally, we were interested in tackling the problem of measuring the capability of reward learning agents to generalize. We believed this was an important problem because, in order for an agent to be practically deployed in the real world, it would need to be able to robustly maneuver in unseen situations. Previous tests of generalization have been limited to standard reinforcement learning algorithms, and we hoped to help bridge that gap by applying the same tools, such as Procgen, to reward learning.
In short, we found that the reward functions we learned were quite brittle. After training a number of reward learning agents on a number of different Procgen environments and under a number of different parameter settings, we were unable to obtain reward functions that could then be used to further robustly train new agents. We experimented with the neural network architecture, but this was difficult as there were few principles to guide our exploration other than intuition. We tried some classic changes – adding convolutional layers, changing filter sizes, changing the loss function to predict multiple steps, etc. – and obtained middling results that were sometimes better than what we already had, but not as good as we’d hoped.
As with any deep learning project, network performance can be subject to precise changes in architecture or training procedure. Although we tried to replicate the method found in the T-REX paper as well as we could, it’s always feasible that there exist some parameter settings that would have produced good reward functions.
Takeaways
Although our project unfortunately hasn’t led to interesting results, we gained invaluable experience and learned important lessons, some of which we’d like to share:
- Write down your progress and plans throughout the project – writing down your thoughts clarifies your thinking. We would have saved ourselves a lot of time by carefully thinking through our next experiments. It’s tempting to hit the ground as soon as possible but actually, it’s not as productive as it can seem.
- Start with established algorithms. It’s important to get results that work and then to optimize the implementation afterwards!
- Establish baselines early – we’ve heard this a million times, but still didn’t do it as early as we should have. We could have spotted the main issue much earlier.
- Don’t shy away from asking people for help (if you’ve done your work). We’re not suggesting emailing the people to ask the question you can Google in a couple minutes. But if you’ve based your research on some paper and made interesting progress or have coherent concerns about one of the previously reported results, you certainly should try to reach out to the author. Over the course of the project, we communicated with a number of mentors and other researchers relevant to our project, including Daniel Brown, Jan Leike, and Adam Gleave, among others. We were introduced to some of them through the AI Safety Camp, but we also simply cold-emailed others. All of them were very kind and gave thoughtful and detailed responses to our questions. We’re immensely grateful for their help.
This sums up our journey, its results, and lessons learned. We wish every aspiring researcher had a chance to experience what we did throughout this project and we’re infinitely grateful to the people who made this possible.