The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods

Published in

Towards Data Science

16 min readJul 9, 2018

Reinforcement Learning strikes me as the wild west of machine learning right now: a place full of drama and progress, with dreams of grand futures hovering on everyone’s horizons. But, also, a place a bit removed from the rules and assumptions governing the worlds of supervised and unsupervised learning. One of the most salient breaches from typical assumptions is the fact that in policy gradients in particular, you can’t reliably know whether the direction you’re moving in will actually improve your reward. As a result, a number of recent policy gradient innovations — two of which this post will discuss — focus on how to optimize more confidently in conditions where your estimates of the best direction to pursue are meaningfully higher variance.

A Quick Detour Into Terminology

“Trajectory” — A trajectory is defined as a chain of states, actions, and rewards. Trajectories are collected by following some policy that determines which action you’ll take at a given state, but that policy doesn’t need to be motivated; trajectories can also theoretically be calculated under a random policy
“State” — This typically is used to loosely mean: the set of environmental inputs that the agent interacts with at any given point in time.
“Reward” — Depending on context, this could refer either to the reward given at a specific timestep, or to the discounted reward from a given action’s timestep until the end of the trajectory.

You Can’t Always Know What You Want: The Problem of Credit Assignment

The fundamental framing of a reinforcement learning problem is that of an agent operating in an environment: it takes in observations (otherwise known as a state, s), and takes actions, which trigger the environment to produce some reward, r, and also update to a new state that results from a given action. An important thing to note about this setup is that it implicitly assumes that the environment has some ability to make judgments, either about actions or preferred states of the world.

This may seem obvious, but that’s probably only because we’re so used to hearing about “reward” as being a fundamental component of RL. But, rather than thinking of reward simply as some fully abstract concept, it’s worth thinking through different kinds of environments, and how well they do and don’t lend themselves to a quantified reward.

On one side of the spectrum, you have points-based games like many of the Atari ones, where — because having many points is a definitional component of what it means to win the game — points can provide a quite direct signal of reward. In the middle, imagine something like Go. In this situation, there is a well-defined notion of winning or losing, but the environment doesn’t automatically supply assessments of how valuable any given intermediate step is, or how close it brings you to winning the game. In fact, a substantive part of what we want the agent to do, in a case like that, is develop knowledge of what intermediate game states are preferred over others.

And, on the other extreme, we have the problem of teaching a robot to walk. In this case, any reward have to be explicitly designed by humans to capture what we think it means to walk effectively; they’re not inherently provided by the structure of the environment. All of this is just to say: the notion of reward can be a useful abstraction, but you should remember that it is used to describe a quite wide range of feedback mechanisms.

This discussion of reward feeds into what I believe is the single most defining fact about reinforcement learning: that you don’t have visibility into the function you’re trying to optimize. This is a subtle point, and one often made offhandedly, in a tone of it being more obvious than it is.

Imagine a case where, in supervised learning, your model produced a set of softmax values (the traditional output structure for a categorical network). If you compare those outputs to the one-hot-encoded true label, you can calculate your cross-entropy error as a direct analytic function of each of those softmax outputs, which allows you to backpropogate how your loss would change if any of those softmax outputs were to change, and so on back into the network. But, in RL, instead of your loss being a differentiable function into which you have full, equation-level visibility, it’s instead a black box controlled by the universe. You take some series of actions, and receive some series of rewards, but it’s not possible to directly calculate how much you should upregulate or downregulate the probabilities of each specific action in order to increase your reward.

You can create estimates of the gradient of reward with respect to your parameters — the approach taken by policy gradients, one of two main strains of modern RL — but these estimates have higher inherent variance than those in supervised learning. In supervised learning, small minibatch sizes give you variance in your current estimate of loss magnitude, but the actual derivative of loss with respect to parameters is a deterministic quantity, defined by the network. By contrast, in reinforcement learning, you’re estimating the derivative itself. Especially in many-step, where a single action early on can lead to a totally different path, the space of possible trajectories that a given policy can lead to is very wide, which means you need more samples to get a good estimate of how well a given policy is performing.

Meet the Policy Gradient

Source: Trust Region Policy Optimization paper. This simulated walker is an example of the kind of continuous-valued problem (where the policy is the continuous-valued torque on each joint) for which policy gradient methods are particularly well-suited

To build up the intuition for a policy gradient approach, start by thinking about the most straightforward form of reinforcement learning: an evolutionary method. In a prototypical evolutionary approach, called the Cross Entropy method, you define some distribution over policy parameters — let’s say a Gaussian with some mean and variance — and collect N trajectories under N sets of parameters. Then, take the top p% highest performing trajectories, and fit a new Gaussian parameter distribution based on using that p%’s parameter samples as an empirical starting point. So, on the next round, your N samples are drawn from a distribution corresponding to the best samples from the last round.

The idea behind this is that, within a parameter set that generated a good trajectory, each parameter within that set is likely to be closer to an optimal parameter setting. While simple methods like this work surprisingly well sometimes, they have the disadvantage of updating parameter values randomly, without following any explicit gradient that carries information about what changes to parameter values are likely to improve performance. Now, one might reasonably argue: you just said, not 10 paragraphs ago, that in reinforcement learning, you can’t specify a gradient chain from your parameters to your reward, because the environmental function between actions and reward is a black box.

And that’s right: you can’t make that calculation directly. But, in a network that outputs a softmax over actions, you can calculate a gradient that will move that softmax towards some different set of actions. And, so, if you define an estimated loss gradient with respect to your actions, you can backprop through your action distribution to improve that estimated loss.

A basic policy gradient along these lines operates by attaching to each action something called an Advantage Estimate. This numerically answers the question of “how much better is this action than the average, or expected, action I’d take at this state”. Importantly, advantage estimates use future discounted rewards, which means all rewards that accumulate from that point where that action was taken, onwards into the future, with discounting rates applied to rewards further in the future. Note that this is different from the evolutionary approach, where each action is implicitly given partial responsibility for the rewards of the full trajectory, rather than just the rewards that came after it. Given this advantage estimate attached to each action, a basic policy gradient approach uses as its reward the expected advantage of a given trajectory, implicitly weighting each action according to its probability by calculating and summing advantage estimates for actions sampled from the probabilistic policy. While each method uses its own slight modifications, this basic framework is the conceptual foundation of policy gradient approaches.

What to Expect When You’re Expecting

Many machine learning problems are defined in terms of minimizing or maximizing some kind of expected value. For those who are a little further from their statistics days, the expected value of a function is defined as:

Expectations are defined with respect to some probability distribution, p(x), from which you expect x to be sampled, and can conceptually be described as “what you’d get if you were to sample x from p(x), and average the f(x) you got from those samples”. This is salient to bring up in the context of reinforcement learning because it helps illustrate yet another subtle difference between it and reinforcement learning.

In a canonical supervised learning problem, f(x) is some loss function — for example, cross-entropy loss between the predicted softmax outputs and the targets — and the p(x) distribution is the distribution over inputs that exists within your training set. In words, this problem framing seeks to minimize expected loss, which translates to building a network that has low error for a certain set of inputs x: the ones that constitute your training set. This is done implicitly, by the fact that the network by definition only sees examples that exist in the training set, and will see an example (or cluster of examples, close in input space) more frequently if it is high probability under the empirical distribution of the training set, p(x). Here, p(x) is outside of your control, and f(x), the network + loss function, is what you’re optimizing when you train your network.

Now, imagine a reinforcement learning problem, where we want to maximize the expected rewards from a given policy.

The structure of the equation is nearly the same here, expect that x has been replaced with tau, to signify that each tau is now a trajectory, drawn from a policy p-theta. (Tau is the canonical symbol for a trajectory). However, here, what you’re optimizing has flipped. Because the result of optimizing a policy is to cause you to choose different actions, when you optimize your policy parameters, that has an impact on the expected value through the distribution of trajectories over which reward is calculated, where in supervised learning, the distribution of samples was dictated by a fixed training set. By contrast, where before, our network optimized the loss/reward function that acted upon the samples, here, we have no visibility into the R(t) reward function, and our only lever for impacting expected reward is through changing the distribution of trajectories we sample over.

This fact has consequences for the failure modes that policy gradient methods can experience. In a supervised learning setting, no matter how many ill-advised optimization steps we take, we won’t impact the distribution of samples that we’re learning over, since that’s outside of our control. By contrast, in a policy gradient setting, if we end up in a particularly poor region of action-space, that can leave us in a position where we have very few usefully informative trajectories to learn from. For example, if we were trying to navigate a maze to find a treat, and had accidentally learned a policy of just staying still, or spinning in circles, we would have a hard time reaching any actions that had positive reward, since we can only learn to increase an action probability by experiencing that action and seeing that it leads to reward.

Don’t Worry, You Can Trust Me

The aforementioned problem of catastrophic collapse has historically led hyperparameter optimization in policy gradient methods to be a fragile and fiddly process: you need to find a learning rate sufficient to actually make progress on the problem, but not so high that you frequently end up terminating your learning process by accident. This is particularly salient in situations where a catastrophic end to your learning process wouldn’t just mean having to rerun some code, but would actually mean the destruction of a physical robotic agent that might have just driven off a cliff.

A question we really want to answer, when we’re about to update our policy, is: can we make an update that is guaranteed to improve on our current policy, or, at minimum, not worsen it? A way of fully calculating this — referring to our policy post-update as the “new policy” and pre-update as the “old policy” — would be to:

Know which states the new policy will visit, and in what probability it will visit them.
At each state, take a sum over actions of: the new-policy probability of each action at that state, multiplied by how much better that action is than the average old-policy action at that state. (This quantity, “how much better is this action than the average action at this state”, is termed Advantage).
The above sum operation amounts to taking the expected Advantage at each state, sampled according to the new policy’s action distribution at that state.
We weight each state’s expected new policy advantage by how likely that state is to be reached by the new policy
All told, this would get us the expected advantage of the new policy over the old one, and lets us confirm that expected advantage is positive before we take a move.

This equation is basically a mathematically restatement of items 1–5 above

The new policy’s probability of a given action at at a given state is easy to calculate: we can easily input the data from any state — even one sampled by a different policy — to a copy of the policy network with updated parameters, and get that network’s action distribution.

What is more difficult, however, is the distribution of states that would be reached by the new policy. As with many expectations, we don’t have any way of directly estimating the probability of ending up in a state for a given policy. So, rather than explicitly summing over states and calculating p(s) for each, we estimate via sampling: if we take the set of states reached by a policy, and calculate the new policy advantage for each, then we are implicitly weighting by p(s), because more probable states will be more likely to be included in a sample. In order to calculate that without sampling, we would need to know the transition function — the distribution over next environment states given a given action, at a given state — and that’s an unknown environmental black box.

The approach of Trust Region Policy Optimization (TRPO) is to calculate an estimate of the advantage quantity described above, but to do so using the distribution of states from the old policy, rather than the state distribution from the new policy, which we could essentially only acquire via potentially costly sampling.

Note: eta(pi), the first term, is just the expected reward of the old policy, which already knew from sampling from it. You can see that, in contrast to the partial equation above, the distribution over states is from the old policy (pi) rather than the new policy (pi tilde)

The validity of this approximation is directly connected to how different the old-policy state distribution is from the new-policy one. If the updated policy reaches the same states with similar enough probability, then this approximation is more likely to hold. Intuitively, this makes sense: if the new policy frequently ends up in some state the old policy never or rarely saw, where all of its actions are far worse than those the old policy would take, then an old-policy-approximated loss that gives that state low or zero weight would be inappropriately optimistic about the performance of the new policy.

While the proof of this is a bit long and involved for this post, the original TRPO paper showed that if you put a limit on the KL Divergence between the action distributions (or action distribution parameters, if you’re learning a parametrized policy) of the new and old policies, that translates into a limit on the difference in the new and old policy state distributions. And, if there’s a strong enough limit on that divergence, we know that our new policy state distribution is close enough to the old policy one that our approximated loss function should be valid.

A Farewell to KL

Empirically, the TRPO approach performs quite well: problems that before required precise problem-specific hyperparameter searchers were now solveable with a set of reasonable parameters that transferred well across problems. However, one challenge of TRPO was the need to calculate an estimate of the KL Divergence between parameters: this required the use of conjugate gradient methods, which added complexity and computing time to implementations of the method.

OpenAI’s release of Proximal Policy Optimization (PPO) last year sought to combat that specific trouble. To understand that approach, we first need to understand something about the way that the surrogate loss function L-pi is calculated, and, specifically, the way that expected advantage is calculated (shown below).

While our model structure does give the ability to directly calculate action probabilities at a given state, for reasons of reduced computation, we don’t tend to explicitly calculate advantage for every action at a given state, and calculate it’s probability for weighting. We instead just once again calculate an implicit expectation via sampling: we sample actions according to our network’s softmax action distribution, and so actions at a given state show up in accordance with how probable they are. However, recall, with TRPO, the whole goal was to be able to infer behavior of a new, post-update policy, using samples drawn from the old policy. Because we’re using samples that are drawn under the old policy, each action is implicitly being multiplied by its probability of being sampled under the old policy. So, to correct this, in TRPO, we take each advantage estimate sampled under the old policy, and multiply it by a ratio: probability under the new policy, divided by probability under the old. Conceptually, this just cancels out the implicit old-policy-probability by dividing by the explicit old policy probability (to get to 1) and then multiplying by new policy probability.

Slightly different symbols than the last few equations, but hopefully still clear.

This probability ratio, which the PPO paper terms r-theta, is the focus of PPO’s modified approach. Instead of estimating a surrogate expected advantage loss, but with a KL divergence penalty, PPO proposes the following (note that this set of symbols wraps the expectation over states into the Et term):

In words, this is calculating two things, and taking the minimum of them:

R-theta * Advantage: The term with the ratio as described above, just taking new action probability divided by old.
A clipped R-theta * Advantage: Here, epsilon is defined to be some fairly small term, say, 0.20, and we clip the probability ratio to being between 0.8 and 1.2. Then, we multiply that clipped ratio by the advantage.

The fundamental operating theory of this approach is the same as TRPO: conservatism, and caution towards updates that would take you into regions not much explored by the old policy. Let’s think through why by walking through a few possible cases.

Clipping applies only to make our estimates of the advantage of the new policy more pessimistic: when there’s a really good action that’s likelier under the new policy, and a bad action that’s likelier under the old. In the former case, we limit how much you can give the new policy credit for upweighted good actions, and in the latter case, we limit how much you can cut the new policy slack for downweighted bad actions. The overall gist of this is that we are giving the algorithm less incentive to make updates that will lead to strong changes in action or state distribution, because we’re muting both the kinds of signals that might lead to such an update occurring.

In all but one case, PPO outperforms or is tied for top performance with competing algorithms. Source: PPO paper

A meaningful advantage of this approach is that, unlike with TRPO’s hard KL divergence constraint, this clipped objective can be directly optimized using gradient descent, allowing for cleaner and faster implementations.

Chesterton’s Policy Fence

In a 1929 book, G.K. Chesterton wrote:

In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle…There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it

This caution towards radically new approaches, and wariness of running too enthusiastically towards change, even if it superficially looks to be an improvement, is the fundamental theme running through both of the policy gradient improvement methods discussed here. At the risk of getting excessively flowery, it’s quite the metaphorically resonant idea: gaining confidence through gradual, cautious steps, taking you just a bit closer to a better place.

Remaining Questions

As is now becoming my habit, I’d like to end with questions

In many problems that have sparse or nonexistent rewards until the moment of success, I would expect that reinforcement learning methods in general would have trouble. Are there any known studies of whether Q Learning vs Policy Gradients perform better in these low-reward-density scenarios?
Relatedly, I would love to understand how crucial reward shaping — the process of creating hand-engineered rewards — is to solving of different kinds of problems.
I never got totally clear on why calculating the KL Divergence constraint within TRPO requires conjugate gradients to estimate

Even within the topic of policy gradients alone, there was so much richness and nuance to the field that I didn’t have time to touch on here: the role of advantage estimates in reducing variance, the pros and cons of PG vs Q Learning, the role of stochastic policies in leading to better learning. If learning more about any of the above sounds interesting to you, I have a somewhat unorthodox recommendation: John Schulman (researcher behind both TRPO and PPO, currently of OpenAI)’s doctoral thesis. Given that it wasn’t materially explicitly written as public-facing teaching, it is remarkably well and clearly written, and contains some of the best explanations of concepts that I was able to find during my research.