
I got a question about the Generalized Advantage Estimate (GAE) on my article implementing the Phasic Policy Gradient (PPG) algorithm, so I thought I’d follow up here with some extra details 😊
Here is the original paper. I’ll be describing the theory more intuitively and explain how to implement it in code. If you’d like to experiment with this and other RL techniques, do check out my open source library, Pearl.
As a summary, this method allows us to estimate the advantage and value functions in reinforcement learning with easy control over a bias variance trade-off. Let’s get into it!
Setting the Scene
We are trying to teach some agent to solve an environment (e.g. a game). In order to do this, the agent must learn the expected outcome of its actions. Here, we will formalize the equations that will give the agent this information.
Trajectories
As an agent navigates the environment states, it collects rewards telling it whether the action it just took was good or bad. The sequence of actions, states and rewards an agent encounters over time is known as a trajectory, which we’ll define here with the symbol τ.
Returns
At the end of a trajectory, we can look back and see the discounted sum of future returns at each step in the trajectory, also known as the return. More formally, the return is defined:

The γ term introduces the discount as a number between 0 and 1. There’s a few reasons we want to do this:
- Mathematically, an infinite sum of rewards may not converge to a finite value and is hard to deal with in equations. The discount factor helps with this.
- More intuitively, future rewards do not provide immediate benefits. We prefer immediate gratification; in fact, we do the same thing when pricing stocks and derivatives!
Value Function
The value function, V, is defined as the expected return of a state:

The agent should prefer states with a higher value because these states have a higher expected total trajectory reward associated with them. This is used in almost every reinforcement learning algorithm.
Advantage Function
The advantage function, A, is the expected return of a state (value) subtracted from the expected return given an action in a state. More intuitively, it tells us how much better or worse the action taken was compared to the overall expected return:

Why do we use this instead of just the expected return given an action in a state (the first term)? Intuitively, this metric has a lower variance in Monte Carlo runs since subtracting the deterministic value function should generally lead to a smaller magnitude.
Bias Variance Trade-off
First, let’s get some definitions going:
- Bias: A biased estimator doesn’t represent/fit the original metric very well. Formally, if the expected value of the estimator is equal to the original metric, it is unbiased.
- Variance: an estimator with high variance has a large spread of values. Ideally, an unbiased estimator should have low variance to consistently match the original metric across inputs. Formally, this is measured the same as the variance of any random variable is measured.
Unfortunately, we generally don’t have the exact form for the value or advantage functions. Let’s use a neural network to model the value function since this only requires a state input rather than state action inputs. Given this, how do we estimate the advantage function with an imperfect value function estimator? First, note:

That is, the expected return given an action in a state is equivalent to the reward of the action state pair (which is assumed to be deterministic here) plus the discounted expected return of the next state.
Therefore, the advantage can be estimated as:

Note that this estimator is highly reliant on the value function estimator. If this has high bias, then the TD estimate will also have high bias!

Ideally, we want to try and get around this. In the reinforcement learning setting, we generally take n steps in an environment for each update. Therefore, it’s quite easy to extend the TD estimate for extra time steps:

Doing this reduces the bias since we increase the proportion of terms from the full advantage function not reliant on the value function estimate, and we scale the magnitude of the value estimate of the nth state by a much smaller number.

However, doing this also has a downside. The new estimator has an increased variance with the addition of many extra terms.
In summary, extending the TD estimate to include more reward steps reduces the bias of the estimator at the cost of increased variance. We can pick any number, i, between 1 and n to put in the extended advantage estimate, A^{(i)}(s, a). The question is, how do we pick this number?
GAE Equation
A pretty good solution is to just take an exponential average for i between 1 and n as input to the extended advantage estimator, A^{(i)}(s, a). Let’s look at the final form directly from the paper, where δ_t is the TD advantage estimate for time step t.
![[1] Generalized Advantage Estimate](https://towardsdatascience.com/wp-content/uploads/2021/05/1OBNGhJZNztR9YTRbhFkSKg.png)
Here, λ is the exponential weight discount. Importantly, this is the lever to control the bias variance trade-off! Note that if we set this to 0, then we are left with the TD advantage estimate (high bias, low variance) and if we set it to 1, this is the equivalent of choosing i = n for the extended advantage estimate (low bias, high variance).
Code
The implementation of the GAE in Python is shown below:
Is it me or was that surprisingly simple? This is the beauty of the exponential average! It can be implemented easily and is computationally linear 😁
The trick is to start from the end and work our way backwards so we don’t calculate the same quantities over and over again. Let’s lay this out more obviously with an n step trajectory:


Therefore:

Furthermore, note the addition of the ‘done’ flags here indicating whether an episode has ended or not (0 = done, 1 = not done). This is important to include so we don’t take future episode rewards into account when calculating the advantage of a step in the current episode.
Phew! Made it to the end 🎉🎉
If you found this article useful, do consider:
- giving me a follow 🙌
- subscribing to my email notifications to never miss an upload 📧
- using my Medium referral link to directly support me and get access to unlimited premium articles 🤗
Promotions out the way, do let me know your thoughts on this topic and happy learning!!
References
[1] High-Dimensional Continuous Control Using Generalized Advantage Estimation by John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel: https://arxiv.org/abs/1506.02438