
The paper describing OpenAI’s Trust Region Policy Optimization (TRPO) algorithm, authored by Schulman et al. (2015), is foundational in modern Reinforcement Learning. It is rooted in early works on natural policy gradients by Amari (1998) and Kakade (2001), as well as improvements made since.
TRPO rapidly gained popularity and mainstream acceptance, as it empirically performed better and more stable than natural policy gradient algorithms. Although it has since been surpassed by Proximal Policy Optimization (PPO), its contributions remain important to grasp.
Although I will briefly recap some core concepts, I assume you already have an understanding of both policy gradient and natural policy gradient methods. This article will address the monotonic improvement theorem behind TRPO (focusing on the intuition) and the three implementation changes that distinguish it from natural policy gradients.
Policy gradient (first-order optimization)
The main idea of policy gradient algorithms is fairly straightforward. We have a parameterized policy _πθ, which returns actions with a certain probability. If we observe an action yields above-average rewards, we increase the probability of that action.
To guide the updates, we define the objective function J(θ) dependent on _πθ, and compute the gradient _∇θJ(θ) (a vector of partial derivatives) w.r.t. parameters θ.

By sampling multiple actions, we can contrast the reward trajectories R(τ) to each other – the derivatives tell us in which direction to update θ. The steeper the slope of the objective function, the larger our update will be.
The probability distribution P(τ;θ) reflects both the stochastic policy and the exogenous uncertainty embedded in the environment. Fortunately, we can derive an expression that only depends on the policy _πθ. If this policy is differentiable w.r.t. θ, we can compute the gradient _∇θ and optimize the objective function. This yields the following update function:

The key challenge is to select the appropriate learning rate α. The first-order derivatives tell us in which direction to step, but not by how much. In fact, a good value for α in one update might be awful for the next update. Over- and undershooting (visualized below) are commonplace in first-order optimization; the gradients simply do not provide information on how large the step size should be.
![First-order gradient algorithms provide no information on appropriate step sizes and often overshoot or stall. [image by author]](https://towardsdatascience.com/wp-content/uploads/2022/10/0jrxOdTjTTjzLoha_.png)
For more detail on policy gradients, check out the following article:
Natural policy gradients (second-order optimization)
To remedy the step size problem of first-order optimization, natural policy gradients also include second-order derivatives that capture the sensitivity of the gradient w.r.t. weight changes. In essence, it measures how much the policy _πθ (recall this is a probability distribution, which moves over a statistical manifold) changes. For this purpose, we compute the difference between the policy before and after the update, known as KL-divergence:

By putting a cap ϵ on the divergence, we derive the following scheme:

As Reinforcement Learning relies on samples, we perform a number of approximations to use natural gradients in practice. Most notably, we perform a Taylor expansion on the modified objective function, using first-order expansion on the objective and second-order expansion on the KL-divergence. The derivation is quite notation-heavy, so I’ll skip straight to the resulting weight update formula:

We replaced α with a dynamic function that depends on local sensitivity – represented by the Fisher information matrix F(θ) – of the objective function. Intuitively, we aim for the largest update within the divergence threshold ϵ. In flat regions we can safely perform larger updates, in curved regions we are more cautious. As such, the behavior is often the opposite of first-order optimization, carefully approximating reward peaks.
For a detailed explanation on natural policy gradients, please check out the following article:
Natural Policy Gradients In Reinforcement Learning Explained
Problems with natural policy gradients
By capping policy changes and taking into account local sensitivities, natural gradients often converge better and faster than vanilla gradients. The approach also suffers from a number of drawbacks though.
First of all, although offering an engaging theoretical framework, natural gradients require a lot of approximations. The second-order Taylor expansion may well misrepresent the actual distance between policies, causing step sizes to be too large. As such, natural policy gradients do not guarantee that the constraint on KL-divergence is met! In practice, one may still take steps that are too large and violate the constraint.
Second, the Fisher information matrix F is an |θ|⋅|θ| matrix, which may take substantial memory to store. More restrictive, however, is computing the inverse matrix F^-1, which is an operation of O(N³) complexity. For neural networks containing thousands or even millions of parameters, this becomes computationally prohibitive.
Third, despite drifting quite far from the theoretical framework, natural gradient algorithms do not check whether the update actually improves the objective. Due to the approximations, the proposed weight update may actually worsen performance.
To summarize, natural policy gradients suffer from the following key problems:
- The KL-constraint might be violated due to approximations, causing the analytically derived step size to be too large.
- Inverse matrix F^-1 takes too long to compute, being an operation of O(N³) complexity.
- We don’t check whether the update actually improves the policy. Due to all the approximations, this might not be the case.
TRPO tackles all three problems – to a degree – with some targeted adjustments. Before addressing the implementation changes, it is necessary to discuss some theory though.
![Illustration of gradient search (first-order optimization) vs. trust region search (second-order optimization). The first follows the steepest slope (at the risk of overshooting or stalling), whereas the latter adjusts its search space to the curvature of the statistical manifold [Photo adapted by author from Dawid Zawiła on Unsplash]](https://towardsdatascience.com/wp-content/uploads/2022/10/1mNPeOwe6iXw_601lRLql0Q.png)
Trust Region Policy Optimization (TRPO)— Theory
If you understand natural policy gradients, the practical changes should be comprehensive. In order to fully appreciate them, it is good to have an understanding of the monotonic improvement theorem though.
We start by recapping the advantage A(s,a), which is a reshaped reward signal. Boilerplate stuff in Reinforcement Learning. The advantage indicates – for a given policy π and in a given state s – the expected cumulative reward of a specific action a minus the overall expected value. Thus, the advantage describes the relative attractiveness of that action.

Stochastic policies influence the action selection in trajectories τ, with the environment dictating the emergence of states. As such, we can compute expected rewards in terms of trajectories generated under the current policy. The objective function _J(πθ) describes these discounted rewards, with the expectation stemming from the policy _πθ:

To describe the difference in expected rewards between two policies, we can take the expected reward of the original policy and add the expected advantage of the new policy [proof in Kakade & Langford (2002) or Schulman et al. (2015)]. The expression takes the adjusted action probabilities under the new policy, but the advantages computed under the old policy (i.e., we don’t have to resample):

As the time horizon is infinite, the above can be re-expressed in terms of the state distribution ρ(πθ+Δθ):

The term ρ(πθ+Δθ) is problematic though – how can we know the state distribution corresponding to the updated policy without sampling it? To remedy this problem, we utilize the state distribution as if occurring under the present policy, at the cost of introducing an approximation error:

Next, we have to make things suitable for simulation, as we do not have the full state- and action sets of our disposal. We replace ρ with an expectation sign – which can be sampled using Monte Carlo simulation— and use importance sampling to replace the sum over actions. With importance sampling, we effectively utilize the action expectation for the present policy, corrected for the probabilities under the new policy:

The expected advantage that describes the quality of the updated policy relative to the old one is known as the surrogate advantage _𝓛_π_θ(πθ+Δθ):

The approximation error can be expressed in terms of the worst-case KL-divergence between both policies. As such, we can denote the following inequality:

Schulman et al. derive a value for C, and as such a lower bound for the improvement of the objective function. Thus, if we improve the right-hand side, we are guaranteed to improve the left-hand side as well. In essence, if the surrogate advantage _𝓛_π_θ(πθ+Δθ) exceeds the worst-case approximation error C_⋅DKL, we will surely improve the objective.
This is the monotonic improvement theorem in a nutshell. The corresponding procedure is a Minorization-Maximization (MM) algorithm. If we improve the lower bound _M=𝓛_π_θ-C⋅DKL, we also improve the objective by at least the same amount.
![Illustration of the MM algorithm. When improving the lower bound L(θ)-C⋅KL, we are guaranteed to also improve the objective function η(θ) [source: Schulman, OpenAI, 2017]](https://towardsdatascience.com/wp-content/uploads/2022/10/1jEJ55nyHSIuyy-9UwicC2Q.png)
This theoretical result is re-assuring, but not overly practical. First, _DKL^max (i.e., the largest divergence) is hard to compute, and for the simulation implementation we replace it with the expected (e.g., mean) divergence. Second, the proposed algorithm harshly penalizes policy divergences, consequently yielding very small, conservative updates.
Despite the practical shortcomings, the monotonic improvement theorem provides important background information to fully understand the mechanisms of TRPO.
Trust Region Policy Optimization (TRPO) – Practice
In terms of practical implementation, TRPO is not that different from the early natural policy gradient algorithms. There are three main improvements, each addressing a problem in the original algorithm. Note that TRPO aims to verify whether proposed updates actually improve our algorithm, in line with the monotonic improvement theorem.
I. Conjugate gradient method [_x=F^-1∇logπθ(x)]
From natural gradients, we already derived the following equivalence, allowing to express the Fisher information matrix as the outer product of the gradients we already have (i.e., we do not have to compute the Hessian contain all second-order derivatives):

Unfortunately, computing the inverse Fisher matrix F^-1(θ) is a time-consuming and often numerically unstable procedure. Especially for neural networks, a |θ|⋅|θ| matrix can grow very large, and an O(θ³) matrix is infeasible to compute.
The good news: we are not interested in the inverse matrix itself. If you check the equations, we just need the product _F^-1∇logπθ(x). If we could compute _x=F^-1∇logπθ(x) – or _F⋅x=∇logπθ(x)— we don’t actually need the inverse.
Enter the conjugate gradient method. Explaining the full procedure stretches vastly beyond the scope of this article, but suffice to say it is an established numerical procedure to approximate x. This way, we can sidestep the need to compute the inverse. Conjugate gradients generally converge within |θ| steps (often much less), such that we can handle large matrices.
The use of conjugate gradients is not a unique to TRPO – for instance, the Truncated Natural Policy Gradient deploys the same approach— but are an integral part of the algorithm nonetheless.
II. Line search [D_KL(π_θ||π_θ+Δθ)≤δ]
Although the natural gradient provides the optimal step size given the constraint placed on KL-divergence, due to all the approximations we may actually not meet that constraint.
TRPO addresses this performance by performing a line search – not unlike the typical gradient search – iteratively reducing the size of the update until the first update that does not violate the constraint. This procedure can be seen as shrinking the trust region, i.e., the region within we trust the update to actually improve the objective.
For the reduction, an exponentially decaying rate α^j is used, with 0<α<1 and j∈N. If the first update (α⁰=1) meets the conditions, we preserve the original natural gradient step size. If not, we keep shrinking the trust region until the update is satisfactory.
![Line search algorithm. The algorithm iteratively reduces the size of the update (using exponential decay α^j), until divergence falls within the threshold (II) and the surrogate advantage is nonnegative (III) [source: Joshua Achuam, UC Berkeley]](https://towardsdatascience.com/wp-content/uploads/2022/11/0RU2cHrMG4feOQnFd.png)
So, unlike natural policy, which presumes the divergence constraint is met, the line search performed in TRPO actually enforces it. Also note the monotonic improvement theory utilizes a penalty, whereas the implementation uses a constraint (i.e., the trust region).
From the pseudo-code, we see that _D_KL(θ||θk)≤δ is not the only constraint that must be satisfied. It also asserts that _𝓛θ ≥0. This is the improvement check, which we explain next.
III. Improvement check [𝓛(θ)≥0]
The final distinction makes a lot of sense. Rather than presuming the update will improve the surrogate advantage 𝓛(θ), we actually c_heck i_t. Recall that we compute advantages based on the old policy, using importance sampling to adjust the probabilities. Naturally the check takes some time, but is reassuring to verify whether our update actually improves the policy before accepting it.
![Pseudocode for TRPO algorithm. TRPO performs a conjugate gradient algorithm, a line search that constrains sample KL-divergence and a check on improving surrogate advantage [source: OpenAI, MIT license]](https://towardsdatascience.com/wp-content/uploads/2022/10/13bOuhvM8261eKtFGxuj20w.png)
Closing words
Natural policy gradients offered a considerable improvement over traditional gradient algorithms, which struggled to determine appropriate learning rates and often stall or overshoot while learning. TRPO again offered a significant step forward, primarily by simply checking whether an update is an improvement. The conjugate gradient approximation makes TRPO suitable for policies with larger numbers of parameters, although this approach has been proposed in earlier works.
TRPO empirically outperforms earlier natural gradient algorithms on many tasks, yet is not without flaws itself:
- Although F^-1 no longer needs to be computed, estimating F can still be a cumbersome task (requiring memory and substantial sample batches). For deep neural networks, TRPO does not scale;
- Conjugate gradient algorithms are readily available, yet their incorporation makes the implementation of RL algorithms more complicated;
- TRPO tends to perform poorly on tasks that utilize convolutional- or recurrent networks (CNN and RNN);
- As a second-order algorithm, it is notably slower than first-order algorithms, and does not take full advantage of excellent optimizers such as ADAM;
- When architectures have multiple outputs – as is often the case with neural networks outputting both policies and value functions – there is no consistent metric for divergence to use;
- In general, TRPO is fairly difficult to explain and interpret, making implementation, debugging and training hard.
Proximal Policy Approximation – also introduced by OpenAI in 2017 – is empirically competitive with methods such as TRPO (or even outperforming it), and considerably simpler to implement. Therefore, it has de facto replaced PPO. Nonetheless, TRPO represents a significant milestone in the development in natural policy gradients, and deserves the attention of students in the field of Reinforcement Learning.
Liked the article? You might also appreciate the following RL articles:
The Five Building Blocks of Markov Decision Processes
Why Reinforcement Learning Doesn’t Need Bellman’s Equation
Cliff-Walking Problem With The Discrete Policy Gradient Algorithm
Further reading
Papers:
- Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2), 251–276.
- Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
- Kakade, S.M. & John Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. In International Conference on Machine Learning.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. In International Conference on Machine Learning.
Lecture slides:
- Levine, S. Advanced Policy Gradients (CS 285). UC Berkeley.
- Achiam, J. (2017). Advanced Policy Gradient Methods. UC Berkeley.
- Fragkiadaki, K. Natural Policy Gradients, TRPO, PPO (CMU 10703). Carnegie Mellon.
-
Schulman, J. (2017). Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. OpenAI.
Blog posts: