Proximal Policy Optimization (PPO) is presently considered state-of-the-art in Reinforcement Learning. The algorithm, introduced by OpenAI in 2017, seems to strike the right balance between performance and comprehension. It is empirically competitive with quality benchmarks, even vastly outperforming them on some tasks. At the same time, it is sufficiently simple for practical adoption by a wide range of users, which cannot be said for every RL algorithm.
On the surface level, the difference between traditional policy gradient methods (e.g., REINFORCE) and PPO is not that large. Based on the pseudo-code of both algorithms, you might even argue they are kind of similar. However, there is a rich theory behind it all, which is needed to fully appreciate and understand the algorithm.
I will briefly discuss the main points of policy gradient methods, natural policy gradients, and Trust Region Policy Optimization (TRPO), which together form the stepping stones towards PPO.
Vanilla policy gradient
A good understanding of policy gradient methods is necessary to comprehend this article. If not there yet, you might want to read this article first:
In the context of RL, a policy π is simply a function that returns a feasible action a given a state s. In policy-based methods, the function (e.g., a neural network) is defined by a set of tunable parameters θ. We can adjust these parameters, observe the differences in resulting rewards, and update θ in the direction that yields higher rewards. This mechanism underlies the notion of all policy gradient methods.
The policy _πθ(a|s) is stochastic, meaning that the parameters dictate the sampling probability of actions a, and thereby influence the probability of following trajectories _τ=s_1,a_1,…s_n,an. As such, we can express the objective function dependent on θ:

With the stochastic policy, we sample various actions – for which we can measure differences in rewards – with corresponding probabilities. The observed rewards, combined with the action probability, yields a reward signal. To determine the update direction that improve the objective function, policy gradient methods rely on gradients _∇θ (a vector of derivatives). This yields the following update function:

Although we have a reward signal and an update direction, we don’t know by how much we should update the policy. Selecting the right learning rate α is a challenge. This problem is aggravated by the inconsistent impact of weight updates. Consider weight updates of equal magnitude to the two policies below. Clearly, the distribution on the left is affected much more than the one on the right. However, the Euclidian distance (i.e., parameter distance) is the same!
![Comparison of normal distribution pairs. The left has μ_1=0, μ_2=1 and σ_1=σ_2=0.3. The right has μ_1=0, μ_2=1 and σ_1=σ_2=3.0. Although the Euclidean distance between both pairs is 1, it is obvious the pair on the right is much more similar than the one on the left. [image by author]](https://towardsdatascience.com/wp-content/uploads/2022/11/1-GoXs5-XjqqYTlfyi48x-g.png)
Due to the risk of large policy changes, samples are only used once. We then use the updated policy to re-sample, which is rather inefficient. Another problem is that reward trajectories may vary wildly, causing swings in updates.
In summary, policy gradients suffers from major drawbacks:
- Sample inefficiency – Samples are only used once. After that, the policy is updated and the new policy is used to sample another trajectory. As sampling is often expensive, this can be prohibitive. However, after a large policy update, the old samples are simply no longer representative.
- Inconsistent policy updates – Policy updates tend to overshoot and miss the reward peak, or stall prematurely. Especially in neural network architectures, vanishing and exploding gradients are a severe problem. The algorithm may not recover from a poor update.
- High reward variance – Policy gradient is a Monte Carlo learning approach, taking into account the full reward trajectory (i.e., a full episode). Such trajectories often suffer from high variance, hampering convergence. [This issue can be addressed by adding a critic, which is outside of the article’s scope]
Natural Policy gradient
Not familiar with natural policy gradients? Read this article first:
Natural Policy Gradients In Reinforcement Learning Explained
Natural policy gradients, introduced by Amari (1998) and Kakade, S. M. (2001), address the problem of selection appropriate learning rates. To do so, the method embeds second-order derivatives that quantify how sensitive the gradient is when updating weights.
The Kullback-Leibner (KL) divergence measures how much a policy _πθ changes due to an update. Remember that the policy is represented by a probability distribution, which determines the probability of selecting a given action. KL divergence is defined as follows:

Now, if we restrict the divergence of the update to be no more than ϵ, we can update weights using the following scheme:

The scheme above is not practically useable, given that RL works with sampled observations. To factor in the divergence, a Taylor expansion on the modified objective function is used. We perform a first-order expansion on the objective and a second-order expansion on the KL-divergence. After some mathematical efforts, the following update formula emerges:

In summary, the hard-to-determine learning rate α has been replaced with a dynamic learning rate, depending on local sensitivity of the objective function. This sensitivity is represented by the Fisher information matrix F(θ). When deployed, the natural gradient update formula allows larger updates in insensitive regions (e.g., plateaus) and smaller ones in curved regions (e.g., close to reward peaks).
In theory, natural gradients allow ensure stable updates with appropriate step sizes. Due to restricting the magnitude by which the policy changes, we might also re-use samples more than once, enhancing sample efficiency. However, in reality natural gradient exhibit a number of shortcomings:
- Inversing the Fisher matrix F is cumbersome – Matrix inversion is an operation of operation of O(N³) complexity. Given that neural networks often contain thousands of parameters, this is often prohibitive. Additionally, the inversion may fail due to numerical instabilities introduced by the approximations made.
- The KL-constraint might not be satisfied – The Taylor expansions are only an approximation of the theoretical objective function. As such, the analytically derived step size is often too large and still exhibits overshooting behavior.
- Policy improvement is not verified – Despite all the reassuring theoretical guarantees offered by natural gradients, the approach does not check whether an update actually yields an improvement. Again, due to the approximations this may not be the case.
Trust Region Policy Optimization (TRPO)
As a direct predecessor of PPO, it might be good to catch up on TRPO:
Since the advent of natural policy gradients in the late 1990s, a number of improvements has been made. These best practices have been combined in the popular TRPO algorithm, introduced by Schulman et al. in 2015.
For the sake of brevity, I will omit a lot of detail again. Two equations are crucial though. First, the surrogate advantage _𝓛_π_θ(π_θ+Δθ) – r_ooted in importance sampling— _r_eflects the expected advantage of an update:

Second, we can bound the approximation error of the surrogate advantage by means of the worst-case KL-divergence between the policies before and after updating. This yields the following inequality:

Penalty C has been derived analytically, yielding a rather severe penalty on divergence. If we would follow the theoretical TRPO approach, we’d have very small updates and slow convergence.
As such, the practical TRPO algorithm differs from the theoretical underpinning on some crucial points. It is important to distinguish between the theoretical model of TRPO (penalty-based) and the practical implementation (constraint-based) – the distinction is relevant when later discussing PPO.
Concretely, the practical implementation of TRPO improves upon natural gradients with the following three adjustments:
- Conjugate gradient method -The natural policy gradients computations only require the product _F^-1∇logπθ(x), we actually have no interest in the inverse matrix itself. Conjugate gradients can approximate this product much quicker than matrix inversion could, greatly speeding up the update procedure.
- Line search— Due to simplifications, the analytical step size derived for natural policy gradients might violate the KL-divergence cap. TRPO starts with the originally suggested step size, but measures whether the divergence constraint is not violated after the update. If it is, it iteratively and exponentially shrinks the trust region (using a shrinking parameter α^j) until the divergence constraint is met.
- Improvement check – Ultimately, we want our updates to improve the policy. TRPO acknowledges this sentiment by actually verifying whether the surrogate loss 𝓛(θ) improves after the update, prior to accepting it. Recall that due to approximations, theoretical guarantees no longer hold.
The latter two improvements are both visible in the algorithmic snippet below:
![TRPO line search algorithm. The algorithm ensures that the update does not violate the divergence constraint (using an exponential decay α^j) and the update does not decrease performance of the policy [source: Joshua Achuam, UC Berkeley]](https://towardsdatascience.com/wp-content/uploads/2022/11/0RU2cHrMG4feOQnFd.png)
TRPO resolves a number of problems associated with natural policy gradients and garnered widespread adoption from the RL community. However, a number of shortcomings remains, in particular:
- Inability to handle large parameter matrices – Despite deploying the conjugate gradient method, TRPO still struggles with large Fisher matrices, even when they don’t need to be inverted. Unfortunately, deep neural networks contain many parameters, and accurately approximating F requires large numbers of samples.
- Second-order optimization is slow – The practical implementation of TRPO is contraint-based and requires to compute aforementioned Fisher matrix, which significantly slows down the updating procedure. Furthermore, we cannot take advantage of (first-order) stochastic gradient optimizers such as ADAM.
- TRPO is complicated – TRPO is quite hard to explain, implement and debug. When training does not yield the desired results, it can be tricky to pinpoint how to improve performance. As such, it is not the most user-friendly RL algorithm.
Proximal Policy Optimization (PPO)
In 2017, Schulman et al. introduced Proximal Policy Optimization (PPO), building on the TRPO foundations. With all the background information given, what does PPO do better and differently than prior algorithms?
There are two main variants of PPO to discuss (both introduced in the 2017 paper): PPO Penalty and PPO Clip. We will first have a look at the penalty-based variant, which is conceptually closest to TRPO.
PPO – Variant with Adaptive KL Penalty
The theoretical foundation of TRPO encapsulates KL divergence as a soft penalty – as seen in the previous section – yet the actual implementation is constraint-based. The theoretical TRPO analytically derives a penalty to multiply with the divergence, yet in practice this penalty is often overly restrictive, yielding only very minor updates. Thus, the question is how to reliably determine the scaling parameter β that allows meaningful updates, yet avoids excessive drifts:

The problem of the constraint is that it is very hard to determine a single value for β that works for multiple problem settings. In fact, even for the same problem, characteristics may change over time. As we do not place a hard cap on divergence, sometimes we’d want to penalize more severely than other times.
PPO solves the matter in a pragmatic way. It sets a ‘target divergence’ δ; we would like our updates to be somewhere within the neighborhood of this divergence. The target divergence should be large enough to substantially alter the policy, but small enough for updates to be stable.
After each update, PPO checks the size of the update. If the realized divergence exceeds the target divergence by more than 1.5, for the next iteration we penalize divergence harder by doubling β. The other way around, we halve β if updates are too small, effectively expanding the trust region. You can see some similarities with TRPO’s line search here, yet the PPO search works in both directions.
As you might expect, doubling and halving β based on exceeding 1.5 times some target threshold is not the result of an elaborate proof; these values are determined heuristically. We will violate the constraint from time to time, but correct it fairly quickly by adjusting β. Empirically, PPO seems to be quite insensitive to the numerical settings. In summary, we sacrifice some mathematical rigor in favor of pragmatism.
The outline for the penalty-based variant of PPO is given below. Note it is fairly similar to the theoretical model of TRPO, but uses some heuristic approximations to make it practically applicable as well.
As PPO updates are fairly small, it is possible to re-use generated samples to some extent, with the importance sampling term correcting for updated action probabilities. PPO performs K update steps; implementations often deploy early stopping when exceeding a certain divergence threshold.
![PPO, penalty variant. The objective function embeds a penalty on the KL-divergence. The corresponding weight β_k is dynamically updated based on the measured divergence relative to the target divergence [source: Schulman et al. 2017]](https://towardsdatascience.com/wp-content/uploads/2022/11/12fD_6bgEjZOztFZyig74Aw.png)
PPO is simpler to implement than natural gradients and TRPO, requiring no analytical results, constraint-based programs or second-order derivatives. Instead, we can just perform the update using popular stochastic gradient descent algorithms like ADAM (drastically increasing speed), and adjust the penalty when updates are too large or too small.
In summary, PPO is much easier to work with than TRPO, while being competitive and often even outperforming it. Let’s see if we can do even better than that.
PPO – Variant with clipped objective
Clipped PPO is the variant currently in swing, and is typically what we mean when discussing ‘PPO’. It usually outperforms the penalty-based variant and is simpler to implement.
Rather than bothering with changing penalties over time, we simply restrict the range within which the policy can change. As advantages achieved by updates outside the clipping range are not used for updating purposes, we provide an incentive to stay relatively close to the existing policy. The objective function still depends on the surrogate advantage, but now looks like this:

Here, _ρt is the importance sampling ratio:

The terms (1-ϵ)⋅A and (1+ϵ)⋅A do not depend on θ, thus yielding a gradient of 0. Consequently, samples outside the trusted region are effectively tossed, discouraging overly large updates. As a result, we don’t explicitly constrain the policy update itself, but simply ignore advantages stemming from overly deviating policies. As before, we can simply use an optimizer like ADAM to perform the updates.
![In this variant of PPO, the surrogate advantage is clipped. If the updated policy deviates from the original one by more than ϵ, the sample yields a gradient of 0. This mechanism avoids overly large updates of the policy, retaining it within a trusted region [image by Schulman et al. 2017]](https://towardsdatascience.com/wp-content/uploads/2022/11/1RUEQ7RXzywlV63nZ0ldJUg.png)
‘Why the min operator?’, you may ask. Why not always clip? In general, the min operator ensures cautious updates by establishing a pessimistic bound. There is one specific case that deserves some detailed attention: the advantage A is negative and _rt>1+ϵ. In other words, we performed a large update that increases the probability of taking worse actions. In this case, _rt ⋅ A is lower _ than (1+ϵ) ⋅_ A, yielding a more rigorous signal to rectify updates in the wrong direction.
There are six non-trivial cases (i.e., policies should be different and advantage nonzero) with corresponding update behavior. For an understanding of PPO, it’s good to take a moment and absorb the cases summarized in the table below:
![Summary of the six non-trivial cases of the PPO clipped objective. When clipped, the gradient is zero and does not contribute to training, discouraging large policy updates [table by author, based on Bick, 2021]](https://towardsdatascience.com/wp-content/uploads/2022/11/18kQ1WK6jgJcZw8HdPwzKDg.png)
To achieve the desired behavior we should tune ϵ, serving as an implicit restriction on KL divergence. Empirically, ϵ=0.1 and ϵ=0.2 are values that work well. These values drift a bit from the theoretical foundations of natural policy gradients (with the theory building upon local approximations assuming _π_θ=π_θk), but empirically it gets the job done.
The full algorithm is detailed below.
![PPO, clipped objective variant. The maximum divergence between new and old policy is clipped. If policies deviate too far, their samples are not used in updates, discouraging large updates. Note that the text uses _ρ_t instead of rt to describe the importance sampling ratio [source: Schulman et al. 2017]](https://towardsdatascience.com/wp-content/uploads/2022/11/1er2e6DH5efzjWSVK9dmgig.png)
PPO2?
You might have heard the term PPO2. Do we need more math still?
Fortunately, PPO2 is simply an updated version of the algorithm as released by Open AI. PPO2 is an implementation with vectorized environments that is optimized for GPU and better supports parallel training. It has a number of other differences (e.g., advantages are normalized automatically and value functions are clipped as well), but uses the same mathematical foundations as outlined in this article. If you plan to directly use the OpenAI implementation, simply remember that PPO is obsolete and you should use PPO2 nowadays.
Closing words
Since its introduction in 2017, PPO has quickly established itself as the go-to algorithm in continuous control problems. Five years is a long time in machine learning, yet within its class of benchmark problems (primarily classic control problems and Atari games), it remains highly competitive.
It appears PPO strikes the right balance between speed, caution and usability. Although lacking the theoretical guarantees and mathematical finesse that natural gradients and TRPO do, PPO tends to converge faster and better than its competitors. Somewhat paradoxically, it appears that simplicity pays off in Deep Reinforcement Learning.
Still not done with Reinforcement Learning for today? Check out the following articles!
Why Reinforcement Learning Doesn’t Need Bellman’s Equation
A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0
A Minimal Working Example for Continuous Policy Gradients in TensorFlow 2.0
Further reading
Papers
- Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2), 251–276.
- Bick, D. (2021). _Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization_ (Master thesis). Rijksuniversiteit Groningen.
- Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. In International Conference on Machine Learning.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Lecture slides
- Levine, S. Advanced Policy Gradients (CS 285). UC Berkeley.
- Achiam, J. (2017). Advanced Policy Gradient Methods. UC Berkeley.
- Fragkiadaki, K. Natural Policy Gradients, TRPO, PPO (CMU 10703). Carnegie Mellon.
-
Schulman, J. (2017). Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. OpenAI.
Websites



![CID of TI-considering current-RF optimization. The actions and received rewards of the first timestep's agent are orange, those of the second step's agent are blue. The highlighted path shows an instrumental goal to preserve the current implemented RF. Source: Author generated, inspired by [4].](https://towardsdatascience.com/wp-content/uploads/2022/04/1EQohNBT-7wJ07psSXCbdsg.png)