The world’s leading publication for data science, AI, and ML professionals.

Phasic Policy Gradient (PPG) Part 1

An Intuitive Dive into the Theory Behind PPG

This is Part 1 of a two part series in which I’ll be discussing the theory behind the PPG algorithm. Be sure to check out Part 2 next for an implementation in PyTorch.

Photo by Antoine Dautry on Unsplash
Photo by Antoine Dautry on Unsplash

Pre-requisites

It helps if you have a working knowledge of Proximal Policy Optimization (PPO) and the general maths behind actor critic methods. Otherwise, I provide a series of links to resources throughout the article to help out 🙂

The Dilemma of Joint Actor Critic Networks

The actor critic framework relies on optimizing two different objectives at the same time: the policy and the value function. Briefly, the policy, generally denoted by π(a|s), represents the Bayesian distribution of actions an agent will sample from given an observed environment state and the value function describes the expected (discounted) reward of being in a particular environment state. Neural networks are great at approximating functions! But a question arises; should we use a single neural network to approximate both of these functions at the same time?

Let’s consider the advantages of doing this:

  1. Empirically, features trained by each objective can be used to better optimize the other 🙂
  2. Using one network rather than two uses less memory 🙂

However, there are also some disadvantages:

  1. There is always a risk that the optimization of one objective will interfere with the optimization of the other when shared parameters are used 🙁
  2. The same level of sample reuse is required for both objectives. In general, the value function objective is off-policy, so it has capacity for higher sample efficiency 🙁

The Algorithm

The idea behind PPG is to decouple the training of both objectives whilst still allowing for some injection of the learned value function features into the policy network. To do this, disjoint policy and value networks are used as shown in Figure 1:

Figure 1: Disjoint policy and value networks [1]
Figure 1: Disjoint policy and value networks [1]

The two θs represent the network parameters (weights and biases) of the respective networks. Furthermore, two distinct training phases are used, the policy phase and the auxiliary phase.

Policy Phase

The goal of the policy phase is to optimize the training of the policy network itself.

During this phase, the agent is trained with the same objectives from Proximal Policy Optimization (PPO). Specifically, the policy network is trained using the following objective:

PPO policy objective function [1]
PPO policy objective function [1]

Where:

Ratio between current policy and old policy before update [1]
Ratio between current policy and old policy before update [1]

And A represents the advantage function. The idea behind this is to clip updates so that the new policy doesn’t deviate too much from the old policy in any single update; leading to more stable training. Often, an entropy term is added to this function to encourage exploration (since a uniform distribution has maximum entropy). For more details on this, see [2].

The value network is also updated here with the mean square error between the predicted value and the target value:

Value network objective function [1]
Value network objective function [1]

Both the value target and advantage function are calculated with the Generalized Advantage Estimate (GAE); an exponential average of the TD estimate over all possible rollout lengths. For more detail on this, see [3] and [4].

Auxiliary Phase

The goal of the auxiliary phase is to optimize the training of the value network whilst also injecting some value features into the policy to aid its future training.

The value network is updated just like in the policy phase with the same objective function.

Notice the auxiliary value head of the policy network in Figure 1. This is the key to injecting learned value features into the policy network! Now, we can use this head to train the policy network on any arbitrary auxiliary objective function. For example, in this case we want to add useful value features, so we can have this head predict the value of an input environment state just like the value network. In doing so, the same value loss objective function is used, except the predicted value is taken from the auxiliary head instead of the value network:

Value objective function for policy network during auxiliary phase [1]
Value objective function for policy network during auxiliary phase [1]

However, there is still the risk that optimizing the auxiliary objective function will interfere with the policy objective. We don’t want to destroy the progress made during the policy phase during the auxiliary phase! In order to prevent this, we can simply add a Kullback-Leibler (KL) divergence term between the policy just before the auxiliary phase and the current predicted policy:

Auxiliary objective function [1]
Auxiliary objective function [1]

For those unfamiliar with KL Divergence, it is derived from Information Theory and essentially describes how dissimilar two distributions are with a global minimum value of 0 when the two input distributions are the same. The β coefficient controls how strongly the current policy is pulled near the previous policy. If set larger, then the gradient value for the KL divergence term will be larger and overall the gradient descent will push the policy more towards optimizing that objective over the auxiliary objective and vice versa.

Tying Everything Together

Figure 2 shows how the policy and auxiliary phases are combined to create the PPG algorithm and Figure 3 shows the default hyperparameters used in the paper.

Figure 2: PPG Algorithm [1]
Figure 2: PPG Algorithm [1]
Figure 3: PPG default hyperparameters [1]
Figure 3: PPG default hyperparameters [1]

We run the auxiliary phase every N runs of the policy phase. Since the main goal of this algorithm is to optimize the policy, this policy frequency parameter can be set relatively high as the auxiliary phase doesn’t directly optimize the policy.

To save tuning another hyperparameter, the number of value epochs is set the same as the number of policy epochs. This could be set higher, but it isn’t necessary since the value network is mainly targetted in the auxiliary phase anyway.

The policy phase is on-policy, so it doesn’t benefit from sample reuse. As a result, the number of policy epochs is kept low and a new batch of experiences is collected after each policy update.

The value function is off-policy, so the auxiliary phase benefits from high sample reuse relative to the policy phase since its goal is to optimize the value network. As a result, the number of auxiliary epochs is higher than the number of policy epochs.

Earlier, I talked about how the β coefficient balances the KL divergence and auxiliary value objective functions. In the paper, this term is set by default to 1. However, this also depends on how the KL divergence is calculated. Unfortunately, the KL divergence doesn’t have a closed form solution for all distributions. Therefore, sometimes an approximation is used:

Approximate form for reverse KL Divergence
Approximate form for reverse KL Divergence

The derivation for this can be found in my article here. This is generally smaller than the full KL Divergence, so the β coefficient should be changed appropriately depending on which form is used.

Concluding Thoughts

PPG offers a simple but powerful solution to many of the problems presented by joint network PPO algorithms. Let’s review the advantages:

  1. Better sample efficiency 🙂
  2. More controlled value injection to the policy 🙂

However, as with anything, there are disadvantages:

  1. More hyperparameters to think about 🙁
  2. Increased memory requirements with two networks 🙁

On the second disadvantage, it’s possible to use a single network and simply detach the value function gradient at the last shared layer during the policy phase. This performs basically the same as the two network solution so makes the point mute.

One exciting thing to think about is what other auxiliary objectives can be used to further improve performance!

I hope this article has been somewhat enlightening and be sure to check out Part 2 for the implementation in PyTorch.

Photo by Karsten Würth on Unsplash
Photo by Karsten Würth on Unsplash

If you found this article useful, do consider:

References

[1] Phasic Policy Gradient by Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman https://arxiv.org/abs/2009.04416

[2] Proximal Policy Optimization Algorithms by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov https://arxiv.org/abs/1707.06347v1

[3] High-Dimensional Continuous Control Using Generalized Advantage Estimation by John Shulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, Pieter Abbeel https://arxiv.org/abs/1506.02438

[4] A (Long) Peek into Reinforcement Learning by Lilian Weng https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html#combining-td-and-mc-learning

[5] Approximating KL Divergence by John Schulman http://joschu.net/blog/kl-approx.html


Related Articles