Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

Thomas Simonini
Towards Data Science
8 min readSep 3, 2018

--

We launched a new free, updated, Deep Reinforcement Learning Course from beginner to expert, with Hugging Face 🤗

👉 The new version of the course: https://huggingface.co/deep-rl-course/unit0/introduction

The chapter below is the former version, the new version is here 👉 https://huggingface.co/deep-rl-course/unit5/introduction?fw=pt

We launched a new free, updated, Deep Reinforcement Learning Course from beginner to expert, with Hugging Face 🤗

👉 The new version of the course: https://huggingface.co/deep-rl-course/unit0/introduction

The chapter below is the former version, the new version is here 👉 https://huggingface.co/deep-rl-course/unit5/introduction?fw=pt

Some weeks ago, OpenAI made a breakthrough in Deep Reinforcement Learning. They beat some of the best Dota2 players of the world with OpenAI five, a team of 5 agents. Unfortunately, they lost during the second experience.

Dota2

This breakthrough was made possible thanks to a strong hardware architecture and by using the state of the art’s algorithm: PPO aka Proximal Policy Optimization.

The central idea of Proximal Policy Optimization is to avoid having too large policy update. To do that, we use a ratio that will tells us the difference between our new and old policy and clip this ratio from 0.8 to 1.2. Doing that will ensure that our policy update will not be too large.

Moreover, PPO introduced another innovation which is training the agent by running K epochs of gradient descent over a sampling mini batches. If you read my article about A2C with Sonic The Hedgehog we already implemented that.

So today, we’ll dive on the understanding of the PPO architecture and we’ll implement a Proximal Policy Optimization (PPO) agent that learns to play Sonic the Hedgehog 1, 2 and 3!

However, if you want to be able to understand PPO, you need first to master A2C, if it’s not the case, read the A2C tutorial here.

The problem with Policy Gradient Objective function

Remember when we studied Policy Gradients, we learned about Policy Objective Function or if you prefer the Policy Loss.

The idea was by taking gradient ascent step on this function (which is equivalent of taking gradient descent of the negative of this function) we will push our agent to take actions that lead to higher rewards and avoid bad actions.

However, the problem comes from the step size:

  • Too small, the training process was too slow
  • Too high, there was too much variability in the training.
When there is enormous variability in the training (Robot Icons made by Smashicons)

That’s where PPO is useful, the idea is that PPO improves the stability of the Actor training by limiting the policy update at each training step.

To be able to do that PPO introduced a new objective function called “Clipped surrogate objective function” that will constraint the policy change in a small range using a clip.

Introducing the Clipped Surrogate Objective Function

First, as well explained in this stack overflow answer, instead of using log pi to trace the impact of the actions, we can use the ratio between the probability of action under current policy divided by the probability of the action under previous policy.

Taken from PPO paper

As we can see rt(θ) denote the probability ratio between the new and old policy:

  • If rt(θ) >1, it means that the action is more probable in the current policy than the old policy.
  • If rt(θ) is between 0 and 1: it means that the action is less probable for current policy than for the old one.

As consequence, our new objective function could be:

Taken from PPO paper

However, without a constraint, if the action taken is much more probable in our current policy than in our former, this would lead to a large policy gradient step and consequence an excessive policy update.

Consequently, we need to constraint this objective function by penalize changes that lead to a ratio that will away from 1 (in the paper ratio can only vary from 0.8 to 1.2). By doing that we’ll ensure that not having too large policy update because the new policy can’t be too different from the older one.

To do that we have two solutions:

  • TRPO (Trust Region Policy Optimization) uses KL divergence constraints outside of the objective function to constraint the policy update. But this method is much complicated to implement and it takes more computation time.
  • PPO clip probability ratio directly in the objective function with its Clipped surrogate objective function.
The Clipped Surrogate Objective function

With Clipped Surrogate Objective function, we have two probability ratios, one non clipped and one clipped in a range (between [1 — 𝜖, 1+𝜖], epsilon is an hyper parameter that helps us to define this clip range (in the paper 𝜖 = 0.2).

Then, we take the minimum of the clipped and non clipped objective, so the final objective is a lower bound (pessimistic bound) of the unclipped objective.

Consequently, we have two cases to consider:

Taken from PPO paper
  • Case 1: When the advantage is > 0

If Č‚t > 0, it means that the action is better than the average of all the actions in that state. Therefore, we should encourage our new policy to increase the probability of taking that action at that state.

Consequently, it means increasing r(t), because we increase the probability at new policy (because At * new policy) and the denominator old policy stay constant.

Taken from PPO paper

However, because of the clip, rt(𝜽) will only grows to as much as 1+𝜖. it means that this action can’t be 100x more probable compared to old policy (because of the clip).

Why?, because we don’t want to update too much our policy. And that for a simple reason, remember that taking that action at that state is only one try, it doesn’t mean that it will always lead to a super positive reward, so we don’t want to be too much greedy because it can lead to bad policy.

→ To summarize, in the case of positive advantage, we want to increase the probability of taking that action at that step but not too much.

  • Case 2: When the advantage Č‚t is smaller than 0

If Č‚t < 0, the action should be discouraged because negative effect of the outcome. Consequently, rt will be decreased (because action is less probable for current policy than for the old one) but because of the clip, rt will only decreases to as little as 1-đťś–.

Again, we don’t want to make a big change in the policy by being too greedy by completely reduce the probability of taking that action because it leads to negative advantage.

To summarize, thanks to this clipped surrogate objective, we restricts the range that the new policy can vary from the old one. Because we remove the incentive for the probability ratio to move outside of the interval. Since, the clip have the effect to gradient. If the ratio is > 1+e or < 1-e the gradient will be equal to 0 (no slope).

So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once, and updating outside of the region where this sample offers a good approximation.

The final Clipped Surrogate Objective Loss:

Implementing a PPO agent in A2C style that plays Sonic the Hedgehog series (Sonic 2 and 3)

So now, we’re ready to implement a PPO agent in A2C style. A2C style means that it follows the same process explained in the A2C article.

Again, this implementation is much complex than the former implementations of this course. We begin to implement state of the art algorithms, so we need to be more and more efficient with our code. That’s why, in this implementation, we’ll separate the code into different objects and files.

To implement a PPO agent, you’ll need to read the notebook below that contains a schema of the complete PPO process and each part of the code explained.

The implementation is in the GitHub repo here.

That’s all! You’ve just created an agent that learns to play Sonic the Hedgehog 1, 2 and 3. That’s awesome! You’ll need about 10 to 15 hours of training on 1 GPU to have a good agent.

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code. Try to modify the hyper parameters, use another environment. Experimenting is the best way to learn, so have fun!

Take time to consider all the achievements you’ve made since the first chapter of this course: we went from simple text games (OpenAI taxi-v2) to complex games such as Doom and Sonic the Hedgehog using more and more powerful architectures. And that’s fantastic!

Next time, we’ll work about one of the most exciting new strategy in Deep Reinforcement Learning: Curiosity-Driven Learning.

If you liked my article, please click the 👏 below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

Keep Learning, Stay awesome!

--

--

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 https://bit.ly/3QADz2Q |