Policy Based Reinforcement Learning, the Easy Way

Step by step approach to understanding Policy Based methods in Reinforcement Learning

Ziad SALLOUM

Published in

Towards Data Science

8 min readFeb 8, 2019

Update 1: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com

Update 2: If you are new to the subject, it might be easier for you to start with Reinforcement Learning Policy for Developers article.

Introduction

Suppose you are in a new town and you have no map nor GPS, and you need to reach downtown. You can try assess your current position relative to your destination, as well the effectiveness (value) of each direction you take. You can think of this as computing the value function. Or you can ask a local and he tells you to go straight and when you see a fountain you go to the left and continue until you reach downtown. He gave you a policy to follow.
Naturally, in this case, following the given policy is much less complicated than computing the value function on your own.

In another example, consider that you are managing in inventory, and you decided that when the count of each item drops below a certain limit you issue a buy order to refill your stock. This is a policy that is far more simple than studying the activity of the customers, and their buying habits and preferences, in order to predict the impact on your stock…

Surely enough, value functions will lead to determining a policy as seen in previous articles, but there are other methods that can learn a policy that can select actions using parameters, without consulting a value function (this is not quite correct, since value function is needed to increase accuracy).

So the main idea is to be able to determine at a state (s) which action to take in order to maximize the reward.

The way to achieve this objective is to fine tune a vector of parameters noted 𝜽 in order to select the best action to take for policy 𝜋.
The policy is noted 𝜋(a|s, 𝜽) = Pr{At = a | St = s, 𝜽t = 𝜽}, which means that the policy 𝜋 is the probability of taking action a when at state s and the parameters are 𝜽.

Advantages

Better convergence properties
Effective in high-dimensional or continuous action spaces
When the space is large, the usage of memory and computation consumption grows rapidly. The policy based RL avoids this because the objective is to learn a set of parameters that is far less than the space count.
Can learn stochastic policies
Stochastic policies are better than deterministic policies, especially in 2 players game where if one player acts deterministically the other player will develop counter measures in order to win.

Disadvantage

Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance
Policy based RL has high variance, but there are techniques to reduce this variance.

Stochastic Policy

At first it is important to note that stochastic does not mean randomness in all states, but it can be stochastic in some states where it makes sense.
Usually maximizing reward leads to deterministic policy. But in some cases deterministic policies are not a good fit for the problem, for example in any two player game, playing deterministically means the other player will be able to come with counter measures in order to win all the time. For example in Rock-Cissors-Paper game, if we play deterministically meaning the same shape every time, then the other player can easily counter our policy and wins every game.

So in this game the optimal policy would be stochastic which will be better than the deterministic one.

Blue Print

Before delving into the details of math and algorithms, it is useful to have an overview of how to proceed, a kind of blue print:

Find an objective function which can be used to assess the effectiveness of the policy. In other words telling how good the result given by the policy.
Define the policies.
What we mean is to list some useful policies that can be used in the learning process.
Naive Algorithm.
Propose an algorithm that makes direct use of the policies in order to learn the parameters.
Improved Algorithms
Find algorithms that improve the objective function in order to maximize the effectiveness of the policy.

Part 1 of the Blue Print: Finding the Objective Function

Remember that in the blue print above we talked about finding an objective function in the aim of assessing the effectiveness of the policy. In this section we will define the objective function and some of its useful derivation.
(More details about the policy gradient can be found in the article Policy Gradient Step by Step).

Objective Function

When talking about maximizing a function, one of the methods that stands out, is the gradient.

But how are we going to maximize the rewards based on 𝜽 ?
One way to do it, is to find an objective function J(𝜽) such that

Where V𝜋𝜽 is the value function for policy 𝜋𝜽, and s0 is that starting state.

In short the maximizing J(𝜽) means maximizing V𝜋𝜽(s).
It follows that

According to the Policy Gradient theorem

Where 𝝻(s) is the distribution under 𝜋 (meaning the probability of being at state s when following policy 𝜋), q(s,a) is the action value function under 𝜋, and ∇𝜋(a|s, 𝜽) is the gradient of 𝜋 given s and 𝜽.
Finally 𝝰 means proportional.

So the theorem says that ∇J(𝜽) is proportional to the sum of the q function times the gradient of the policies for all actions at the states that we might be at. However we don’t know 𝜋(a|s, 𝜽), how can we find its gradient?

It turns out that it is possible as the following demonstration shows:

Reminder: ∫ dx/x = Log(x) which means dx/x = (Log(x))’ = ∇Log(x)

∇Log 𝜋𝜃(s,a) is called the score function.
Note that the gradient of the policy can be expressed as an expectation. If you are asking yourself why? Check this wikipedia article on Expected Value.

Parameters Update

Since this is a gradient method, the update of the parameters (that we are trying to optimize) will be done the usual way.

Part 2 of the Blue Print: Define Policies

This section explains few standard Gradient Policies such as Softmax and Guassian. We use these policies in RL algorithms to learn the parameters 𝜽.
In practice whenever in the RL algorithm we see reference to ∇Log 𝜋𝜃(s,a) we plug in the formula of the chosen policy.

Softmax Policy

The softmax Policy consists of a softmax function that converts output to a distribution of probabilities. Which means that it affects a probability for each possible action.

Softmax is mostly used in the case discrete actions:

It follows that

Where

You can check the full demonstration of the derivation here.

Gaussian Policy

Gaussian policy is used in the case of continuous action space, for example when driving a car and you steer the wheels or press on the gas pedal, these are continuous actions because these are not few actions that you do since you you can (in theory) decide the rotation degree or the flow amount of gas.

The derivation becomes

Part 3 of the Blue Print: Naive Algorithm

This section will give some of the algorithms that will take into account the policies and their objective function in order to learn parameters that will give the best agent behavior.

REINFORCE (Monte-Carlo Policy Gradient)

This algorithm uses Monte-Carlo to create episodes according to the policy 𝜋𝜃, and then for each episode, it iterates over the states of the episode and computes the total return G(t). The it uses G(t) and ∇Log 𝜋𝜃(s,a) (which can be Softmax policy or other) to learn the parameter 𝜃.

from Sutton Barto book: Introduction to Reinforcement Learning

Part 4 of the Blue Print: Improved Algorithm

We have said that Policy Based RL have high variance. However there are several algorithms that can help reduce this variance, some of which are REINFORCE with Baseline and Actor Critic.

REINFORCE with Baseline Algorithm

The idea of the baseline is to subtract from G(t) the amount b(s) called baseline in the purpose of reducing the wide change changes in results.
Provided that b(s) does not depend on the action a, it can be shown that the equation of ∇J(𝜽) is still valid.

So now the question would be how to choose b(s) ?

One of the choices for the baseline is to compute the estimate of the state value, û(St,w), where w is a parameter vector learned by some methods such as Monte Carlo.
So b(s) = û(St,w)

The REINFORCE with Baseline algorithm becomes

Actor Critic Algorithm

(Detailed explanation can be found in Introduction to Actor Critic article)
Actor Critic algorithm uses TD in order to compute value function used as a critic.
The critic is a state-value function. It is useful to assess how things go after each action, the critic computes the new state to determine if there has been any improvement or not. That evaluation is the TD error: