On-Policy v/s Off-Policy Learning

Understanding Important Sampling for Off-Policy Learning

Abhishek Suran
Towards Data Science

--

In this article, we will try to understand the difference b/w On-Policy learning and Off-policy learning, which may be a bit confusing for people new to reinforcement learning. And will dive into the concept of Important sampling for Off-Policy learning. Let us first take a look at two terms before moving further.

  1. Target Policy pi(a|s): It is the policy that an agent is trying to learn i.e agent is learning value function for this policy.
  2. Behavior Policy b(a|s): It is the policy that is being used by an agent for action select i.e agent follows this policy to interact with the environment.
Example of Behavior and Target Policy, image made via https://app.diagrams.net/

On-Policy learning :

On-Policy learning algorithms are the algorithms that evaluate and improve the same policy which is being used to select actions. That means we will try to evaluate and improve the same policy that the agent is already using for action selection. In short , [Target Policy == Behavior Policy]. Some examples of On-Policy algorithms are Policy Iteration, Value Iteration, Monte Carlo for On-Policy, Sarsa, etc.

Off-Policy Learning:

Off-Policy learning algorithms evaluate and improve a policy that is different from Policy that is used for action selection. In short, [Target Policy != Behavior Policy]. Some examples of Off-Policy learning algorithms are Q learning, expected sarsa(can act in both ways), etc.

Note: Behavior policy must cover the target policy i.e pi(a|s) > 0 where b(a|s) > 0.

Why use Off-Policy?

Some benefits of Off-Policy methods are as follows:

  1. Continuous exploration: As an agent is learning other policy then it can be used for continuing exploration while learning optimal policy. Whereas On-Policy learns suboptimal policy.
  2. Learning from Demonstration: Agent can learn from the demonstration.
  3. Parallel Learning: This speeds up the convergence i.e learning can be fast.

Using Important Sampling for Off-Policy

As of now, we know the difference b/w off-policy and on-policy. So the question that arises is how can we get the expectation of state values under a policy while following another policy. This is where Important Sampling comes handy. Let us understand with the monte Carlo update rule.

image via Reinforcement Learning: An Introduction
Richard S. Sutton and Andrew G. Barto

As you can see, the update rule consists of an average of all the sampled rewards from a state. These rewards are sampled by following behavior policy b(a|s), But we want to estimate values for target policy pi(a|s) and need rewards sampled from target policy pi(a|s). We can do this by simply multiplying ‘ρ’ with every reward that is sampled from behavior policy. The value of ‘ρ’ is equal to the probability of trajectory under target policy pi(a|s) divided by probability of trajectory under behavior policy b(a|s). These probabilities of trajectory are defined as the probability of taking action ‘At’ by an agent in the state St’ and moves into state ‘St+1' then taking action ‘At+1’ and so on until time T. This probability can be split into two parts i.e probability of taking action ‘At’ in some state ‘St’ and the probability of ending up in some state ‘St+1’ by taking action ‘At’ in the state ‘S’. In short stochastic policy and stochastic environment.

Derivation of Important Sampling:

Consider a random variable ‘x’ is being sampled from probability distribution ‘b’ and we want to estimate the expected value of ‘x’ with respect to target distribution ‘pi’. Expectation can be written as

Now, we can multiply with the probability of sampling x via ‘b’. By shifting b(x) below pi(x) we got our important sampling ratio ‘ρ’.

Now, if we treat xρ(x) as a new variable then we can write it as expectation under ‘b’.

Now the expectation from data can we write as a weighted sample average where ‘ρ’ are used as weights.

Now we can sample ‘x’ from ‘b’ and can estimate its expected value under ‘pi’ using the above formula.

So, this concludes this article. Thank you for reading, hope you enjoy and was able to understand what I wanted to explain. Hope you read my upcoming articles. Hari Om…🙏

References:

--

--