Till now, we have extended our reinforcement learning topic from discrete state to continuous state and have elaborated a bit on applying tile coding to on-policy learning, that is the learning process follows the trajectory the agent takes. Now let’s have a talk of off-policy learning in continuous settings. While in discrete settings, on-policy learning can easily be generalised to off-policy learning(say, from Sarsa to Q-learning), in continuous settings, the generalisation can be a little troublesome, and in some scenarios can cause divergence issues. In this article, I will:
- Introduce the problems caused by the extension to off-policy learning
- Illustrate the problem by giving the Baird counter example
- Introduce a potential method to solve the problem

Problems of Off-Policy Learning
The most prominent consequence of off-policy learning is it may not necessarily converge in continuous settings. The major reason is caused by the distribution of updates in the off-policy case is not according to the on-policy distribution, that is the state, action being used to update might not be the state, action the agent takes. Let’s get into the update formula and further talk about the problem. In continuous state of off-policy learning, the 1-step update process follows:

Notice that comparing to on-policy generalisation, t[[here](https://medium.com/@zhangyue9306/importance-sampling-introduction-e76b2c32e744)](https://stats.stackexchange.com/questions/237085/how-to-correctly-compute-rho-in-reinforcement-learning-with-importance-sampli?rq=1) is one more term ρ
, which is called importance sampling ratio. ρ
is computed by target policy π
divided by behaviour policy b
, in control, the target policy is typically the deterministic greedy policy with respect to the current estimate of the action-value function. This policy becomes a deterministic optimal policy while the behaviour policy remains stochastic and more exploratory, for example, an ϵ-greedy policy. Recall that in discrete state space of Q learning, the temporal difference is max{Q(S', A')} - Q(S, A)
(disregarding reward), so that at each state update, the maximum Q value(coming from greedy policy) is used to update instead of the value of actual state, action taken by the agent(behaviour policy). In continuous space, importance sampling is leveraged to approach the target policy value using behaviour policy. For more explanation of ρ
, you can refer to here and here(For more explanation of importance sampling, please refer to here).
The rest part of the formula is same as on-policy learning, where w
is weight vector, and δ
is temporal difference(notice the δ is slightly difference depending on whether the problem is episodic or not, for non-episodic case, please refer to here).
The divergence of off-policy learning, referring to Sutton’s description in his book, is caused by:
One transition occurs repeatedly without
w
being updated on other transitions. This is possible under off-policy training because the behaviour policy might select actions on those other transitions which the target policy never would.
No worries if you get a little confused here, let’s proceed to an example and see how the weights diverge.
Baird Counter Example
This is a famous and concise example in illustrating the divergence feature of off-policy learning.
Problem Description

The dashed action takes the system to one of the six upper states with equal probability, whereas the solid action takes the system to the seventh state. The behaviour policy b
selects the dashed and solid actions with probabilities 6/7
and 1/7
, so that the next-state distribution under it is uniform (the same for all nonterminal states), which is also the starting distribution for each episode. The target policy π
always takes the solid action, and so the on-policy distribution (for π
) is concentrated in the seventh state. The reward is 0
on all transitions. The discount rate is γ = 0.99
.(The target policy is set fixed only for simplicity and illustration reason)
Notice that inside each circle is the representation of each state using weight vector.
In the following implementation, The step size α = 0.01
, and the initial weights are w = (1, 1, 1, 1, 1, 1, 10, 1)
.
Implementation
Check out full implementation
Initialisation
Despite some general setting, note that self.features
is the representation of each state, and self.weights
is the weight vector(The initial value is purposely set). The multiplication of weights and features are the values of each state.
Action
The action choosing and taking follows the rules.
Value Function
The value function, as stated above, returns the value of a state by taking multiplication of feature and weight.
Run Off-Policy
With the above preps, we are ready to let the agent learn. At each round, the agent repeat the process of state, action, next state, next action, …, and because we set the target policy to always choose solid
line, when the agent takes action dash
, the importance sampling ratio will be 0, and 1/self.prob
when it takes solid
action. The sarsa
argument used to control whether use on-policy learning or not(in on-policy learning, target policy is always the behaviour policy, thus ρ = 1
), and this is for result comparison only.
Learning Result

See that in off-policy learning, the weights diverge to infinity, while on-policy method(Sarsa in this case) converges. Now let’s look at the image again and explain the reason:

For example, when the agent takes a solid action from the leftmost state(2w1+w8) to the bottom state(w7+2w8), if w8 increase its value through the formula above, this will result in value increase of both states, as they share weight w8. However, from the bottom state, by taking dash action would not contribute to off-policy learning, as ρ
is 0 in this case, this leads to the value of w8 always increase but never decrease. This is what Sutton said:
One transition occurs repeatedly without
w
being updated on other transitions. This is possible under off-policy training because the behaviour policy might select actions on those other transitions which the target policy never would.
Solutions
In terms of solutions, the first one is to not use off-policy learning on continuous space settings, always use on-policy instead. Another way is to change the target error function. Till now, all function approximation updates are based on the target error of (true_value - current_value)^2
, but to make the training process converge, Sutton in his book suggested to minimise the Projected Bellman Error(PBE). For detailed explanation please refer to his book here, chapter 11.
Here I give an simple implementation of this method and comparing it with classic off-policy learning. By minimising PBE, one gets an update formula of:

Where β
is another step parameter and vt*xt
(xt is the feature vector) is an approximation of ρ*δ
. This method is called TD with gradient correction(TDC)
Implementation of TDC
The only difference lies in the run
function:
The process follows exactly the formula above, and with TDC, we get an learning process of:

See that the weights converge slowly to the optimal value. There are also other methods introduced to solve the problem, you can find out yourself if you are interested(Check out full implementation).
Lastly, referring to the ending section from Sutton’s book:
The whole area of off-policy learning is relatively new and unsettled. Which methods are best or even adequate is not yet clear. Are the complexities of the new methods introduced at the end of this chapter really necessary? Which of them can be combined effectively with variance reduction methods? The potential for off-policy learning remains tantalizing, the best way to achieve it still a mystery.
Reference: