The False Promise of Off-Policy Reinforcement Learning Algorithms

Do off-policy reinforcement learning algorithms deliver what they promise?

We have all witnessed the rapid development of reinforcement learning methods in the last couple of years. Most notably the biggest attention has been given to off-policy methods and the reason is quite obvious, they scale really well in comparison to other methods. Off-policy algorithms can (in principle) learn from data without interacting with the environment. This is a nice property, this means that we can collect our data by any means that we see fit and infer the optimal policy completely offline, in other words, we use a different behavioral policy that the one we are optimizing. Unfortunately, this doesn’t work out of the box like most people think, as I will describe in this article. Before we start talking algorithms and math, a small heads-up:

To understand this article, you should probably understand basic concepts in reinforcement learning.

A shining example of an off-policy algorithm is DDPG (Deep Deterministic Policy Gradients)- supposedly. Why do I say supposedly? Well the behavioral policy in DDPG is not completely decorrelated from the optimizing policy. Anyone who is familiar with the implementation of DDPG and its variants realizes that the exploration results from adding some kind of noise to the actions selected by the policy, therefore we cannot call DDPG a pure off-policy algorithm because the policy that we optimize over is actually used to acquire data (to explore).

What about Q-learning or deep Q-learning and all of its variants, are they true off-policy algorithms? No, they are not. In Q-learning, we do something that’s called epsilon-greedy exploration. Which means basically that with a small probability we choose a random action vs. the current optimal estimate of the action. So basically the behavioral policy is not that different from the policy that we are optimizing over.

As it turns out, the fact that these algorithms are not pure off-policy algorithms is the only reason that these algorithms work consistently. This is exactly something that was shown in [1]. Interestingly, the paper has the title “Off-Policy Deep Reinforcement Learning without Exploration.” which may be a bit misleading since exploration is definitely something that is needed in reinforcement learning, but we generally don’t want to use the policy that we are optimizing for exploration in off-policy methods (this is what should make them off-policy in the first place).

One of the experiments that the authors of [1] conducted was that they trained a DDPG policy truly off-policy based on experience collected from another DDPG policy. What this means is that they took two completely different initial policies, one was trained iteratively while doing data acquisition and the other one wasn’t used for data acquisition at all but was trained on the data acquired by the other policy. This was evaluated on the hopper task in OpenAI Gym, which is not even that hard:

If you take a look at the resulting average return graph which is basically the distance traveled within the episode, the results are quite shocking (if you don’t know what is happening behind the scenes):


One can see that the actual policy that is used to collect the data (the orange line) has improved to regions of high return, whereby the other policy that was just updated from the buffer (the blue line) failed to achieve a high average return. What should we think about this? Are we doomed? Are the robots going to remain miserable mindless creatures?

It is not actually that hard to understand. The situation is completely explainable through good old fashioned statistical learning theory. The problem with reinforcement learning is that the data generating distribution, i.e. the transitions from the environment that we obtain by interaction, depends on the policy also that we use for interacting (logically). This means that the distribution of the transitions in our collected experience or dataset doesn’t reflect necessarily the distribution induced by the policy that we are optimizing in a truly off-policy setting.

In other words, our training set isn’t distributed similarly to the actual data that we are going to encounter. This can cause quite a big problem in machine learning algorithms in general. Nevertheless, people came up with “clever” tricks to mitigate these problems, such as importance sampling, which is essentially reweighting the examples based on their likelihood. Apparently, the authors of [1] argue that this is not enough to alleviate the problem in the case where the batch used for training doesn’t contain any high likelihood transitions.

Photo by Tim Gouw on Unsplash

To address this problem of the data generating distribution vs. the training distribution, the authors of [1] derived the Batch Constrained Q-learning algorithm (BCQ). The idea of the algorithm is quite simple and it addresses the discrepancy between the data generating distribution of the behavioral policy and the current policy induced distribution directly, although their implementation of it is a bit complex. The idea is to learn a conditional generating distribution for actions, such that only state-action pairs that are likely to have occurred in the experience are generated for calculating the TD-target in the TD-error update. Just a quick recap for those of you who forgot how the bootstrapping procedure in the TD-error update works, this is the TD-error:


So basically we calculate the error between the current reward + our discounted next step estimate and the current step estimate. This is why the procedure is called bootstrapping because we are bootstrapping on our own future estimates to update our current estimates. Now notice the pi(s’) in the formula. This is where the magic happens, the policy that we are optimizing may generate actions that have nothing to do with the data-generating distribution which is conditioned on our behavioral policy. Note that the r + gamma*Q term in the norm is referred to as the TD-target and the norm of the target-estimate estimate is referred to as TD-error.

Instead of using a policy directly to calculate the TD-error update, a separate perturbation policy is trained that applies small perturbations to the actions which are sampled from the generative model. The generative model, in this case, is trained on the data encountered. In a nutshell, this means that the policy resulting from BCQ receives the following form:


The perturbation policy is trained by maximizing the Q value over the state action pairs in the buffer. This again is quite obvious, since we want to take the actions that maximize the long-term reward. As I said earlier, these perturbations are small since we don’t want to generate actions that vary too much from the ones encountered in the experience:


Alas, another trick can be used in order to discourage taking actions with an uncertain outcome. First of all, we can use 2 Q networks and train them in parallel. By taking the minimum of the 2 in computing the TD target value we are effectively discouraging high variance regions in the optimization. The authors opted for a convex combination between the minimum and maximum parametrized by the parameter lambda. By tweaking lambda, we can trade-off how heavily uncertainty is penalized.


In this way, we can relatively make sure that the transition distribution that we use to fit our policy is similar to the distribution in the training data. Although this is not exactly trivial to achieve for continuous state-action spaces because we need to model the data generating distribution with a generative model such as Variational Autoencoders, I find it quite amazing that this fact was addressed only now. The requirement that the training data needs to be similarly distributed to the actual data is part of the fundamentals of statistical learning theory. [1] is a shining example of how there is a need for a better theoretical analysis of reinforcement learning algorithms. Paper verdict:

Photo by Collin Armstrong on Unsplash