The Power of Offline Reinforcement Learning: Part I

RL algorithms that could potentially scale to real-world problems

Or Rivlin

Published in

Towards Data Science

13 min readNov 1, 2020

Limitations of Online RL

Reinforcement learning has grown rapidly in the past few years, from tabular methods that can only solve simple toy problems to powerful algorithms that tackle incredibly complex problems such as playing Go, learning robotic manipulation skills or controlling autonomous vehicles. Unfortunately, adoption of RL for real-world applications has been somewhat slow, and while current RL methods have proven their ability to find high performing policies for challenging problems with high-dimensional raw observations (such as images), actually using them is often difficult or impractical. This is in stark contrast to supervised learning methods, which are highly prevalent in many fields of industry and research and are utilized with great success. Why is that?

Most RL research papers and implementations are geared towards the online learning setting, in which the agent interacts with an environment and gathers data, using its current policy and some exploration scheme to explore the state-action space and find higher-reward areas. This is typically illustrated in the following manner:

Such online RL algorithms interact with the environment and use the gathered experience either immediately or via some replay buffer to update the policy. The important thing is that data is gathered directly in the environment and only that data is used for learning, with learning and gathering interleaved.

This introduces several difficulties:

- The agent must gather sufficient data for learning each skill/task, which could be prohibitively expensive for systems like robots or autonomous cars.

- The agent interacts with the environment using a partially trained policy, which might take potentially unsafe actions, such as administering a wrong drug to a patient.

- The need to gather specialized data for each task using the training environment often induces a very narrow distribution of states, which might cause the policy to be brittle to slight changes, making it un-trustworthy for deployment.

These are not the only difficulties one might face when applying RL to real-world problems, but they could be a major factor in the decision to not use RL for your task. It is enough to briefly glance at current RL research papers to see that even relatively simple simulated tasks often require millions of interaction steps to learn a good policy, so how practical would it be to try this on real robots and gather such quantities of data for each new task?

Interestingly, these problems are not as common in supervised learning. When training an image classifier or object detection network, practitioners often have access to vast datasets of labeled data, in diverse real-world settings. This is why many such supervised learning models sometimes generalize surprisingly well even on input images that are quite different than those encountered during training, and can often be finetuned for new tasks with very little labeled task data. A similar thing can be seen in the NLP community, where large models pretrained on enormous datasets are very helpful for learning new tasks, making the process practical with only modest requirements in terms of labeled task data.

So why not do the same when learning policies? Suppose we want to learn some robotic skill from images, can’t we just use a model pretrained on ImageNet? As it turns out, this doesn’t have quite the same benefit as we are used to in supervised learning, as such a model does not give us any clue regarding what we should do, since it was not trained with any task. What else can we do to alleviate this issue and make RL more applicable to real problems?

Off-policy RL and distribution shift

A very impressive paper was published in 2018, called “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. In this paper the authors (Google…) used several robots to simultaneously gather data and trained a policy for grasping objects in a bin. They ran their experiment for a few months straight and performed 580K grasp attempts in the process, producing a state-of-the-art grasping policy. At its core, their method is based on Q-Learning, which is an RL method based on dynamic programming. Our goal in RL is to find a policy that maximizes the expected Value:

The value function of a policy for a given state tells us what is the sum of discounted rewards that we can expect to get by following the policy from that state, and the objective in RL is to find the policy for which the expected value over all states is the largest. We can also describe the Q-value, which is the value we expect to get by taking a specific action and then following the policy from now on:

In Q-learning we try to find the optimal Q-function (or the Q-function of an optimal policy) by minimizing the bellman error:

A Q-function for which the bellman error is zero for all states-actions is guaranteed to be optimal, and we can extract the optimal policy by taking the action with the highest Q-value at every state:

Of course, for real problems the Q function is approximated by some deep neural network, and some modifications are needed to the basic Q-learning recipe to make it work that way. The QT-Opt paper used such a Q-learning based method and got very impressive results (that required Google scale resources), but they performed another interesting experiment in that paper; they took all the data gathered during training and tried to train a new Q-function from scratch using only that data, without any further interaction with the robots.

In principle, this should have worked. if we look at the math above we can see that the Q-learning algorithm is actually agnostic to the source of the data, meaning that we are allowed to apply it to data that was gathered by any other policy or procedure, and in particular we should be able to use it on the logged training data. This property is called Off-Policy learning, and it is one the main reason that Q-learning is an efficient learning method, as it can reuse data from arbitrary sources. Surprisingly however, the authors observed a significant performance gap between the original policy (trained on the same data it was gathering, as it was gathering it) and the one trained using the static data, and learning from the static data yielded much worst performance. This procedure of learning a policy or Q-function from static data with no further interaction with the environment is called Offline RL (sometimes called Batch RL), as opposed to the online RL setting in which we gather new data directly from the environment.

The above figure shows how offline RL looks like; data is gathered by some source or sources (a policy, a script, humans etc.) and kept in some buffer, a policy is trained offline using only that data and then deployed to the real world after being fully trained.

If the gathered data was good enough to learn the first policy successfully, why is it not working well when training offline? Let us examine the bellman error from our Q-learning algorithm again:

When implementing it in practice we do something like this:

We try to minimize the bellman error for transitions of the form:

Which contain the state, the action that was taken from that state, the reward received by taking that action and the state that was observed after taking that action. When calculating the target for the bellman error:

We take the best Q-value that our Q-function thinks it can get by taking an action a’ and use that to correct our current estimate of the Q-value for the state s. This is called bootstrapping and is a key aspect of Q-learning. Since our Q-function is not optimal (especially during early phases of training) this target value will surely be wrong sometimes, and will lead us to propagate errors to the estimate of the Q-value of s. For this reason online RL comes to the rescue, and since errors in Q-function estimation will cause the policy to take wrong actions, it will experience firsthand the outcomes of taking those actions and be able to correct its mistakes eventually. However, in the offline case this breaks down, as the policy does not interact with the environment and does not gather more data, therefore has no way to know it is propagating wrong values and correct them.

Even worst, using the max operator results in an almost adversarial search for the Q-values with the highest numerical values, which will often be very wrong. This can in many cases cause the predicted Q-values to diverge and training to fail completely. In such cases our Q-function thinks it can get extremely high returns while in reality, it cannot. Formally, this problem arises because there is a distribution shift between the policy that gathered the data and the policy we are learning now, and when we evaluate our Q-function on s’ and look for the action with the highest score, we are querying our model on state-actions that it has never seen before, and might be completely wrong about.

So how can we alleviate this problem and learn well in the offline setting? One approach is to constrain the policy so that it stays close to the data gathering policy via some divergence measure such as KL-divergence. This will prevent the Q-function from propagating highly optimistic out-of-distribution Q-values, but might prevent it from learning a better policy than the data gathering policy. This can be a major issue if the data gathering policy is not very good, since a major appeal in offline RL is to improve over what we had before, otherwise we could just use behavior cloning and be done with it. A more subtle constraint is to force the policy to choose actions close to those that appear in the dataset, while allowing the policy to diverge from the data gathering policy. At first glance, this might seem pointless, how can the policy be different if it is forced to use the same actions from the dataset? The following figure illustrates this:

In the above figure, the right image depicts two trajectories present in our data, one reaching from A to B and the other from B to C. we would like to learn to reach from A to C, but that trajectory does not actually appear in the dataset. Fortunately, there is some overlap between the trajectories which allows learning the trajectory from A to C just using the actions in the dataset. This is one the main strengths of RL by dynamic programming, that it can stitch together parts of trajectories and learn a better policy than present in the data. The caveat however, is that enforcing such a constraint (use only actions from the dataset) in a realistic manner in large state and action spaces requires us to model the data gathering policy somehow, which usually entails approximating it first with some neural network and then using it during offline RL. This approximate data gathering policy is a potential source of errors, and some research papers demonstrate that improvements in behavior modeling improve performance of such offline RL methods.

This is actually a more challenging problem than it might seem at first glance from the literature. Our hope in offline RL is to use vast datasets of prior experiences that were accumulated over time from multiple sources, and this might include things like scripted policies and human demonstrators, which might behave in a non markovian way, making modelling of their behavior very difficult. Ideally, we would like our RL algorithm to not require such behavior models and be simple and scalable.

Conservative Q learning

Recently, researchers at Berkeley the paper “Conservative Q-Learning for Offline Reinforcement Learning”, in which they developed a new offline RL algorithm called Conservative Q learning (CQL), which seems to perform very well while being relatively simple and maintaining some nice properties. As we have seen before, when performing naïve Q-learning or actor critic algorithms on offline data, we propagate highly optimistic Q-values when minimizing the bellman error, and the max operator in the bellman error seeks these errors out. CQL addresses this issue with a simple addition to the objective function.

In standard Q-learning our loss was:

The authors suggest the following addition:

By actively trying to minimize the Q-values of the state-actions our policy thinks are high, we gradually force down all those optimistic errors and force the Q-values to be no larger than what they actually should be. The authors prove that with a proper choice of alpha, the resulting Q-function can be bounded by the “true” Q-values (that are unknown), and is thus a conservative estimate of those values. They empirically demonstrate that this is indeed the case and the predicted Q-values are lower than the ones gained by deploying the learned policy. In fact, the resulting Q-values are a bit too conservative, and the authors propose another addition to the loss function:

This addition tries to maximize the Q-values of state-actions that appear in the dataset, encouraging the policy to stick to more familiar actions and making Q-values less conservative. The authors prove that the expected resulting Q-values are upper bounded by the “true” Q-function, and demonstrate that this variant produces better results and more accurate Q-values. They test their method on a wide range of offline RL benchmarks and show that it is superior to existing methods.

The nice thing about this method is that it is relatively simple, tackles the problem at its heart and operates in a way that makes sense intuitively. It also has the benefit of not requiring a model of the data gathering policy, which eliminates a potential source of errors and removes redundant machinery and models from the process.

Generalizing skills using prior datasets

We have seen that there is great potential in using vast offline datasets for learning strong policies that generalize well in the real world, and that improvements in offline RL algorithms are making that vision closer to reality, but one thing kept bothering me. It seems that most of the research papers that I have seen assume that we want to learn to perform some task X, and that we have a large offline dataset that is annotated with the proper rewards for this task X. This seems unrealistic, as the entire premise for using offline RL was being able to utilize large datasets gathered from diverse sources over time, and it would be dubious to assume we would be able to annotate such datasets with a reward for our task in hindsight. How might one even go about annotating a dataset of robots opening doors with a reward for grasping objects, using images?

A new paper by the authors of the CQL paper, called “COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning”, addresses this issue and demonstrates that unlabeled offline data can be used to enhance and generalize a smaller annotated data for our task. The authors use the example of a robot that is trained to grasp objects placed inside an open drawer, using data that was gathered by some scripted policy (the task data). In addition to that, there exists a larger dataset of the robot interacting with the environment for other tasks such as opening and closing a drawer, picking and placing objects and so forth (the prior data). Training our policy on the task data using some offline RL algorithm such CQL will yield good performance on the task, and the robot will likely be able grasp the objects from the open drawer with high probability. However, if we were to change the initial state significantly in some way, such as closing the drawer, it would be unrealistic to expect our policy to succeed now, as the task data contains no such information.

The authors suggest a simple solution; annotate the task data with sparse binary rewards (+1 for completing the task and 0 otherwise) and annotate all the prior data with a 0 reward. The datasets are then combined and CQL is used to train on the resultant large dataset. As we have seen before, offline RL algorithms that use dynamic programming have the seemingly magical ability to stitch together parts of trajectories and learn something greater than the sum of its parts. The authors demonstrated that CQL was able to propagate the Q-values from the final goal reaching states all the way back to the initial condition (open drawer, object placed inside) and further out to states with the drawer closed, thus learning to generalize the task to new an unseen initial condition, that were never encountered when gathering data for our task.

This in my opinion is a powerful demonstration of what a simple and elegant algorithm can do when provided with prior unlabeled data. I hope that in the future we will see better methods developed and vast datasets gathered that will bring RL to wide adoption in industry and research, unlocking its full potential.

For the interested reader, I highly suggest reading Sergey Levine’s (The head of the Berkeley research group) MEDIUM article on offline RL.

The Power of Offline Reinforcement Learning: Part I

RL algorithms that could potentially scale to real-world problems

Written by Or Rivlin