Weekly review of Reinforcement Learning papers #6

Every Monday, I present 4 publications from my research area. Let’s discuss them!

Quentin Gallouédec

Published in

Towards Data Science

6 min readApr 26, 2021

[← Previous review][Next review →]

Paper 1: MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale

Kalashnikov, D., Varley, J., Chebotar, Y., Swanson, B., Jonschkowski, R., Finn, C., … & Hausman, K. (2021). MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale. arXiv preprint arXiv:2104.08212.

Intelligent robots: an inexhaustible source of inspiration for science fiction, and a rich field of study for reinforcement learning. To be truly “intelligent”, a robot must master a large repertoire of skills (we call it a generalist robot). There are many robots that have successfully learned 1 task using reinforcement learning. These robots share one problem, however: learning requires a lot of training. A general-purpose robot is therefore still a science fiction.

In this article, the authors propose some bases to get closer to this generalist robot. They suggest for example that some aspects of learning can be shared: exploration, experience, and representations between tasks. In this framework, learning is continuous, new tasks can be learned on the fly, by reusing past experience, collected for learning other tasks. This method is called MT-Opt. The authors show some examples of tasks learned with this method, (alignment, rearrangement, …).

Oh yes, one more thing: learning can also be distributed among several agents (here several robots). Thanks to this learning collaboration, MT-Opt learned 12 tasks in real life with a distributed learning on 7 robots.

Paper 2: Discovering Diverse Athletic Jumping Strategies

Pictures speak louder than words:

Zhiqi, Y., Yang, Z., van de Panne M. and Yin, KK. (2021). Discovering Diverse Athletic Jumping Strategies. ACM SIGGRAPH 2021.

The authors of this paper present a learning framework for athletic skills, specifically the high jump.

A little historical digression: the jump that all athletes use today is called the Fosbury-flop (it consists in arriving with the back to the obstacle). It was executed for the first time at the final of the 1968 Olympic Games. Before, the athletes jumped facing the obstacle, which was much less efficient. Nobody before Dick Fosburry had thought of this jump. He will be crowned Olympic champion but will not improve the world record.

Based on a simulated character model, they applied reinforcement learning to learn the high jump. (Note that the running phase before the jump does not have to be learned). No demonstration is used, and the agent has to discover the best strategy to jump over the bar. In order for the agent to learn humanly feasible strategies, the actions were restricted to natural poses. By encouraging new policies, an important variety of strategies could emerge. Among them: the Fosbury-flop which remains the best strategy by reaching 2.00m in simulation. I find the idea super fun.

Paper 3: Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention

Gupta A., Yu J. Zhao T.Z., Kumar V., Rovinsky A., Xu K, Devlin T, and Levine S. (2021). Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention. arXiv preprint arXiv:2104.11203.

Reinforcement learning applied to robotics has a special feature. It is intended to be deployed on a real robot. There are two approaches. Either (1) the learning is done on a simulated robot, and the learned policy is deployed on a real robot, or (2) the learning is done in real life directly. Each approach has its advantages and pitfalls. If the learning is done in simulation, one is free to simulate hundreds of robots simultaneously, which can be simulated more quickly than in real life. Nevertheless, since simulation is systematically very simplified, the policy learned in simulation is difficult to deploy on the real robot. The second option is to learn directly on a real robot. The problem of the gap between simulation and reality no longer exists. But the cost of interaction with the environment is much higher: time cannot be accelerated, and there is often only one robot available. There is one last limitation to real-world learning, and it is the subject of this paper. The environment requires human intervention, or complex engineering to be reset (e.g., to close a drawer if the task is to open the drawer).

In this paper, the authors address this limitation by proposing an approach that takes advantage of the fact that some tasks are in essence the reset of another. For example (this is not what is done in the paper but the example illustrates the principle well), opening a drawer is the reset of the task that consists in closing a drawer. Conversely, closing a drawer is the reset of the task that consists in opening a drawer. The idea is to learn these tasks simultaneously. The robot would first learn a pattern to close the drawer, and once the drawer is closed, a new learning episode starts to learn how to open the drawer, and so on.

Figure from the article: Tasks and transitions for lightbulb insertion in simulation. The goal is to recenter a lightbulb, lift it, flip it over, and then insert it into a lamp

This method was tested on a real robot, and the robot was able to learn several tasks in just over 2 days, without any human intervention. There was no need for any complicated tinkering to reset the environment either. This learning trick will probably become the standard for learning complex behavior on a real robot.

Paper 4: Online and Offline Reinforcement Learning by Planning with a Learned Model

Schrittwieser, J., Hubert, T., Mandhane, A., Barekatain, M., Antonoglou, I., & Silver, D. (2021). Online and Offline Reinforcement Learning by Planning with a Learned Model. arXiv preprint arXiv:2104.06294.

Model-based reinforcement learning is the most widely used solution for data-efficient learning. It is often performed either offline (from fixed data) or online (by interacting with the environment). Current approaches first learn a model of the environment, then train the policy on this model. They do not directly use the learned model to plan action sequences. This is precisely what the method the authors call Reanalise proposes. This method focuses on the direct use of the learned model for policy and value improvement. This improvement can be done both offline (from data) and online (by interacting with the environment). Reanalize is finally the solution for doing semi-online-semi-offline learning. Their method is compatible with model-based algorithms, and in particular with MuZero. Combining the two, we get what the authors call MuZero unplugged.

Figure from the article: Final scores in Ms. Pac-Man for different *Reanalyse* fractions. MuZero unplugged can learn from any data budget.

MuZero unplugged reaches a new state of the art in online and offline reinforcement learning. It outperforms previous baselines in the Atari learning environment, for all orders of magnitude of data budget. I like this natural idea of a unified algorithm that can learn with all orders of magnitude of data budget and handle both offline and online data without special adaptation.

Bonus paper: Backpropagation applied to handwritten zip code recognition

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541–551.

Once again this week, an article that should be known. This paper, published in 1989, presented for the first time the approach of using retropropagation to learn convolution kernel coefficients directly from images. In this publication, the goal was to recognize handwritten ZIP code numbers. (These same codes will give birth a little later to the MNIST dataset). This approach is used in all computer vision algorithms, including reinforcement learning when the observation space is an image.

Figure from the article : Convolutional Neural Network architecture

It is important to note that this approach did not immediately gain interest. The algorithms are computationally intensive, and in practice the learning process was very long. It is only in the 2000’s that the parallelization of the computation thanks to the GPU allowed to increase considerably the efficiency of this approach by parallelizing the matrix computation.

It was with great pleasure that I presented you my readings of the week. Feel free to send me your feedback.