Weekly review of Reinforcement Learning papers #4

Every Monday, I present 4 publications from my research area. Let’s discuss them!

Quentin Gallouédec

Published in

Towards Data Science

6 min readApr 12, 2021

[← Previous review][Next review →]

To the memory of Andréas.

Paper 1: On the role of planning in model-based deep reinforcement learning

Hamrick, J. B., Friesen, A. L., Behbahani, F., Guez, A., Viola, F., Witherspoon, S., … & Weber, T. (2020). On the role of planning in model-based deep reinforcement learning. arXiv preprint arXiv:2011.04021.

What is the contribution of planning in reinforcement learning? It’s hard to know: it is part of many very powerful algorithms like MuZero. But to what extent is this planning phase necessary for good learning results? This is the question that the authors of this publication try to answer. To do so, they revise MuZero, and confront it with different environments, with different ablations.

Here is a digest of their response. Planning is useful (phew) but not always very efficient. In some cases that one would intuitively define as requiring a lot of reasoning, like Sokoban, it is not necessary to do a lot of planning. In others like 9x9 Go, the learning performance is strongly impacted by the depth of the planning. On the other hand, planning is not enough for good generalization. It suggests that identifying good policy biases may be more important than learning better models for driving generalization.

Intuitively, the ability to predict the future is important for learning good policy. I find it interesting to factually question this intuition by testing a benchmark model-based learning algorithm.

Paper 2: Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots

Li, Z., Cheng, X., Peng, X. B., Abbeel, P., Levine, S., Berseth, G., & Sreenath, K. (2021). Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots. arXiv preprint arXiv:2103.14295.

Bipedal locomotion is a great demonstration of the power of machine learning. The control of most robots is not based on learning. We use linear automatic methods known for decades and the results are very satisfactory. But these methods have never been robust enough to make a biped robot walk. It is at this frontier that machine learning is of great interest.

In this paper, the authors present a reinforcement learning framework adapted to the control of a biped robot. In this framework, a first learning phase is done in simulation. However, the simulation differs systematically from the real world. This is called the sim2real gap. This is why they use domain randomization: the simulation constants are no longer constant: they vary from one simulation epoch to the next. This allows the policy to be more robust to the domain change it will undergo when deployed on the real robot.

Figure from the article : recover from foot sliding

The learned policies allow the biped robot to perform a set of interesting behaviors: on the figure for example, they destabilize the robot by making it slide. We see how it reacts correctly to this perturbation! It also performs other tasks: walking fast, turning, supporting an additional weight…

Another nice demonstration of the power of deep-RL applied to robot control. Go see their video. I find this realization quite telling since each of us has learned to walk during his first months.

Paper 3: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Parisotto, E., & Salakhutdinov, R. (2021). Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation. arXiv preprint arXiv:2104.01655.

For some applications, especially for real-time control of robots, it is necessary that the response time of the learned model is low. The robot must react quickly to variations in its environment. The learned model must therefore be simple enough to allow inference compatible with the constraints of real-time control. Very often, the learned reinforcement learning models are simple. As Andrej Karpathy would say

“Everything I know about design of ConvNets (resnets, bigger=better batchnorms etc.) is useless in RL. Superbasic 4 layers ConvNets works best.” [source]

But then, how to take advantage of the complex models that make supervised learning so successful? To answer this question, the authors of this publication propose a procedure of “Actor-Learner Distillation” (ALD). It allows to transfer the learning progression from a model learned on a large network to a smaller network. This allows them to use for example the very-efficient-but-very-heavy transformer architecture on non-Markovian environments (i.e., partially observable environments). This super-model is then distilled to a lighter LSTM model.

Figure from the article: Overview of ALD.

They tested it on fairly simple environments (I-Maze 9x9 and Meta-Fetch) and the learned model manages to combine the lightness of an LSTM and the efficiency of a transformer. It could well help to reconcile reinforcement learning and supervised learning, to the benefit of reinforcement learning!

Paper 4: pH-RL: A personalization architecture to bring reinforcement learning to health practice

Hassouni, A. E., Hoogendoorn, M., Ciharova, M., Kleiboer, A., Amarti, K., Muhonen, V., … & Eiben, A. E. (2021). pH-RL: A personalization architecture to bring reinforcement learning to health practice. arXiv preprint arXiv:2103.15908.

Always the same problem: in simulation or in games, reinforcement learning has proven itself. But what about the real world? In this paper, the authors present a general reinforcement learning architecture for a health problem: personalization, and more specifically personalization of mobile applications. They call it pH-RL (personalization in e-Health with RL). This architecture allows health applications to be personalized, through learning, and the level of personalization is adjustable.

In fact, they propose a guide to introduce a reinforcement learning model in a mobile health application. They demonstrate the effectiveness of their approach with the MoodBuster application (a plateform that treat psychological complaint online). Empirically, they show that the learned model correctly selects the actions and messages needed to maximize daily adherence to therapeutic modules.

I love this kind of article that makes the connection to healthcare. The results are interesting but I can’t help but remark: beware of technological solutionism, especially in healthcare.

Bonus Paper: A multimillion-year-old record of Greenland vegetation and glacial history preserved in sediment beneath 1.4 km of ice at Camp Century

Christ, A. J., Bierman, P. R., Schaefer, J. M., Dahl-Jensen, D., Steffensen, J. P., Corbett, L. B., … & Southon, J. (2021). A multimillion-year-old record of Greenland vegetation and glacial history preserved in sediment beneath 1.4 km of ice at Camp Century. Proceedings of the National Academy of Sciences, 118(13).

Picture of CHRISTIAN PFEIFER from Pexels

50 years later, a forgotten sample reveals Greenland’s alarming history. Thanks to pieces of rock and soil accidentally collected in the middle of the cold war, which nobody paid attention to for decades, researchers have shown that the Greenland ice cap has completely melted about one million years ago.

In 1966, American researchers were sent to Camp Century, in Greenland, to carry out a 1400 m deep drilling. Objective? Officially: it is to pierce the secrets of survival in the Arctic. Unofficially: it is to hide under the ice pack 600 nuclear missiles within reach of the Soviet Russia. Under the direction of Chester Langway a core of 1.4 km of ice and 3 m of sub-glacial sediments is extracted, frozen and transferred to a warehouse at the University at Buffalo. A unique archive, which will eventually be forgotten for several decades. Chester Langway hesitated to destroy these samples in the 1990s to make space in the university’s freezers. But it is the University of Copehnague that would eventually take them.

In 2017, during a major cleaning, the famous samples will be distributed to several teams around the world, including one at the University of Vermont. In 2021, these researchers from the University of Vermont study these samples, and discover vegetation fossils that date back one million years. This means that a million years ago, at Camp Century, there was probably a boreal forest, and not a glacier.

However, this amazing discovery is quite frightening: a million years ago, the average temperature was only 2 to 3 degrees warmer than today. Ironically, this is exactly the rise predicted for the next 50 years. The melting of Greenland would cause a 6 to 7 meter rise in sea level. So what are we waiting for? Do we continue to make adjustments or do we really get off our butts to prevent this?

It was with great pleasure that I presented you my readings of the week. Feel free to send me your feedback.