Every Monday, I present 4 publications from my research area. Let’s discuss them!

[← Previous review][Next review →]
Paper 1: Reinforcement Learning with Random Delays
Ramstedt, S., Bouteiller, Y., Beltrame, G., Pal, C., & Binas, J. (2020). Reinforcement Learning with Random Delays. arXiv preprint arXiv:2010.02966.
Delays between action and reward are common, and are a central problem in RL. Even in the real world: an action can produce a reward either immediately (e.g., negative rewards for pain that come immediately after a fall), or with a very long delay (doing well in school gets you a job away from financial trouble). Obviously, the whole intermediate spectrum is covered: an action can produce rewards arbitrarily distant in time). Conversely, a reward at a given moment cannot systematically be attributed to a single past action. In fact, it is probably a reward resulting from all the previous actions, each having a more or less important contribution.
In this paper, the authors introduce the following paradigm: a delayed environment is the result of the encapsulation of an undelayed environment (an action immediately produces the associated reward) in a delay communication dynamic.

Actions reach this delay-free environment with delay, and observations reach the agent with delay. They call it a Random-Delay Markov Decision Processes (RDMDP).
Next, the authors introduce an algorithm derived from Soft Actor Critic (SAC) that they call Delay-Corrected actor-critic (DCAC) in which they add a partial resampling of the trajectory from a replay buffer. This allows them to estimate the multi-step off-policy state values. The results obtained seem to be better than those obtained with SAC. The advantage is mainly the robustness to the delay, which can be equally large or zero.
Paper 2: Reinforced Attention for Few-Shot Learning and Beyond
Hong, J., Fang, P., Li, W., Zhang, T., Simon, C., Harandi, M., & Petersson, L. (2021). Reinforced Attention for Few-Shot Learning and Beyond. arXiv preprint arXiv:2104.04192.
One of the main limitations of Machine Learning is that it is {everything}-intensive: it is data-intensive, computationally intensive, time-intensive and energy-intensive. A major challenge is therefore to improve learning algorithms to make them less greedy. This is the objective of few-shot learning. It consists in recognizing correctly samples of unknown classes from a very small number of supports.
In this article, the authors make the link between RL and few-shot learning, by training an attention mechanism with a reinforcement learning algorithm. The agent is thus trained to adaptively locate areas of interest in the feature space. The reward function is constructed such that the agent is rewarded if it has a good prediction of the selected data.
The results suggest that the retained representation is increasingly discriminative on few-shot learning scenarios. Similarly, classification seems to give satisfactory results. Reinforcement Learning could therefore help a lot in this domain.
Paper 3: Automating turbulence modelling by multi-agent reinforcement learning
Novati, G., de Laroussilhe, H. L., & Koumoutsakos, P. (2021). Automating turbulence modelling by multi-agent reinforcement learning. Nature Machine Intelligence, 3(1), 87–96.
For all those who have done some fluid mechanics, you know well that the only way to describe the flow of a fluid is by simulation. The subsequent equations only find solutions for very special cases, far too simple for aviation or meteorological applications. Recently, machine learning has made it possible to greatly increase the realism of simulations. The authors present here multi-agent reinforcement learning as a tool for automatic turbulence model discovery. To go into more technical details: the authors use large current isotropic turbulence simulations, and reward the agent with a measure of statistical property recovery from direct numerical simulations. The agents must identify critical spatiotemporal patterns in the flow. The goal is to estimate the unresolved physics at the subgrid scale.

The results suggest that this approach is of real interest, and seems to generalize well to flows never encountered during learning. The learned model gives good results but avoids integrating the complex equations of fluid dynamics.
Paper 4: Parrot: Data-Driven Behavioral Priors for Reinforcement Learning
Singh, A., Liu, H., Zhou, G., Yu, A., Rhinehart, N., & Levine, S. (2020). Parrot: Data-Driven Behavioral Priors for Reinforcement Learning. arXiv preprint arXiv:2011.10024.
Consider an environment in which a robot must learn to grasp an object. It will first explore the environment, failing most of the time to even touch the object. This can be a considerable obstacle to learning, especially if the reward is sparse. Besides, this is not how we humans learn. We reuse skills we have learned in other areas and try to use them for new tasks: learning to play tennis can be hard, but we can reuse skills we have already learned, such as running, or gripping the racket; we never actually start from scratch.
The question addressed in this paper is: how can we enable such useful pre-training for RL agents? The authors propose so-called behavioral priors, i.e. learned from previous learning on other tasks. These priors can be reused to quickly learn new tasks, while preserving the agent’s ability to try new behaviors.

The results are compelling. The learning performance is much better than the baselines. I don’t think this is a big surprise and it is quite reassuring. Reusing experience should increase learning capabilities. The real achievement of this paper is to find a mechanism that allows an agile compromise between the reuse of already learned skills and the exploration of new behaviors.
It was with great pleasure that I presented you my readings of the week. Feel free to send me your feedback.