Weekly review of Reinforcement Learning papers #5

Every Monday, I present 4 publications from my research area. Let’s discuss them!

Quentin Gallouédec

Published in

Towards Data Science

7 min readApr 19, 2021

[← Previous review][Next review →]

Paper 1: Machine Translation Decoding beyond Beam Search

Leblond, R., Alayrac, J., Sifre, L., Pislar, M., Lespiau, J., Antonoglou, I., Simonyan, K., & Vinyals, O. (2021). Machine Translation Decoding beyond Beam Search. arXiv preprint arXiv:2104.05336.

Let’s talk about machine translation. BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of a text that has been translated by a machine from one language to another. The closer a machine translation is to a professional human translation, the better the evaluation. The best results are obtained with beam search. This is a heuristic search algorithm that explores a graph by considering only a limited set of children from each node, thus reducing the memory required to run it. But the authors are not satisfied with this beam search, because it does not take into account the metrics that practitioners are interested in. What are these metrics? The brevity of this article prevented me from explaining, so I refer you to the article. To replace the beam search, they propose to use RL, and more specifically MCTS based algorithms.

They show a surprising result: there is no best algorithm, all metrics considered. The metric considered greatly conditions the best algorithm to use. But in some cases, MCTS-based algorithms obtain much better results than beam search algorithms. In 23 pages, the authors explore the strengths and weaknesses of this new approach for translation. They also propose several appendices that allow to easily reproduce their results. The foundations are laid for this new approach to machine translation.

Paper 2: Learning with AMIGo: Adversarially motivated intrinsic goals

Campero, A., Raileanu, R., Küttler, H., Tenenbaum, J. B., Rocktäschel, T., & Grefenstette, E. (2020). Learning with amigo: Adversarially motivated intrinsic goals. arXiv preprint arXiv:2006.12122.

It is complicated to learn in an environment where rewards are sparse. There is nothing to guide the agent to areas of high reward. One solution is to introduce what is called an intrinsic reward. Concretely, it is a reward that does not come from the environment, but from the agent itself. But what is the purpose of this reward? And how to decide when to give this reward? Here is a simple example: the agent can give itself a reward when it visits a new state. This encourages it to be curious, and to discover new states. But this is not the type of reward that is used here. Let me explain briefly.

The authors propose a framework in which the agent generates a “teacher” and a “student”. The teacher learns to propose higher and higher goals, and the student learns to achieve them. The agent can thus improve, obtain general skills, without even needing to obtain a reward from the environment. In fact, it is as if the agent is learning to evolve in a world without using its rewards.

Figure from the paper: the two modules of AMIGO: a goal-generating teacher and a goal-conditioned student policy

Taking a step back, I find this idea very interesting from a philosophical point of view. We learn new things every day, but the world we live in has no purpose (or it is well hidden). We create our own goals to progress.

The results obtained look very good. The trainings were performed on a more difficult variant of the gridworld and, in some configurations, the agent was able to achieve rewards where no state-of-the-art algorithm had been able to. Awesome.

Paper 3: AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control

Peng, X. B., Ma, Z., Abbeel, P., Levine, S., & Kanazawa, A. (2021). AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control. arXiv preprint arXiv:2104.02180.

If the movements of the animated characters look so realistic, it’s mostly thanks to motion capture. You know, those actors in plain suits with little tracking balls on all the joints. The results are really good. But if the results are so good, it’s because the range of motion is limited. For more complex movements, sophisticated machineries are needed to get a satisfactory result.

What is the link with Deep-RL you may say. In this work, the authors propose an approach based on adversarial learning to exempt the design of imitation objectives.

For the agent, the objective is twofold: to accomplish a high-level task, and to adopt the character’s style. Concerning the high-level task, it is often easy to define an associated reward function: for example, for a forward progression, the reward will be worth the distance covered. Concerning the imitation of the style, the agent will have to present movements as close as possible to a dataset containing the natural movements of the character. The advantage is that the motion sequences of this dataset are quite similar, and do not necessarily correspond to the task in question.

The results obtained are very realistic: the character does cartwheels, somersaults, plays soccer, you should check it out!

Paper 4: Asymmetric self-play for automatic goal discovery in robotic manipulation

OpenAI, O., Plappert, M., Sampedro, R., Xu, T., Akkaya, I., Kosaraju, V., … & Zaremba, W. (2021). Asymmetric self-play for automatic goal discovery in robotic manipulation. arXiv preprint arXiv:2101.04882.

Alice and Bob. These are the names of the two robot agents in competition. Alice’s goal? To do things that Bob can’t. Bob’s goal? To achieve what Alice has done. Here’s how the learning process works:
(1) Alice begins to interact with the environment. She can move objects, push them… After a few steps, we freeze the scene. This is the “target” state
(2) Bob starts in the same initial state as Alice. By interacting with the environment, Bob must find a way to reproduce the target state.

Figure from the article: An initial state is sampled from a predefined distribution, Alice generate a goal state and Bob tries to reach the goal state

Once Bob succeeds in reproducing the scenes that Alice proposes, Alice will propose more complicated states to reproduce. We will simply allow Alice to interact longer with the environment. The longer she interacts, the more she will modify the scene, and the harder it is for Bob to reproduce it.

This idea is very clever, since we make sure each time that the target state is possible to reach (Alice has reached it, so we have at least one solution). On the other hand Bob is trained to reproduce a target state. If the learning is successful, Bob is not trained to do a single task (as we often see in robotics), but can do any task. I invite you to see the complexity and diversity of the tasks that the trained agent is able to perform, here is a small overview.

Figure from the article: the final states Bob succeed to reach.

I bet with you that this paper will be a milestone in robotic learning.

Bonus paper: Improved protein structure prediction using potentials from deep learning

Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., … & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710.

Again this week, no connection to reforcement learning for this bonus article. We’ll talk about AlphaFold. It’s a news that you must have seen last year: DeepMind scientists have solved the protein folding problem. But what does it mean exactly?

Proteins are the basis of biochemistry in life. Understanding how they work would allow huge advances in the understanding of the mechanisms of life. The idea is quite simple: proteins are made of a sequence of amino acids. Amino acids are in a way the building blocks of proteins. There are about twenty of these elementary building blocks. By assembling these amino acids in a specific order, we obtain a protein. But a protein does not remain an amorphous chain. It will fold and take a final shape. It is this form that is at the origin of its chemical properties. Hence the question: how do proteins fold?

We know that for a sequence of amino acids, there is a unique way to fold. The challenge is therefore to predict, from its amino acid sequence, the shape that the protein will take. This problem has been studied for 50 years, and every year, a competition is organized to determine the best folding prediction algorithm. Here are the results of the last edition:

*Ranking of participants in CASP14, as per the sum of the Z-scores of their predictions. Figure from the official* *CASP14 webpage*.

The bar on the far left, vastly larger than the second one, is AlphaFold, the model proposed by DeepMind. (The ordinate, the z-score, for short: the higher its value, the better the prediction.) How did they manage to get such good results? In a nutshell, they use deep learning. The network is trained to predict the distance between pairs of residues. From this data, they can accurately predict the shape of the protein.

This publication is a great demonstration of the positive applications that AI can have for humanity. Congratulation.

It was with great pleasure that I presented you my readings of the week. Feel free to send me your feedback.