
[← Previous review][Next review →]
Paper 1: Sparse Reward Exploration via Novelty Search and Emitters
Paolo, G., Coninx, A., Doncieux, S., & Laflaquière, A. (2021). Sparse Reward Exploration via Novelty Search and Emitters. arXiv preprint arXiv:2102.03140.
The major trade-off in reinforcement learning is the exploration versus exploitation trade-off. Exploration is necessary to find new rewards, and exploitation to capitalize on the knowledge already assimilated. Some environments, such as robotic environments, have the particularity of having very rare rewards (called sparse). The purpose of this work is to propose an algorithm allowing an efficient exploration in the framework of sparse rewards. They call their method SERENE (SparsE Reward Exploration via Novelty and Emitters). What is it?

The main difference of this algorithm is that it will clearly distinguish the exploration process from the exploitation process. In other words, the agent starts with the exploration process, then switches to the exploitation process, and so on. How does the agent decide whether to explore or exploit? This is the role of the meta-scheduler, which will have to allocate the right time to each mode.
The exploration phase is actually reward agnostic, and seeks only to maximize the novelty of the states encountered. This allows to efficiently discover new areas, which would never have been discovered by only looking for the reward. Then in the exploitation process, the agent will create local instances of algorithms, which they call emitters. These emitters are instances of evolutionary algorithm based on the reward. They are adapted for the optimization of the reward on a small area of the space.
The results obtained are not edifying, but the idea of distinguishing the exploratory phases from the exploitation phases allows, according to me, a better readability of the algorithms, and thus allows to manage the compromise in a much finer way.
Paper 2: Real-time Attacks Against Deep Reinforcement Learning Policies
Tekgul, B. G., Wang, S., Marchal, S., & Asokan, N. (2021). Real-time Attacks Against Deep Reinforcement Learning Policies. arXiv preprint arXiv:2106.08746.
Deep Reinforcement learning agents are, like us, susceptible to be attacked. What is an attack? Imagine a trading agent, trained to buy and sell stocks. It is possible, by training an attacking agent, to fool the first agent, and make him take bad decisions, by carefully perturbing the observations. The agent could therefore buy a considerable amount of falling stock, or on the contrary, sell a very profitable stock at the wrong moment.

The main limitation until now was the slowness of these policies, which could not be deployed in real time. In this paper, the authors propose a method that they call Universal Adversarial Perturbation (UAP). They show on ATARI2600 Games, that by correctly training the perturbator, it is possible to make the performance of the main agent drop catastrophically, by applying a perturbation of low amplitude (0.5%). Also, they show that unlike the previous methods, the lightness of their algorithm allows its use in real time.
Is this good news? Not sure.
Paper 3: DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning
Zha, D., Xie, J., Ma, W., Zhang, S., Lian, X., Hu, X., & Liu, J. (2021). DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning. arXiv preprint arXiv:2106.06135.
Dou dizhu is a very popular card game in China. It is played by three players. Each player starts with a hand of cards, and the objective is to get rid of them before the other players. The rules are simple, which makes it very easy to learn. Nevertheless, it has the particularity to be extremely hard to master. To play correctly, you need to think mathematically, probabilistically, and strategically.

You can imagine that some people wanted to know if this game could resist Reinforcement Learning. The authors of this article present DouZero, a DRL agent trained in self-play by mastering this game. What is interesting is the implementation. In this game, each turn, it is possible to place one or more cards. But be careful, not just any cards. There are a limited number of legit actions. This makes the implementation a bit more complex. In this work, the authors propose two major ingredients: action encoding and parallel actors. The trained agent achieves performance that exceeds that of any currently available bot. It seems that no game will resist to the RL…
Paper 4: Model-Based Reinforcement Learning via Latent-Space Collocation
Zhu, C., Nagabandi, A., Daniilidis, K., Mordatch, I., & Levine, S. (2020). Model-Based Reinforcement Learning via Latent-Space Collocation.
The problem with reinforcement learning based on visual perception is that the observation space is very large (product of all possible values for each pixel). Of course, most of the observations in the space are not reachable, and only a minority will be visited for the considered environment. Nevertheless, this is enough to make reinforcement learning algorithms, and in particular those based on planning, difficult. A solution that works very well is to create a latent representation of the space, a more condensed version of the observation, but which contains in theory all the data present in input. But let’s face it, long term planning is still far from obvious. In fact, algorithms using a latent observation space can often predict a few future observations, but remain unable to plan for the long term.

Their hypothesis is the following: instead of planning a sequence of actions (which is usually done in the literature), it should be easier to solve a task by planning a sequence of states. The intuition behind this is that each action has important consequences on the rest of the trajectory. Thus, the smallest error quickly produces a considerable prediction error. Instead, the authors use the collocation method, which optimizes a state sequence to maximize the reward. This method also comes with a guarantee of trajectory feasibility.
It is interesting to see that this method, which they call LatCo (for Latent Collocation), allows a very strong improvement of the results in some environments, but not for others. It would be interesting to investigate to understand the cause of this difference in results.
Thank you for reading my article to the end. I would be delighted to read your comments.