Train your reinforcement learning agent to handle unlucky scenarios and avoid accidents
In this post I present our recent NeurIPS 2022 paper (co-authored with Yinlam Chow, Mohammad Ghavamzadeh and Shie Mannor) about risk-averse Reinforcement Learning (RL). I discuss why and how risk aversion is applied to RL, what its limitations are, and how we propose to overcome them. An application to accidents prevention in autonomous driving is demonstrated. Our code is also available on GitHub.
TL;DR
Risk-averse RL is crucial when applying RL to risk-sensitive real-world problems. To optimize in a risk-averse manner, current methods focus on the part of the data corresponding to low returns. We show that in addition to severe data inefficiency, this also leads to inevitable local optima. Instead of focusing on low returns directly, we propose to focus on high-risk conditions of the environment. We devise a variant of the Cross-Entropy Method (CEM) that learns and over-samples these high-risk conditions, and use this module as part of our Cross-Entropy Soft-Risk algorithm (CeSoR). We show cool results on driving and other problems.
The cross-entropy method that we used to over-sample high-risk conditions is available as a PyPI package. Of course, it can be applicable beyond the scope of examples sampling for RL. We also provide a tutorial about the CEM and the package.
Background (I): why risk aversion?
Reinforcement Learning (RL) is a subfield of machine learning, which supports learning from limited supervision as well as planning. These properties make RL very promising for applications that require decision making, e.g., driving, robotic surgery and finance. In the recent years, RL demonstrated promising success in a variety of games, to the level that a movie has been made about its performance in the game of Go. Yet, RL struggles to find its way into real-world applications.
One challenge in closing the gap between video games and robotic surgery, is that the latter is highly risk-sensitive: while a gaming bot is allowed to occasionally falter, a real-world system like a medical device must perform reasonably and reliably under any circumstances. In other words, in the real world, we are often interested in optimizing a risk measure of the agent returns – instead of optimizing the average return. A common risk measure to optimize is the Conditional Value at Risk (CVaR); essentially, CVaR_α of a random variable (such as the returns) measures the average over the α lowest quantiles – instead of the average over the whole distribution. α corresponds to the risk level we’re interested in.
Background (II): traditional risk-averse RL
Intuitively speaking, the standard Policy Gradient approach for CVaR optimization (CVaR-PG) considers a batch of N episodes (trajectories) collected by the agent, takes only the αN episodes with the lowest returns, and applies a PG step to them.
Below we discuss the crucial limitations of this CVaR-PG approach. In the paper, we also provide evidence for similar limitations in other approaches beyond PG, e.g., Distributional RL.
The limitations of CVaR-PG
Sample inefficiency: Well, the data inefficiency in CVaR-PG is quite straight-forward: CVaR-PG essentially throws away 1-α of our data, which is typically 95%-99%! If instead we could sample only episodes corresponding to the α worst-cases of the environment, and optimize wrt them, then clearly we could restore the sample efficiency of the standard (risk-neutral) algorithms, i.e., improve data efficiency by a factor of 1/α. As discussed below, that’s exactly what the Cross Entropy Method aims to do.
Blindness to success: CVaR-PG does not only throw away most of the data; it throws away all the successful episodes in the data! If our agent happens to explore a new exciting strategy to deal with a challenging scenario – the optimizer will immediately discard this episode as "high return thus irrelevant". We refer to this phenomenon as blindness to success. In our paper, we show theoretically that in environments with discrete rewards, this inevitably causes the gradients to vanish – resulting in a local optimum.
Illustrative example – the Guarded Maze: In the Guarded Maze, the agent needs to reach the green target (whose location is constant) as quickly as possible. However, the shortest path passes through the red zone, which is sometimes occupied by an officer who charges random bribery fees (based on true stories from a country that shall not be named here). On average, the shortest path is still optimal, despite the rare negative rewards. However, the longer and safer path is CVaR-optimal (e.g., for α=5%).
We implemented the GCVaR algorithm, which is a standard realization of CVaR-PG. As shown in the sample episode above, GCVaR learned to avoid the risky short path, yet failed to learn the alternative long path: every time it encountered the long path, its return was high and thus was not fed to the optimizer. It was blind to the successful strategy of the long path – even though it often explored it!
CeSoR to rescue
Our method CeSoR (Cross-Entropy Soft-Risk) addresses the issues described above using two mechanisms.
Soft risk: as discussed above, CVaR-PG uses αN episodes out of every batch of N episodes. On one hand, this derives a consistent estimator of the true policy gradient. On the other hand, blindness to success leads this gradient into a local optimum. We use a simple solution to this tradeoff: we replace α with α’, which begins as 1 and gradually decreases to α. This way, in the beginning our gradient looks beyond the local optima towards successful strategies; yet in the final training phase, it remains a consistent estimator of the policy gradient of CVaR.
Cross Entropy Method (CEM): the soft risk is not enough – we still have two problems. First, as discussed above, we throw away data and lose sample efficiency whenever α'<1. Second, the soft risk itself may eliminate our intended Risk Aversion: even if the training ends with α’=α, what if the agent converges into a risk-neutral policy before that?
To address these issues, we assume to have control over certain conditions of the training environment. For example, when learning to drive, we can choose the roads and times of our rollouts, which affect the driving conditions. Under this assumption, we use the Cross Entropy Method (CEM) to learn which conditions lead to the lowest returns and then over-sample these conditions. The CEM is a really cool and less-familiar-than-it-should-be method for sampling and optimization. In a separate tutorial, I present the method and demonstrate how to use it in Python using the cem package.
Once we over-sample the high-risk conditions, we no longer need to throw away as many episodes as before. In particular, the objective of the CEM – sampling from the _α-_tail of the original returns distribution – would increase the sample efficiency by a factor of 1/α (as mentioned above). In practice, the CEM achieves a more moderate improvement.
By over-sampling high-risk conditions, we also keep the risk aversion, neutralizing the negative side-effect of soft risk: the soft risk allows the optimizer to learn policies with high returns, while the CEM sampler still preserves the risk aversion.
The main principle of CeSoR may be put as: to be risk averse, focus on high-risk scenarios – not on poor agent strategies. This is illustrated in the figure below.
The phenomena discussed above are well demonstrated in the training process of the Guarded Maze, as shown in the figure below:
- Standard CVaR-PG (GCVaR) does explore the long path (top left figure), but never feeds it to the optimizer (bottom right figure). Thus, it eventually learns to do nothing. Note that using the CEM alone (without soft risk) cannot solve this limitation.
- Soft-Risk alone (SoR, without the CEM) eliminates the blindness to success and does feed the long path to the optimizer (bottom right). However, it begins as risk-neutral and thus prefers the short path (top right). By the time it becomes risk-averse again, the actor has already converged to the short-path policy, and the long path is not explored anymore.
- Only CeSoR observes the "good" strategy (thanks to soft risk) and judges it under "bad" environment variations (thanks to the CEM), converging to the long path (bottom left).
Learning to be a safe driver
We tested CeSoR on a driving benchmark, where our agent (in blue) has to follow a leader (in red) from behind as close as possible – but without bumping into it. The leader may drive straight, accelerate, brake or change lanes.
As displayed in the first figure of this article, CeSoR improves the agent’s CVaR-1% test-return by 28% on this benchmark, and in particular eliminates all the accidents made by the risk-neutral driver. More interestingly, CeSoR learns intuitive behaviors corresponding to safe driving: as shown in the figure below, it uses the gas and the brake slightly less often (right figure), and keeps a larger distance from the leader (left figure).
Finally, we notice that the CEM sampler itself performed as desired. In the Driving Game we let the CEM control the leader behavior (note that the leader is part of the environment and not the agent). As shown below, the CEM increased the relative part of turns and emergency-brakes made by the leader during training. The frequency of these leader behaviors was increased in a controlled manner – to align the agent experience with the _α-_tail of the true returns distribution.
Summary
In this post, we saw that risk-averse objectives in RL are more challenging to train than the standard objective of the expected value – due to blindness to success and sample inefficiency. We introduced CeSoR, which combines soft risk to overcome blindness to success and the CEM sampler to improve sample efficiency and preserve risk aversion. On a driving benchmark, CeSoR learned an intuitive safe-driving policy and managed to prevent all the accidents that happened to alternative agents.
This work is but a starting point towards more efficient and effective risk-averse RL. Future research may improve CeSoR directly (e.g., through the soft-risk scheduling), or extend it beyond policy gradient methods and beyond the CVaR risk measure.