
Performance awareness in RL
Reinforcement learning is a branch of machine learning in which an agent learns to make decisions sequentially in a certain environment. While most of the research in the field focuses on the training of the agent, in many applications (especially risk-sensitive applications such as medical systems and autonomous vehicles), the training has to stop and the agent has to be fixed before deploying it to run on production. In such frameworks, it is essential to be aware of the performance of the agent at any point of time. In particular, it is critical to know ASAP if the performance begins to deteriorate (e.g. if the car controller starts faltering, we’d like to notice before it actually crushes into anything).
Since we focus on the post-training phase of the agent, one may argue that this problem relates to statistical monitoring rather than RL. However, there are two factors that make the context of RL particularly interesting. First, in RL we have several characteristic sources of information: the rewards, the observations, and the understanding of the agent (which may be expressed, for example, through state-estimation or value-function). Second, in RL, all these signals usually come in an episodic manner, in which within the episode the signal is not independent nor identically-distributed (and in our framework is not Markov, either).
Optimal test for rewards deterioration
In our work we focus on the rewards of the agent, which lets the monitoring system be entirely external and independent of the agent. We assume that a reference dataset of "valid" rewards is available (in real applications this could be, for example, the recording of the tests of the system), and we test sequentially in order to notice whenever the rewards deteriorate with relation to the reference.
Such a test is very straight-forward to write: just take the mean reward over the recent episodes, and compare it to the reference. However, it turns out that this is highly sub-optimal: for example, if a certain time-step in the episode has highly-varying rewards, whereas another time-step has very stable rewards, then clearly the second one would be more informative for detecting degradation.
In our work, we use Neyman-Pearson Lemma to show that indeed, if we assume that the rewards are normally-distributed, then an optimal test (i.e. a test with optimal significance/power threshold) would consider a weighted mean of the rewards rather than a simple mean, with weights corresponding to the row-sums of the inverse covariance matrix of the rewards. We also show that in absence of normality, the weighted mean is still better than the simple mean (even if not necessarily optimal over all possible tests). We also quantify its advantage over the simple mean, and show that it increases with the heterogeneity of the spectrum of the rewards covariance matrix.
Tuning the test threshold
So we know which statistic we should consider for degradation test, but what is the threshold below which we would cry "degradation"? We wish to use bootstrapping over the reference dataset in order to determine this threshold, but there are to difficulties with that: (1) the rewards are not i.i.d, thus the sampling in the Bootstrap may ruin the distribution of the signal; (2) the tests are conducted sequentially with overlaps, and thus a simple false-positive probability is not a good descriptor of the test significance.
We deal with the non-i.i.d issue by exploiting the episodic setup and sampling complete episodes. Regarding the sequential framework, we define the significance of the test through the requirement "during h episodes of sequential tests, the probability of a false-alarm must not exceed α", and devise the bootstrapping accordingly in BFAR – Bootstrap for False Alarm Rate control in non-i.i.d signals.
Numerical results
To test our procedures, we trained an agent on several environments, then made changes to the environments and let the rewards deteriorate (with no further training). In all environments and all statistical tests, in absence of environment modifications the false alarms were successfully controlled – which indicates the success of BFAR. In the presence of modifications, our suggested test detected degradation faster and more often that the alternative tests – often by orders of magnitude. As its competitors, we chose the naive simple mean (which is usually used in RL in practice), CUSUM (from sequential testing), and Hotelling (from multivariate mean-shift testing).

This work was published in the RWRL Workshop in NeurIPS 2020.
Drift Detection in Episodic Data: Detect When Your Agent Starts Faltering