Model-based micro-data reinforcement learning

What are the crucial model properties and which model to choose?

Published in

Towards Data Science

7 min readDec 9, 2020

Joint work with Gabriel Hurtado and Albert Thomas. This research is part of a broader theme to build AI autopilots for engineering systems.
The corresponding research paper.
Open source code to reproduce all our results.
Video teaser.

Context

One of our main research objectives at the Noah’s Ark Lab in Huawei France is to build AI autopilots for engineering systems.

We can improve various metrics: making the systems better, cheaper, more reliable, safer, or more energy efficient. Typical systems we are working on include the cooling system of data centers, the system that manages the connections in the wireless antenna, or your local Wi-Fi network. The applications of the technology we are developing are countless: given that engineering systems are the backbone of most industry and transportation, making AI useful in this domain is arguably a multi-trillion dollar endeavor.

What makes engineering systems special: they are physical, not getting faster with time, unlike computers, and they are tightly controlled by system engineers. This means that we need to learn a good control policy based on small offline data and iterate with systems engineers slowly: a typical iteration of policy validation and data taking may take several weeks even months. Model-based RL (MBRL) is considered the best approach for this micro-data regime, and it fits well the human work process we can hope to implement within the constraints of a real engineering system. In addition, systems engineers can validate the learned simulator on its own, increasing their trust in the process.

A typical controlled engineering system. System = airplane, engineer = pilot. (All images by author.)

Where to start?

Scanning the literature, we found excellent model-based techniques and this immensely useful horizontal study on the best known models compared on many systems. The MBRL research field seems to operate in an engineering paradigm where the main objective is to design end-to-end systems that beat the state of the art on some benchmark systems. Our goal is to learn generalizable principles and state and rigorously test hypotheses, while also attempting to obtain good performance on benchmark systems. More pragmatically, we would like to develop

a toolkit (a model and agent zoo) and
well-established guidelines

so that the data scientist (and eventually the systems engineer) can make informed decisions given his or her system to automate. For this, we started with a systematic study on models and proper ways to compare and tune them. We defined a set of criteria to compare models, not only on their performance but also on their practical usability. We designed rigorous and fair experiments, tested and (in)validated hypotheses, and found some curious experimental facts that we cannot yet explain.

We made a choice to do a vertical study: instead of running the methods on several environments, we chose one, the relatively simple yet challenging Acrobot system. Unlike the name suggests, Acrobot mimics a gymnast on the horizontal bar rather than an acrobat. It is an underactuated double pendulum (the only action is a discrete torque at the second joint: the gymnast can kick his legs forward, backward, or no action). It is noiseless but chaotic at the top equilibrium. We use Acrobot with continuous (dense) reward (height of the tip) since it is closer to what we have in engineering systems than Acrobot with the (sparse) 0/1 reward (height > 3 on our scale, the horizontal line in the video below).

The Acrobot system: underactuated double pendulum, mimicking a gymnast.

The Acrobot system with a random shooting agent and a trained DARMDN model.

We used two setups to test the different aspects of the models. The “sincos” system predicts the sine and cosine of the two angles, eliminating the discontinuities at ±π, while introducing strong (functional) y-interdependence between the observables. This is the usual setup in the literature.

How different models handle y-interdependence (dependence of observables even given the history). Multivariate Gaussian models (like Gaussian processes) “spread” the uncertainty in all directions. Mixtures of multivariate Gaussians (DMDN) may “tile” the nonlinear y-interdependence by smaller Gaussians but the match is not perfect, unless we use a lot of components. Autoregressive mixtures (DARMDN) may put the right amount of uncertainty on y2|y1, learning for example the noiseless relationship of cosθ and sinθ.

The “raw angles” system keeps the angles as the prediction targets, challenging the models at ±π where the angle trajectories jump and Acrobot behaves chaotically (small differences in the initial conditions lead to qualitatively different futures).

How different model types deal with uncertainty and chaos around the non-continuity at ±π on the Acrobot “raw angles” system. The Acrobot is standing up at step 18 and hesitates whether to stay left (θ1 > 0) or go right (θ1 < 0 with a jump of 2π). Deterministic and homoscedastic models underestimate the uncertainty so a small one-step error leads to picking the wrong mode and huge errors down the horizon. A heteroscedastic unimodal model correctly determines the large uncertainty (due to the jump) but represents it as a single Gaussian so futures are not sampled from the modes. The multimodal model correctly represents the uncertainty (two modes, each with small sigma) and leads to a reasonable posterior predictive after ten steps.

Choosing a single system definitely limits the generality of our findings. On the other hand, it allowed us to pay attention to the specifics of the system and find solutions that turned out to be better than the state of the art by multiple folds (in terms of sample complexity). Above all, it made us learn a lot, both about Acrobot and the models we tested.

Model requirements

We defined seven discrete-valued (binary or ternary) requirements on which the models can be compared. They go from hard “must have” model properties to softer ones that a practicing data scientist may find important. We can complete them by gray-scale computational complexities at training and simulation time.

It should be computationally easy to properly simulate observables given the system trace. This is a hard requirement otherwise the model cannot be interfaced with popular control techniques that require such simulations.
Easy computation of an explicit likelihood for easy model selection and tuning.
Ability of modelling what we named y-interdependence: the dependence of system observables, even given the historical trace. Popular angle representation with sinus and cosine is an example where this property makes the model more credible to a systems engineer.
Heteroscedastic models are able to vary their uncertainty estimate as a function of the input state or trace. We found that even when using the deterministic mean prediction at planning time, allowing heteroscedasticity at training time alleviates error accumulation down the horizon.
Multi-modal posterior predictives can handle uncertainty around discrete jumps in the system state that lead to qualitatively different futures.
We should be able to model different observable types, for example discrete/continuous, finite/infinite support, positive, heavy tail, multimodal, etc. Engineers often have strong prior knowledge on distributions that should be used in the modelling.
Complex multivariate density estimators rarely work out of the box on a new system. We are aiming at reusability of our models (not simple reproducibility of our experimental results). In (iterated) offline RL, system models need to be retrained and retuned automatically. Both of these require robustness and debuggability: self-tuning and gray-box models and tools that can help the modeler to pinpoint where and why the model fails.

Metrics

We have introduced a rigorous experimental framework.

Models can be tested on static data sets, using various metrics: log likelihood, precision (R2), outlier rate, and calibratedness. Some of these metrics can be computed using Monte-Carlo simulation several steps ahead.

Models can be tested fairly on the dynamical system. We fixed the planning agent to random shooting with fixed hyperparameters (optimized to do well on the real system). We introduced

a normalized asymptotic metrics RMAR: the mean asymptotic reward MAR divided by the optimal MAR, and
a sample complexity metric MRCP: the number of system access steps required to reach 70% of the asymptotic optimal MAR.

We report rigorous confidence intervals based both on random seeds and on running the systems for a long time after convergence.

Findings

The sincos Acrobot system

The sincos system requires no multimodality, so essentially

all deep mixture density models do well, improving the current state of the art sample complexity by two to four fold.

Dynamic metrics (RMAR=asymptotic performance, max=1000, MRCP: learning pace in system access steps) and learning curves on the Acrobot sincos system. All mixture models beat the previous SOTA by two to four fold on sample complexity.

To our biggest surprise, we do not even need probabilistic models:

deterministic models (forecasting the mean of the posterior predictive) are optimal, but only if we use a heteroscedastic log-likelihood metrics at training time.

A deterministic neural net trained with the classical MSE target one step ahead is suboptimal both on the dynamic scores and on the static scores ten steps ahead. We have no verified explanation for this; our current hypothesis is that heteroscedasticity allows the model to downweight certain training points that are hard to predict.

10-step forecasts of deterministic neural models. Optimizing MSE one step ahead (NN) leads to suboptimal performance down the horizon while optimizing log likelihood (DARMDN) in a mixture model performs better, even if we ignore the uncertainty estimate at simulation time (we simulate from the mean). The results hold even with a single Gaussian component (not shown).

Flows (RealNVP) and variational autoencoders (VAE) are suboptimal even on the sincos system.

The raw angles Acrobot system

This system requires multimodal posterior predictives at ±π.

The only models that do well are those that use explicit mixture densities.

Dynamic metrics (RMAR=asymptotic performance, max=1000, MRCP: learning pace in system access steps) and learning curves on the Acrobot raw angles system. Only mixture models that can model multimodal posterior predictive perform well.

In principle flows (RealNVP) and variational autoencoders (VAE) can model multi-modal posterior predictives but in our experiments on the raw angles Acrobot system we were not able to tune them to do so.

Conclusion

Overall, our preferred model is the deep autoregressive mixture density net, we call it DARMDN (“darm-dee-en”; similar to RNADE). Not only it performs optimally on both Acrobot systems, but autoregressivity means that we can model one variable at a time, so DARMDN is easy-to-adapt to engineering systems which come with heterogeneous system observables.

The classical non-autoregressive mixture density net (DMDN) uses a mixture of axis-aligned multivariate Gaussians. It also nails the two Acrobot systems but the practicing data scientist has less control over individual observable models, in case they are discrete/continuous or have heavy tail.

The bagged single-Gaussian DMDN(1) is equivalent to what the PETS algorithm is using. It is a mixture density model, but it is learned using ensembling instead of directly optimizing log-likelihood. Ensembling is a variance reduction technique which improves sample complexity at the extreme low-data regime, but it cannot learn multi-modal posterior predictives, so it remains suboptimal on the raw angles system.