
Historically, control systems have been built by first approximating the controlled system (or "plant") using well-understood models like a Linear Quadratic Regulator (LQR) or a tabular Markov Decision Process (MDP), and then designing an (near-)optimal controller for this assumed model. This method works quite well, as evidenced by the plethora of application areas classical control theory supports from keeping airplanes afloat to keeping chemical plants from blowing up. Indeed, research in these areas has birthed a readily available storehouse of design principles and rules of thumb that can be used to generate with little effort, a menu of controllers showing tolerable performance.
Another advantage this classical approach offers is interpretability of the resulting control policies – they are intuitively satisfying and make sense. This is one of the reasons why simple linear feedback controllers and threshold policies have remained the mainstay of control theory for such a long time despite there being plenty of room for improvement in performance. In contrast, you would be hard-pressed explaining to the lay person, for example, what the 500,000 weights of your favorite deep neural network actually "mean," despite the fact that it "works."
Often however, owing to the sheer complexity of modern control applications or the ever-increasing demand for faster deployment and better performance, this approach might fall considerably short of expectation. For instance, poorly understood systems – and the resulting misspecified models – could lead to overfitting and sub par controllers in practice. Or the amount of data required to approach near-optimality with these methods might adversely impact both performance and deployment speed. The advent of systems as complex as fleets of autonomous vehicles and robotic teams, therefore, now calls for a paradigmatic change in the way we look at the fundamental problem of control.

The advent of systems as complex as fleets of autonomous vehicles and robotic teams now calls for a paradigmatic change in the way we look at the fundamental problem of control.
One emerging, general-purpose reinforcement learning (RL) approach in this new scenario appears to be to employ combinations of given, pre-designed ensembles of "basic" (or "atomic") controllers, which (a) allows for flexibly combining the given controllers to obtain richer policies (we will use the terms "policy" and "controller" interchangeably) than the atomic policies, and, at the same time, (b) can preserve the basic structure of the given class of controllers and confer a high degree of interpretability on the resulting hybrid policy. Here, we use the usual meaning of the term policy, i.e., a mapping that outputs an action for every state of the plant.
What is Improper Learning? In machine learning parlance, given a class of controllers (or hypotheses or classifiers as the case may be), an algorithm that picks strictly from among those available is called a Proper Learner, and those that output from (potentially) outside the given class are said to be Improper Learners [2]. A simple example of this could be a classification problem using a finite set of N linear predictors, i.e., N weight vectors. In this scenario, a learning algorithm which (post training) always picks the best from among the N given predictors would be called a proper learner. Alternatively, the algorithm could output the best from the convex hull of this set, and would then be called an improper learner. Over the years, while improper learning (IL) has received some attention within the statistical and online learning communities, it remains largely unexplored in control. One of the objectives of this article, therefore, is to bring to the attention of researchers and practitioners this technique as a promising and timely direction of investigation.
Statistical Learning. Improper Learning has already made inroads in statistical learning and the results have been quite clearly encouraging. One obvious example is the technique of Boosting, explored thoroughly in the context of classification in [5]. The celebrated Adaboost algorithm has now been adapted to every conceivable application area in machine learning from classification to algorithmic trading, even winning its inventors the G_o_del Prize in 2003. More recently, the problem of model misspecification, for example, was investigated in the context of supervised learning in [3], which also showed a dramatic improvement in regret performance with improper learners even when using an incorrect parametric model (proper learners were shown to perform much worse). Similarly, the problem of finite-time regret with logistic regression, was recently investigated in [4] wherein, once again, regret performance was significantly improved by leveraging improper learning. Note that in both these cases, the learner only had to expand its search to include convex combinations of available, atomic predictors.
Improper Learning in Control. Improper Learning is beginning to receive attention within the control community and already, two distinct approaches can be observed. The first [6] follows the paradigm described above, using a base or atomic class of (non-adaptive) control policies along with an adaptive meta-learner that combines the outputs of these policies to produce an improper controller with performance strictly better than those in the base class. Indeed, [6] also shows examples where a stabilizing controller emerges from a set of unstable atomic controllers. Importantly, the improper controller interacts with the controlled system only through the base controllers, i.e., it picks a base controller in each round, which in turn implements its control action on the system. The control policy that emerges from the adaptive algorithm need not match exactly with any of the base policies for every system state and hence, is clearly improper.
On the flip side, the second approach [1] essentially extends the idea of boosting to control. This involves maintaining a set of instances of a "weak learning" algorithm (such as Online Gradient Descent). The weak learners are assumed to be adaptive and offer control suggestions to anon-adaptive booster. The booster, in turn, combines these suggestions into a control action which it implements upon the controlled system. The same logic as before shows why the booster can be considered an improper learner. Note however, that the architecture here is essentially the reverse of the one in [6] – the controller is non-adaptive and directly interacts with the system.
Moving Forward. While these preliminary attempts look promising, there is much room for improvement. For instance are the two architectures described above the only two possible? Is there a principled way to choose the base class for a given application? On the side of theory, questions about regret bounds and convergence rates abound. How does one extend this theory to situations comprising multiple learning agents? Plenty of very fundamental questions both theoretical and practical thus remain open and provide exciting opportunities for researchers to move the needle forward in this emerging field of control.
Keywords: Improper Learning, Reinforcement Learning, Boosting, MetaRL, AdaBoost.
References
[1] Naman Agarwal, Nataly Brukhim, Elad Hazan, and Zhou Lu. Boosting for control of dynamical systems. In International Conference on Machine Learning, pages 96–103. PMLR, 2020.
[2] Wikipedia contributors. Distribution learning theory – Wikipedia, the free encyclopedia.https://en.wikipedia.org/w/index.php?title=Distribution_learning_theory, 2020.
[3] John Duchi, Annie Marsden, and Gregory Valiant. On misspecification in prediction problems and robustness via improper learning. arXiv preprint arXiv:2101.05234, 2021.
[4] Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic regression: The importance of being improper. In Conference On Learning Theory, pages 167–208. PMLR, 2018.
[5] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. The MIT Press., 2013.
[6] Mohammadi Zaki, Avi Mohan, Aditya Gopalan, and Shie Mannor. Improper learning with gradient-based policy optimization. arXiv preprint arXiv:2102.08201, 2021.