Thoughts and Theory

Optimal Control, a Preamble

If you know what I min

Guillaume Crabé

Published in

Towards Data Science

21 min readMay 22, 2021

What do Nature, AI, economist theories, classical mechanics our brain, and almost everything have in common? They try to optimize.

You are not even thinking about it, but your brain is constantly trying to optimize. Every single movement is optimized to consume as little energy as possible. Every animal is trying to optimize its survival rate. Economists are trying to optimize profit, and well nature as a whole is also trying to optimize.

This simple fact was a revelation for me in my school time. I was at that time investigating the fascinating topic of bubble geometries. Now consider this simple experiment. Let’s say you make a simple cube out of wires (only for the edges), and plunge it into a solution made of liquid soap (the same one that kids use to make bubbles). What do you expect to come out of it? The answer is rather surprising as you can see by yourself below:

Source: Youtube, The Science of Bubbles (Full Science Documentary) | Spark

Now the solution to this riddle is actually rather simple. The physical phenomenon behind this is called surface tension, which stems from the fact that some molecules prefer to be together rather than in contact with another group of molecules, here water and air. Water molecules are going to arrange themselves in a shape that minimizes the total surface of contact with air, leading to this very peculiar shape. This is also the reason why bubbles are spherical, as without constraints (the wires) the minimal surface that encapsulates a given volume is indeed the surface of a sphere. But it does look less cool.

Now, this simple experiment does look incredible to me. It looks like Nature is actually solving complex equations and trying to optimize its energy consumption. Considering that bubbles are a geometrical solver, we could almost base scientific computation out of bubbles (almost..). The important result here is that Nature is trying to solve an optimization problem: reducing a quantity given a set of constraints. Here reducing the surface deployed of watery liquid to encompass the air that we blew into the bubble (the constraint).

General formulation

Let’s now try to transcribe the notion of “optimization” into a mathematical formulation.

Most globally, in mathematics, the notion of optimization is related to finding the minimum of a function. Given a function “f” which takes as input a vector of n real numbers and maps it to a real number, the problem can be formulated as below: find x* such that

These two formulations are equivalent. Note that, the input domain can be restricted to a smaller domain the set of real numbers.

Pretty basic hey. Now, this can hide an immensely high level of complexity depending on the nature of “f”, whether it is a “nice” function or not. Solving this equation is actually what a lot of the algorithms out there in the wild do. One way or another, you often end up having to find the minimum of a function. Take neural networks from the field of Deep Learning for example. At every step of the algorithm, what we do is simply trying to find out the minimal value for a certain cost function.

As it is often done, we can try to reformulate the problem into a nicer shape, equivalent but actually tractable. In particular, we can find two necessary conditions so that a given x is the solution to the equation:

Note that, since the gradient of a function with a vector of n inputs is a vector of size n, and that the gradient of a vector of size n is a matrix of size (n x n), the second-order condition needs some explanation. The superior or equal condition does not mean that every term needs to be superior or equal to 0. Instead, it means that the matrix needs to be semi-positive definite. Also, note that the second-order gradient of a function is called a Hessian. So in this case we would say that the Hessian of “f” needs to be semi-definite positive.

How to simply explain these two conditions? The first condition is analog to the well-known results from high school: a point can only be a minimum of a function if its derivative at that point is equal to 0. Well, what we presented is the same, except that we broaden it to a multiple variables case, thus taking the gradient.
One thing that was omitted during high school is that there is another necessary condition for that point to be a minimum. The second derivative taken at this point should be superior or equal to 0. Similarly, the condition on the Hessian transposes this result from 2D to a multi-dimension case.
If these two conditions are respected, then we can say that this point is a stationary point. (but we cannot tell if it’s a minimum yet).

But we can go further than that. If the Hessian of “f” is definite-positive, then these two conditions are sufficient (in 2D, we would also have this sufficient condition if the second-order derivative is strictly positive), and we can know for sure whether a point is a minimum:

All is well, we did put the equation in a more tractable way. This means we could solve it analytically if f is nice enough. If it is not, we need to use numerical approaches which will try to guess the result iteratively. But finding out these conditions enable us to verify the result provided by these approaches. For example, we can guess if a returned solution is a stationary point but not an actual minimum by looking at the Hessian of “f” at that point.

Above depicted a saddle shape. At the inflection point, also called saddle point, only derivative is null, but the Hessian is not strictly semi-definite positive.

We can also talk briefly about the common numerical approaches to solve such an equation. As neural networks do, we can find the minimum using the gradient descent method or using Newton’s method. Another approach is to use the so-called evolution strategies that I studied further in detail in this previous article. However, these approaches work in unconstrained cases only.

Constrained case:

All is well, but when we try to perform optimization, we are often facing some sort of constraints. All the more when the things that we are trying to optimize are real, mechanical, physical entities. An economist has to take into account the fact that resources are limited, a control engineer that its mechanical system cannot go over a set of velocities and accelerations, or else see its system fall apart. Constraints are thus an important part of optimization theory.
We can break the constraint types into two:

Equality constraints
Inequality constraints

Using these constraints, the formulation of the optimization problem can now be stated as: Find x* such that

Where hi(x) represent the i-th equality constraints and gi(x) the i-th inequality constraints.

Let’s take a quick example:

Here, the function to minimize is a convex paraboloid shape (which equation is a quadratic function), and our constraint is a line (with equation x = y, blue dots). If the function to minimize has a dimension N, the constraint is always at most of dimension N-1, so if the function to optimize lies in three dimensions, then the constraint lies in a two-dimensional space.

The solution to the problem, the minimum point, needs to belong to the paraboloid of course. But the linear constraint brings in a new requirement. Let’s imagine that for each point of the paraboloid of coordinates (x, y, z), we draw the line x = y in the same z-level. For that point to respect the constraint, it must be on that line. If we do that for every point of the surface that we want to optimize, what we end up doing is a projection of the constraint onto our surface. The result of the projection is called the set of admissible points of the problem (orange dots):

As seen, the projection is exactly a parabola. This means that the problem is reduced to finding the minimum point of this parabola instead of the initial surface. Most globally, adding constraints to an optimization problem reduces the overall dimensionality of the entity to minimize (but still increases overall complexity).

Now, what if we consider inequality constraints? Let’s consider the equation of our line x = y. Transform it into an inequality constraint x ≥y.

The blue dots correspond to the area that satisfies the inequality constraint x ≥ y in the (x, y) plane. The orange dots are the points belonging to the paraboloid which also respect the inequality constraint x ≥ y.

As can be seen, the constraint still lies in a lower dimensionality (the x, y plane). However, due to the inequality, we end up with an area instead of a line. All the same, to reduce the optimization problem, we need to project it onto our surface to minimize to end up with the set of admissible points.
It’s easy to understand the notion of projection, but it is much less understandable when going up in dimensionality. This notion of projection is still an important aspect of the optimization field.

Now that was the geometrical visualization. In practice, to find the minimum of a function with equality constraints, the general tool is to use the Lagrange multipliers, which allows formulating the problem in a more tractable way by introducing a new coupling variable.

For inequality constraints, we use the same procedure, and an equivalent formulation of the problem can be found using the KKT conditions.

Path Optimization

A lot of real-life optimization problems cannot be formulated using the above formalism.

Take Google maps, that tool can provide you with the fastest public transport route. It is indeed an optimization problem. Google is trying to find the route that minimizes the overall transportation time for you from a starting point to a destination. What roughly happens is that Google will browse through all possible paths that you could take and choose the fastest. Now the formulation described in the last section is suited to find an optimal point for a given function, though here we are trying to find the optimal function (a function here is taken to be equivalent to a path) out of a collection of functions (paths). It looks like we should take functions as varying parameters, and find the optimal function. Obviously, we need a better formulation.

The first attempt at solving a problem of this type came with the famous brachistochrone problem: find the shortest trajectory for a bead that can slide from two given points.

Brachistochrone problem, source: Wikipedia

Intuitively, we could say that if we can decompose each trajectory into small segments, and find out what is the time cost for the bead to move on each of these segments, then if we sum them we could find the total time for each trajectory. Put into a mathematical formulation:

Where a and b are respectively the starting and ending points of the problem, and J represents the total cost of our path, here being the total moving time of the bead, J* being the optimal cost.
So now what does this mean? Contrary to the previous formulation where we find the minimum over one variable (or vector of variables), we find here the minimum over a set of paths, which can be seen in the brachistochrone example. Though, each path can be associated with a single value, its cost. We have a mapping from functionals to real numbers.
Here, the optimal path is the path that leads to the optimal total cost J*.

Now this formulation looks a bit too simple. Let’s try to reformulate. We can see that the time for the bead to is depending on multiple things. First, the length of the segment that we are considering. If we define the curve by a function y so that each point of the curve can be described by its coordinates (y(x), x), then the time T needed for the bead to roll on a segment is at least depending on x and y(x). Since we know that gravity plays a role as well, we can imagine that the speed of the bead is directly related to the inclination of the segment, the more inclined, the faster the speed. Therefore, the time T is also dependant on the inclination, in other words, y’(x). All in all, the function T is depending on x, y, and y’. We have now the formulation:

The general formulation in the context of the calculus of variations

In a similar fashion, the optimal path y* is the path that leads to the optimal total cost J*.
Now, this actually the general formulation for a field of mathematics analysis called “calculus of variations” and guess what, this field was mainly started because of the brachistochrone problem.

If seeing a variable of a function as a derivative of another function is bothering you, just think about it as a normal variable. It could well be written z for what we know, the only reason for which we let it as a derivative is that the common tool to solve this family of problems, the Euler-Lagrange equation takes full advantage of this fact.

This name sounds a bit enigmatic at first. The logic behind is rather simple though, we need to look more precisely at what the Euler-Lagrange equations are trying to achieve. It follows the same principle as finding the minimum of a function: we sample a point belonging to the function to optimize, and we look around if the nearby points are actually all above our considered point (by looking at the gradient). If that’s so, we are at a local minimum. Well, it is the same with paths. We take a path, distort it a little bit in multiple ways (by taking the gradient of the integral), and look if the created paths all result in a higher cost. Thus we are looking at small variations over paths… calculus of variations.

Globally, we can use the same formulation to represent a vast number of problems characterized by a local cost that is cumulated over a path to form a total cost. Obviously, all navigation problems can fit in, but not only. Economists are also especially interested in this kind of problem, as the term “cost” itself may suggest…

From Path Optimization to Lagrangian Mechanics

Now comes some magic.

Have you ever heard about Lagrangian mechanics? The concept is quite interesting. Instead of formulating the equations of motions in a reference frame like the Newtonian mechanics do, the Lagrangian mechanics try to get rid of any frame by reasoning in terms of energy. The analog of the infamous F = m*a in Lagrangian mechanics then becomes:

It does get a bit wordy, but in the end, we can see that in the original case of classical mechanics, the Lagrangian function L can be thought of as the cost of motion. Compared to our previous path optimization formulation, the variables changed a bit but overall it’s the same thing again. Instead of integrating over a distance, we are integrating over time which is equivalent (we could have done so for the brachistochrone problem).
Recall from the first part that a stationary point of a function is a point for which the gradient of the function vanishes. We need to be a bit more precise on the formulation here is as the functional depends on both y and t. Here the gradient would then be taken with respect to y as we are looking at differences between paths.

So if we consider that the functional S represents the total cost for a given system to move, finding the stationary point of S then means that in most of the cases, we are trying to find the minimum of S! Rephrasing, the trajectory of a system is found when the total cost of motion in a time interval is minimized. And here it’s not us trying to optimize some value, this is the laws of motions for a system. This means, we have another proof that nature is trying to minimize its expense of energy, and this has been very elegantly put into equations by Joseph-Louis Lagrange.

We can see the closeness between the formulations from the calculus of variations and the Lagrangian mechanics formulation. This is no coincidence, Lagrange is one of the main contributors of this mathematical field, the equations solving this family of equations even hold this name. In some sense, we can say that Lagrangian mechanics is the calculus of variations way to formulate the laws of motions (if that does make sense?).

Lagrange did not stop there though. Maybe you already saw his name somewhere in this article. Indeed, the Lagrange multipliers appeared when we wanted to find the minimum of a function with constraints.
The reason why Lagrange did not stop there is that the formulation we just presented is only applicable to few cases. If we dig a bit deeper, we find out that since the terms in the Lagrangian are related to the energy of the system, the Lagrangian formulation as it is above can only encompass conservative forces (as the potential energy for a conservative force depends only on the state of the system and not on the past states). Problem is, most of the forces in the wild are not conservative. If we take the example of the pendulum:

We have two forces in play: the gravity (mg), and the tension of the rod T. Gravity is a conservative force, no problem, but the tension of the rod is not. That means, currently we cannot resolve the equations of motion with the Lagrangian formulation.
Unless we state the tension T as being a constraint over the position of the pendulum.
In the same fashion, we can make solve minimization problems with constraints using the Lagrange multipliers. Indeed, without them, the formulation is pretty ineffective at solving real-world problems, and Lagrange did push this idea of multipliers altogether with formulating the equations of motions in his treaty “Mécanique analytique”. The general formulation of the problem becomes:
The curve y describes the trajectory of the system if y is a stationary point of the functional:

Rephrasing, the formulation can be thought of as finding the trajectory of a system such that this system minimizes its expense of energy with respect to constraints.

Control Theory

Finally, we are there. Let’s see how we can transpose all the results that we saw above to the field of control.

First, what is the so-called Control Theory?

Let’s take a simple example. If you fire a bomb from a canon, since you know about the laws of motion (maybe you are an expert of Lagrangian mechanics), you are able to predict where this bomb is going to fall.
Now if there was a gust of wind, your bomb may deviate from its initial trajectory and the bomb may end up falling on an undesired target. Moral of the story? Don’t trust physicists, trust engineers.

What do engineers do? “Why not add some propellers on the bomb”? One of them says. “Like this, we can adjust the trajectory of the bomb in real-time to force it to follow the desired path!”. “Great”, another engineer says, “but how do we decide how much thrust to give to the propellers?” And that (in my imagination) is how the field of control theory came to light.

The control theory idea is:
— Take a dynamic system having its own laws of evolution through time
— Add a unit that enables you to change the state of the system
The system is thus having an input value (what we provide to the unit) and an output value (the new state of the system).
— Take a corrective action that drives the system to the desired state by feeding input values to the system. The part in charge of feeding the adequate input values to the system so that the desired state is reached is called a controller.

Control theory is mostly interested in how to define this controller. Given the desired state to reach, how to provide input values to the system so that this state can be reached as fast as possible, with the maximal precision, and with ensuring robustness, making sure that the system keeps a certain level of stability and does not degenerate.

In this context, a missile is a system in evolution according to the laws of motion. The propellers are the unit enabling corrective actions. The input value of the system is the thrust of the propellers, the output is the position and velocity of the missile.
Similarly, an air conditioner is a unit allowing us to take corrective actions on the temperature of the room subject to the laws of thermodynamics. The input of the overall (room + air conditioner) system is the voltage provided to the air conditioner, the output is the new room temperature.
And when we enter a room and turn on the device, we want that the desired temperature can be reached as fast as possible, but we also don’t want the room to turn into a fridge for no apparent reason (robustness).

The visualization tool of predilection for Control theory is the block diagram. It enables us to see the overall input that we are trying to add over our system's original dynamics. For example:

This is an example of an open-loop. The value x(t) corresponds here to the desired state of the system. We are first applying our controller action, then feeding this value to our system. In this case, the action of the controller does not depend on the new system output. Looking at the air conditioner example, it would be equivalent to say that we fix the voltage of the air conditioner so that it outputs a stream of constant temperature airflow. And if you decide to open your window, the controller won’t adapt. You might then not be able to reach the desired temperature.

Engineers then created the notion of closed-loop. Basically, it consists in not looking at the desired state, but at the difference between the current state of the system and the desired state. That way, the controller becomes much more robust against the potential perturbations (opening the window).

As we can see in the diagram above, what is fed to the controller is not the desired state x anymore, but the error e which is the difference between the output of the system and the desired state. In addition, we note the presence of a sensor. In most cases, the output of a system is not known by the controller and it needs to be measured. An air conditioner needs a sensor to determine the new room temperature, and a missile needs a tracking tool to determine its localization in space.

Optimal Control

After this introduction to the theory of control, we can focus on one of its sub-branch, optimal control.

As we saw, control focus on multiple aspects. Stability, precision, and speed. In optimal control, we define a cost that can be computed at each time step, which reflects the quantity that we want to minimize. It could be time, or precision, or both at the same time, or we can also have other quantities to optimize (consumption of fuel, etc..).

In the first place, let’s consider that the quantity that we want to optimize is time. Given an initial system state, the desired state to reach, the goal here is to find the sequence of inputs to the system that steers it to the desired state in the shortest time possible. In other words, if we found the optimal controls, there can not be another sequence of controls that we bring the system to the desired state in a shorter amount of time.
Let’s take the example of a car, which can be controlled over a straight line by choosing how much we press the gas pedal... The optimal control problem can be expressed as, starting from A and stopping at B, how does the driver should press the gas pedal to stop at B as fast as possible. Or in control terms, the problem is equivalent to say “find the time-optimal sequence of controls u(t) (the angle of the gas pedal) from A to B” (resulting in a motion of the car x(t)).

Given our experience on optimality, we would almost want to put this into a mathematical formulation. Something like this:

Now if you are still following, you might be wondering, how are we supposed to find u(t) and x(t) with that, it’s not even in the integral… That’s where dynamics kick in. All the systems considered in optimal control are dynamic systems, which means that they follow a law of evolution through time in the form x’(t) = f(x(t)).
If you take the air conditioner, the temperature is governed by the laws of thermodynamics, and the missile by the laws of motion, as is our considered car. Don’t you think that these should be taken into account to know how fast we can get to point B? No need to make things complicated, let’s just add it as a constraint of the minimization problem:

Still, no controls u in the equation? Let’s think a bit… The air conditioner follows the laws of thermodynamics, but we are messing up with them by adding an influx of colder air. The missile has a nice free-falling trajectory, but by activating the propellers, we do modify its trajectory. Could it be, the equation of dynamics needs to take the controls into account as well?

Now that does look better. Let’s try to go toward a general case. Here we only want to optimize the total time of the problem. What if, we also want to consider the total gas consumption? The gas consumption is directly related to the controls u that we are applying to the gas pedal, therefore it should be present in the cost. But what if, you are driving from Alaska to California and suffer a lot from the cold? Then you would better get away fast from Alaska (while you would be okay around California). It looks like the cost here would need to be depending on the position x(t) of the car.

An optimal control problem should then be formulated as follow:
Find the sequence of controls u(t) for every t such that we minimize the quantity:

General formulation of an optimal control problem

Note that the function representing the cost is written with a capital L. L like… Lagrange? Indeed, control theorists decided to name it Lagrangian as this optimal control formulation is derived from the calculus of variations. The difference being, the Lagrangian is not depending on (t, x, x’ ) but on (t, x, u). But as we can see, the equation of the dynamics can link u to x’ so that’s about the same really.

Great, it looks like this equation can be solved in the framework of the calculus of variations using the Euler-Lagrange equations….
Wait a minute… The constraint is non-holonomic? (can’t express the constraint as limits on the trajectory)? The Euler-Lagrange equations can’t be used? So you are telling me that we went through all of this formalism on the calculus of variations for nothing? (No!)

Hopefully, some talented minds were able to derive some results for this optimal control formulation using strategies from the calculus of variations. For example, the Lagrangian multipliers would still be used in the derivation to take other holonomic constraints into account. The theorem resulting from this is called the Pontryagin maximum principle.

However, the field of optimal control is by far not limited to Pontryagin and its theorem. As implementing the latter theorem is computationally very heavy, other methods have emerged, each of them being able to find the optimal controls for specific cases. One of them is called Dynamic Programming, a branch that would lead later on to Reinforcement Learning and Deep Reinforcement Learning. There are also different cases to consider, as differentiating between finite horizons and infinite horizons problems (whether we have time limits or not), continuous or discretized representations. Some solutions as the Pontryagin principle are analytical, others are numerical.

As we can see, the field of optimal control has plenty of methods to solve the initial problem, finding the best controls optimizing a given cost. However, all these methods start from the same formulation that we presented above. The literature on the topic is often very steep, not presenting any context for the formulations of the problem that are used. This article was made exactly to provide some introductory background, and I wish it can help you get further away into the field of optimal control.

To go further:

Optimal and Robust control from the CTU in Prague. https://moodle.fel.cvut.cz/course/view.php?id=2120&lang=en
All the lectures are available on Youtube as well (channel “aa4cc”).
“Optimal Control Theory, an introduction” from Donald E. Kirk.
Principles of Optimal Control, open courseware from MIT
https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-323-principles-of-optimal-control-spring-2008/lecture-notes/