Thoughts and Theory

The idea of autonomous systems excite me and applying reinforcement learning to everything to achieve autonomy seems tempting. But it is not always that easy and using optimal control might sometimes be the better solution. While I do see a big potential in combining Optimal Control and machine learning to enhance the performance of physical systems (e.g. Learning Model Predictive Control[1]), I don’t want to go into too much detail on that topic, maybe in an upcoming post. What I do want to demonstrate in this post are the similarities (and differences) on a high level of optimal control and reinforcement learning using a simple toy example, which is quite famous in both, the control engineering and reinforcement learning community — the Cart-Pole from **** OpenAI Gym. I will not reimplement a reinforcement learning algorithm for the Cart-Pole — there is no need to reinvent the wheel — therefore I encourage you to read this post by [Matthew Chan]() about applying Q-Learning to the Cart-Pole task or this nice tutorial on arXiv.org.
I will start off briefly covering the Cart-Pole and then go into more detail about an optimal control approach to solving the task and its implementation using Python and end with a short discussion. So let’s go!
The Cart-Pole

The Cart-Pole consists of a pole, which is connected to a horizontally moving cart. To solve the task, the pole has to be balanced by applying a force F to the cart. The system is nonlinear, since the rotation of the pole introduces trigonometric functions into the force balance equations. Furthermore, the system’s equilibrium with the pole in the up-right position is unstable, as small disturbances will cause the pole to swing down.
The states of the Cart-Pole are the distance s of the cart, the velocity ṡ of the cart, the angle of the pole θ and the angular velocity of the pole θ̇. The parameters are mₚ as the mass of the pole, mₖ as the mass of the cart and Jₚ as the moment of inertia. The equations, which describe the dynamics of the system as they are implemented in the Cart-Pole environment of OpenAI Gym can be seen here.
Now that we have defined our systems and its dynamics, let’s try to control it using optimal control!
Optimal Control of the Cart-Pole
In this section I will provide links to videos made by Brian Douglas in cooperation with MATLAB as he explains the fundamentals of control beautifully!

To control the cart we will design a linear quadratic regulator (LQR) which will result in an optimal control gain K. We will feedback the states x of the environment and K will determine our input u into the system – the force F, that we want so apply onto the cart to balance the pole.
However as the name LQR states, we actually need a linear model of our system, but when we analyzed our system we saw it is nonlinear. Fortunately, control has a tool called linearization which corresponds to a first oder Taylor expansion around a desired working point (here’s a video covering the fundamentals of linearization). If we perform linearization around the upper equilibrium of our system where our states x are x = (_s, ṡ, θ, θ̇)ᵀ = (0, 0, 0, 0)_ᵀ __ we obtain a linear system which can be written in state space representation as:

Now that we have our state space representation, we can start with designing the LQR. On a high level, the LQR tries to find an optimal control matrix K by minimizing the cost function below

where Q and R are weight matrices, which have to be defined by us. Q states how import performance is and R how expensive the usage of our control is. Therefore, Q and R define, what we actually regard as optimal. Solving this integral looks hard at first, but can be done through partial integration. Since the cost function is quadratic, we know that there exist one unique minimum, which can be found by substituting u =-Kx (see in the figure) and setting the derivative to zero. What we then get is the algebraic Riccati equation:

After solving for P we can calculate our controller K as K = inv(R)Bᵀ P. This is all the math we need so, let’s get to the fun part, the application, and look at how we can implement this using Python and OpenAi Gym.
First, lets define the linearized dynamics of the system:
Second, lets calculate the optimal controller:
And last, lets define a function, which we can call to actually calculate the input force F during runtime:
Now we can set up the simulation!
As a technical detail, in the original Cart-Pole the force is fixed to 10N, but the magnitude can be adjusted for each time step (see line 15) and the variable action then only defines the direction.
Results

If we now run the simulation, we can plot our states over the time steps. The Cart-Pole task is considered solved when 199 time steps are reached without falling over. However, since we designed the controller and did not have to learn it, the 199 time steps are always achieved. So lets have a look at the first 400 time steps! What we can observe from the diagram are mainly two things:
- Our system oscillates. We could have looked at the poles of our closed loop system and see, that we have complex conjugated poles. but I did not want to go into too much detail in the analysis of the system. In fact, we can even choose the poles ourself to have a non oscillating system by using pole placement. However, if we do so, we are no longer guaranteed to solve the task optimally!
- All states converge to 0. This is, what we hoped to achieve with our controller. All states converging to zero means, that the system is converging towards the upper equilibrium point around which we linearized it. Therefore, the state feedback controller stabilizes the unstable upper equilibrium point. – So that’s pretty cool!

We can also evaluate, how the choice of R affects the force, which is applied onto the system. The usage of our actuating variable (the force F) is "cheap" for a small weight matrix R, therefore the magnitude of F increases. As we increase R the magnitude of F decreases, since we increased "the cost of using F". This is to be expected, as R is defined for this purpose, but its still nice so see, that the math behind it all, actually works!
Discussion
We have solved the Cart-Pole task from OpenAI Gym, which was originally created to validate Reinforcement Learning algorithms, using optimal control. Q-Learning in the post from Matthew Chan was able to solve this task in 136 iterations. And we only needed one iteration. In fact, we needed zero iterations! Assuming that our dynamics model of the systems is correct, we are able to mathematically guarantee stability and hence guarantee to solve the task, without actually running an iteration. And we control the systems optimally, since we used an LQR, which minimizes the cost function stated above. So if we need zero iterations and are guaranteed to solve the task optimally, why do we need reinforcement learning? Why don’t we just always use optimal control?

- Sometimes, we can’t! For the control approach we need to know the system dynamics. Sometimes, it is not possible to model the system since it is too complex. Then, it is not possible to use control algorithms. For reinforcement learning, we don’t need any prior knowledge of our system. A reinforcement learning algorithm can learn a model of the dynamics (model-based reinforcement learning) or try to solve the task without such a model (model-free reinforcement learning, e.g. Q-Learning).
- Sometimes, it’s too expensive! Engineering is expensive. So what if we could reduce cost, by applying or creating algorithms which reduce the required engineering to control a system. Also, in simulations trying (iterations) are very cheap and we can try a lot to collect data and create powerful agents. While we could attempt to model the simulation, such as a game, and create knowledge-based controls, this would be very time inefficient. However, once we move to physical systems, things are very different. Quickly running the Cart-Pole experiment 1000 times will take forever and at some point the physical system will fail. So in that time period it might have been easier to set up the force balance equations.
- Sometimes, we don’t want to! Autonomous systems are fascinating! Look at PILCO [1] learning the swing up and balance a pole within just 7 trails – without any prior knowledge! Now if that doesn’t get you excited, nothing will!
Ok, but when should you use what? There are many questions, which have to be answered here. If the task is some kind of simulation, e.g. win a game, the answer should probably be reinforcement learning (look at AlphaGo [3]or AlphaFold [4]). But if we look at physical systems such as robots more questions occur. How expensive are iterations regarding time needed and wear and tear? How expensive is modeling? Is the system safety critical? Do we need mathematical guarantees? … So it comes down to that. But powerful combinations that combine control theory and machine learning are also emerging, and more are coming!
Key Takeaways
The first key takeaway from this post should be to realize that optimal state feedback control and reinforcement learning are very similar in the way they are set up. In both we measure or receive information about the states of the environment. ** We then use this information about the state to interact** with the environment though an _actio_n (in our case, this action was applying the force __ F) an_d influen_ce the behavior of the environment (in out case, stabilizing the unstable upper equilibrium point). And both approaches try to do this using some measure o_f optimali_ty. For the LQR it was minimizing a cost function, for Q-Learning it is maximizing reward. And the second key takeaway of this post should be, that the main difference between optimal control and reinforcement learning is the prior knowledge that we have of our system – do we know the dynamics or do we not know them?
Some parts of this posts are oversimplified, both from a control perspective as well as from a reinforcement learning perspective, but I tried to paint the "big picture". If you found any mistakes or have any questions, feel free to comment/ask them! Thank you for reading!
References
[1] U. Rosolia and F. Borrelli, Learning Model Predictive Control for iterative tasks. A Data-Driven Control Framework (2016), CoRR
[2] M. P. Deisenroth and C. E. Rasmussen, PILCO: A Model-Based and Data- Efficient Approach to Policy Search (2011), Proceedings of the 28th International Conference on Machine Learning
[3] D. Silver et al., Mastering the game of Go with deep neural networks and tree search (2016), Nature
[4] A. W. Senior et al., Improved protein structure prediction using potentials from deep learning (2020), Nature