Foundational RL: Dynamic Programming

Cover photo generated by the author using an AI tool Midjourney (Licenses as Creative Commons Noncommercial 4.0 asset license)

Through the previous two articles: (1) Markov States, Markov Chain, and Markov Decision Process, and (2) Solving Markov Decision Process, I set up a foundation for developing a detailed concept of reinforcement learning (RL). The RL problem is formulated as Markov Decision Process (MDP) which can be solved for optimal policies (i.e. what action needs to be taken by an agent to achieve a goal). The solution can be achieved either by sweeping through the entire state space or through some other technique that recognizes that the solution can be achieved recursively by solving smaller problems. The latter is the topic of Dynamic Programming.

Dynamic Programming

Dynamic Programming is a way to solve complex problems by splitting them into smaller, and simpler subproblems. It is a bottom-up approach where first smaller subproblems and solved and is worked upward to combine solutions to solve larger subproblems.

In order to use dynamic programming, a problem must have two specific characteristics:

Optimal Substructure: The optimal solution to the overall problem can be found by combining the optimal solutions with the smaller sub-problems.
Overlapping Sub-problems: The smaller sub-problems within the larger problem overlap with each other, meaning they share some common elements. This allows us to reuse solutions to sub-problems that have already been solved, rather than having to solve them multiple times.

By using these two characteristics, we can effectively solve problems and find the most efficient solution using dynamic programming.

We can understand this in the context of longitudinal control of automated driving where the goal is to minimize fuel consumption. Consider that the state of the vehicle consists of (v, a), i.e, velocity and acceleration. We can define the set of possible actions to be acceleration, deceleration, or maintaining a constant speed. f(v, a) can be the fuel consumption as a function of speed and acceleration. This is a very reasonable function for fuel consumption (see [4]).

We can use dynamic programming to find the optimal sequence of actions that minimizes the total fuel consumption over a given time horizon.

We can define cost function C(t, s) that represents the minimum fuel consumption to reach a state s at time t, and using the following recursion:

where s‘ is the predecessor of s.

We can obtain an optimal sequence of actions by tracking the action that was taken at each time step to reach the minimum cost at that time. For example, at time t the vehicle is in the state s = (v, a) and the minimum cost to reach this state is C(t, (v, a)). If the minimum cost was reached by taking the action "accelerate" at time t-1, then the optimal sequence of actions includes "accelerate" at time t-1.

A pythonic-pseudo code of the above example can be found at https://gist.github.com/rahulbhadani/a119916d26e251b939b817dde03ad032.

Approximate Dynamic Programming

Approximate dynamic programming (ADP) is used for solving problems that are sometimes large and complex, and are usually (but not always) stochastic.

ADP is used for overcoming the curse of dimensionality that is usually observed in Bellman’s equation.

ADP is about learning what to learn, and how to learn it, to make better decisions over time [5].

In ADP, the true value function Vπ(s) is replaced by its statistical approximation Vπ*(s). Further, ADP steps forward in time with some variations. The approximate function Vπ*(s) can be determined in a number of ways, such as:

Look up tables
Parametric models
Non-parametric models

Alternatively, the Monte Carlo method can be used for approximating the expectations.

A number of methods such as discrete models, neural networks, and kernel regression may be a combination of the above three strategies. We need a separate discussion thread to go into details about approximating the value function.

Readers should note that with ADP, the policy may not be optimal and we need to compromise with sub-optimal policies.

Conclusion

In this article, I discussed the specifics of dynamic programming with an example of longitudinal control in autonomous driving. Dynamic programming is an efficient way to solve MDP which is at the heart of RL problems. In the next article of this series, I will talk about value iteration and policy iteration. An upcoming future article will be dedicated to strategies that be employed for approximating value functions.

If you have not checked out the first two articles in the Foundational RL series, please be sure to check them out:

Foundational RL: Markov States, Markov Chain, and Markov Decision Process

Foundational RL: Solving Markov Decision Process

Did you enjoy this article? Buy me a Coffee.

Love my writing? Join my email list.

Want to know more about STEM-related topics? Join Medium

References

Deep Reinforcement Learning, Mohit Sewak, https://doi.org/10.1007/978-981-13-8285-7
Deep Reinforcement Learning, Aske Plaat, https://doi.org/10.1007/978-981-19-0638-1, Springer Singapore
Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions by Warren B. Powell (ed.), Wiley (2022). Hardback. ISBN 9781119815051.
Lee, J.W., Gunter, G., Ramadan, R., Almatrudi, S., Arnold, P., Aquino, J., Barbour, W., Bhadani, R., Carpio, J., Chou, F.C. and Gibson, M., 2021, May. Integrated Framework of Vehicle Dynamics, Instabilities, Energy Models, and Sparse Flow Smoothing Controllers. In Proceedings of the Workshop on Data-Driven and Intelligent Cyber-Physical Systems (pp. 41–47). https://doi.org/10.1145/3459609.3460530
Powell, Warren B. "What you should know about approximate dynamic programming." Naval Research Logistics (NRL) 56, no. 3 (2009): 239–249. https://doi.org/10.1002/nav.20347
Buşoniu, Lucian, Bart De Schutter, and Robert Babuška. "Approximate dynamic programming and reinforcement learning." In Interactive collaborative information systems, pp. 3–44. Springer, Berlin, Heidelberg, 2010. https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_028.pdf