
In this series of blog posts, I am excited to share with you my passion for the Reinforcement Learning (RL) paradigm. My mission is to provide an in-depth exploration of RL, combining theoretical knowledge with practical examples that tackle real-world problems.
Course details
Who is this course intended for?
This course is for anyone who wishes to learn about RL. It is recommended that you have some prior coding experience in Python and be familiar with the use of Jupyter Notebooks. Furthermore, it would be beneficial to have a basic understanding of neural networks when tackling the chapters in the second part of this course. If you lack prior knowledge of neural networks, you can still use them and treat them as black box function approximators; however, I suggest taking the time to learn more about them before progressing through the latter chapters.
Why should I care about Reinforcement Learning?
RL is an incredibly exciting field that has seen exponential research growth in recent years. Despite the successes that have been achieved, bridging the gap between academia and industry is not as straightforward as is the case of other Machine Learning fields. RL is often viewed as too impractical for commercial applications, but I am here to dispel that notion and arm you with practical knowledge of RL methods to bolster your problem-solving toolbox.
What will I learn?
This is the first in a series of blog posts where you will:
- Learn to identify use cases of RL and model your real-world scenarios to fit the RL paradigm
- Become familiar with the various RL algorithms and understand their advantages and limitations
- Understand the principles of exploration and exploitation and how to balance them for optimal performance
- Analyse and interpret the results of RL experiments
Here is a sneak peak of the techniques we will cover:
- Multi-arm bandits
- Dynamic Programming
- Monte Carlo Methods
- Temporal Difference Learning
- Deep Q-Networks
- Policy Gradient Methods
- Actor-Critic Methods
Introduction
Some of the most significant breakthroughs in Artificial Intelligence are inspired by nature and the RL paradigm is no exception. This simple yet powerful concept is closest to how we humans learn and can be seen as an essential element of what we would expect from an artificial general intelligence: Learning through trial and error. This approach to learning teaches us about cause and effect, and how our actions impact our environment. It teaches us how our behaviour can either harm or benefit us, and enables us to develop strategies to achieve our long-term goals.
What is RL?
The RL paradigm is a powerful and versatile Machine Learning approach that enables decision makers to learn from their interactions with the environment. It draws from a wide range of ideas and methodologies for finding an optimal strategy to maximize a numerical reward. With a long history of connections to other scientific and engineering disciplines, research in RL is well-established. However, while there is a wealth of academic success, practical applications of RL in the commercial sphere remain rare. The most famous examples of RL in action are computers achieving super-human level performance on games such as chess and Go, as well as on titles like Atari and Starcraft. In recent years, however, we have seen a growing number of industries adopt RL methods.
How is it used today?
Despite the low level of commercial adoption of RL, there are some exciting applications in the field of:
- Health: Dynamic treatment regime; automated diagnosis; drug discovery
- Finance: Trading; dynamic pricing; risk management
- Transportation: Adaptive traffic control; autonomous driving
- Recommendation: Web search; news recommendation; product recommendation
- Natural Language Processing: text summarization; question answering; machine translation; dialog generation
Sometimes less is more
A good way to gain an understanding of RL use cases is to consider an example challenge. Let us imagine we are trying to help our friend learn to play a musical instrument. Each morning, our friend tells us how motivated they feel and how much they have learned during yesterday’s practice, and asks us how they should proceed. For reasons unknown to us, our friend has a limited set of studying choices: Taking a day off, practicing for one hour, or practicing for three hours.
After observing our friend’s progress, we have noticed a few interesting characteristics:
- It appears that the progress our friend is making is directly correlated with the amount of hours they practice.
- Consistent practice sessions make our friend progress faster.
- Our friend does not do well with long practicing sessions. Every time we instructed them to study for three hours, the next day they felt tired and unmotivated to continue.
From our observations, we have created a graph modeling their learning progress using state machine notation.

Let us discuss again our findings based on our model:
- Our friend has three distinct emotional states: neutral, motivated, and demotivated.
- On any given day, they can choose to practice for zero, one, or three hours, except when they are feeling demotivated – in which case studying for zero hours (or not studying) is their only available option.
- Our friend’s mood is predictive: In the neutral state, practicing for one hour, will make them feel motivated the following day, while practicing for three hours will leave them feeling demotivated und not practicing at all will keep them in a neutral state. Conversely, in the motivated state, one hour of practice will maintain our friend’s motivation, while three hours of practice will demotivate them and no practice at all will leave them feeling neutral. Lastly, in the demotivated state, our friend will refrain from studying altogether, resulting in them feeling neutral the next day.
- Their progress is heavily influenced by their mood and the amount of practice they put in: the more motivated they are and the more hours they dedicate to practice, the faster they will learn and grow.
Why did we structure our findings like this? Because it helps us model our challenge using a mathematical framework called finite Markov decision processes (MDPs). This approach helps us gain a better understanding of the problem and how to best address it.
Markov Decission Processes
Finite MDPs provide a useful framework to model RL problems, allowing us to abstract away from the specifics of a given problem and formulate it in a way that can be solved using RL algorithms. In doing so, we are able to transfer learnings from one problem to another, instead of having to theorise about each problem individually. This helps us to simplify the process of solving complex RL problems. Formally, a finite MDP is a control process defined by a four-tuple:

The four-tuple (S, A, P, R) defines four distinct components, each of which describes a specific aspect of the system. S and A define the set of states and actions respectively. Whereas, P denotes the transition function and R denotes the reward function. In our example, we define our friend’s mood as our set of states S and their practice choices as our set of actions A. The transition function P, visualised by arrows in the graph, shows us how our friend’s mood will be altered depending on the amount of studying they do. Furthermore, the reward function R is used to measure the progress our friend has made, which is influenced by their mood and the practice choices they make.
Policies and value functions
Given the MDP, we can now develop strategies for our friend. Drawing on the wisdom of our favorite cooking podcast, we are reminded that to master the art of cooking one must develop a routine of practicing a little every day. Inspired by this idea, we develop a strategy for our friend that advocates for a consistent practice schedule: practice for one hour every day. In RL theory, strategies are referred to as policies or policy functions, and are defined as mappings from the set of states to the probabilities of each possible action in that state. Formally, a policy π is a probability distribution over actions a given state s.

To adhere to the "practice a little every day" mantra, we establish a policy with a 100% probability of practicing for one hour in both the neutral and motivated states. However, in the demotivated state, we skip practice 100% of the time, since it is the only available action. This example demonstrates that policies can be deterministic, instead of returning a full probability distribution over available actions, they return a degenerate distribution with a single action which is taken exclusively.

As much as we trust our favourite cooking podcast, we would like to find out how well our friend is doing by following our strategy. In RL lingo we speak of evaluating our policy, which is defined by the value function. To get a first impression, let us calculate how much knowledge our friend is gaining by following our strategy for ten days. Assuming they start the practice feeling neutral, they will gain one unit of knowledge on the first day and two units of knowledge thereafter, resulting in a total of 19 units. Conversely, if our friend had already been motivated on the first day, they would have gained 20 units of knowledge and if they had started feeling demotivated, they would have gained only 17 units.
While this calculation may seem a little arbitrary at first, there are actually a few things we can learn from it. Firstly, we intuitively found a way to assign our policy a numerical value. Secondly, we observe that this value depends on the mood our friend starts in. That said, let us have a look at the formal definition of value functions. A value function v of state s is defined as the expected discounted return an agent receives starting in state s and following policy π thereafter. We refer to v as the state-value function for policy π.

Where we define the state-value function as the expected value E of the discounted return G when starting in state s. As it turns out, our first approach is in fact not far off the actual definition. The only difference is that we based our calculations on the sum of knowledge gains over a fixed number of days, as opposed to the more objective expected discounted return G. In RL theory, the discounted return is defined as the sum of discounted future rewards:

Where R denotes the reward at timestep t multiplied by the discount rate denoted by a lowercase gamma. The discount rate lies in the interval of zero to one and determines how much value we assign to future rewards. To better understand the implication of the discount rate on the sum of rewards let us consider the special cases of assigning gamma to zero or to one. By setting gamma to zero, we consider only immediate rewards and disregard any future rewards, meaning the discounted return would only equal the reward R at timestep t+1. Conversely, when gamma is set to one, we assign any future rewards their full value, thus the discounted return would equal the sum of all future rewards.
Equipped with the concept of value functions and discounted returns we can now properly evaluate our policy. Firstly, we need to decide on a suitable discount rate for our example. We must discard zero as a possible candidate, as it would not account for the long-term value of knowledge generation we are interested in. A discount rate of one should also be avoided, as our example does not have a natural notion of a final state; thus, any policy that includes regular practice of the instrument, no matter how ineffective, would yield an infinite amount of knowledge with enough time. Hence, chosing a discount rate of one, would make us indifferent between having our friend practice every day or once a year. After rejecting the special cases of zero and one, we have to choose a suitable discount rate between the two. The smaller the discount rate, the less value is assigned to future rewards and vice versa. For our example, we set the discount rate to 0.9 and calculate the discounted returns for each of our friend’s moods. Let us start again with the motivated state. Instead of considering only the next ten days, we calculate the sum of all discounted future rewards, resulting in 20 units of knowledge. The calculation is as follows¹:

Note, by introducing a discount rate smaller than one, the sum of an infinite number of future rewards is still constant. The next state we wish to analyze is the neutral state. In this state, our agent choses to practice for one hour, gaining one unit of knowledge, and then transitions to the motivated state. This simplifies the calculation process tremendously, as we already know the value of the motivated state.

As a final step, we can also calculate the value function of the demotivated state. The process is analogous to the neutral state, resulting in a value of a little over 17.

By examining the state-value functions of our policy in all states, we can deduce that the motivated state is the most rewarding, which is why we should instruct our friend to reach it as quickly as possible and remain there.
I encourage you to consider developing alternative policy functions and evaluating them using the state-value function. While some of them may be more successful in the short term, they will not generate as much units of knowledge as our proposed theory in the long-term². If you want to dig deeper into the math behind MDPs, policies and value functions, I highly recommend "Reinforcement Learning – An Introduction" by Richard S. Sutton and Andrew G. Barto. Alternatively, I suggest checking out the "RL Course by David Silver" on YouTube.
What’s next?
What if our friend was not into music, but instead asked us to help them build a self-driving car, or our supervisor instructed our team to develop an improved recommender system? Unfortunately, discovering the optimal policy for our example will not help us much with other RL problems. Therefore, we need to devise algorithms that are capable of solving any finite MDP².
In the following blog posts you will explore how to apply various RL algorithms to practical examples. We will start with tabular solution methods, which are the simplest form of RL algorithms and are suitable for solving MDPs with small state and action spaces, such as the one in our example. We will then delve into deep learning to tackle more intricate RL problems with arbitrarily large state and action spaces, where tabular methods are no longer feasible. These approximate solutions methods will be the focus of the second part of this course. Finally, to conclude the course, we will cover some of the most innovative papers in the field of RL, providing a comprehensive analysis of each one, along with practical examples to illustrate the theory.
[1] In our case, the state-value function is simply equal to the discounted return. However, if the environment or the policy were to be nondeterministic, then we would need to sum over all the discounted returns, each weighted by their likelihood.
[2] Proving that the proposed policy is truly optimal in the long run is beyond the scope of this blog post. However, I encourage those interested to seek out a proof as an exercise. I would start by defining a hypothetical optimal policy, then try to demonstrate that the value function of the proposed policy is greater than or equal to the value function of the optimal policy for every state of the environment.
[3] Not all RL problems fit the mathematically idealised form of finite MDPs. Many interesting RL problems require a relaxed formulation. Nevertheless, algorithms developed for finite MDPs will often perform well on RL problems that do not fit the strict formulation.