Deep Reinforcement Learning Explained – 02

Today we start with the second post in the series "Deep Reinforcement Learning Explained". As we announced in the first Post, one of this series’s main aspects is its orientation to practice; however, we need some theoretical knowledge before starting coding. In this post, we will explore specific assumptions and abstractions in their strict mathematical form. Don’t panic; your patience will be rewarded!
Here we will introduce the reader to the mathematical representation and notation of the concepts that have been presented in the previous post, which will be used repeatedly in this series. Actually, the reader will learn to represent these kinds of problems using a mathematical framework known as Markov Decision Processes (MDP) that allows modeling virtually any complex Environment.
Often, the dynamics of the environments are hidden and inaccessible to the Agent; however, as we will see in future posts, DRL agents do not need to know the precise MDP of a problem to learn robust behaviors. But knowing about MDP is essential for the reader because agents are commonly designed with the assumption that an MDP, even if inaccessible, is running under the hood.
1. Markov Decision Processes
The Markov Decision Process (MDP) provides a mathematical framework for solving the RL problem. Almost all RL problems can be modeled as an MDP. To understand an MDP, first, we need to learn about the Markov property and the Markov process.
The Markov property states that the future depends only on the present and not on the past. The Markov process consists of a sequence of states that strictly obey the Markov property.
When an RL problem satisfies the Markov property, i.e., the future depends only on the current state s and action a, but not on the past, it is formulated as a Markov Decision Process (MDP).
For the moment, we could consider that an MDP consists basically of five-tuple <S,A,R,p,γ> where the symbols mean:
- S – a set of states
- A – a set of actions
- R – reward function
- p – transition function
- γ – discounting factor
Let’s describe each of them below; however, we need to clarify something about the mathematical notation used in this series before going ahead.
1.1 About the Mathematical Notation
Given the practical and introductory nature of this series, I will try to make the explanation understandable to a general audience without requiring them to rigorously follow the mathematical formulation.
However, I will try to maintain the rigor in the formulas for those readers who want to follow the most mathematical detail. But due to the Medium.com editor have certain limitations for writing formulas and different font types, this is not easy.
Therefore, in the text, we will be a bit lax in the use of notation. For example, subscripts are not allowed by the Medium editor. In this case, when we want to refer to the Agent’s state s at time t, we will use the notation St.
Similarly, to refer to a state, action, or reward, we will use both uppercase and lowercase letters that we will use at convenience to make it easier to read the text in Medium.
I will do my best to make the text understandable. In some cases, I am going to include mathematical symbols and formulas as images created with Latex to remedy the shortcomings of the Medium editor.
In any case, before proceeding, for full mathematical rigor, please refer to the appendix (at the end) of this post.
1.2 States
A state is a unique and self-contained configuration of the problem. The set of all possible states is named the state space. There are individual states as starting states or terminal states.
In our Frozen-Lake example, used in the previous post, the state space of the Environment is composed of 16 states as shown in the drawing:

More formally from a programming point of view, we can obtain them as follows:
print("State space: ", env.observation_space)
State space: Discrete(16)
In the Frozen-Lake Environment, for instance, there is only one starting state (which is state 0) and five terminal states (states 5,7,11,12, and 15):

1.3 Actions
At each state, the Environment makes available a set of actions, an action space, from which the Agent will choose an action. The Agent influences the Environment through these actions, and the Environment may change states as a response to the action taken by the Agent. The Environment makes the set of all available actions known in advance.
In Frozen-Lake Environment, there are four available actions in all states: UP, DOWN, RIGHT, or LEFT:
print("Action space: ", env.action_space)
Action space: Discrete(4)
Now that we have presented the states and actions, we can remember the Markov property. The probability of the next state St+1, given the current state St and current action At in a given time t, will be the same as if you give it the entire history of interactions. In other words, that is, the probability of moving from one state to another state on two separate occasions, given the same action, is the same regardless of all previous states or actions encountered before that point. In the Frozen-Lake example, we know that from state 2, the Agent can only transition to state 1, 3, 6, or 2, and this is true regardless of whether the Agent’s previous state was 1, 3, 6, or 2. That is, you don’t need the history of states visited by the Agent for anything.
We take here to explain that we can categorize action spaces into two types:
- Discrete action space: When our action space consists of actions that are discrete. For instance, in the Frozen-Lake Environment, our action space consists of four discrete actions: up, down, left, right, and so it is called a discrete action space. A discrete environment is one where the environment’s action space is discrete.
- Continuous action space: When our action space consists of actions that are continuous. For instance, when we drive a car, our actions have continuous values, such as the car’s speed, or degrees we need to rotate the wheel, and so on. In our series, we will use examples of Continuous action space as CartPole Environment. A continuous environment is one where the environment’s action space is continuous.
2. Transition Function
Moving from one state to another state the Agent will arrive in (and the Environment changes its state) is decided by the transition function, which indicates the probability of moving from one state to the next state, and is denoted by p. It represents the one-step dynamics of the Environment, that is, the probability of next state and rewards, given the current state and current action.
Remember that the Environment responds to the Agent at time step t; it considers only the state and action at the previous time step _t-1._ It does not care what states were presented to the Agent more than one step prior. It does not look at the actions that the Agent took before the last one. And finally, neither how much reward it is collecting has no effect on how the Environment chooses to respond to the Agent.
Because of this, we can completely define how the Environment decides the state and reward by specifying the transition function p as we did here. The function p defines the dynamics of the MDP.
As a summary, emphasize that when we have a real problem in mind, we will need to specify the MDP as a way to formally define the problem, and therefore the function p. The Agent will know the states, actions, and rewards, along with the discount factor. But the function p will be unknown to the Agent. Despite not having this information, the Agent will still have to learn from interaction with the Environment on how to accomplish its goal.
Depending on the Environment, Agents can select actions either deterministically or stochastically. Let’s see how the transition function is for both cases.
2.1 Transition function for deterministic Environment
Imagine the example of Frozen-Lake that is not a slippery surface. We can create this Environment with the argument is_slippery=False
to create the Environment in a deterministic mode:
env = gym.make('FrozenLake-v0', is_slippery=False)
In this case, the probability at time t of the next state St+1 given the current state St and action At was always 1. In other words, a deterministic Environment where there was always a single possible next state for action. In this case, we can consider the transition function as a simple lookup table of a two-dimension matrix (2D). In our Frozen-Lake example, we could obtain it with env.env.P
that outputs the function as a dictionary:
{
0: {0: [(1.0, 0, 0.0, False)],
1: [(1.0, 4, 0.0, False)],
2: [(1.0, 1, 0.0, False)],
3: [(1.0, 0, 0.0, False)]},
1: {0: [(1.0, 0, 0.0, False)],
1: [(1.0, 5, 0.0, True)],
2: [(1.0, 2, 0.0, False)],
3: [(1.0, 1, 0.0, False)]},
.
.
.
14: {0: [(1.0, 13, 0.0, False)],
1: [(1.0, 14, 0.0, False)],
2: [(1.0, 15, 1.0, True)],
3: [(1.0, 10, 0.0, False)]},
15: {0: [(1.0, 15, 0, True)],
1: [(1.0, 15, 0, True)],
2: [(1.0, 15, 0, True)],
3: [(1.0, 15, 0, True)]}
}
In this output, env.P
returns all the states (lots removed for clarity, go to the notebook for the complete output) where each state contains a dictionary which maps all possible actions (0,1,2,3) from that state to the next state if we take that action. And further, each action includes a list, where each element of the list is a tuple showing the probability of transitioning into the state, next state, reward, and if the Game terminates there or not (done= True
if the next state is a HOLE or the GOAL).
For example, in this "not slippery" Environment , if we execute the sequence/plan (called by me "good plan") indicated in the next drawing the Agent will arrive safely at the end:

We can check with the following code that it is a plan that allows the Agent to achieve the objectives:
actions = {'Left': 0, 'Down': 1, 'Right': 2, 'Up': 3 }
good_plan = (2*['Down']) + ['Right'] + ['Down'] + (2*['Right'])
env = gym.make("FrozenLake-v0", is_slippery=False)
env.reset()
env.render()
for a in good_plan:
new_state, reward, done, info = env.step(actions[a])
env.render()
if done:
break

Here the Environment reacts deterministically to Agent’s actions, but if we remember from the previous post, the original Environment reacts stochastically to Agent’s actions to simulate slipping.
2.2 Transition function for Stochastic Environment
We introduced that which state the Agent will arrive in is decided by the transition function. But in a stochastic Environment, at time t the transition function p maps a transition tuple (St, At, St+1) to the corresponding probability of transition p from a source state St to target state St+1 when taking action At.
Now, to capture all the details about the Environment and possible reactions to the Agent’s actions, the transition function couldn’t be represented as a 2D matrix as in the case of the deterministic Environment. In this case, we need a 3D matrix with the dimensions source state, action, and target space, where each element indicates the probability of transition from a source state St to a target state St+1 given action At.
To verify that we are talking about a 3D matrix, we could obtain the transition function as before, but now for the slippery Environment:
env = gym.make("FrozenLake-v0")
print(env.env.P
{
0: {0: [(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 4, 0.0, False)],
1: [(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 4, 0.0, False),
(0.3333333333333333, 1, 0.0, False)],
2: [(0.3333333333333333, 4, 0.0, False),
(0.3333333333333333, 1, 0.0, False),
(0.3333333333333333, 0, 0.0, False)],
3: [(0.3333333333333333, 1, 0.0, False),
(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 0, 0.0, False)]},
.
.
.
14: {0: [(0.3333333333333333, 10, 0.0, False),
(0.3333333333333333, 13, 0.0, False),
(0.3333333333333333, 14, 0.0, False)],
1: [(0.3333333333333333, 13, 0.0, False),
(0.3333333333333333, 14, 0.0, False),
(0.3333333333333333, 15, 1.0, True)],
2: [(0.3333333333333333, 14, 0.0, False),
(0.3333333333333333, 15, 1.0, True),
(0.3333333333333333, 10, 0.0, False)],
3: [(0.3333333333333333, 15, 1.0, True),
(0.3333333333333333, 10, 0.0, False),
(0.3333333333333333, 13, 0.0, False)]},
15: {0: [(1.0, 15, 0, True)],
1: [(1.0, 15, 0, True)],
2: [(1.0, 15, 0, True)],
3: [(1.0, 15, 0, True)]}
}
This transition function implements that there is a 33.3% chance we will transition to the intended cell (state) and a 66.6% chance we will transition to orthogonal directions. There is also a chance we will bounce back to the state we are coming from if next to the wall.
It could be helpful a visual representation of the Environment as a graph with nodes representing the states and edges, labeled with probabilities (and rewards), representing a possible transition from state to state. For simplicity and clarity, I have added to the image below only the transition function for all actions of state 14. This subset of states allows the illustration of all possible transitions without too much clutter.

Now, if we execute the same sequence of the "good plan" actions several times, we can see that it does not behave in a deterministic way, giving very different results. We will come back to this example in future posts.
3. Reward Function and Discount Factor
Once an action is taken, the Environment delivers a reward as a measure of goodness to transitions using a reward function. What is a reward function?
3.1 Reward Function
The reward function is usually denoted by r(s,a) or r(s,a,s’). It represents the reward our Agent obtains while transitioning from state s to state s’ while performing an action a. It’s just a scalar value the Agent obtains in each time step (or every fixed number of time steps) from the Environment (can be positive or negative, large or small).
It’s important to note that the word "Reinforcement" and "Reinforcement Learning" is a term originally from behavioral science. It refers to a stimulus that’s delivered immediately after behavior to make the behavior more likely to occur in the future. The fact that this name is borrowed is no coincidence. We should design the reward as a feedback mechanism that tells the Agent that it has chosen the appropriate action. The reward will be our way of telling the Agent that it is doing a "good job" or "bad job". In other words, that is, tell our Agent how well it has behaved.
For example, consider an Agent who would like to learn to escape a maze. Which reward signals will encourage the Agent to escape the maze as quickly as possible? The reward can be -1 for every time step that the Agent spends inside the maze. Once the Agent escapes, it receives a reward of +10, and the episode terminates. Now, consider an Agent who would like to learn to balance a plate of food on her head. Which reward signals will encourage the Agent to keep the plate balanced for as long as possible? For instance, can be a reward of +1 for every time step that the Agent keeps the plate balanced on her head. If the plate falls, the episode terminates. I hope that with these two simple examples, the reader has intuited the idea of how the reward can contribute to "Reinforce" the learning of the Agent.
As a summary, the reward gives the Agent feedback about its success, and that reward obtained should reinforce Agent’s behavior in a positive o negative way. Nevertheless, it only reflects the success of the Agent’s recent activity and not all the successes achieved by the Agent so far. What an Agent is trying to achieve is the largest accumulated reward over its sequence of actions, and we need another feedback, the return. But before introducing the return, let me introduce the last component of an MDP, the discount factor.
3.2 Discount factor
We already advanced that the task the Agent is trying to solve may or may not have a natural ending. Tasks that have a natural ending, such as a game, are called episodic tasks. The sequence of time steps from the beginning to the end of an episodic task is called an episode. The Frozen-Lake Environment presents episodic tasks because there are terminal states; there are clear goal and failure states. Conversely, tasks that do not have a natural end are called continuing tasks, such as learning forward motion.
Because of the possibility of infinite sequences of time steps, we need a way to discount the value of rewards over time; that is, we need a way for telling the Agent that getting +1’s is better sooner than later. So, we commonly use a positive real value less than one to exponentially discount the value of future rewards. The further into the future we receive the reward, the less valuable it is in the present.
This parameter is called the discount factor, gamma, or discount rate denoted by γ, and its range is [0,1]. The Agent uses the discount factor to adjusts the importance of rewards over time. The later the Agent receives rewards, the less attractive they are to present calculations. In other words, the Agent is more interested in securing rewards that come sooner and are more likely than the rewards that come later and are less likely.
4. Problem Setup
In the previous post, we presented the Reinforcement Learning cycle where an RL Agent interacts with an Environment over time. Summarizing, at each time step t, the Agent receives a state s, and selects an action a, following a strategy (a policy). According to the Environment dynamics, the Agent receives a scalar reward r, and transitions to the next state s’. In an episodic problem, this process continues until the Agent reaches a terminal state, and then it restarts.
4.1 Episode and Trajectory
The Agent interacts with the environment by performing some actions, starting from the initial state and reaches the final state. This Agent-Environment interaction starting from the initial state S0 until the final state is called an episode.
But sometimes we can be interested in only part of the interaction Agent-Environment. At an arbitrary time step t, the Agent and the MDP Environment interaction has evolved as a sequence of states, actions, and rewards. Each step is defined as a transition:

We define a trajectory, just a part of a episode, composed by a sequence of transitions. A trajectory is slightly more flexible than an episode because there are no restrictions on its length; it can correspond to a full episode or just a part of an episode. We represent a trajectory with τ:

We denote the length of the trajectory with a H (or h) where H stands for Horizon.
4.2 Discounted Return
We introduced in Post 1 that, in general, the Agents are designed to maximize the cumulative reward it receives in the long run. Now that we have introduced the gamma, we can define how the return is calculated from the gamma and rewards for continuing tasks (and for the episodic tasks). We define discounted return, denoted Gt, at the time step t, as this quantity:

For every time step t, we calculate the discounted return Gt as a sum of subsequent rewards, but more distant rewards are multiplied by the discount factor raised to the power of the number of steps we are away. If gamma equals 1, then discounted return just equals a sum of all subsequent rewards (cumulative rewards), the return. If gamma equals 0, the return will be just the immediate reward without any subsequent state.
When we set the discount factor to a small value (close to 0), we give more importance to immediate rewards than future rewards. When we set the discount factor to a high value (close to 1), it implies that we give more importance to future rewards than to immediate rewards. The extreme values are useful only in corner cases, and most of the time, gamma is set to something in between.
An Agent gives importance to immediate and future rewards depending on the tasks. In some tasks, future rewards are more desirable than immediate rewards and vice versa. In a chess game, for example, the goal is to defeat the opponent’s king. Let’s give more importance to the immediate reward, which is acquired by actions such as our pawn defeating any opposing chessman. The Agent probably will learn to perform this sub-goal instead of learning the important goal.
Actions have short and long-term consequences, and the Agent needs to gain an understanding of the complex effects its actions have on the Environment. As we will along this series, the discounted return will help the Agent in this task: the Agent will always choose an action towards the goal of maximizing the return. But as we will see, in general, an Agent can’t predict with complete certainty what the future reward is likely to be, and therefore discounted return, so it has to rely on a prediction or an estimate depending on the family of methods that we will use to learn. This is where the key concepts of policy functions and value functions come into play.
4.3 Policy
Let’s see how the Agent makes the decision to meet its objectives: to find a sequence of actions that will maximize the discounted return G during an episode. In other words, the Agent must have a plan, a sequence of actions from the START state to the GOAL state.
We previously introduced a "good plan" for the Frozen-Lake example, which seems intuitively the best. But when we run it in a stochastic environment, even the best of plans fall, due to actions taken will not always work the way we intend. Remember that in the Frozen-Lake Environment, unintended actions affects have even higher probability: 66.6% vs. 33.3%.
And in particular, it happens that due to the environment’s stochasticity, our Agent lands on a cell not covered by our plan. Then? What to do? Well, what we need is a plan for every possible state, a "universal plan" , a policy that covers all possible states.
A policy defines the Agent’s behavior in an environment. It ** is the strategy (e.g., some set of rules) that the Agent employs to determine the next action based on the current state. Typically denoted by 𝜋, a policy ** is a function that determines the next action **** a to take given a state s.
The policy that we just mentioned at the beginning is called a deterministic policy. A deterministic policy, 𝜋(𝑠) tells the Agent to perform one particular action a _in a state s. ****_ Thus, the deterministic policy maps the state to one particular action.
But in general, we will deal with more general policies, and will be defined as probability and not as concrete action. In other words, that is, a stochastic policy that has a probability distribution over actions that an Agent can take at a given state. So instead of performing the same action every time the Agent visits the state, the Agent performs different actions each time based on a probability distribution returned by the stochastic policy. A stochastic policy is typically denoted by 𝜋(𝑎|𝑠).
As we will see later in this series, the policy may change as the Agent gains more experience during the learning process. For example, the Agent may start from a random policy, where the probability of all actions is uniform; meanwhile, the Agent will hopefully learn to optimize its policy toward reaching a better policy. In an initial iteration, this Agent performs a random action in each state and tries to learn whether the action is good or bad based on the reward obtained. Over a series of iterations, the Agent will learn to perform good actions in each state, which gives a positive reward. Finally, the Agent will learn a good policy or optimal policy.
We can categorize a stochastic policy into two types:
- Categorical policy: when the action space is discrete, and the policy uses a categorical probability distribution over the action space to select. For instance, in the Frozen-Lake Environment, we select actions based on categorical probability distribution (discrete distribution) as the action space of the environment is discrete.
- Gaussian policy: when our action space is continuous, and the policy uses a Gaussian probability distribution over the action space to select. We will talk about this category of policy in future posts.
Even for reasonably simple environments, we can have a variety of policies. Then we need a method to automatically find optimal policies. The optimal policy is the policy that gets the Agent a good reward and helps the Agent to achieve the goal. It tells the Agent to perform the correct action in each state so that the Agent can receive a good reward. This is where the value functions come into play.
4.4 Predicting futures rewards: Value functions
A value function that determines what is good for the Agent in the long run, unlike the immediate reward. We have two types of value functions, state-value function , and action-value function, that help to learn and find optimal policies for the Agent.
The state-value function, also referred to as the V-function, measures the goodness of each state. It tells us the total return we can expect in the future if we start from that state. In other words, how good or bad it is to be in a particular state according to the discounted return G **** when we follow a certain policy. However, discounted return __ G quantity is not very useful in practice, as it was defined for every specific episode, so it can vary widely, even for the same state. But, if we go to the extreme and calculate the mathematical expectation 𝔼[.] of return for a state by averaging a large number of episodes, we will get a much more useful value for the V-function.
Moreover, we can extend the definition of the state-value function, defining a value for each state-action pair, which is called the action-value function, also known as Q-function. It tells us how good or bad it is to take a specific action from the set of actions that we can choose from the state we are.
Estimating the state-value function and action-value function is an essential component of Reinforcement Learning methods. In future posts in this series, we will go into more detail and cover different ways of calculating and estimating these functions.
5. Summary
We have reached the end of this post!. A post that doesn’t pretend to be a complete theoretical framework will introduce the minimum formalism to start, and we will be introducing new formulations as we need them throughout the series. We understood several important fundamental concepts involved in RL.
In the next three posts (Post 3, Post 4, Post 5), we will review Deep Learning concepts (and PyTorch) required for programming Agents in Reinforcement Learning Problems. If you have previous knowledge of Deep Learning, PyTorch, and TensorBoard, you can skip these three Posts and go for Post 6. to solve a Reinforcement Learning Problem Using Cross-Entropy Method.
Post updated on 11/12/2020
Appendix: Mathematical Notation
In general, we tend to use the notation used in the textbook Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. This book is "the" classic text with an excellent introduction to Reinforcement Learning fundamentals.
Dr. Richard S. Sutton is a distinguished research scientist at DeepMind and a renowned professor of computing science at the University of Alberta. Dr. Sutton is considered one of the founding fathers of modern computational Reinforcement Learning. Dr. Andrew G. Barto is a professor emeritus at the University of Massachusetts Amherst and was the doctoral advisor of Dr. Sutton.
The main definitions and mathematical symbols that we will use along this series related with this posts will be:

Deep Reinforcement Learning Explained Series
by UPC Barcelona Tech and Barcelona Supercomputing Center
A relaxed introductory series that gradually and with a practical approach introduces the reader to this exciting technology that is the real enabler of the latest disruptive advances in the field of Artificial Intelligence.
About this series
I started to write this series in May, during the period of lockdown in Barcelona. Honestly, writing these posts in my spare time helped me to #StayAtHome because of the lockdown. Thank you for reading this publication in those days; it justifies the effort I made.
Disclaimers – These posts were written during this period of lockdown in Barcelona as a personal distraction and dissemination of scientific knowledge, in case it could be of help to someone, but without the purpose of being an academic reference document in the DRL area. If the reader needs a more rigorous document, the last post in the series offers an extensive list of academic resources and books that the reader can consult. The author is aware that this series of posts may contain some errors and suffers from a revision of the English text to improve it if the purpose were an academic document. But although the author would like to improve the content in quantity and quality, his professional commitments do not leave him free time to do so. However, the author agrees to refine all those errors that readers can report as soon as he can.