The world’s leading publication for data science, AI, and ML professionals.

Markov Decision Processes and Bellman Equations

An Introduction to Reinforcement Learning: Part 2

A Baby Robot’s Guide To Reinforcement Learning

All images by author.
All images by author.

Introduction

In the first part of this series on Reinforcement Learning we saw how the behaviour of an agent could be evaluated, to measure how well it performed on a given problem.

The problem environment was split into a set of states and each of these states was given a value that reflected how good it was to be in that state. By selecting actions that moved Baby Robot in the direction of increasing state values, we were able to help him find his way to the exit of a simple grid level.

Additionally, we managed to do this using a limited amount of theory. Rather than defining the full Reinforcement Learning problem up front, we derived our equations as we needed them, starting simply and building on those we’d used before. As it turns out, the simple steps we’ve taken have already brought us close to defining the full Reinforcement Learning problem. In this part we’ll take the final few steps needed to reach our goal.


Part 1 can be found here…

State Values and Policy Evaluation


Code

The associated _[Jupyter Notebook](http://Reinforcement_Learning/Part 2 – Markov Decision Processes and Bellman Equations.ipynb)_ for this article can be found on github. This contains all of the code required to setup and run the levels and algorithms described below.


Markov Processes (Markov Chains)

In the last part, Baby Robot managed to find his way out of a simple grid level. Unfortunately he still hasn’t made his way back to his mum and instead finds himself in the new level shown below:

While this is a pretty small level, and Baby Robot can see the exit, the roof has been leaking and so the floor is covered in puddles. Unfortunately, Baby Robot wasn’t really designed to work in the wet. Puddles will slow him down and, even worse, can cause him to skid. As a result it’s not guaranteed that he’ll move in the desired direction if he starts in a square that contains a puddle.

We do however know all of the information about the system, including the probability that a puddle will cause a skid to occur:

  • Small puddles have a 20% chance of skidding, so 80% of the time the chosen direction will be taken.
  • Big puddles have a 60% chance of skidding, so the chosen direction will only be taken 40% of the time.

When a skid occurs it will cause Baby Robot to move in a direction different to the one that was chosen. The probability of this happening will be split evenly over the other possible directions. Additionally, we’re initially only observing the environment. Baby Robot will be trying to make his way to the exit, but is at the mercy of the system dynamics. He has no control over the path that he takes and his next state will just be decided by the transition probability of his current state. At this stage there are no actions or rewards.

All of this can be summarised in the state transition diagram shown below.

State transition probabilities for the leaky grid level.
State transition probabilities for the leaky grid level.

Initially, Baby Robot is undecided as to which way to move. Therefore, from the start state, he’ll either move East to S1 or South to S3, with equal probability 0.5.

State S1 contains a large puddle so, while Baby Robot would like to move to state S2, there’s only a 0.4 chance of this actually happening. A more likely outcome is that he’ll skid and will instead either end up in state S4 or back at the start. Since the probability of skidding is 0.6, and this is split equally between the 2 possible directions, the probability that he’ll end up in S4 is 0.3 and the probability he’ll go back to the start is also 0.3.

State S4 also contains a large puddle, so will behave in the same way as state S1. From S4 Baby Robot would like to move to the exit, but there’s only a 0.4 chance he’ll actually move in this direction. Instead he may skid and either move to state S1 or to state S3, both with probability 0.3.

Similarly, if he begins by moving to state S3, which contains a small puddle, his chance of his next move taking him to state S4, where he’d like to go, is 0.8 and the chance of skidding back to the start is 0.2. S2 also contains a small puddle, so from here there’s a 0.8 chance he’ll move to the exit and a 0.2 chance he’ll skid back to S1.

These state transition probabilities can also be expressed in a state transition matrix, as shown below. This matrix tells us how likely it is, if starting in any state, that you’ll end up in any other state.

State transition matrix for the leaky grid level
State transition matrix for the leaky grid level

Note that the probabilities of leaving any state always add up to one, as shown on the state transition diagram and also for each row in the state transition matrix. Additionally, when Baby Robot reaches the exit this is the terminal state from which he’ll not move again. Therefore, at the exit, the transition probability of staying at the exit is 1.0.

Beginning at the start of the level, we can follow a series of paths through the level until we reach the exit. Each of these paths represents an episode and each episode will follow a random trajectory that is defined by the system dynamics. Due to the randomness of the state transitions, the paths can be of any length.

So, each of the following episodes are valid paths through the level:

  • Start, S1, S4, Exit
  • Start, S3, S4, S1, S2, Exit
  • Start, S1, S2, S1, S4, Exit

For the level given above, and its accompanying state transition diagram and matrix, we’ve described a Markov Process (also known as a Markov Chain). This is a model of a stochastic process in which a sequence of events occur, with the probability of an event occurring being dependent only on the current state.

The complete set of states in our grid level make up the state space, which defines all of the possible states in the system. Each of these states is independent of the previous states and therefore satisfies the Markov Property. All of the information about a state is contained in that state. As a result, we don’t need to know what’s happened in the past in order to predict what will happen in the future.

In practice it’s unlikely that a system’s transition probabilities are known in advance. Therefore, to estimate how likely it is to move from one state to another, it’s possible to observe multiple episodes and then take the average. This approach, of taking random samples to calculate estimates, is known as Monte Carlo Sampling and we’ll examine this fully in a future article.


Markov Reward Processes

The Markov process defines a state space and the transition probabilities of moving between those states. It doesn’t specify which states are good states to be in, nor if it’s good to move from one state to another. For this we need to add rewards to the system and move from a Markov process to a Markov Reward process.

In the first part of this series, we defined that each time Baby Robot moved from one square to the next he would receive a reward of -1. This encouraged him to take the shortest path through the level, since the longer he took the more negative his reward would be. Remember we’re trying to maximise the total accumulated reward, so therefore having a large negative reward isn’t good.

In this leaky level we’ve specified that puddles will cause Baby Robot to slow down, with big puddles taking longer to traverse than small ones. With this in mind we can modify our rewards as follows:

  • Small puddles take twice as long to move through as a standard grid square. Therefore, to account for this, a reward of -2 will be given for moving into a state with a small puddle.
  • Similarly, large puddles take 4 times as long to move through, so a reward of -4 will be given for moving into a state that has a large puddle.

Here we’ve defined that the penalty for being in a puddle occurs when Baby Robot moves into a puddle, so the increased negative reward will be given when moving into the state with the puddle. It would also be perfectly reasonable to attribute the rewards when exiting the state containing the puddle, but we’re trying to emphasise that it’s not good to step into big puddles!

We’ve added these rewards to the state transition diagram, as shown below. The rewards obtained for each transition are shown in red, below the associated transition probability.

State reward process for the leaky level
State reward process for the leaky level

Note that we’re still only observing the system and haven’t yet introduced the concept of actions. Therefore, Baby Robot will move from one state to the next with a probability given by the current state’s transition probabilities, except now he’ll receive a reward when he makes each move.

From the Start state, Baby Robot can move in two possible directions, he can either move into the large puddle in state S1 and receive a reward of -4. Alternatively he can move to state S3, where there’s a small puddle and so he’ll be given a less negative reward of -2. Because Baby Robot doesn’t yet have the ability to choose which path he takes, the probability of moving to either S1 or S3 are equally likely and each will happen with a probability of 0.5.

Although Baby Robot can’t yet influence the path he takes through the level, the addition of returns does allow us to evaluate how good it is to follow a particular path. From this, we can work out how good it is to be in any particular state, given the defined transition probabilities.

In part 1 we introduced the concept of a return. This is the total accumulated reward, starting from a time step ‘t‘. For example, for the sample path {Start,S1,S4,Exit}, the return would be (-4 –4 –1) = -9. Note that although we have a ‘Start’ state in our system, representing where Baby Robot enters the grid level, it’s not necessary to begin at this position when calculating a return. For example, at time step ‘t‘ we could say that Baby Robot was at state S1 and the return would then be calculated from this position until the end of the episode.

Additionally, since the random nature of state transitions means that, in theory, sequences could potentially be of infinite length, we added a discount factor to our return calculation. This prevents the return value from becoming excessively large and focuses attention on rewards that are obtained in the immediate future. At each increasing time step the reward is multiplied by an increasing power of the discount factor and the discount factor is chosen to be a value between 0 and 1. Therefore, the contribution of the reward obtained at each step decreases with time.

This gives us the discounted return formula:

Equation 1: discounted return
Equation 1: discounted return

where:

  • Gₜ = the return (the total amount of reward accumulated over an episode, starting at time ‘t’)
  • Rₜ₊₁ = the reward obtained at time t+1
  • γ (gamma) = the discount factor, where 0 ≤ γ ≤ 1.

If we pick any state as the starting point (i.e. where we are at time ‘t‘) and follow a path to the end of the episode, adding the discounted reward at each time step, we get a return value that will be a very rough estimate of the state’s value, describing how good it was to be in the initial state. If the return value is a large negative value then this isn’t very good.

Obviously, given the random nature of the paths through the system, the return value of a single episode isn’t the most accurate estimate of a state’s value. So instead we take the expected value, which is effectively the average value over infinite episodes. This gives us the state value function v(s):

Equation 2: state value function
Equation 2: state value function

So, at time ‘t‘ if we start in state ‘s‘ and measure the expected return, we’ll get the value of state ‘s‘, describing how good it is to be in that state.

Given that we haven’t yet introduced actions into the mix, the value function may seem to be of little use. We could calculate how good it is to be in a particular state, but don’t yet have any way to make use of this knowledge to let us take a path that would give the maximum return. However, by making a slight re-adjustment to this equation we can make it massively more powerful.

Look again at equation 1, the return value starting at time ‘t‘. Currently this is expressed in terms of all future rewards. Although the rewards have been discounted, and so future rewards contribute less to the final return, if you want to calculate the return to full accuracy you in theory need to consider the rewards obtained at all future time steps. This makes things rather impractical, since for every state you’ll need to consider every reward that could be obtained in the future.

But it’s possible to change equation 1, from being expressed in terms of all future rewards, to instead make it recursive, expressing it in terms of the future return:

Equation 3: Return at time 't' given in terms of the immediate reward and the next return.
Equation 3: Return at time ‘t’ given in terms of the immediate reward and the next return.

Now the return is given in terms of the immediate reward plus the discounted return that will be given at the next time step. By substituting this into equation 2, we can also express the state value in terms of the immediate reward and the future discounted return:

Equation 4: The state value function expressed in terms of the immediate reward and the next return.
Equation 4: The state value function expressed in terms of the immediate reward and the next return.

But Gₜ₊₁ is just the return at the next time step, at which point we’ll be at the next state which, by using equation 2 again, is just the expected value of the next state. So we can also express the state value in a recursive way, making it a function of the immediate reward and the value of the next state:

Equation 5: The Bellman Equation
Equation 5: The Bellman Equation

This rearrangement of the state value function, decomposing it into the immediate reward Rₜ₊₁ **** and the discounted value of the next state γv(Sₜ₊₁), is known as the Bellman Equation, which is arguably the fundamental equation of Reinforcement Learning. Using this, the value of any state can be calculated simply by looking ahead to the next state as opposed to having to inspect every future state. Once the values of states are known, the best actions can be chosen to move between these states and ultimately the best policy may be found to solve the given problem.

We can now apply this to the states in our Markov Reward process, to calculate their values. To do this, we first need to express the Bellman Equation in a slightly different form, as shown below:

Equation 6: The Bellman Equation expressed in terms of transition and reward probabilities.
Equation 6: The Bellman Equation expressed in terms of transition and reward probabilities.

In equation 6 we’ve simply changed from expressing the state value function as an expectation, to instead averaging over each of the transitions that can happen in state ‘s‘.

There are a couple of points worth noting about this equation:

  • The joint sum over all of the next states and the rewards is just a convenient way of writing that we’re summing over all next states and all rewards. So in reality there are 2 sums: one for the next states, which uses the transition probability of moving to state s′, and one for the rewards, given the probability of receiving reward ‘r‘ when we start in state ‘s’.
  • p(s′,r|s) is the probability of moving to state s′ and getting reward ‘r’, given that we start in state ‘s‘.
  • These probabilities are multiplied by the immediate reward that is obtained and the discounted value of the next state.

In this format we can apply this equation to the states of our leaky grid level. So, for example, at the Start state, it has 2 possible transitions, each of which takes place with a probability of 0.5 and which give rewards of -4 and -2, when moving to states S1 and S3 respectively. So its state value is given by:

Equation 7: the state value of the Start state.
Equation 7: the state value of the Start state.

Since this is a very simple level, that we know will ultimately end up at the exit state and terminate, we can simplify this further by setting the discount factor ‘γ‘ to 1. However, we’re still left with the value of the Start state expressed in terms of the values of states 1 and 3, which we don’t yet know.

In the first part of this series we saw how state values could be calculated using iterative policy evaluation. Starting with the initial assumption that each state had a value of zero, we could iteratively improve this estimate until we converged on the true state values. Although we didn’t explicitly state it at the time, we were already using a partial form of the Bellman Equation to accomplish this.

Applying iterative policy evaluation to our Markov Rewards process version of the leaky level gives the state values shown below. These values were obtained after 32 iterations, when the maximum change in any state value had converged to less than 0.001 (although the values shown below are only to 1 decimal place):

State values for the MRP.
State values for the MRP.

From the converged state values we can see that, beginning at the start state and moving to the exit will, on average, occur a reward penalty of -16.3. Similarly, if we’ve made it as far as state S2, then we can expect, on average, to incur a penalty of -4 to reach the exit.

So, now that we know all of the state values, we can plug the values for S1 and S3 into equation 7 above, to check we’ve got the correct value for the Start state:

v(Start) = 0.5 [-4 -11.9] + 0.5 [-2 -14.8] = -16.35

Similarly the value for S2 is given from the values of S1 and the Exit:

v(S2) = 0.8 [-1] + 0.2 [-4 -11.9] = -0.8 -3.18 = -3.98

The first 23 iterations of policy evaluations on the leaky grid level are shown below. It can be seen how the values convergence (only the first 23 out of the total 32 iterations are shown, since after this point the change in the state values occurs at a greater precision than the 1 decimal place we’re displaying):

Iterative policy evaluation for the leaky grid level MRP
Iterative policy evaluation for the leaky grid level MRP

From the calculated state values it’s easy to see that the best route through the grid would be Start, S1, S2, Exit, to incur the lowest penalty. However, in an MRP we have no way of selecting actions and therefore have no way of choosing which path to take. To enable this we need to add actions and move to a Markov Decision Process (MDP).


Markov Decision Processes (MDPs)

The Bellman Equation allows us to calculate state values by simply doing a one-step look ahead to the next state. Using these values we can then see which states are good to be in and which states should be avoided. However, in a Markov Reward Process we don’t have any control over which states we move to. For this we need to introduce actions and instead move to a Markov Decision Process (MDP).

Actions allow decisions to be made. So it becomes possible to choose the state to move to and the reward to obtain. Although, in a stochastic system, you may not always get the chosen action or reward. Both the transition probability and the reward function now depend on the action. The action itself is selected by a policy. This defines which action should be chosen in any particular state.

By making a very slight adjustment to the Bellman Equation, we can modify it to take account of the policy:

Equation 8: The Bellman equation for policy π
Equation 8: The Bellman equation for policy π

Equation 8 is such a subtle change to the standard Bellman equation from equation 5, that it would be easy to miss the change. The difference is that now the value of both state ‘s’ and the next state Sₜ₊₁, where we’ll move to at the next time step, are expressed in terms of the policy π. This means that if we are in state ‘s‘ at time ‘t‘, we’ll choose an action given by the policy ‘π‘ and we’ll continue to select actions according to this policy in all future states.

As we did for the MRP, we can change this equation into a usable form by expressing it in terms of the probabilities of taking a particular action ‘a‘, ending up in a successor state s′ **** and receiving a reward ‘r‘, summed over all of these quantities:

Equation 9: State value under policy π
Equation 9: State value under policy π

This is identical to equation 6, where we expressed the Bellman Equation in terms of transition and reward probabilities for the MRP, except now actions have been added.

So the value of state ‘s‘ when following policy ‘π is equal to:

  • The sum over all actions that can be taken in a particular state.
  • Each action has a probability of occurring that is governed by the policy. So π(a|s) is the probability that action ‘a‘ will be taken given that we’re in state ‘s‘. Summing over these probabilities is effectively taking the average of all the actions for the state.
  • The reward obtained for taking an action and the next state, where we end up after taking that action, are also stochastic, so we take the average of these by summing over the probabilities of them occurring multiplied by the immediate reward obtained and the discounted value of the next state.

In the first part of this series we only looked at deterministic actions. When an action was taken we knew exactly what the next state would be and the amount of reward we’d get for taking the action. Now, in the full form of the Bellman Equation, the next state and reward for an action are determined by a probability distribution, p(s′,r|s,a). This is the probability of moving to state s′ and getting reward ‘r’, given that we start in state ‘s’ and take action ‘a’.

This is shown below for the possible actions from state S4 of our sample level:

State S4 actions, transition probabilities and rewards.
State S4 actions, transition probabilities and rewards.

S4 is a state that contains a large puddle, so taking an action in this state has a large chance of causing a skid to occur. As a result the probability of ending up in the desired state, after taking the chosen action, is only 0.4. There’s a 0.6 chance of skidding and instead ending up in a state that wasn’t the original target. This probability is split equally over the other possible states.

So, for example, from S4’s 3 possible actions, North, East and West, Baby Robot would obviously like to choose to move East and arrive at the exit. However, he’s only got a 0.4 chance of actually reaching this target state and claiming the standard -1 reward for doing so. A more likely outcome is that he’ll skid. In which case there’s a 0.3 chance of ending up at S1, giving him a large negative reward of -4, or that he’ll get a reward of -2 and end up at state S3, again with there being a 0.3 chance of this happening.


Q-Values

If you know the values of the returns that can be expected from taking each individual action it becomes a lot easier to find the best actions to take. Consequently, due to the large amount of knowledge that individual action values can give about a system, they’ve been assigned their own value letter and are referred to as Q-Values.

(We’ve already seen Q-values when we looked at Multi-Armed Bandits. In that case they were used to describe the expected reward for the actions available in the single state bandit problem.)

In equation 9, given above, the state value function is given by the sum over all actions, where the probability of taking each action is multiplied by the value of that action. Therefore, the right-hand sum, taken over the next states and rewards, represents the value of taking a particular action. We can separate this out, to give us the action-value function, when taking action ‘a‘ in state ‘s‘ under policy ‘π‘:

Equation 10: Action value function for policy π
Equation 10: Action value function for policy π

For state S4, shown in the previous diagram, Baby Robot would like to take the action that leads him to the exit. Using equation 10 we can calculate the action-value for taking this action to see if this would be a good action to take. Since this is a finite MDP we’ll use a discount factor of γ _ = 1. Additionally, since the exit is a terminal state by definition it will have a state value of zero. We haven’t yet calculated the state value functions for S1 and S3, the other possible states where Baby Robot could end up if he skids, so for now we’ll just represent these in their ‘_V’ format.

This gives the following action-value for taking the East action in state S4:

  • q(S4,East) = 0.4 (-1) + 0.3 (-4 + V(S1)) + 0.3 * (-2 +V(S3))

In comparison, taking the 2 other possible actions in this state would give:

  • q(S4,North) = 0.4 (-4 + V(S1)) + 0.3 (-2 + V(S3)) + 0.3 * (-1)
  • q(S4,West) = 0.4 (-2 + V(S3)) + 0.3 (-4 + V(S1)) + 0.3 * (-1)

This lets us see how the value for each action can be calculated, although without knowing the values of all the states we can’t yet work out the final action values. To do this we can use the iterative Policy Evaluation method that we looked at in part 1.


Policy Evaluation with the Bellman Expectation Equation

As a reminder, Policy Evaluation initially sets the estimated value of all states to zero and then does repeated sweeps through all of the states to gradually improve these estimates, until eventually it converges on the true state values.

We previously used this to calculate the values of states where the actions where deterministic; when you took an action you always ended up in the desired state. Now things are more complicated, since the actions are now stochastic; taking an action doesn’t guarantee you’ll end up in the target state and, as result, the reward obtained may also vary. So, in this case, we’ll need to implement Policy Evaluation using the full Bellman Expectation equation, which will take the expectation over the possible actions in a state and over the expected next states and rewards that can result from choosing an action.

The Bellman Expectation equation, given in equation 9, is shown in code form below. Here it’s easy to see how each of the two sums is simply replaced by a loop in the code. For each state, the first loop calls a function ‘_get__π’ which returns all of the possible actions and their probabilities (i.e. the policy). The second loop then calls ‘_getp‘ and iterates over all of the possible next states for the current action, calculating the Q-value for each action (this part of the code implements equation 10).

(_for the full code implementation, check out the github notebook_)

Combining the above function with repeated sweeps across the states allows us to calculate the state value for all of the states. This is slightly different to what we did in part 1, since the results of the actions are no longer deterministic.

For example, as he did on the first simple level from part 1, Baby Robot decides to choose which way to go by tossing a coin (except now on certain states he’ll need a 3-sided coin!). In other words, in any state, the actions will be chosen with equal probability. However, due to the puddles in the level, where he ends up may not be where he was hoping to get to. The next state and the reward he gets are determined by the system dynamics.

The state values obtained when running Policy Evaluation on this stochastic policy are shown below. These values eventually converge after 63 iterations.

Iterative policy evaluation for a stochastic policy on the leaky grid level MDP
Iterative policy evaluation for a stochastic policy on the leaky grid level MDP

These values can be seen to be quite a bit worse than those obtained for the MRP. When we chose the transition probabilities for the MRP we chose values that would point in the direction of the exit. Effectively we hard-coded actions into the system. Now, in the MDP, we’re choosing actions randomly and so Baby Robot will end up spending a lot more time exploring the level.

However, although these values are worse than in the MRP, by acting greedily with respect to these values, Baby Robot can improve his policy and move from a stochastic policy to one that is optimal in terms of finding the best route from the start to the exit.


The Bellman Optimality Equation

To navigate through our grid level in the shortest amount of time, or to gain the maximum reward in any Reinforcement Learning problem, in every state we want to select the action that gives the maximum expected return. In other words, since the policy is the thing that decides what actions to choose for each state, we want to find the optimal policy.

For any state, if we search through the set of all possible policies, we can find the action that gives the maximum return. This is the optimal action-value function:

Equation 11: The optimal action-value function is given by taking the action with the maximum action-value over all policies
Equation 11: The optimal action-value function is given by taking the action with the maximum action-value over all policies

Then, if for every state we know which is the optimal action, we can simply always choose the optimal action and get the optimal state-value function:

Equation 12: The optimal state-value function
Equation 12: The optimal state-value function

So the optimal value function for a state is given by selecting the action that gives the maximum return and then always selecting the optimal action in all future states.

As with the optimal action-value function, the optimal state-value function occurs when the state value is maximum over all policies. When this is true for all states, so that the value of every state is equal to or greater than the value it would have under any other policy, we’ve found the optimal policy.

By substituting the value of ‘q‘ from equation 10 into the optimal state-value function in equation 12, we get the Bellman Optimality Equation. The optimal value for a state is given by selecting the best action in that state and thereafter following the optimal policy to choose the best actions in all subsequent states:

Equation 13: The Bellman Optimality Equation defining the optimal state-value function.
Equation 13: The Bellman Optimality Equation defining the optimal state-value function.

In the next part we’ll see how the Bellman Expectation and Optimality equations can be used in Policy and Value Iteration to locate the optimal policy.

For now, we can just act greedily with respect to the state values we found above for the stochastic policy to get the optimal actions. From these we can then calculate the state values for this optimal policy. The iterative policy evaluation for the optimal policy is shown below. This actually converges in 31 iterations and produces state values that are much better than those given by a random policy.

Iterative policy evaluation for the optimal policy on the leaky grid level MDP
Iterative policy evaluation for the optimal policy on the leaky grid level MDP

Using this optimal policy, Baby Robot can now navigate through this simple leaky level and, while he may still skid when he encounters a puddle, he will reach the exit in the shortest possible time.

One such path through the level is shown below. A skid occurs when the actual state that Baby Robot ends up in doesn’t match with his desired, target, state.

Baby Robot following the optimal policy through the leaky level
Baby Robot following the optimal policy through the leaky level

Summary

In the first part of this series we helped Baby Robot to escape from a simple grid level. We did this by evaluating the state values and then choosing a path based on these values, to let us find the shortest route through the level. Although we didn’t know it at the time, the equations we were using were the Bellman equations and the system we were operating in could be described by a Markov Decision Process. We’ve now covered both of these.

A basic Markov process defines a set of states, each of which satisfies the Markov property, where no knowledge of the past is required. The transition probabilities define the probability of moving from one state to the next and, in a Reinforcement Learning problem, these are given by the current policy.

When rewards are added to the Markov process we, unsurprisingly, get a Markov Reward process. The Bellman Equations let us combine the transition probabilities with these rewards to calculate the value of each state, to give a measure of how good it is to be in each state.

Although in a Markov Reward process we can calculate the value of each state, it’s still not possible to decide how to move through the states to maximise the rewards. To enable this, they need to be extended to allow actions to be taken and this is exactly what happens when we move to Markov Decision processes. When actions are added, a Markov Decision Process can be used to fully describe a Reinforcement Learning problem’s environment, and how an agent acts within that environment.


What’s Next

We’ve now covered all of the basic theory that’s required to describe a Reinforcement Learning problem. Additionally, we’ve already seen how we can use Policy Evaluation to calculate the state values from which, for simple problems, we were able to find the optimal policy.

In the next part we’ll extend this, to use the Bellman Equations that we’ve just looked at, in the Policy and Value Iteration algorithms. These will allow us to discover the optimal policy in more advanced settings.


< Part 1                                                    Part 3 >
State Values and Policy Evaluation        Policy and Value Iteration

Related Articles