Reinforcement Learning Tutorial Series

Monte Carlo Methods

Published in

Towards Data Science

8 min readNov 20, 2020

This is part 5 of the RL tutorial series that will provide an overview of the book “Reinforcement Learning: An Introduction. Second edition.” by Richard S. Sutton and Andrew G. Barto. This book is available for free here

This article is part of a series. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, and Part 7!

Chapter 5 — Monte Carlo Methods
Unlike previous chapters where we assume complete knowledge of the environment, here we’ll estimate value functions and find optimal policies based on experience.
We start looking at model-free learning, where we don’t have knowledge of the state to next state transition given our actions.
In general, Monte Carlo describes randomized algorithms. In this chapter we use it to describe sampling episodes randomly from our environment

Monte Carlo methods require only experience. Meaning, they sample states, actions, and rewards, while interacting with the environment.
They are a way to solve RL problems based on averaging sample returns.
And since we are going to average returns, the book focuses on Monte Carlo for episodic tasks (if we would have a continuous task it will be difficult to compute the average). Once an episode ends, the value estimates and policies change.

How chapters 3 and 4 worked
To see how the previous chapters updated the value of states, let’s look at the Bellman equation again

In those cases, we knew how state “s” transition into the next state “ s’ “ because we had a model of the environment.

In model free learning, we don’t have the transition function p(s’, r | s, a).
Instead, we update the states by averaging the returns we experience while traveling through those states.
Basically, until now we knew all the transitions (s → s’) so we could just update a state value by calculate them. Now however, we have to actually explore, starting from state “s”, see what the next state and action look like from experience (i.e. sample a state and action), and update that state “s” value by averaging the results as we’re exploring.

How the sampling returns differ from before

Sampling returns (left) vs backup diagram vπ (right) [ref]

In sampling returns, we update the value of state “S” based on samples of episodes going through the state (left image).
In comparison, in the backup diagram before we looked 1 step ahead to all of the next states S’, and used that to update state S.

Monte Carlo (MC) Prediction
Here we’ll see how we can learn the state value function for a policy. Different from previous model based RL methods, is that here we’re changing how we’re evaluating the policy and estimating the state-action values.

From previous chapters we’ve seen that the value of a state is the expected return starting from that state and then following a particular policy.
An easy way to estimate it based on experience, would be to average the returns we observe after visiting a state. As we interact with the environment more and observe more returns, the average should converge to the expected value. That’s the idea behind all Monte Carlo methods.

Suppose we want to estimate the value of a state Vπ(s). Each time we visit state “s” in an episode is called a “visit to s”. Clearly, we may visit “s” many times in a single episode. Let’s call the first time we visit “s” in an episode the “first visit to s”.

In first-visit MC, we estimate the value of state “s” by averaging the returns we observe after our first visit to “s”.

In every-visit MC, we estimate the value of “s” by averaging the returns after all visits to “s”.

Monte Carlo Estimation of Action Values
As we’ve seen, if we have a model of the environment it’s quite easy to determine the policy from the state values (we look 1 step ahead to see which state gives the best combination of reward and next state).
But if we don’t have a model of the environment, state values are not enough. In that case, it is useful to estimate action values (the values of different actions in a state) rather than state values.

Thus, a main goal of MC methods is to estimate the optimal action values q∗.

To obtain q∗, we first look at policy evaluation for action values.
Which means that we’re going to estimate qπ(s, a), the expected return when you start in state “s”, take an action “a”, and then follow a policy π.

This is similar to what we talked about with state values (Vπ), except that now we’re talking about visiting a state-action pair, rather than just a state.

You can think about it as being a little more specific. A single state may have several actions. So by visiting a state we have several options (i.e. actions) we can take. However, when we talk about state-action pair, we are always talking about taking that specific action in that specific state.

In the first-visit MC method we average the returns after the first time we took that action in that state.

In the every-visit MC method we estimate the value of that state-action pair by averaging the returns that have followed visits to it.

The only problem is that many state-action pairs may never be visited. For example, if we have a deterministic policy we’ll only get one action per state (the one that the policy favor). And hence, we’ll only observe returns for one action. This is the general problem of maintaining exploration — for policy evaluation to work we need to make sure that we visit all actions in every state.

One way to go about this, is to specify that every episode will start in some state–action pair, and that every pair has a probability > 0 to be selected as the start pair. The book called it the assumption of exploring starts.

But this assumption doesn’t always work. For example, if we learn directly from the interaction with the environment, starting conditions are not very useful.

A more common way to go about it, is to only consider stochastic policies where the probability of every action in every state is not 0.

Monte Carlo Control
We now look at how our MC estimation can be used in control. Meaning, to approximate optimal policies.

The idea is to follow generalized policy iteration (GPI), where we will maintain an approximate policy and an approximate value function.
We continuously alter the value function to be a better approximation for the policy, and the policy is continuously improved (see previous chapter).

The policy evaluation part is done exactly as described in the previous chapter, except that we’re evaluating the state-action pair, rather than states.

The policy improvement part is done by taking greedy actions in each state. That is, for any action-value function “q”, and for every state “s”, the greedy policy chooses the action with maximal action-value:

Monte Carlo Control without Exploring Starts
To make sure that all actions are being selected infinitely often, we must continuously select them.
There are 2 approaches to ensure this — on-policy methods and off-policy methods.

On-policy methods: we try to evaluate or improve the policy that we have.

Off-policy methods: we have 2 policies, and we try to evaluate or improve one of them, and use the other for directions.

This section focuses on on-policy method that doesn’t use the assumption of exploring starts.

The policy is generally soft in on-policy methods. Meaning, π(a|s) > 0 for every state “s” and action “a”, but gradually moving closer to a deterministic optimal policy. The policy is also ε-greedy (see part 2).

Off-policy Prediction via Importance Sampling
Suppose now that we have episodes generated from a different policy (i.e. off-policy method).

The way we go about this is by keeping a target policy π — which is the policy that will try to behave optimally, and we’ll also have a behavior policy b, which is our exploration policy.
And the idea is that we’ll use episodes generated from the behavior policy to go and explore the environment, and then use this to update our target policy π. We do that to estimate Vπ or qπ.

For this method to work, we have to align π and b with importance sampling.
Which is a method of estimating the expected values for one distribution, given samples from another distribution.
Basically, we are going to weight the returns based on the relative probability of their trajectories occurring under π and b. We call this the importance-sampling ratio.

Example:
Consider the 2 vectors below:

π (left) and b (right) [image taken by me]

Looking at the target policy π, we see that we have 20% chance of taking action A1, where the behavior policy b have 70% chance of taking A1. These kind of different probabilities for actions is why we need to weight the actions differently in π (left) and b (right)

So let’s see how importance sampling is used to align the probabilities of 2 policies, so that we can update π even thought we’re behaving using b.

Importance-sampling ratio formula [ref ]

Looking at the formula for the importance-sampling ratio, it is the product of taking a certain action given a certain state from the target policy compared to the behavior policy. And again, we do this so we can do off-policy learning.

Imagine that the behavior policy experience the environment and receive a return Gt = 10.
If the behavior that b took given a certain state is less likely to happen than it is in π, by say 3, we would weight the return by 3 and have the actual return when we update π be 30:

π(a|s) > b(a|s) → p > 1 e.g. 3 → Gt = 10 * 3 = 30

On the other hand, if the behavior is much more likely to happen in b than in π, we won’t weight the return as much when updating π:

π(a|s) < b(a|s) → p < 1 e.g. 0.25 → Gt = 10 * 0.25= 2.5

Incremental Implementation
Similar to chapter 2 where we saw that in order to compute the average of a sequence of numbers, you only need the total sum and the number of numbers — you don’t need the whole sequence of numbers.
Incremental Implementation are another important idea for saving memory while calculating the returns.

If you’re interested in more of my work you can check out my Github, my scholar page, or my website

Reinforcement Learning Tutorial Series

Monte Carlo Methods

Written by Sagi Shaier