The world’s leading publication for data science, AI, and ML professionals.

Monte Carlo Learning

Reinforcement Learning using Monte Carlo Method

In this article I will cover Monte Carlo Method of reinforcement learning. I have briefly covered Dynamic programming (Value Iteration and Policy Iteration) method in earlier article. In Dynamic programming we need a model(agent knows the MDP transition and rewards) and agent does planning (once model is available agent need to plan its action in each state). There is no real learning by the agent in Dynamic programming method.

Monte Carlo method on the other hand is a very simple concept where agent learn about the states and reward when it interacts with the environment. In this method agent generate experienced samples and then based on average return, value is calculated for a state or state-action. Below are key characteristics of Monte Carlo (MC) method:

  1. There is no model (agent does not know state MDP transitions)
  2. agent learns from sampled experience
  3. learn state value vπ(s) under policy π by experiencing average return from all sampled episodes (value = average return)
  4. only after a complete episode, values are updated (because of this algorithm convergence is slow and update happens after a episode is Complete)
  5. There is no bootstrapping
  6. Only can be used in episodic problems

Consider a real life analogy; Monte Carlo learning is like annual examination where student completes its episode at the end of the year. Here, the result of the annual exam is like the return obtained by the student. Now if the goal of the problem is to find how students score during a calendar year (which is a episode here) for a class, we can take sample result of some student and then calculate mean result to find score for a class (don’t take the analogy point by point but on a holistic level I think you can get the essence of MC learning). Similarly we have TD learning or temporal difference learning (TD learning is like updating value in every time step and does not require wait till end of episode to update the values) that we will cover in future blog, can be thought like a weekly or monthly examination (student can adjust their performance based on this score (reward received) after every small interval and final score is accumulation of the all weekly tests (total rewards)).

Value function = Expected Return

Expected return is equal to discounted sum of all rewards.

In Monte Carlo Method instead of expected return we use empirical return that agent has sampled based following the policy.

If we go back to our very first example of gem collection, agent follows policy and complete an episode, along the way in each step it collects rewards in the form of gem. To get state value agent sum-up all the gems collected after each episode starting from that state. Refer to below diagram where 3 samples collected starting from State S 05. Total reward collected (discount factor is considered as 1 for simplicity) in each episode as follows:

Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems

Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems

Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems

Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems

Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on 3 samples following policy π.

Monte Carlo Backup diagram

Monte Carlo Backup diagram would look like below (refer to 3rd blog post for more on backup diagram.

There are two types of MC learning policy evaluation (prediction) methods:

First Visit Monte Carlo Method

In this case in an episode first visit of the state is counted (even if agent comes-back to the same state multiple time in the episode, only first visit will be counted). Detailed step as below:

  1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) = 0 (these values are updated across episodes)
  2. The first time-step t that state s is visited in an episode, increment counter N(s) = N(s) + 1
  3. Increment total return TR(s) = TR(s) + Gt
  4. Value is estimated by mean return V(s) = TR(s)/N(s)
  5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π) as N(s) approaches infinity

Refer to below diagram for better understanding of counter increment.

Every Visit Monte Carlo Method

In this case in an episode every visit of the state is counted. Detailed step as below:

  1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) = 0 (these values are updated across episodes)
  2. every time-step t that state s is visited in an episode, increment counter N(s) = N(s) + 1
  3. Increment total return TR(s) = TR(s) + Gt
  4. Value is estimated by mean return V(s) = TR(s)/N(s)
  5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π) as N(s) approaches infinity

Refer to below diagram for better understanding of counter increment.

Usually MC is updated incrementally after every episode (no need to store old episode values, it could be a running mean value for the state updated after every episode).

Update V(s) incrementally after episode S 1, A 2, R 3,….,S T For each state S t with return G t

Usually in place of 1/N(S t) a constant learning rate (α) is used and above equation becomes :

For Policy improvement, Generalized Policy Improvement concept is used to update policy using action value function of Monte Carlo Method.

Monte Carlo Methods have below advantages :

  • zero bias
  • Good convergence properties (even with function approximation)
  • Not very sensitive to initial value
  • Very simple to understand and use

But it has below limitations as well:

  • MC must wait until end of episode before return is known
  • MC has high variance
  • MC can only learn from complete sequences
  • MC only works for episodic (terminating) environments

Even though MC method takes time , it is an important tool for any Reinforcement Learning practitioner.

Thanks for reading . You can connect me @ LinkedIn .


For only $5/month, get unlimited access to the most inspiring and uplifting content… Click on the link below to become a Medium member and support my writing. Thank you! https://baijayanta.medium.com/membership


Related Articles