The world’s leading publication for data science, AI, and ML professionals.

Reinforcement Learning, Part 3: Monte Carlo Methods

From casinos to AI: unveiling the power of Monte Carlo methods in complex environments

Introduction

Reinforcement learning is a domain in Machine Learning that introduces the concept of an agent who must learn optimal strategies in complex environments. The agent learns from its actions that result in rewards given the environment’s state. Reinforcement learning is a difficult topic and differs significantly from other areas of machine learning. That is why it should only be used when a given problem cannot be solved otherwise.

In this article, we are going to discover Monte Carlo algorithms. Compared to standard approaches, their elegance lies in the fact that they do not require any knowledge about the environment’s dynamics. This aspect makes all the difference because in real life we do not normally know all possible transition probabilities between states.

Note. To fully understand the concepts included in this article, it is highly recommended to be familiar with the main concepts of reinforcement learning as well as the policy improvement methods introduced in the first two articles of this series.

Reinforcement Learning, Part 1: Introduction and Main Concepts

Reinforcement Learning, Part 2: Policy Evaluation and Improvement

About this article

  • In [Part 1](http://Additionally, this article is based on the third chapter of the famous book "Reinforcement Learning" written by Richard S. Sutton and Andrew G. Barto, which I would highly recommend to everyone interested in delving deeper.), we have introduced the main concepts of reinforcement learning: the framework, policies, value functions and Bellman equations.
  • In Part 2, we have understood the GPI methodology, which aims to find optimal policies through alternation between policy evaluation and policy improvement.

Despite the important principles we learned in Part 2, they are only applicable in perfect environments whose dynamics are fully known. When the transition probabilities p(s’, r | s, a) are not given (which is the case in the majority of real-life problems), it is becomes much harder to search for optimal strategies.

Fortunately, there exist techniques in reinforcement learning that can optimize strategies without using knowledge about the environment’s dynamics. Monte Carlo methods are examples of such algorithms.

This article is based on Chapter 5 of the book "Reinforcement Learning" written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.

Note. The source code for this article is available on GitHub. By the way, the code generates Plotly diagrams that are not rendered in GitHub notebooks. If you would like to look at the diagrams, you can either run the notebook locally or navigate to the results folder of the repository.

ML-medium/monte_carlo/blackjack.ipynb at master · slavastar/ML-medium

Model-free vs model-based methods

In reinforcement learning, the term model refers to anything the agent uses to predict the environment’s response to its actions.

Reinforcement Learning algorithms can be classified into two categories:

  • Model-based: methods using a model to plan action before they are taken.
  • Model-free: methods without a model. The algorithm tries to learn to associate actions and respective returns.

The algorithm we studied in the previous article is an example of a model-based approach, since the agent uses the estimates of given transition probabilities p(s’, r | s, a).

Monte Carlo (MC) methods neither estimate nor use p(s’, r | s, a), so they are model-free. Instead, they learn from experience – sequences of states, actions and rewards. By using given agent trajectories, they average approximate expected values and use GPI to obtain optimal policies.

In this article, we will be focusing only on episodic tasks. They are easier to study in the context of MC methods, since the experience is obtained independently from each episode. The more episodes are available, the more information can be used to better approximate value functions.

V-function estimation

By definition, the v-value of a state corresponds to the expected return the agent receives by starting from that state. With the information from the agent’s experience, we can average all returns after visits to that state. This way, the computed average value will represent the v-value of the state.

Having more information about the experience is very beneficial, since we can calculate the average from more returns to states, thus improving the overall approximation.

Based on the fact that the agent can be in the same state multiple times during an episode, there exist two versions of the MC algorithm for V-function estimation:

  • The first-visit MC method: only returns of the first visits to a state are considered;
  • The every-visit MC method: returns during all visits to a state are considered.
The first-visit MC method pseudocode. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto
The first-visit MC method pseudocode. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

The first-visit MC algorithm has been studied more in the field.

As the number of visits goes to infinity, by the law of large numbers, both methods converge to the real V-functions.

Example

We will have a look at the example of a popular casino game – blackjack.

Rules

The game’s objective is to get more points than the dealer without passing 21. The rules are simple: a player is dealt two cards and the dealer reveals his single card. Then the player has to make either one of two decisions:

  • Hit: the player takes an additional card and this card’s value is added to the player’s score. If the player’s score is greater than 21, then they lose the game. The player can continue hitting as much as they would like to.
  • Stick: the player does not take cards anymore and passes the turn to the dealer. The dealer continues taking cards until they obtain 17 or more points. If the dealer’s score is greater than 21, then the player wins the game. If the dealer’s score is between 17 and 21, then the person with more points wins the game and the other person loses. There is a draw in case of the same amount of points.

The cards have the following values shown in the table below:

Blackjack card values. The top row refers to playing cards and the bottom row to their values.
Blackjack card values. The top row refers to playing cards and the bottom row to their values.

The ace is a special game card: it counts as 11 if the overall score with it does not exceed 21 (in this case, the ace is called usable); otherwise, the ace counts as 1.

State definition

The player’s decision in blackjack depends only on three factors:

  • The player’s current score (between 12 and 21);
  • Whether the player has a usable ace or not (binary value);
  • The dealer’s score with the revealed card (between 2 and 11).

That is why it is logical to declare the state set S as a combination of all possible tuples containing these three variables. For simplicity, we will not be concerned about trivial states where the player’s score is less than 12, since in such cases it is always optimal for the player to hit. As a result, we have only 10210 = 200. This number of states is low, compared to most reinforcement learning problems. For the Monte Carlo algorithm, it should be relatively easy to calculate value functions.

The first-visit MC is the same as every-visit MC in blackjack since every game state can be visited only once during every episode.

V-function

The diagram below shows the estimated V-function by the Monte Carlo algorithm on 5 ⋅ 10⁶ iterations under the policy, according to which the player sticks only when they reach 20 points.

The V-function for the policy, according to which the player sticks if they reach 20 points. The horizontal axis denotes the player's card sum and the vertical axis shows the dealer's first card. The dealer's ace is denoted by 1.
The V-function for the policy, according to which the player sticks if they reach 20 points. The horizontal axis denotes the player’s card sum and the vertical axis shows the dealer’s first card. The dealer’s ace is denoted by 1.

Here are some interesting observations we can make:

  • The overall policy of sticking only after reaching 20 performs poorly. The only two situations when the player has an advantage occur only when their score is either 20 or 21. If a player has 18, for example, and takes another card, it is very likely that they will exceed 21 (by getting any card with the value of 4 or greater) and lose the game.
  • The color structure of both heatmaps is very similar. However, the presence of a usable ace gives the player a "second chance" if their score exceeds 21. Therefore, the corresponding v-values for cases with usable aces are higher than those without it.
  • The obtained v-values for games with a usable ace are less precise. This is explained by the fact that these cases occur less frequently in blackjack. A possible solution would consist of generating more episodes.

Theoretically we could have used dynamic programming algorithms for estimation of blackjack states but it would be much harder. For instance, imagine that the player has 13 points. To calculate the v-value for that state, we would need all transition probabilities from 14 to 21 points which are extremely hard to estimate using probability formulas, since there are too many possible game combinations.

Q-function estimation

In model-based methods, the estimated V-function is enough to find the optimal policy: by being at any state, we can always choose an action such that the sum of the action’s reward (given by p(s’, r | s, a)) and next state’s return will be maximal.

In model-free methods (which is the MC case), having the V-function only is not sufficient. The component we miss here is the reward of choosing any action in every state. Thus, we cannot estimate the overall return by taking a certain action. At the same time, we cannot use the Bellman equation which could establish the relationship between v- and q-values: in this scenario, we would still need p(s’, r | s, a).

An elegant trick to overcome this issue is to look at the problem at a slightly different angle. We can take all possible pairs of states and actions (s, a) ∈ S x A and consider them as states. To find the values of these states, we can utilise the MC method for V-function estimation! Computation of this V-function will be the equivalent to the Q-function estimation in the original problem.

The idea of MC methods for estimating Q-functions consists of calculating the average return during all visits to a given (state, action) pair.
The idea of MC methods for estimating Q-functions consists of calculating the average return during all visits to a given (state, action) pair.

In particular, to apply the first-visit MC method, we can derive the visit criteria: a state-action pair (s, a) is visited if the state s is visited and the action a is chosen in it.

Advantages

Compared to the dynamic programming approach we have covered in the previous article, Monte Carlo methods for estimation of value functions have significant advantages:

  • every state is estimated independently and does not depend on values of other states;
  • the algorithmic complexity for updating a single state depends on the number of episodes and not on the total number of states. Since it is up to us to define the number of episodes to average the ultimate results, we have more flexibility, even if the total number of states is high;
One of the advantages of using MC methods over dynamic programming is that the algorithmic complexity to compute value functions only depends on the number of episodes and not on the number of states.
One of the advantages of using MC methods over dynamic programming is that the algorithmic complexity to compute value functions only depends on the number of episodes and not on the number of states.
  • it is almost impossible to theoretically estimate transition probabilities in many environment settings for every state and action to obtain the environment’s dynamics. At the same time, MC methods do not need this information and can elegantly learn from observations that are easy to generate.

Example

Let us continue with the previous blackjack example but this time for the Q-function estimation.

Q-function

As we already know, there are two possible actions in blackjack: hit and stick. If we add them to possible states, we would get 2002 = 400 (state, action) pairs. This makes it easy for the Monte Carlo algorithm to estimate any Q-function under a given policy.

The diagram below shows the Q-function for a random policy where p(hit) = p(stick) = 0.5.

The Q-function for the random policy. The horizontal axis denotes the player's card sum and the vertical axis shows the dealer's first card. The dealer's ace is denoted by 1.
The Q-function for the random policy. The horizontal axis denotes the player’s card sum and the vertical axis shows the dealer’s first card. The dealer’s ace is denoted by 1.
  • As in the previous scenario, diagrams with and without a usable ace are similar to each other except for the fact that the usable ace gives a higher chance for the player to win for every state.
  • The diagram with the overall lowest q-values is for the action hit in the case without a usable ace. The two factors contributing to it are the facts that, firstly, by hitting, the player already faces the risk of going busted, and, secondly, there is a 50% chance (according to our policy) that the player will hit again which significantly increases the chances of losing a game. As a consequence, the last column for 21 player points consists of all -1, since the player is guaranteed to lose the game by hitting in all of these situations.

Exploration vs exploration trade-off

Note. After understanding how to evaluate Q-functions, it would be logical to discuss how to find optimal policies. However, before doing that, we need to go ahead with several important concepts that will help us to construct an efficient algorithm.

At the current moment, there is a major issue we are facing in the current approach of Q-function estimation: the high number of total pairs have (state, action) many of which might never be visited. If the policy is deterministic, then for a given state, it might always greedily choose only one action in the following learning iterations. As a result, we will never be able to explore other (state, action) pairs which could have potentially led to higher returns.

A policy is called deterministic if the agent can only take the same single action from a state. Such a policy assigns the probability of 1 to this action and 0 to all other actions for that state. If this condition is not satisfied, then the policy is called stochastic.

The described problem is a classic formulation of exploration and exploitation trade-off. On the one hand, we want to exploit the best action to obtain the highest possible reward. On the other hand, we need to explore all possible actions to determine which of them results in bigger rewards.

That is why it is necessary to strive for the right balance between exploitation and exploration. Speaking of Monte Carlo methods, there exist several approaches, one of which will be described below.

Example

Imagine an agent learning the optimal strategy in the maze below. The agent’s objective consists of collecting the maximal possible return.

Maze example. If the policy is updated greedily, then it is likely that the agent will not expore the reward at D2.
Maze example. If the policy is updated greedily, then it is likely that the agent will not expore the reward at D2.

The agent starts at A1 and can go in any direction at every iteration. The terminal state is located at A3 and gives a reward R = 1. Apart from this, there is a large reward R = 10 at D2 that we, obviously, want the agent to collect. At the same time, there is a hole at C1. If the agent steps there, the episode ends and the reward R = -3 is given to the agent. Stepping into other cells brings the reward R = 0.

The optimal strategy would consist of collecting the reward at D2 and then navigating to the terminal state at A3. Unfortunately, this would not always be the case if the balance between exploration and exploitation is not adjusted appropriately.

That is, if the agent always greedily chooses the action with the maximal return, then after one of the first several iterations, it might discover that the path A1-A2-A3 is the only optimal path without exploring any other options.

In another scenario, the agent can initially go to the right but if it falls at C1, then the policy evaluation will assign negative state or state-action values to cells around C1. If the policy is greedy or the exploration rate is very small, then the agent will never explore the reward at D2.

Exploring starts

One of the simplest ways to explore more (state, action) pairs is to explicitly start episodes multiple times in every possible combination (state, action) to still get averages in rare scenarios. Another approach consists of randomly sampling starting states for every episode with a non-zero probability for all states. This approach is called exploring starts.

The exploring starts technique makes the distribution of (state, action) pairs more aligned and increases chances of observing returns in rare environment situations.
The exploring starts technique makes the distribution of (state, action) pairs more aligned and increases chances of observing returns in rare environment situations.

Despite the simplicity of exploring starts, it has a considerable disadvantage: the actual data distribution from the interaction of the agent with the environment might differ a lot from the one used for exploring starts. As a consequence, the agent might learn the optimal policy based on the unreal data distribution. Therefore, the learned policy will not be optimal in the real environment.

Conclusion

In this article, we have gone through Monte Carlo methods which are incredibly useful in reinforcement learning. Their ability to optimize policies in environments without any knowledge of their dynamics makes them suitable to many problems, compared to policy improvement algorithms that were discussed in the previous part.

Nevertheless, we have only talked about ways for estimation of V- and Q-functions and have not covered in detail the full Monte Carlo algorithm for searching an optimal policy, since we have to first address the problem of balancing between exploration and exploitation. To do that, we will introduce ε-greedy policies in the next part, combine them with the exploring starts approach and make several improvements over the current version of the algorithm.

Resources

All images unless otherwise noted are by the author.


Related Articles