The world’s leading publication for data science, AI, and ML professionals.

Introduction to Direct Reinforcement Learning by Example

Finding Optimal Moves in Blackjack

Introduction to Tabular MC Reinforcement Learning— Blackjack Example

Finding optimal moves in the classic gambler’s game

Blackjack is a popular gambler’s game that is most widely played around the world. Historically known as Black Jack and Vingt-U, or Twenty is a game believed to have an European root from the 17th Century.

Reinforcement learning is a high-level term for a discipline that seeks to maximize a certain reward without explicit information being presented to the learner, or agent, who must interact with an environment to learn and produce rules, or policies that could be generalized for the environment.

One of the methods to perform a reinforcement learning task is to use Monte Carlo methods, which in programming broadly refers to any methods that require randomization in finding optimality. In Layman’s terms, it means that you are creating an agent equivalent to a newborn baby, who mostly has no idea what it is doing at the beginning and their actions might as well be random. These agents learn through trials and errors entirely. As they succeed or fail, they are building this "Rule of Thumb", or policy, they would finally gain knowledge of the move they must take at each condition. In this article, we will perform the simplest form of such a learning task with the most basic version of Blackjack, without the rule variants that many casinos have.

(Fig 1) Reinforcement Learning by direct RL , without model-based planning (shown in dash-lines)
(Fig 1) Reinforcement Learning by direct RL , without model-based planning (shown in dash-lines)

An on-policy (policy-based decision making) direct RL loop requires the agent (learner) to know its own states and available actions. A value can be tied to either the state or the action or both, such as in this example. The learner starts from a blank slate and begins interacting with the environment, which provides experiences usually in terms of rewards. A policy update is marked by changes that reflect the good and bad of the reward so it will formulate better actions in the future. For simplicity of the example, we will not use a model.

An important step in reinforcement learning is to find a way to represent the environment, which is usually easier said than done. However, for a game like Blackjack, it is quite straightforward. To avoid redundancy, only key components of the Python code are shown. (Full code available here)


First, the distribution of cards is defined. To make rules simpler, we would make a sloppy assumption that the game in fact has no usable Ace (Ace as 11), which is not true in most casinos today. Also, speaking from experience, a tower of decks is loaded every time the cards are used up, therefore, the number of cards here is assumed to be finite and played until the finish.

Second, we define the dealer, whom we assume play strictly by the cards of its own hand by maintain a minimum of 15, short enough:

Lastly, the player (agent) here must be defined with more components that allow learning:

The key input parameters here are :

  • Epsilon, ϵ : the exploration factor. Ranging from 0–1, determines the amount of random components in its decisions.
  • Gamma, γ : the discount factor. We will use an n-step (n=1) look-forward to add the expected reward of the state 1 step ahead of the current state, times the discount factor.

and the important initializations for the player are:

  • actions – available actions to the agent: hit or pass
  • states – all of the exact states (total = 380): (Player’s sum of hand) x (Whether it has 5 cards) x (Top card of the dealer), which is a combination of each category of states that we would like the agent to account for
  • Rewards, values & policy – Their data hierarchical structure encompasses some numeric values of the action at each state for all states (i.e. s∈S, a∈A(s)). Initial values are important to kickstart the learning.

And of course, the player must decide what do to at each round based on its own policy:

When you see this type of data tabulation, it may remind you of concepts from dynamic programming, which require us to "remember". The core part for the player being an agent is in its ability to learn:

The pseudocode used for the updates are shown here, in a sequential fashion:

1-step look-forward expectation of State-action reward (G) within an episode (a game)
1-step look-forward expectation of State-action reward (G) within an episode (a game)
Update of Returns by Appending G to Returns
Update of Returns by Appending G to Returns
Update of value by averaging the returns for state-action pair
Update of value by averaging the returns for state-action pair
Update of ε-greedy Policy for each state for optimal vs non-optimal actions
Update of ε-greedy Policy for each state for optimal vs non-optimal actions

Performing the aforementioned update will result in the following convergence. For more details, please see Generalized Policy Iterations **** (GPI).

Finally, we put together the game where the agent would play iteratively as many times as to train the policy of the agent, the rules of the game is defined here:

After playing for 20,000 Games, the following shows the state-value function that the agent perceives and its policy bias towards taking certain actions at the various sums of hand. Other factors (such as the dealer’s top card or a number of cards in the hand that could satisfy 5-card Charlie) could be similarly viewed in this form.

First to point out in the results is that the agent definitely lost a lot less abiding by these rules than playing randomly (Fig 2a). If you do play randomly, you are expected to lose over 40% of your money. However, playing optimally actually allow you to gain around 10%! This might be a reason why casinos imposed a betting fee or requiring players to maintain a minimum.

We could clearly see expected trends in the state-value function, where values increase towards 21, as odds of winning increase (Fig 2c). There is an important peak range at 10 to 11, where the state-value is higher due to the higher chance of getting 19 to 21 with face cards, and a sharp turn to the negative beyond 11. Unfortunately, none out of the 20,000 trials did the agent have a 5 card Charlie (Fig 2b), which shows how unlikely that happens. For action-policy, the agent generally chooses to hit when the sum of hand is less than 13 and pass when the sum gets larger (Fig 2b).

(Fig 2) An evaluation of RL agent performance after n= 20000 trials, for cases where dealer top card = 5 (Fig 2a) The return of the RL player (orange) clearly outperforms both the random player (green) and the negative control (blue). (Fig 2b) The average state value of the agent in states with/without 5 card Charlie and (Fig 2c) in each state of sum of hands. (Fig 2d) The probability of taking an action given each state of sum of hands.
(Fig 2) An evaluation of RL agent performance after n= 20000 trials, for cases where dealer top card = 5 (Fig 2a) The return of the RL player (orange) clearly outperforms both the random player (green) and the negative control (blue). (Fig 2b) The average state value of the agent in states with/without 5 card Charlie and (Fig 2c) in each state of sum of hands. (Fig 2d) The probability of taking an action given each state of sum of hands.

If you are new to blackjack, consider learning some of these rules from the agent in this sloppy simulation of the real game, which doesn’t have usable Aces. If you want advanced or situational insights, consider improving this simulation by adding real rules from casinos and let the agent learn to beat the odds.

In conclusion, this demonstrates one of the most naïve implementations of Reinforcement Learning that learns by keeping records of experiences and making judgments directly based on them. There are many things you can tweak just in this system, such as the exploration hyperparameters. If you would like to take reinforcement learning to the next level, you may add a supervised learning model such as a Neural Net or time series forecasting model that seeks to find patterns in these experiences to plan policy ahead. Doing so could greatly boost the performance and rate of convergence towards our goal.

As a fun follow-up question, if you would like an agent to learn if it wants to choose to gamble at all, how could you implement it?


Related Articles