Tackling the UNO Card Game with Reinforcement Learning

A complete walk-through starting from scratch

Bernhard Pfann, CFA
Towards Data Science

--

Photo by Third Serving on Unsplash

During quarantine times like these, the classic UNO card game emerged to be one of the most frequently played games between my girlfriend and me. After countless rounds and several major winning streaks on each side, I was curious if an optimal game strategy could be analytically derived, that would significantly outperform in the long-run.

My journey consisted of the following parts:

  1. Creating a game engine of the UNO card game in Python from scratch
  2. Obtaining game statistics from simulating a series of 100,000 games
  3. Implementing basic Reinforcement Learning techniques in order to discover an optimal game strategy

1. UNO Card Game Engine

In order to train a Reinforcement Learning (RL) agent how to play intelligently, a fully-fledged game environment needs to be in place. It has to capture all the mechanics and rules so that the agent can interact with the game like a real human player would do.

The object-oriented nature of Python allows structuring the code intuitively into class objects for cards, deck, player, turn, and game. Here are the main ideas…

1.1 Card

The most granular element of the game is the card itself, which has a “color” and “value” property. A card can also be evaluated if it is playable or not, dependent on the currently open card.

1.2 Deck

A fixed list of UNO cards represents the card deck. It is initialized and shuffled at the beginning of each game. Cards can be drawn from it and a card refill is implemented, in case the deck runs out of cards.

1.3 Player

A player has a name, hand cards, and playable hand cards as properties and needs to perform a variety of functions. He has to:

  • Identify the current state he is in (cards he holds etc.)
  • Recognize the possible actions he can take (cards he can play)
  • Incorporate the decision making logic from the RL-agent

1.4 Turn

This class clumps together a players’ action sequence during a turn. After the initialization which distributes the 7 starting cards to each player, the active player (the player whose turn it is) either draws or plays a card. In this setting, player_1 is utilizing the RL-agent, while player_2 follows a random strategy.

1.5 Game

Finally, a card game comprises of nothing else than a series of turns. Therefore an iteration of turns is executed until any of the players meets the winning condition (holding no hand cards).

The depicted code snippets try to convey the basic game mechanism. The full script can be accessed on my Github repo https://github.com/bernhard-pfann/uno-card-game_rl.

2. Game Statistics from Simulations

With a fully functioning game engine, we are now able to run as many simulations of the game, as we like. This is helpful to get insights into the statistical features of the game, but it also facilitates the training of an RL-agent. I was mainly interested in the following questions:

  • How many turns do games last?
  • How big is the advantage of the player making the first turn?
  • What are the most likely situations in the course of a game?

2.1 Game Length

Within the generated sample of 100,000 simulations, the average game length is 41 turns. The significantly lower mode of 13 turns results from the right skewness (the portion of games with more than 100 turns).

Distribution of turns per game from 100,000 simulated games

The longest game in the sample lasted 327 turns, while the shortest only took 3 turns, which is only possible through the combination of multiple special cards (Skip, Reverse, +2, +4).

2.2 Advantage of the First Mover

A player who starts the game has a certain edge over his opponent. To quantify this advantage, the cumulative long-run winning-rate of two randomly playing agents is observed in the following two scenarios:

  1. Player 1 always gets to make the first move (red simulations)
  2. Player 1 gets to make the first move every 2nd game (violet simulations)
Cumulative Win-Rate from simulated 30,000 games

As expected, the winning rate of two randomly playing agents converges to 50% in a fair set-up with alternating first movers. This is an indication that the game engine works as expected.

More interesting is the fact that the player who always makes the first move consistently wins 52% of all games. This corresponds to a ca. 4% (52% vs. 48%) higher likelihood to win against the opponent.

2.3 The Course of Play

Like in sports analytics, a heatmap can be used to display where the course of play had its focus. By keeping track of every state and action a player has experienced, we can identify the most likely situations in the game.

Number of state-action occurrences during 100,000 simulated games

The axes of the heatmap denote a players’ number of hand cards, as well as the action taken at the respective point in time. It stands out, that most of the time players are lingering in the “mid-game” by holding 4–5 cards. It makes sense that normal cards (0–9) from all colors are played the most often since they are the most common cards in a deck.

3. Application of Reinforcement Learning

Searching for the optimal strategy in the UNO card game is a classical use case for Reinforcement Learning. Thereby the game itself can be framed as a finite Markov Decision Process (MDP), which implies the following characteristics:

  • States: Each step of the game can be described by a state
  • Actions: The decision-maker interacts with the game by taking actions based on the state he is in
  • Reward: Taking certain actions can lead to a desirable terminal state (e.g winning the game), which is rewarded

The stochastic element, which is inherent through the randomly drawn cards as well as the opponents’ moves, require numerous simulations, to identify a long-run optimum. The basic techniques, which I applied are Monte Carlo and Q-Learning with a discrete state-action matrix.

3.1 Defining State-Space

In a supervised machine learning set-up, a function is fitted to map the available features to the output variable. RL on the other hand evaluates each action sequentially and individually at each step. Therefore states, actions, and rewards need to be defined.

A state can represent any information available to the decision-maker, that is useful to describe the current situation of the game. This could be the type of cards he holds, the number of cards the opponent holds, or information regarding cards that have already been played.

At first sight, UNO might appear to be a very simplistic card game, due to its limited set of rules. However, when figuring out the possible combinations of cards a player can hold, it quickly gets out of hand. To be precise there are about 10¹⁶³ combinations. Since the RL-agent has to learn an optimal action for each state, it makes sense to limit the number of states.

Thus, I formed a 16-digit state identification, that captures which cards the agent holds and which he can play, differentiating only between colors, and special cards (skip, reverse, etc.)

Identification of a state

By capping each property of the state-identification with a maximum value of 2, state-space has been successfully limited to 270,000 possibilities.

3.2 Possible Actions

The cards, that the agent decides to play, represent the actions he is taking. Applying the same aggregation logic as above results in 9 different actions: Red, Green, Blue, Yellow, Skip, Reverse, +2, +4, Wild Color.

Q-table initialized at zero

3.3 Monte Carlo Agent

Given the discrete state-action matrix, the agent navigates through the fields by simulating multiple games. While the matrix is initialized with all values at zero, Monte Carlo (MC) updates all visited state-action values after every completed game.

q-value update function

The q-value at state s taking action a is updated dependent on the achieved reward in this episode R, and the step size parameter alpha. In order to decide which action to take in a respective state, the epsilon-greedy form of the algorithm chooses:

  • With epsilon probability: Random action
  • With 1-epsilon probability: Action with maximum q-values

3.4 Q-Learning

In its basic form, Q-learning works in a similar way. However, while MC waits for the completion of each episode before updating q-values, Q-learning updates them with a lag of one step, at each step.

q-value update function

The q-value is thereby dependent on the step-size parameter, the reward of the next step r, and the q-value of the next step at state s-hat and action-hat.

Both algorithms consequently take the same 2 parameters which have the following effects:

  • Alpha: A higher step size parameter increases the change in q-values at each update while prohibiting values to converge closer to their true optimum
  • Epsilon: A higher epsilon grants more exploration of actions, which do not appear profitable at first sight. At the same time, this dilutes the optimal game strategy when it has been picked up by the agent.

4. Results

Now that we have specified the models, let us see if they are useful. Therefore games are simulated with one random player and the other one leveraging the RL-algorithm. As an evaluation criterion, I use the cumulative long-run winning rate of the RL-player.

Cumulative Win-Rate from 100,000 simulated games

After running 100,000 simulations for each of the two RL-models, a consistent outperformance over the random agent can be testified. Given the enormous number of run episodes, statistical significance can be ensured, even though the margin of outperformance seems to be limited.

A win-rate of 51.5% means, that the RL-algorithm is 3% more likely to win a game against a random player.

4.1 Q-Table

To see how well the algorithms have learned the game, a closer inspection of the Q-table can be helpful. Since the full table is of size 270,000 x 8, states have been aggregated by the number of hand cards.

The steadily declining violet curve depicts the fact that states with fewer hand cards are closer to win a game. Therefore these states carry higher Q-values, which helps the RL-agent to choose actions that reduce their number of hand cards as fast as possible, in order to achieve a reward.

Aggregated Q-Values from Q-Learning Model

4.2 State-Action Coverage

To judge if the models have been trained on enough simulations, the state-action coverage can be assessed. Therefore the different state-action pairs that have already occurred during 100,000 games can be counted.

In total, both models have seen ca. 60,000 different combinations. At first sight, this is lower than expected, since we now know that 100,000 games have had on average 41 turns (41 x 100,000 = 4.1 Mio. turns). However, we also showcased how concentrated the course of play is towards certain card combinations, that appear multiple times within the sample.

State-Action Coverage during 100,000 simulated games

The quicker initial coverage increase of the Monte-Carlo model does not come by surprise, due to the nature of its update function.

5. Conclusion

Even though the RL-models are able to consistently outperform a random player by some margin, it remains unclear how the performance would turn out against a real human player. There are several limitations that are hindering the algorithm from achieving more impressive results:

  • There are only a few moves in each game, that require strategic evaluation, since most of the time, players do not hold more than one playable card in their hand. In only ca. 32% of playable turns, players have multiple options available, where intelligent behavior could lead to outperformance.
  • Not all relevant state information has been included since this simple form of RL with a discrete state-action matrix only works for a limited set of states. Information about the number of cards an opponent holds, or the color distribution of already played cards could be helpful in order to fine-tune a game strategy.
  • Finally, it is a game of luck! Compared to other board games like Monopoly or Chess, UNO brings in a large amount of stochasticity, which diminishes the potential for strategic supremacy.

The bottom line is that I will continue to trust my gut feeling when being challenged in the next game of UNO. My showcased work is available in my Github repository, where anybody more acquainted with advanced RL-methods can pick it up from here.

--

--