DouZero: Mastering DouDizhu with Reinforcement Learning

Explaining a strong AI system for DouDizhu, the most popular Chinese Poker

Published in

Towards Data Science

6 min readJun 28, 2021

Starting from AlphaGo and AlphaZero, Artificial Intelligence (AI) has made encouraging progress in recent years, beating top human players in Go, Chess, Texas Hold’em, DOTA, Starcraft, etc. So, what is the next challenge?

In this article, I would like to introduce an ICML paper that presents a new AI System for a Chinese Poker game DouDizhu (斗地主 in Chinese). DouDizhu is the most popular poker game in China with hundreds of millions of daily active players, with many tournaments held every year. It is easy to learn but difficult to master. Interestingly, DouDizhu remains to be a grand challenge for AI because of competition, collaboration, and a large number of possible moves from turn to turn.

What is DouDizhu and Why is it Challenging?

It is a shedding-type game where the player’s objective is to empty one’s hand of all cards before other players. Two of the Peasants players play as a team to fight against the other Landlord player. The Peasants win if either of the Peasants players is the first to have no cards left. Each game has a bidding phase, where the players bid for the Landlord based on the strengths of the hand cards, and a card-playing phase, where the players play cards in turn.

In short, the following properties make DouDizhu very challenging for AI:

Imperfection Information: Unlike Go or Chess, the players in DouDizhu can not observe other players’ hands. This leads to an explosion of possibilities.
Both Competition and Collaboration: The two Peasant players need to cooperate to fight against the Landlord player. Most previous Poker AIs are designed for competitive settings, such as Texas Hold’em, or collaborative settings, such as Hanabi. Doing both competition and collaboration is an open challenge.
Large and complex action space: There are 27, 472 possible moves due to combinations of cards. The legal actions vary significantly from turn to turn. Existing AIs often only support very simple and small action spaces.

An example hand and its legal combinations. Image by Author, source: https://arxiv.org/abs/2106.06135

Mastering DouDizhu from Scratch with DouZero

Despite these challenges, I am excited to introduce a recent ICML paper and DouZero project. DouZero, for the first time, realizes a strong DouDizhu AI that can be trained from scratch with deep reinforcement learning to reach human-level performance. Interestingly, the magic behind DouZero is surprisingly simple yet very effective. The authors also develop an online demo based on our RLCard project. You can try the demo here.

Let’s dive into the techniques behind DouZero. The core algorithm used is called Deep Monte-Carlo (DMC), which is a deep version of classical Monte-Carlo (MC) methods for reinforcement learning. The goal of MC methods is to optimize Q values Q(s, a), where s stands for the state, and a is the action. The key idea of MC methods is to optimize a policy by iteratively executing the following procedures:

Generate an episode using the current policy.
For each s, a appeared in the episode, calculate and update Q(s, a) with the return averaged over all the samples concerning s, a.
For each s in the episode, update the policy with the action that leads to the largest Q(s, a).

The average return in Step 2 is usually obtained by the discounted cumulative reward. In step 1, we can use epsilon-greedy to balance exploration and exploitation. MC methods are the most basic reinforcement learning algorithms. They directly approximate the target values without bootstrapping (i.e., using the value of the next state to estimate the value of the current state as in DQN).

Let’s look into DMC, which combines MC with neural networks. In MC, Q(s, a) is represented by a table. Now, we replace the Q-table with a Q-network, which can be multiple layers of neural networks. Then we use mean-square-error (MSE) loss to update the Q-network in step 2. This leads to the so-called DMC, i.e., a deep version of MC.

While MC methods are criticized not to be able to deal with incomplete episodes and believed to be inefficient due to the high variance, the authors found that DMC is very suitable for DouDizhu. On the one hand, DouDizhu is an episodic task so that we do not need to handle incomplete episodes. On the other hand, DMC can be easily parallelized to efficiently generate many samples per second to alleviate the high variance issue. One benefit of DMC is that it can be easily and efficiently accelerated with multiple processes since it is very simple to implement and run. DouZero further enhances DMC with multiple processes and action encoding (I will elebrate later), and surprisingly achieve very strong performance.

Comparison with DQN and Policy Gradient

Deep Q-Learning (DQN) and Policy Gradient (PG) are the most popular reinforcement learning algorithms. Why choosing DMC but not DQN or PG?

In PG, we often use a classifier-like function approximator, where the output scales linearly with the number of actions. However, in DouDizhu, there are 27, 472 possible actions, which make the output very large and hard to train. In practice, there are usually not many legal actions in each turn, and the actions in DouDizhu can be naturally encoded into card matrices. Thus, a more promising way is to take action as input so that we can avoid having 27, 472 actions as the output. Value-based methods can naturally handle this situation since the Q value naturally takes as input the states and the actions.

Similarly, DQN also approximates Q-values. However, implementing DQN in DouDizhu is tricky. DouDizhu has a large set of possible actions, where different subsets of actions will be legal in different turns. The bootstrapping step in DQN relies on a maximum operation, which is computationally expensive and tricky to implement on the variable action space. In addition, DQN may suffer from over-estimation when the action space is large.

In contrast, DMC is very easy to implement and easy to accelerate with multiple processes. Although DMC may suffer from high variance issues, the authors showed it works well in DouDizhu with parallel actors.

Implementing DMC in DouDizhu

Action encoding. Image by Author. Source: https://arxiv.org/abs/2106.06135

The state in DouDizhu includes the hand cards, historical moves, the number of hand cards for other players, and the action is a set of cards. DouZero encodes each card combination into a 15 x 4 matrix. This representation is very general and can be used to represent any cards. DouZero encodes both states and actions into such card matrices.

Neural Architecture. Image by Author. Source: https://arxiv.org/abs/2106.06135

For the neural network, DouZero uses a neural network that encodes historical moves with an LSTM network. Then the output of LSTM is concatenated with other features, which are fed together to six layers of MLP to generate Q values. In each step, we just need to input the state and all the legal actions. The network will predict the Q-values for all the legal actions. and we can choose the largest one.

DouZero uses three different networks for the three positions in DouDizhu with minor differences in the features. Then the authors run 45 actors in parallel in a 4-GPU server for simulating. After days of training, DMC can surprisingly learn very strong policies. The official implementation is here. We have also supported DMC in RLCard project. RLCard is a toolkit for reinforcement learning in card games, currently supporting 8 environments including DouDizhu, and 4 algorithms including DouZero.

Summary

The success of DouZero suggests that classical Monte-Carlo methods with some enhancements can have very strong results in a hard domain — DouDizhu. Monte-Carlo methods are easy to implement and use with very few hyperparameters, yet, they are very powerful. I hope you enjoy the reading, and this insight could help you develop your AI agents and solve other problems. I will go through the codebase of DMC and DouZero in the later posts.