
Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com
This article describes an implementation of Neural Fictitious Self-Play (NFSP) in Leduc Hold’em Poker Game, based on code by Eric Steinberger. The full source code can be found on his Github repository.
If you are new to the topic it is better to start with these articles first: Introduction to Fictitious Play Fictitious Self Play Neural Fictitious Self-Play
Disclaimer: this article does not aim to explain every bit of detail in the implementation code, but rather to highlight the main axis and the mapping between the implementation of the code and the academic solution.
The implementation involves distributed computation which adds a level of complexity to the code. However in this article, we will focus on the algorithm per se, and we will bypass the distributed computation aspect. For this purpose, we will do the parallel with the academic algorithm below.

Leduc Hold’em
First, let’s define Leduc Hold’em game. Here is a definition taken from DeepStack-Leduc. It reads:
Leduc Hold’em is a toy poker game sometimes used in academic research (first introduced in Bayes’ Bluff: Opponent Modeling in Poker). It is played with a deck of six cards, comprising two suits of three ranks each (often the king, queen, and jack – in our implementation, the ace, king, and queen). The game begins with each player being dealt one card privately, followed by a betting round. Then, another card is dealt faceup as a community (or board) card, and there is another betting round. Finally, the players reveal their private cards. If one player’s private card is the same rank as the board card, he or she wins the game; otherwise, the player whose private card has the higher rank wins.
Global View
The main class is workersdriverDriver.py which has a method run() that sets everything in motion. It sets the main loop and the execution of the algorithm at each iteration, as seen in the following image.

The Algorithm
The bulk of the action happens in the _HighLevelAlgo.py where it is easy to distinguish the different parts of the academic solution.

Play Phase
Let’s zoom on the play phase. The code does not perfectly map to the academic solution but it is not far from it. It is only reorganized differently. The sampling of the action a(t) happens in the class SeatActor class. You can think of the SeatActor as the player. There are more than one.
On the other hand, PockerEnv.py (PokerRLgame_rl_envbase) is in fact the game engine, where the game dynamics and rules are executed. Each time a "player" acts in the game, the PockerEnv.step() method is called.
Finally, the LearnerActor (workerslalocal.py) is the class that coordinates the different SeatActors. It contains the play() method that performs calls to several classes and methods of the sequence below:


Result
According to Eric Steinberger the code has achieved the same result as the academic paper.
the 100 mA/g (= 0.1A/g from the paper) line is crossed after 230k iterations (4:24hrs) and the 60 mA/g line after 850k iterations (19hrs).
The graph shows the exploitability curve according to the number of NFSP iterations.

As a reminder, the exploitability is the measure of strategy’s worst-case performance. Introduction to exploitability can be found in these slides.
Conclusion
This was a quick overview of the implementation of the NFSP in practice as done by Eric Steinberger (check the source code here). It will help you find your way into the code, that has rather a complex structure, and gives you a clear understanding of where the real action is happening.
Finally, I would like to thank Eric Steinberger for reviewing this article and for sharing his remarks. Another thanks for Arman Didandeh for proof reading the article.