
Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com
Introduction
Using Reinforcement Learning in a zero-sum game requires some more involved methods than the standard Fictitious Play. Standard Fictitious Play is used in Normal Form Games which does not consider time. To apply Reinforcement Learning to zero-sum games, another approach is needed. This article is based on the paper "Fictitious Self-Play in Extensive-Form Games" by Johannes Heinrich & David Silver.
Extensive Form Game
Normal Form Games are modelled as a table where the actions (called strategies) of each player, are the headers of rows and columns, and the content of each cell is the payoff of the strategy employed by each player.
The problem of this form is that it does not capture the sequence or the time. For this reason, Extensive Form Game uses tree-like representations where each node is the player and each edge is the strategy, and the sequence can be clearly seen as the turn goes from node to node.

Note that for each Extensive Form Game, we can find a Normal Form Game that yields the same results.
Learning
Learning aims to find what strategies to play to reach Nash Equilibrium so that we won’t lose (on average). Another way to say the same thing, the process of learning is to find the probabilities (or probability distribution) to use each strategy (or action) so that we won’t lose on average.
ε-Best Response
In standard Fictitious Play, each player will seek the best response to his opponents move. The best response means the strategy that leads to the best possible value or reward in the current situation. Consider R to be the reward for the best response. We define ε-best response as the set of responses that result in rewards r such that r ≥ R-ε.
Generalised Weakened Fictitious play
A generalised weakened fictitious play is an iterative process of computing a mixed strategy ∏ using the current strategy, and the ε-best response bɛ, and some other perturbations M.

The important take-away from this method is that it converges in the case of a two-player zero-sum game. The reason for this method will be clearer later on.
Details of the generalised weakened fictitious play can be found in a paper written by Leslie & Collins.
Extensive-Form Fictitious Play (XFP)
Extensive Form Fictitious Play employs the concept of the generalised weakened fictitious play, to iteratively compute the Nash Equilibrium, so in a two-player zero-sum game, XFP converges.

However, there is a problem. XFP suffers from the curse of dimensionality. At each iteration, computation needs to be performed at all stages of the game. This will lead to an exponential computation.
For this reason, Reinforcement Learning comes into play to create the Fictitious Self Play.
(FSP)
Consider an extensive-form game. In their paper, Heinrich And Silver have demonstrated that the extensive-form game is defined by a Markov Decision Process (MDP) and thus it is suitable for Reinforcement Learning method. We call this Fictitious Self Play (FSP).
Remember that XFP uses computation of best response and strategy update but this computation is exponential. What FSP does is replace those two computations with Machine Learning methods, like the following:
- Best response: FSP approximates the best response using the Reinforcement Learning method. As its name implies approximation means that best response is not computed but something close to it, which is the ε-best response.
- Strategy update: FSP updates the average strategy as a supervised learning task, where each player learns a transition model of their own behaviour.
IMPORTANT DETAIL: To be sure that FSP is conformed to generalised weakened fictitious play (remember that we need the convergence property). It is proven that the sampling of data should be such that samples be taken from the best response with probability η and the current strategy with (1 − η). This is can be seen in the GenerateData method of the algorithm below: σ ← (1 − η)π + ηβ

Algorithm Explanation
- Episodes of the game are simulated from the agents’ strategies, using GENERATEDATA method.
- The data is stored in the form of episodes of transition tuples current state, action, next reward, next state.

- The resulting experience or data is stored in two types of agent size-limited FIFO memories:
- Mʀʟ stores experience of an agent’s opponents’ behaviour. In other words, it simulates the "environment behaviour" facing the agent. This requires the use of Reinforcement Learning algorithm to learn the best responses.
- Msʟ stores the agent’s own behaviour and used in Supervised Learning to update the agent strategy.
- Each agent computes an approximate best response by reinforcement learning off-policy from Mʀʟ.
- Each agent updates its own average strategy by supervised learning from the memory Msʟ.
Experiments
The two algorithms are evaluated in two parameterized zero-sum imperfect-information games. Leduc Hold’em and River poker. In a two-player zero-sum game, the exploitability of a strategy profile, π, is defined as δ = R1 + R2 Where R1 is the reward of the best response of player 1 and R2 is the reward of the best response of player 2. Ideally at Nash Equilibrium δ should tend towards zero.
Figure 2 shows that in In 6-card and 60-card Leduc Hold’em poker XFP clearly outperformed FSP.

In figure3, FSP improved its average strategy profile much faster than the full-width variant in both instances of River poker. In River poker, FSP obtained an exploitability of 0.11 after 30 seconds, whereas after 300 seconds XFP was exploitable by more than 0.26

Conclusion
Heinrich & Silver argue that XFP, which is a fictitious play algorithm, preserves convergence guaranteed in games with the fictitious play property. However, it suffers from the curse of dimensionality. FSP is a sample-based approach that implements generalised weakened fictitious play in a machine learning framework. It is still an open question of whether guaranteed convergence can be achieved with a finite computational budget per iteration.