Training an Agent to Master a Simple Game Through Self-Play

Simulate games and predict the outcomes.

Sébastien Gilbert

Published in

Towards Data Science

8 min readSep 6, 2023

A robot computing some additions. Image by the author, with help from DALL-E 2.

Introduction

Isn’t it amazing that everything you need to excel in a perfect information game is there for everyone to see in the rules of the game?

Unfortunately, for mere mortals like me, reading the rules of a new game is only a tiny fraction of the journey to learn to play a complex game. Most of the time is spent playing, ideally against a player of comparable strength (or a better player who is patient enough to help us expose our weaknesses). Losing often and hopefully winning sometimes provides the psychological punishments and rewards that steer us towards playing incrementally better.

Perhaps, in a not-too-far future, a language model will read the rules of a complex game such as chess and, right from the start, play at the highest possible level. In the meantime, I propose a more modest challenge: learning by self-play.

In this project, we’ll train an agent to learn to play perfect information, two player games by observing the results of matches played by previous versions of itself. The agent will approximate a value (the game expected result) for any game state. As an additional challenge, our agent won’t be allowed to maintain a lookup table of the state space, as this approach would not be manageable for complex games.

Solving SumTo100

The game

The game that we are going to discuss is SumTo100. The game goal is to reach a sum of 100 by adding numbers between 1 and 10. Here are the rules:

Initialize sum = 0.
Choose a first player. The two players take turns.
While sum < 100:

The player chooses a number between 1 and 10 inclusively. The selected number gets added to the sum without exceeding 100.
If sum < 100, the other player plays (i.e., we go back to the top of point 3).

4. The player that added the last number (reaching 100) wins.

Two snails minding their own business. Image by the author, with help from DALL-E 2.

Starting with such a simple game has many advantages:

The state space has only 101 possible values.
The states can get plotted on a 1D grid. This peculiarity will allow us to represent the state value function learned by the agent as a 1D bar graph.
The optimal strategy is known:
- Reach a sum of 11n + 1, where n ∈ {0, 1, 2, …, 9}

We can visualize the state value of the optimal strategy:

Figure 1: The optimal state values for SumTo100. Image by the author.

The game state is the sum after an agent has completed its turn. A value of 1.0 means that the agent is sure to win (or has won), while a value of -1.0 means that the agent is sure to lose (assuming the opponent plays optimally). An intermediary value represents the estimated return. For example, a state value of 0.2 means a slightly positive state, while a state value of -0.8 represents a likely loss.

If you want to dive in the code, the script that performs the whole training procedure is learn_sumTo100.sh, in this repository. Otherwise, bear with me as we’ll go through a high level description of how our agent learns by self-play.

Generation of games played by random players

We want our agent to learn from games played by previous versions of itself, but in the first iteration, since the agent has not learned anything yet, we’ll have to simulate games played by random players. At each turn, the players will get the list of legal moves from the game authority (the class that encodes the game rules), given the current game state. The random players will select a move randomly from this list.

Figure 2 is an example of a game played by two random players:

Figure 2: Example of a game played by random players. Image by the author.

In this case, the second player won the game by reaching a sum of 100.

We’ll implement an agent that has access to a neural network that takes as input a game state (after the agent has played) and outputs the expected return of this game. For any given state (before the agent has played), the agent gets the list of legal actions and their corresponding candidate states (we only consider games having deterministic transitions).

Figure 3 shows the interactions between the agent, the opponent (whose move selection mechanism is unknown), and the game authority:

Figure 3: Interactions between the agent, the opponent, and the game authority. Image by the author.

In this setting, the agent relies on its regression neural network to predict the expected return of game states. The better the neural network can predict which candidate move yields the highest return, the better the agent will play.

Our list of randomly played matches will provide us with the dataset for our first pass of training. Taking the example game from Figure 2, we want to punish the moves made by player 1 since its behaviour led to a loss. The state resulting from the last action gets a value of -1.0 since it allowed the opponent to win. The other states get discounted negative values by a factor of γᵈ , where d is the distance with respect to the last state reached by the agent. γ (gamma) is the discount factor, a number ∈ [0, 1], that expresses the uncertainty in the evolution of a game: we don’t want to punish early decisions as hard as the last decisions. Figure 4 shows the state values associated with the decisions made by player 1:

Figure 4: The state values, from the point of view of player 1. Image by the author.

The random games generate states with their target expected return. For example, reaching a sum of 97 has a target expected return of -1.0, and a sum of 73 has a target expected return of -γ³. Half the states take the point of view of player 1, and the other half take the point of view of player 2 (although it doesn’t matter in the case of the game SumTo100). When a game ends with a win for the agent, the corresponding states get similarly discounted positive values.

Training an agent to predict the return of games

We have all we need to start our training: a neural network (we’ll use a two-layers perceptron) and a dataset of (state, expected return) pairs. Let’s see how the loss on the predicted expected return evolves:

Figure 5: Evolution of the loss as a function of the epoch. Image by the author.

We shouldn’t be surprised that the neural network doesn’t show much predicting power over the outcome of games played by random players.

Did the neural network learn anything at all?

Fortunately, because the states can get represented as a 1D grid of numbers between 0 and 100, we can plot the predicted returns of the neural network after the first training round and compare them with the optimal state values of Figure 1:

Figure 6: The predicted returns after training on a dataset of games played by random players. Image by the author.

As it turns out, through the chaos of random games, the neural network learned two things:

If you can reach a sum of 100, do it. That’s good to know, considering it is the goal of the game.
If you reach a sum of 99, you’re sure to lose. Indeed, in this situation, the opponent has only one legal action and that action yields to a loss for the agent.

The neural network learned essentially to finish the game.

To learn to play a little better, we must rebuild the dataset by simulating games played between copies of the agent with their freshly trained neural network. To avoid generating identical games, the players play a bit randomly. An approach that works well is choosing moves with the epsilon-greedy algorithm, using ε = 0.5 for each players first move, then ε = 0.1 for the rest of the game.

Repeating the training loop with better and better players

Since both players now know that they must reach 100, reaching a sum between 90 and 99 should be punished, because the opponent would jump on the opportunity to win the match. This phenomenon is visible in the predicted state values after the second round of training:

Figure 7: Predicted state values after two rounds of training. Sums from 90 to 99 show values close to -1. Image by the author.

We see a pattern emerging. The first training round informs the neural network about the last action; the second training round informs about the penultimate action, and so on. We need to repeat the cycle of games generation and training on prediction at least as many times as there are actions in a game.

The following animation shows the evolution of the predicted state values after 25 training rounds:

Figure 8: Animation of the state values learned along the training rounds. Image by the author.

The envelope of the predicted returns decays exponentially, as we go from the end towards the beginning of the game. Is this a problem?

Two factors contribute to this phenomenon:

γ directly damps the target expected returns, as we move away from the end of the game.
The epsilon-greedy algorithm injects randomness in the player behaviours, making the outcomes harder to predict. There is an incentive to predict a value close to zero to protect against cases of extremely high losses. However, the randomness is desirable because we don’t want the neural network to learn a single line of play. We want the neural network to witness blunders and unexpected good moves, both from the agent and the opponent.

In practice, it shouldn’t be a problem because in any situation, we will compare values among the legal moves in a given state, which share comparable scales, at least for the game SumTo100. The scale of the values doesn’t matter when we choose the greedy move.

Conclusion

We challenged ourselves to create an agent that can learn to master a game of perfect information involving two players, with deterministic transitions from a state to the next, given an action. No hand coded strategies nor tactics were allowed: everything had to be learned by self-play.

We could solve the simple game of SumTo100 by running multiple rounds of pitching copies of the agent against each other, and training a regression neural network to predict the expected return of the generated games.

The gained insight prepares us well for the next ladder in game complexity, but that will be for my next post! 😊

Thank you for your time.