Training an Agent to Master Tic-Tac-Toe Through Self-Play

Surprisingly, a software agent never gets tired of the game.

Published in

Towards Data Science

7 min readSep 18, 2023

Ah! primary school! This was the time when we learned valuable skills, such as literacy, arithmetic, and playing tic-tac-toe optimally.

Playing a tic-tac-toe match with your friend without getting caught by the teacher is an art. You must discretely pass the game sheet under the desk while giving the impression of being attentive to the subject matter. The fun was probably more about the undercover operation than the game itself.

We cannot teach the art of avoiding getting caught in the classroom to a software agent, but can we train an agent to master the game?

In my previous post, we studied an agent learning the game SumTo100 through self-play. It was an easy game that allowed us to display the state value, which helped us build an intuition about how the agent learns the game. With tic-tac-toe, we are tackling a much larger state space.

You can find the Python code in this repository. The script that performs the training is learn_tictactoe.sh:

#!/bin/bash
declare -i NUMBER_OF_GAMES=30000
declare -i NUMBER_OF_EPOCHS=5

export PYTHONPATH='./'

python preprocessing/generate_positions_expectations.py \
 --outputDirectory=./learn_tictactoe/output_tictactoe_generate_positions_expectations_level0 \
    --game=tictactoe \
    --numberOfGames=$NUMBER_OF_GAMES \
    --gamma=0.95 \
    --randomSeed=1 \
    --agentArchitecture=None \
    --agentFilepath=None \
    --opponentArchitecture=None \
    --opponentFilepath=None \
    --epsilons="[1.0]" \
    --temperature=0
 
dataset_filepath="./learn_tictactoe/output_tictactoe_generate_positions_expectations_level0/dataset.csv"
 
python train/train_agent.py \
  $dataset_filepath \
  --outputDirectory="./learn_tictactoe/output_tictactoe_train_agent_level1" \
  --game=tictactoe \
  --randomSeed=0 \
  --validationRatio=0.2 \
  --batchSize=64 \
  --architecture=SaintAndre_1024 \
  --dropoutRatio=0.5 \
  --learningRate=0.0001 \
  --weightDecay=0.00001 \
  --numberOfEpochs=$NUMBER_OF_EPOCHS \
  --startingNeuralNetworkFilepath=None
  
 
for level in {1..16}
do
 dataset_filepath="./learn_tictactoe/output_tictactoe_generate_positions_expectations_level${level}/dataset.csv"
 python preprocessing/generate_positions_expectations.py \
  --outputDirectory="./learn_tictactoe/output_tictactoe_generate_positions_expectations_level${level}" \
  --game=tictactoe \
  --numberOfGames=$NUMBER_OF_GAMES \
  --gamma=0.95 \
  --randomSeed=0 \
  --agentArchitecture=SaintAndre_1024 \
  --agentFilepath="./learn_tictactoe/output_tictactoe_train_agent_level${level}/SaintAndre_1024.pth" \
  --opponentArchitecture=SaintAndre_1024 \
  --opponentFilepath="./learn_tictactoe/output_tictactoe_train_agent_level${level}/SaintAndre_1024.pth" \
  --epsilons="[0.5, 0.5, 0.1]" \
  --temperature=0 
  
 declare -i next_level=$((level + 1))
 python train/train_agent.py \
  "./learn_tictactoe/output_tictactoe_generate_positions_expectations_level${level}/dataset.csv" \
  --outputDirectory="./learn_tictactoe/output_tictactoe_train_agent_level${next_level}" \
  --game=tictactoe \
  --randomSeed=0 \
  --validationRatio=0.2 \
  --batchSize=64 \
  --architecture=SaintAndre_1024 \
  --dropoutRatio=0.5 \
  --learningRate=0.0001 \
  --weightDecay=0.00001 \
  --numberOfEpochs=$NUMBER_OF_EPOCHS \
  --startingNeuralNetworkFilepath="./learn_tictactoe/output_tictactoe_train_agent_level${level}/SaintAndre_1024.pth"
  
done

The script loops through calls to two programs:

generate_positions_expectations.py: Simulates matches and stores game states with the discounted expected return.
train_agent.py: Trains the neural network for a few epochs on the most recently generated dataset.

The training cycle

The learning of the game by the agent proceeds through a cycle of generating matches and training to predict the match outcome from the game state:

Figure 1: The cycle of match generation and neural network training. Image by the author.

Generation of matches

The cycle starts with the simulation of matches between random players, i.e., players that choose randomly from the list of legal actions in a given game state.

Why are we generating matches played randomly?

This project is about learning by self-play, so we can’t give the agent any a priori information about how to play. In the first cycle, since the agent has no clue about good or bad moves, the matches must be generated by random play.

Figure 2 shows an example of a match between random players:

Figure 2: Example of a tic-tac-toe match played randomly. Image by the author.

What lesson can we learn by observing this match? From the point of view of the ‘X’ player, we can assume this is an example of poor play since it concluded in a loss. We don’t know which move(s) is/are responsible for the defeat, so we’ll assume that all the decisions made by the ‘X’ player were bad. If some decisions were good, we bet on statistics (other simulations could go through a similar state) to rectify their predicted state value.

The last action from player ‘X’ is given a value of -1. The other actions receive a discounted negative value that geometrically decays by a factor of γ (gamma) ∈ [0, 1] as we go backward toward the first move.

Figure 3: Target values for game states. Image by the author.

States from matches that resulted in a win receive similar positive discounted values. States drawn from draws are given a value of zero. The agent takes the point of view of both the first and the second player.

The game states as tensors

We need a tensor representation for the game state. We’ll use a [2x3x3] tensor where the first dimension represents the channels (0 for ‘X’ and 1 for ‘O’), and the two other dimensions are the rows and the columns. The occupancy of a (row, column) square gets encoded as a 1 in the (channel, row, column) entry.

Figure 4: The game state representation by a [2x3x3] tensor. Image by the author.

The pairs of (state tensor, target value) obtained by the generation of matches constitute the dataset on which the neural network will train in each round. The dataset is built at the beginning of the cycle, taking advantage of the learning that has occurred in previous rounds. While the first round generates purely random play, the subsequent ones generate gradually more realistic matches.

Injecting randomness into the play

The first round of match generation opposes random players. The subsequent rounds oppose the agent to itself (hence “self-play”). The agent is equipped with a regression neural network trained to predict the match outcome, which allows it to choose the legal action that yields the highest expected value. To promote diversity, the agent chooses actions based on an epsilon-greedy algorithm: with probability (1-ε), the best action gets chosen; otherwise, a random action gets chosen.

Training

Figure 5 shows the evolution of the validation losses along five epochs for a maximum of 17 training rounds:

We can see that the first few training rounds show a fast decrease in the validation loss, and then there seems to be a plateau around a mean squared error loss of 0.2. This trend shows that the agent’s regression neural network gets better at predicting the outcome of a match played against itself, from a given game state. Since the actions from both players are non-deterministic, there is a limit to the predictability of the match outcome. That explains why the validation loss stops improving after some rounds.

Improvement from round to round

With the game SumTo100, we could represent the state on a 1D grid. However, with tic-tac-toe, we can’t directly display the state value evolution. One thing we can do to measure the improvement is to pit the agent against the previous version of itself and observe the difference between the wins and the losses.

Using ε = 0.5 for the first action of both players and ε = 0.1 for the rest of the match, running 1000 matches per comparison, this is what we get:

Figure 6: Comparison of the agent with its previous version. Image by the author.

The number of wins exceeded the number of losses (showing an improvement) until 10 training rounds. After that, the agent didn’t improve from round to round.

Test

Time to see how our agent plays tic-tac-toe!

One useful feature of having a regression neural network is the possibility of displaying the agent’s evaluation of every legal move. Let’s play a game against the agent, showing how it judges its options.

Manual play

The agent starts, playing ‘X’:

Figure 7: A match played against the agent, with the action evaluations. Image by the author.

That’s how you get brutally crushed by a soulless tic-tac-toe machine!

As soon as I put the ‘O’ in the (1, 0) square, the expected return increased from 0.142 to 0.419, and my fate was sealed.

Let’s see how it does when the agent plays second:

Figure 8: A match played against the agent, with action evaluations. Image by the author.

It didn’t fall into the trap, and the match was a draw.

Matches against a random player

If we simulate a large number of matches against a random player, this is what we get:

Out of 1000 matches (the agent played first in half the matches), the agent won 950 matches, lost no one, and there were 50 draws. This is not proof that our agent is playing optimally, but it has certainly reached a decent level of play.

Conclusion

As a follow-up to Training an Agent to Master a Simple Game Through Self-Play where the game was easy to crack and the state space was small, we used the same technique to master tic-tac-toe. Although this is still a toy problem, the state space of tic-tac-toe is large enough that the agent’s regression neural network is required to find patterns in the state tensors. These patterns allow the generalization for unseen state tensors.

The code is available in this repository. Give it a try, and let me know what you think!