The world’s leading publication for data science, AI, and ML professionals.

How to create an AI that plays tick tac toe with reinforcement learning

The AI is unbeatable if it goes first

A short video of the AI kicking my ass in tic tac toe! The AI is X, I am O. See how he traps me? Video by author
A short video of the AI kicking my ass in tic tac toe! The AI is X, I am O. See how he traps me? Video by author

In this article, we will create two agents who play each other in tick tac toe until one has reached tic tac toe mastery.

Writing a program that learns to play tic tac toe is a first step in learning how Reinforcement Learning works. For this project, I assume you have already been introduced to the theory of tabulated reinforcement learning (which you can learn more about here). This includes understanding what the value of a state is and how to find it. I will go in-depth into implementing this theory in code. After you learn how to do this project you are only one neural network away from getting the agent to learn to play video games (and maybe some more useful stuff too)!

Implementation:

First, we will need a Tic Tac Toe game to be our environment. The implementation posted below should suffice.

This is a classic implementation of tic tac toe where the spots on the board are represented as numbers in order from 0 to 8, going from top to bottom, left to right.

I will first show how to encode our states, with each state being a specific configuration of the game board. Then there are two functions that are in our way of tic tac toe autonomy, the get_action function, and the update_values function. Once you are done you should be able to build 2 agents and have them play each other until at least one has mastered the game of tic tac toe.

Representing states as numbers

As I said we need to store each state the agent encounters in its memory. For tic tac toe, we are defining the current configuration of the game board as the state. In my implementation, the game board is described by a list called ‘spot_placeholders’. The problem is if we store each state, which is a list, ever visited in another list, we will have a huge list full of lists. This is very inefficient. Instead of having a list full of lists, let’s aim to have a list full of integers with each integer representing a unique state.

To figure out how to represent each state as a number, let’s start off by manually representing states and try to identify a pattern. Note that in the list ‘spot_placeholders’, a 1 means there is an X in that spot and 2 means there is an O in that spot.

  • State: [0,0,0,0,0,0,0,0,0] = 0
  • State: [1,0,0,0,0,0,0,0,0] = 1
  • State: [2,0,0,0,0,0,0,0,0] = 2
  • State: [0,1,0,0,0,0,0,0,0] = 3
  • State: [1,1,0,0,0,0,0,0,0] = 4
  • State: [2,1,0,0,0,0,0,0,0] = 5
  • State: [0,2,0,0,0,0,0,0,0] = 6
  • State: [1,2,0,0,0,0,0,0,0] = 7
  • State: [2,2,0,0,0,0,0,0,0] = 8

One possible function we could use to enumerate these states is:

This should feel pretty natural if you're used to working with binary
This should feel pretty natural if you’re used to working with binary

Or in code:

If this function doesn’t make sense to you, don’t stress out about it. This is very specific to tic tac toe and probably won’t be used much in future projects. If you just copy and paste the code without a full understanding of how it works, not much harm is going to be done to your understanding of RL. Just know that you can improve efficiency by encoding states in some way.

Epsilon Greedy

The epsilon greedy algorithm is how we will decide which action our agent will take. We want the agent to take random actions at first, but once it starts getting the hang of things we want it to play to win. So what we do is set a variable ε that represents the agent’s chance of taking a random action, and slowly decrease ε over time. The pseudocode is this:

  1. Initialize ε = 1
  2. Generate a random number between 0 and 1
  3. If this random number is < ε, explore the state space (take a random action).
  4. If this random number is > ε, perform the action that leads to the state with the highest value.
  5. Decrease ε by a small amount

ε continues to fall until it reaches some pre-determined value. You can think of ε as the percent chance that the agent will take a random action.

The Get Action Function

Currently, our get_action function asks the user which spot they would like to go. Since we are making this completely autonomous you are free to clear the get_action function and start from scratch. What we want to do is find out all the possible next states the agent can be in, then find which one of those states has the highest value. We do this by looping through the game board (spot_placeholders) and if a spot is open, we record what the new state would be if the agent went in this free spot. We then look in the agent’s state value table and find the value of this state. If this state has the highest value, we move to the spot that results in this state.

The pseudocode my implementation of this function is the following;

  1. Check whose turn it is
  2. Find all the next possible states that the player can be in.
  3. Perform ε greedy to either take a random action or be greedy(take the best action).
  4. If greedy, loop through the next possible states and find the state with the highest value.
  5. Try to perform this action, but if the agent doesn’t have any of these states in its table then do a random action.

The code for my implementation of this function can be found below:

Note that I also added the epsilon attribute to the player class for the ε greedy strategy.

The Update Values Function

Currently, we can make two agents play each other, but these two agents have no memory whatsoever. It’s like every game they play is their first-ever game of tic tac toe. So, let’s fix that.

What we do is keep a list of all the states visited in an episode. At the end of the episode, we need to call a function that applies the value update function to every state in this list for each agent. Notice that we only receive a reward at the last state of the episode. Therefore, we can just set the value of the last state to the reward. The pseudocode to my implementation is the following:

  1. Set the value of the last state equal to either 1 or -1 (reward if game won or punishment if game lost)
  2. Append any new, previously undiscovered states to the player’s memory
  3. Apply the update function starting from the second to last state and moving towards the first state

The code for my implementation is:

I added a call to the update function in the play game function here too.

Putting it all together

We are now able to get two agents to play each other and become tic tac toe pros all on their own. You can find the code for my finished project below:

I added a test function so that you can play against the computer. You will find that if you go first the agent doesn’t play very well at all, this is due to the game of tic tac toe being very bias towards who goes first (which messes up training). Adjusting training time, the size of rewards, and other hyperparameters may help the agent play better when it goes second (you could try to reward the player who goes second for a cat-game, instead of punishing them).

Project Summary

In summary, we are making an agent play the game of tic tac toe thousands of times and recording what happens in every state of the game. We then assign a value to each state and correct this value using the value update function. When the agent is trying to win, he will always move to the state with the highest value. We call this tabulated RL since we keep all the states and their values in a table.

But what if this was a 4X4 or 5X5 or even a 10X10 game of tic tac toe? The number of total possible states grows exponentially as the game board grows. This means that our value table will eventually be way too large to be practical! If only there was a way to just approximate the value of each state.

If only there existed a function that approximates the value of each possible action, given the current input…

Well, there is such a function and we find it using a neural network. This is the topic of deep reinforcement learning, which I will write an article on soon and post the link here.

Thank you for reading! If this post helped you in some way or you have a comment or question then please leave a response below and let me know! Also, if you noticed I made a mistake somewhere, or I could’ve explained something more clearly then I would appreciate it if you’d let me know through a response.


Related Articles