The world’s leading publication for data science, AI, and ML professionals.

Creating a Tic-Tac-Toe game with a Q-learning AI which masters the game

You'll never win against it – All in pure python

Photo by Alex Knight on Unsplash
Photo by Alex Knight on Unsplash

In this tutorial we are going to implement a tic-tac-toe game in Python. And the fun part is that we will create an AI agent which learns how to play the game perfectly and almost never loses to you! All that without specifying any manual rules. Also we will code in pure python and won’t use any external libraries.

The complete code of this tutorial is available here on GitHub.


Let’s Create the Game

First we should create our game logic so we can use it to train our AI and play with it. The game logic is quite simple. First take a look at its code:

We use 1 for the first player and -1 for the second player. the play function accepts x and y as arguments and places the current player move at that coordinates on the 3×3 board of the game and then changes the turn to the other player. If we have a winner it will return the winner otherwise it will return None.

The logic for detecting winner is in the _get_winner function. It simply checks for wins first in rows, then in columns and last in diagonals.

The is_ended function checks if there is any empty cell remaining in the board and returns if the game is ended or not. The get_valid_actions function will return all coordinates of empty cells. It will be useful for training our AI agent.

As you may have noticed, the class has a render attribute. If we pass render=True, it will print the game board on the terminal after each move. But during the training process, the agent will play thousands of games in a few seconds. So we don’t want it to print the board after each move. That’s why we’ve made that as an option.

So the structure of the game is like this: we should call the play function in a loop until we have a winner or the game is ended by a draw.

Now we have our game class, let’s proceed to the next part.

Creating the Q Table and the Agent

In each turn, our AI agent must make a decision of which cell to choose to place its move. To do so, it should have an estimation of the rewards for each (state, action) pair so it can choose the action which results in the maximum reward. The Q table will hold these estimations such that:

Q[state][action] -> reward estimation.

But how can we calculate these estimations? We will use Q-learning algorithm to iteratively update these values until they converge to the actual ones.

Here’s the learning process:

  • The agent get’s the current state of the game and chooses an action. At first the Q values are not updated and we want the agent to explore the game. So we add some randomness to it. It will choose a random action with probability of p or will choose the action with maximum estimated reward with probability of 1-p. As the training continues and agent explores the game environment, we gradually decrease p from 1 to 0 so at the end the agent only chooses the best action.
  • Agent applies the chosen action to the game. Then as an opponent, we play a random move. Now we have the new state. If the agent wins, we have a reward of 100. If it loses, we use -100 as the reward. otherwise the reward is 0.
  • Now we can update the Q value of the (state, action) pair with this formula: Q(S,A) = Q(S,A) + α ∗ (R + γ ∗ maxaQ(S′,a) − Q(S,A)). S is the current state and R is the instant reward we got after applying the action. A is the chosen action, α is the step size which means how much weight we want to give to the new value compared to the current value.γ is the discount factor. It means how much weight we are giving to the future rewards compared to the instant reward. maxaQ(S′,a) is the maximum Q value over all actions for the new state.
  • Repeat all 3 steps above until Q estimations are good enough. If the game ends, create a new one and continue the loop.

I’ve implemented a Q class which holds the estimations. It has two functions for updating its values using the above formula and getting the best action:

If you are wondering about the defaultdict part, it is a dict subclass that calls a factory function to supply missing values. So a defaultdict never raises a KeyError, instead it will return the default value specified. In this case if we use self.values[X][Y] for any missing X and Y, it will return 0.

Now that we have the Q class, we should create an agent which uses the TicTacToe and Q classes to do the training process that we have described above:

The eps attribute is the probability of choosing a random action. As we said we gradually decrease it after each game in the learn function.

Now we need a main file to glue it all together:

Here we have a play function. It first creates a game. This time the render is True so we can see the game board as we play. Then in a loop, first the agent takes an action and then it asks the user to input an action which is the coordinates of the cell which we want to place our move. It will continue until the game is ended with a loose, win or draw.

At the main part, we first create the agent and then call its learn function. Now the agent is ready to beat us and in a while loop we call the play function so we can play with the agent interactively.

I played with the agent multiple times and I think it learned the game in a perfect way and it’s impossible to win against it.

Here’s a sample output of the gameplay, the agent is x and we’re o:


Summary

We first created our TicTacToe game logic so we can use it to train our agent and play with it. Then we described the Q-learning algorithm and implemented it with an agent to play and learn the game iteratively.

You can get all the code from here and play with it yourself.

Feel free to ask any questions.

Thanks and happy coding!


Related Articles