Reinforcement learning Q-learning with illegal actions from scratch

A tic-tac-toe example to show how to write RL Q-learning algorithms when some actions are forbidden for particular states

Shuyang Xiang
Towards Data Science

--

The tic-tac-toe game animation

One day I was watching my 18-month-old son learning how to eat with a spoon and I realized that he put down the spoon immediately and asked me for food once his bowl was empty. He adjusted the way to handle the spoon according to the position and the quantity of food in the bowl while it was natural for him not to take any useless effort as long as there is no more food. This reminded me of some RL problems that I came across: not all actions are allowed under some specific circumstances: e.g. an elevator can never go up when it is already on the top floor, an HVAC system would always stop cooling down when the lowest temperature is already reached.

In the article, I am going to take the tic-tac-toe game as a simple example and build the Q-learning algorithm with the presence of illegal actions because of the game rules. In the following, you will be reading:

  1. The RL environment o the tic-tac-toe game.
  2. The Q-learning algorithm with illegal actions.

All the code is available on my Github in case that you need more details.

The tic-tac-toe environment

The tic-tac-toe game or Xs and Os is a game for two players who take turns marking the spaces in a three-by-three grid with X or O. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row is the winner.

We will be using Gym to set the tic-tac-toe environment. In general, an RL environment has four key functions: initialization, reset, step, and render.

Initialization

The initialization function mainly aims to initialize the reward, done(the value to check if the game is over or not), and especially set up the action space and the observation (state) space. In the tic-tac-toe game, each of the two players (let us call them player O and player X) can take 9 possible actions: one of the grids he wants to mark. A state is represented by a three-by-three array of which each element can have three possible values: 0 for not marked, 1 for having been marked by player O, and two for being marked by player X). Remark that not all actions are allowed for all states: a player cannot mark a grid that is already taken.

import gym
from gym import spaces
import numpy as np
class TicTacToe(gym.Env):
metadata = {'render.modes': ['human']}
def __init__(self):
super(TicTacToe, self).__init__()
self.states = np.zeros((3,3))
self.counter = 0
self.done = 0
self.reward = 0
self.action_space=spaces.Discrete(9)
self.observation_space=spaces.Box(low=0,high=2, shape=(3,3))

Reset

The reset function aims to set the environment to an initial state. In our example, we simply set the done and reward value to be zero and the state to be the one that nothing is ever marked on the game board.

def reset(self):
self.states=np.zeros((3,3))
self.counter = 0
self.done = 0
self.reward = 0
return np.array(self.states)

Step

The step function is a function of action and it is probably the most essential part of the environment. It represents how the state of the environment will transit with one given action and what reward we will receive. In the tic-tac-toe game, there will be four kinds of different states once a legal action is executed: player O wins, player X wins, the game continues or no one wins but the game ends.

This part of the code is omitted due to space limitation but the implementation is quite direct: we only need to check if the game continues or not after every action and if not, who wins. Besides, we set the reward to be 100 if player O wins and -100 player X wins. We don’t give any reward to the intermediate steps when the game is not over. Please check the details on my Github.

Render

The render function renders the environment to the screen. e.g. it prints the following report of the game:

Image by author

Q-learning with illegal actions

Once the tic-tac-toe environment was set, we are really to train the agent (player). In this example, I am going to use the most basic RL algorithm Q-learning: In one work, q-learning seeks to learn a policy that maximizes the total reward. It is a tabular method that creates a q-table of the shape [state, action] and updates and stores the value of q-function after every training episode. When the training is done, the q-table is used as a reference to choose the action that maximizes the reward. Note that every state in our game is represented by a three-by-three array: we first need a function (state_to_number) to change every state to an integer:

def __init__(self,alpha = 0.1,gamma = 0.6,epsilon = 0.1,epochs=5000):
self.alpha=alpha
self.gamma=gamma
self.epsilon=epsilon
self.epochs=epochs
def state_to_number(self, state):
state = state.reshape(9)
number = 0
for i, num in enumerate(state):
number += num * 3 ** (len(state) - i - 1)
return int(number)

Recall that in our tic-tac-toe environment, the q_table cannot be filled everywhere: the player cannot mark a grid that was already marked by himself or his opponent. In this case, we simply mask the forbidden action in the following way: we set the corresponding element in the table to be nan and choose the argmax by ignoring nan values in the corresponding row. Some suggested setting a big negative value as reward for the invalid action, but it is not a good idea because the invalid action might still be chosen with a small probability.

def learn(self,env):
if env.env_name=="tictactoe":
self.q_table = self.q_table(env)
for i in range(self.epochs):
state = env.reset()
epochs, reward, = 0, 0, 0
done = False
while done !=True:
if random.uniform(0, 1) < self.epsilon:
action = env.action_space.sample() # Explore action space
#forbiden ilegal action
while state[int(action / 3)][action % 3] !=0:
action=env.action_space.sample()
else:
action_value_list=self.q_table[self.state_to_number(state)]
for action,action_value in enumerate(action_value_list):
if state[int(action / 3)][action % 3]!=0:
action_value_list[action]=np.nan
action = np.nanargmax(action_value_list) # Exploit learned values
next_state, reward, done, info = env.step(action)
old_value = self.q_table[self.state_to_number(state), action]
next_max = np.nanmax(self.q_table[self.state_to_number(next_state)])
new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
self.q_table[self.state_to_number(state), action] = new_value
state = next_state
epochs += 1

We show the cumulative mean value of rewards of the training episodes and we can see that the reward converges to around 80%, that is, player O has an 80% chance to win the game. And of course, no invalid action will be taken in the future game.

Image by author

Conclusion

In this article, I used the tic-tac-toe game to show how to treat RL when some actions are illegal and how to mask them in training with a simple q-learning example built from scratch. The necessary masking steps prevent the agent from taking invalid in learning.

--

--