Creating a Custom Environment for TensorFlow Agent — Tic-tac-toe Example

An introductory blog on custom TF-Agent environment

Published in

Towards Data Science

8 min readMay 26, 2020

Backdrop

Reinforcement learning is an emerging field of AI that has shown a lot of promise in areas like gaming, robotics, manufacturing, and aerospace. After beating human champions in games like Go[1] and Chess[2] in the mid 2010s, reinforcement learning got traction. Google bought DeepMind[3], a highly respected AI startup that contributed to the majority of reinforcement learning breakthroughs in 2010s. Similarly, OpenAI was founded in late 2015 by Elon Musk, Sam Altman, and others[4], who pledged US$1 billion to conduct research in the field of artificial intelligence. OpenAI stated their aim to promote and develop friendly AI in such a way as to benefit humanity as a whole. OpenAI Five, a project of OpenAI, demonstrated the ability to achieve expert-level performance, learn human-AI cooperation, and operate at internet scale on Dota 2 game[5]. Recently, Google used reinforcement learning on chip placement, one of the most complex and time-consuming stages of the chip design process, with an objective to minimize PPA (power, performance, and area) and showed that the generated placements are superhuman[6].

Reinforcement learning has been around since 1950s[7] producing many interesting applications in games and machine control. They never got the headline until 2013 when researchers from DeepMind showed its use in Atari games, which outperformed humans in most of them[8]. The defining improvement was the use of neural networks to learn the Q-values[9]. As with every other field of AI, neural networks revolutionized the field of reinforcement learning with the introduction of deep reinforcement learning[9]. Since then, reinforcement learning is everywhere and gaining popularity at an unprecedented scale. In the recent ICLR conference (ICLR 2020), we can see that reinforcement learning was the most frequent tag[10].

So, What Is Reinforcement Learning?

Unlike supervised machine learning, where labelled data is available, reinforcement learning is not provided with explicit labelled data. In reinforcement learning, an agent performs some actions on some environment due to which the state of the environment changes. Based on the feedback (reward or penalty) given by the environment for some action, the algorithm learns optimum policy. A child learning to walk himself/herself resembles a reinforcement learning paradigm. A child balancing himself/herself is the reward stage whereas a child losing balance is the penalty or failure stage. More theoretical explanation can be found on reinforcement learning introductory blogs and reader is strongly recommended to do so if reinforcement learning is new for him/her.

TF-Agents

TF-Agents is a library for reinforcement learning in TensorFlow, which makes the design and implementation of reinforcement learning algorithms easier by providing various well tested, modifiable, and extendable modular components. This helps both the researchers and developers in quick prototyping and benchmarking.

TF-Agents stable version can be installed with the following code:

pip install --user tf-agents
pip install --user tensorflow==2.1.0

More details about TF-Agents can be found here.

Environment

Environment is the surrounding or setting where the agent performs actions. The agent interacts with the environment and the state of the environment changes. While implementing reinforcement learning algorithms on some application, the environment for the application is required. Though TensorFlow provides environment for some of the popular problems like CartPole, we come into the situation where we need to build custom environments. Here, I will show the implementation of Tic-tac-toe by building a custom environment.

Custom Environment for Tic-tac-toe

To focus more on building custom environment, we simplify the game of Tic-tac-toe. Instead of two players, the simplified Tic-tac-toe has only one player. The player chooses positions randomly and if the position s/he chooses has already been chosen, the game ends.

Let’s start first by doing the required imports.

import tensorflow as tf
import numpy as npfrom tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts

Environments can be either Python environment or TensorFlow environment. Python environments are simple to implement but TensorFlow environments are more efficient and allow natural parallelization. What we do here is to create Python environment and use one of our wrappers to automatically convert it to the TensorFlow environment.

Constituents

Creating a custom environment constitutes of writing primarily four methods: action_spec, observation_spec, _reset, and _step. Let’s see what each of them means:

action_spec: describes the specifications (TensorSpecs) of the action expected by step
observation_spec: defines the specifications (TensorSpec) of observations provided by the environment
_reset: returns the current situation (TimeStep) after resetting the environment
_step: applies the action and returns the new situation (TimeStep)

SimplifiedTicTacToe Class

Now, let’s start creating the SimplifiedTicTacToe class. The class inherits from py_environment.PyEnvironment class so to extract already-available methods and properties.

The Tic-tac-toe board has nine positions. Let’s label them from 0 to 8 (inclusive). The player can put the mark in one of those positions. So, an action is a value from 0 to 8.

Observation is the state of the environment. The observation specification has specifications of observations provided by the environment. As the board has 9 positions, the shape of an observation is (1, 9). If some position is occupied, we can denote the state of the position by 1 and otherwise 0. Initially, the board is empty, so we represent the state of the environment by nine zeros.

class SimplifiedTicTacToe(py_environment.PyEnvironment):  def __init__(self):
    self._action_spec = array_spec.BoundedArraySpec(
        shape=(), dtype=np.int32, minimum=0, maximum=8, name='play')
    self._observation_spec = array_spec.BoundedArraySpec(
        shape=(1,9), dtype=np.int32, minimum=0, maximum=1, name='board')
    self._state = [0, 0, 0, 0, 0, 0, 0, 0, 0]
    self._episode_ended = False  def action_spec(self):
    return self._action_spec  def observation_spec(self):
    return self._observation_spec

After the game ends, we should reset the environment (or state). To do that, we can write a method called _reset on the custom environment we created. The method must return the default state of the environment at the start of the game.

def _reset(self):
  # state at the start of the game
  self._state = [0, 0, 0, 0, 0, 0, 0, 0, 0]
  self._episode_ended = False
  return ts.restart(np.array([self._state], dtype=np.int32))

It is worth mentioning here about episode and step. An episode is an instance of a game (or life of a game). If the game ends or life decreases, the episode ends. Step, on the other hand, is the time or some discrete value which increases monotonically in an episode. With each change in the state of the game, the value of step increases until the game ends.

Let’s also define two methods for checking if some spot is empty and if all the spots are occupied.

def __is_spot_empty(self, ind):
    return self._state[ind] == 0def __all_spots_occupied(self):
    return all(i == 1 for i in self._state)

Now, there is last method we need to write: _step. It applies the action and returns the new situation in the game. The situation is of the class TimeStep in TensorFlow. TimeStep has four information: observation, reward, step_type and discount. Details about each information can be found here.

While writing _step method, we should first see if the episode has ended. If it has ended, we need to call the _reset method. Else, we see if the position to be marked is empty. If it is not empty, the episode ends. If the position is empty, we place the mark in the position and see if that is the last step. On the basis of it being last step or not, we return either termination or transition respectively.

def _step(self, action):    
    if self._episode_ended:
        return self.reset()    if self.__is_spot_empty(action):        
        self._state[action] = 1
        
        if self.__all_spots_occupied():
            self._episode_ended = True
            return ts.termination(np.array([self._state], dtype=np.int32), 1)
        else:
            return ts.transition(np.array([self._state], dtype=np.int32), reward=0.05, discount=1.0)
    else:
        self._episode_ended = True
        return ts.termination(np.array([self._state], dtype=np.int32), -1)

For playing each step, the reward of 0.05 is given. The reward of 1 applies when we make all the 9 positions ticks. If the game ends with less than 9 position ticks, the negative reward of -1 is received. Here, discount of 1.0 is used so that there is no decay of rewards with respect to time/step.

Now, let’s create the TensorFlow environment.

python_environment = SimplifiedTicTacToe()
tf_env = tf_py_environment.TFPyEnvironment(python_environment)

Hurrah! The TensorFlow environment has been created!

Let’s Play

Now, let’s play the game for 10000 episodes.

time_step = tf_env.reset()
rewards = []
steps = []
number_of_episodes = 10000for _ in range(number_of_episodes):
  reward_t = 0
  steps_t = 0
  tf_env.reset()
  while True:
    action = tf.random.uniform([1], 0, 9, dtype=tf.int32)
    next_time_step = tf_env.step(action)
    if tf_env.current_time_step().is_last():
      break
    episode_steps += 1
    episode_reward += next_time_step.reward.numpy()
  rewards.append(episode_reward)
  steps.append(episode_steps)

I am interested in knowing the mean number of steps. So, I execute the following code.

mean_no_of_steps = np.mean(steps)

I got the mean number of steps as 3.4452. That means one can expect the game to end in the fourth step. We played 10000 episodes. So, we believe the mean to estimate expectation of the distribution well. Therefore, let’s find out the theoretical expectation of the random variable and see how it fits with the one we estimated experimentally.

Expectation of the Number of Steps

Let X be the random variable that represents the number of steps after which the repetition occurs.

X is made up of nine random variables, X_1, X_2, … , X_9. X_i is 1 if there is no repetition until the ith step. We need to find the expectation of X.

If for some value of i, X_i = 0; X_j = 0 for all the values of j > i.

So, E[X] = E[X_1] + E[X_2|X_1] + … + E[X_9|X_1, … , X_8]

Now, let’s calculate the value of E[X_i | X_1, …, X_(i - 1)].

The random variable X_i is 1 if there is no repetition until ith step. For that the probability is:

P(no repetition until ith step) = no of non-repetitive permutations until ith step / total number of permutations until ith step
= P(9, i) / (9 ^ i)

The probability multiplied by 1 gives the probability itself. So, the expectation becomes the sum of those probabilities.

Hence, the expectation is:

E[X] = sum from i=1 to i=9 (P(9, i) / (9 ^ i)) ≈ 3.46

Another elegant approach to find the expectation can be found here — thanks to Henk Brozius.

The theoretical expectation is very very close to the experimental expectation. That gives some satisfaction.

That is how a custom environment can be created in TensorFlow. TF-Agents provide modular components that make prototyping easier. You are off to a good start. There is a lot more we can do with TF-Agents. Keep exploring! Sky is the limit!

Interested readers are highly encouraged to follow the following links to learn more about creating a custom TensorFlow environment.

Environments: https://www.tensorflow.org/agents/tutorials/2_environments_tutorial
tf_agents.specs.BoundedArraySpec: https://www.tensorflow.org/agents/api_docs/python/tf_agents/specs/BoundedArraySpec

If you have any questions or comments or confusion, feel free to comment here. I will try my best to answer them.

References

https://www.newscientist.com/article/2132086-deepminds-ai-beats-worlds-best-go-player-in-latest-face-off/
https://www.theguardian.com/technology/2017/dec/07/alphazero-google-deepmind-ai-beats-champion-program-teaching-itself-to-play-four-hours
https://techcrunch.com/2014/01/26/google-deepmind/
https://openai.com/blog/introducing-openai/
https://openai.com/projects/five/
https://ai.googleblog.com/2020/04/chip-design-with-deep-reinforcement.html
http://incompleteideas.net/book/first/ebook/node12.html
https://arxiv.org/abs/1312.5602
Aurélien Géron — Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow_ Concepts, Tools, and Techniques to Build Intelligent Systems (2019, O’Reilly Media)
https://iclr.cc/virtual_2020/paper_vis.html
https://www.tensorflow.org/agents
https://www.quora.com/Suppose-we-are-drawing-a-number-randomly-from-the-list-of-numbers-1-to-9-After-each-draw-the-number-is-replaced-How-can-we-find-the-expected-number-of-draws-after-which-we-encounter-the-first-repetition/answer/Henk-Brozius?__filter__=all&__nsrc__=1&__sncid__=5298994587&__snid3__=8434034083