Training an Agent to beat Grid World

Published in

Towards Data Science

4 min readSep 28, 2017

What is Grid World? Grid World is a 2D rectangular grid of size (Ny, Nx) with an agent starting off at one grid square and trying to move to another grid square located elsewhere. Such an environment is a natural one for applying reinforcement learning algorithms to discover optimal paths and policies for agents on the grid to get to their desired goal grid squares in the least number of moves. Personally, I have always found solving Grid World to be a great way to begin learning about reinforcement learning because the of its intuitive geometry and its relevance to many real-world applications.

In our implementation of Grid World we start the agent at the top-left grid corner at (0, 0) with the aim of arriving at bottom-right grid corner at (Ny-1, Nx-1) in a minimal number of steps which will be Ny + Nx steps. The agent is only allowed actions of moving in up, down, left, right directions by 1 grid square.

To beat this game we use the reinforcement learning algorithm of an on-policy Monte Carlo reward-average sampling, and with an epsilon-greedy agent to navigate through the Grid World environment. Tabular forms of the rewards, state-action values, and policy are used in this implementation (which is appropriate here for small grid sizes), and much of the above theory can be found in Sutton & Barto’s “Reinforcement Learning” textbook. Our Python code implementation of this algorithm can be found at my Github:

https://github.com/ankonzoid/LearningX/tree/master/classical_RL/gridworld

An example code output for the optimal policy that can be found from running our code is:

Final policy:

  [[2 1 1 2 2 2 2]
   [2 2 1 1 2 1 2]
   [1 2 1 1 2 2 2]
   [2 1 1 2 1 1 2]
   [1 1 2 2 2 1 2]
   [1 1 1 1 1 2 2]
   [1 1 1 1 1 1 3]]

  action['up'] = 0
  action['right'] = 1
  action['down'] = 2
  action['left'] = 3

Note that there is no unique optimal policy in this case! As long as all actions in the final trained policy are either moving “down” or “right”, then we will know that we are at a (non-unique) optimal policy.

For the remainder of this article, we will accompany our code with a brief walkthrough of what makes up the agent and the environment.

For more of my blogs, tutorials, and projects on Deep Learning and Reinforcement Learning, please check my Medium and my Github.

Walkthrough of methods used in our Python code

I find that by defining 4 distinct object classes, we can make our code conceptually easier to understand. In particular, the classes are the:

(1) Environment
(2) Agent
(3) Brain (of agent)
(4) Memory (of agent)

The training cycle involves having the agent interacting with the environment via taking actions, collecting the corresponding rewards, and transitioning its state to other states. How the agent can build upon its future decision making can be done based on operations done by the brain via processing the past state and action histories of the episodes stored in the memory. It is this brain and memory combo that provides valuation to what sequence of states and actions can lead to high and low long-term rewards for the agent.

Question 1: How do we set up the environment?

We define a 2D grid and start defining the the actions that an agent is allowed to have at each grid square i.e. an agent can move in all 4 directions in a middle square, 3 directions on the grid edge, and 2 directions at the grid corner. We also define the starting state of the agent to be at coordinates (0, 0) corresponding to the top-left corner of the grid, and (Ny-1, Nx-1) to be the terminal state at the bottom-right corner of the grid. As for rewarding the agent, we give a reward of R=+100 for arriving at the desired goal grid square, and a reward of R=-0.1 to incentivize the agent to reduce the number excess moves to get to the goal.

Question 2: How do we set up the agent?

We choose an epsilon-greedy policy for the agent here, meaning that for every action decision there is an epsilon probability of the agent trying out a completely random action (does not require agent brain), otherwise it will take chooses an action greedily i.e. argmax{action} Q(state, action) (does require agent brain).

Question 3: How do we set up the brain?

We initialize a tabular state-action value function Q(s, a) that will be iteratively updated to assist in the agent’s exploitation mechanism. Recall that Q(s, a) represents the expected long-term discounted or un-discounted reward collected from applying action “a” to the state “s”, then following the optimal policy afterwards for the remainder of the state transitions.

In particular for our Grid World example code, we use a reward-average sampling technique as our Q(s,a) update method that is an simple method of computing Q(s,a) as the average total rewards collected when (s,a) was experienced by the agent (please refer to Sutton & Barto for more details). The Q(s,a) update happens after every episode where we take the total reward collected from the episode, and update exactly the Q(s,a) values for the (s, a) state-action pairs that agent experienced during the episode; it is the memory that stores this information for the brain to process.

Question 4: How do we set up the memory?

The purpose of the memory is to facilitate the computations done by the brain by keeping track of the state, action, and reward that occurred during an episode. Therefore it as simple as iteratively updating our counters during and after the training episodes.