The world’s leading publication for data science, AI, and ML professionals.

Reinforcement Learning – Implement Grid World From Scratch

When you try to get your hands on reinforcement learning, it's likely that Grid World Game is the very first problem you meet with. It is…

Introduction of Value Iteration

When you try to get your hands on Reinforcement Learning, it’s likely that Grid World Game is the very first problem you meet with. It is the most basic as well as classic problem in reinforcement learning and by implementing it on your own, I believe, is the best way to understand the basis of reinforcement learning. Meanwhile, it is super fun to implement your own game and see how a robot manage to learn on its own!


Rules

Grid Board
Grid Board

The rule is simple. Your agent/robot starts at the left-bottom corner(the ‘start’ sign) and ends at either +1 or -1 which is the corresponding reward. At each step, the agent has 4 possible actions including up, down, left and right, whereas the black block is a wall where your agent won’t be able to penetrate through. In order to make it more straight forward, our first implementation assumes that each action is deterministic, that is, the agent will go where it intends to go. For instance, when the agent decides to take action up at (2, 0), it will land in (1, 0) rather than (2, 1) or elsewhere. (We will add uncertainty in out second implementation) However, it the agents hit the wall, it will remain at the same position.

Board

So let’s get cracking on the code! Firstly, let’s set up some global rules of the board. (full code)

And as a grid game, it needs a State to justify each state(position) of our agent, giving reward according to its state.

When our agent takes an action, the State should have a function to accept an action and return a legal position of next state.

Agent

This is the artificial intelligence part, as our agent should be able to learn from the process and thinks like a human. The key of the magic is value iteration.

Value Iteration

What our agent will finally learn is a policy, and a policy is a mapping from state to action, simply instructs what the agent should do at each state. In our case, instead of learning a mapping from state to action, we will leverage value iteration to firstly learn a mapping of state to value(which is the estimated reward) and based on the estimation, at each state, our agent will choose the best action that gives the highest estimated reward.

There is not going to be any cranky, head-scratching math involved, as the core of value iteration is amazingly concise.

value iteration
value iteration

This is the essence of value iteration, super neat, right? This formula almost applies to all reinforcement learning problems, let me explain how our agent evolves from an infant to expert based on this line of formula. Value iteration, just as its name, update its value(estimated reward) at each iteration(end of game).

At first, our gent knows nothing about the grid world(environment), so it would simply initialises all reward as 0. Then, it starts to explore the world by randomly walking around, surely it will endure lots of failure at the beginning, but that is totally fine. Once it reaches end of the game, either reward +1 or reward -1, the whole game reset and the reward propagates in a backward fashion and eventually the estimated value of all states along the way will be updated based on the formula above.

Let’s take a closer look at the formula. The V(St) on the left is the updated value of that state, and the right one is the current non-updated value and α is learning rate. The formula is simply saying that the updated value of a state equals to the current value plus a temporal difference, which is what the agent learned from this iteration of game playing minus the previous estimate. For example, let’s say there are 2 states, S1and S2, both of which has an estimated value 0, and at this round of playing, our agent moves from S1 to S2 and gets reward 1, then the new estimate of S1 = S1 + α(S2 - S1), which is 0 + 0.1(1-0) = 0(assume α is 0.1 and reward at S1 is 0)

We record all states of our agent, and at the end of the game, we will update estimates in reversed fashion.

Exploration & Exploitation

There is one last thing we need to talk about. Once our agent finds a path to get reward +1, should it sticks to it and forever follows that path (exploitation) or should it gives other path a chance(exploration) and expects a shorter path? In actual, we will balance exploration and exploitation in order to avoid our agent stuck in local optimal. Here our agent will choose action based on certain exploration_rate

Play

That’s it! These are all we need for a grid world game. We can start and let our agent play the game!

50 round of playing
50 round of playing

This is the estimates of each state after playing 50 rounds of game. As our action is deterministic, we can get best action at each state by following the highest estimate! Full code is here, go play with it and welcome to contribute if you find any to update!

For now, we are all focusing on value iteration and deterministic game world. However, in real situation the agent will not always lands in a position where it hopes to go. Let us explore deeper on non-deterministic game playing and Q-learning.

References

[1] https://www.cs.swarthmore.edu/~bryce/cs63/s16/slides/3-21_value_iteration.pdf

[2]https://github.com/JaeDukSeo/reinforcement-learning-an-introduction


Related Articles