The world’s leading publication for data science, AI, and ML professionals.

What is Tabulated Reinforcement Learning?

The first step in learning how to make your AI teach itself

Tabulated reinforcement learning is the first algorithm in which you should learn if you want to master the art of getting your AI to teach itself in a developing, and constantly changing environment.

In general, the main idea behind Reinforcement Learning is that you have an Agent, Environment, State, Actions, and Rewards.

  • Agent: The agent is the thing that is living in the environment, taking in input, and performing certain actions.
  • Environment: The environment is the world that the agent is in.
  • State: A state is basically all the inputs that the agent is receiving at a particular moment.
  • Actions: The actions are all the ways that the agent can interact with its environment.
  • Rewards/Punishments: Rewards are given to the agent after performing an action. They are just numbers, so if the agent does something bad, we may give it a punishment of -1. If it does something good, we would give it a reward of +1.

In the environment, we have an agent. This agent takes in some type of input from the environment, this input is the agent’s state. Based on its state, the agent takes some action. This action brings the agent to a new state. Depending on if this new state is good or bad, the agent receives a reward/punishment. The diagram below helps to visualize this process:

Note the order of events written in blue. The environment offers a state to the agent, the agent takes an action based on this state, this action changes the environment and offers a reward/punishment. Image by author
Note the order of events written in blue. The environment offers a state to the agent, the agent takes an action based on this state, this action changes the environment and offers a reward/punishment. Image by author

To further drive the point home, if you think of yourself as an agent then the entire universe is your environment. You have 5 senses or 5 ways to receive input. At any particular time, your current state is what you are seeing, hearing, smelling, feeling, and tasting. You perform an action based on your current state. These actions eventually lead to either a reward (promotion at work maybe) or a punishment (getting fired). The idea is that you will learn from these punishments and rewards so that your future actions will lead to better rewards.

To get some more jargon out of the way, an episode is one full cycle of the agent interacting with the environment. For example, if the environment was a game of tic tac toe an episode would be one game of tic tac toe.

Agent Memory/Value Table

As the agent interacts with the environment, we can store all the states that the agent has been in and assign a value to each of these states. A state’s value is a measure of how many rewards the agent can expect to receive for the rest of the episode (for right now, understanding what this means quantitatively is not important). I will explain how to assign the correct values in a moment, but for now, just know that it is possible.

For example, say our agent is just starting its first game of tic tac toe. The environment is the game tic tac toe, and its current state would be the empty game board. The agent would store this unique state, say s0, in its memory. Say our agent then takes a random action, call it a0. Recall that taking an action brings a new state, let’s call it s1. Now say our agent continues to play with his opponent and ends up losing the game (it will receive some kind of punishment at this point).

Since the agent lost, we can say that every state the agent was in during this episode has a low value. Now say our agent starts his second game and is back in s0. Now, it can look in its memory and see what happens if the action a0 is taken. It will see that action a0 will lead to state s1, which has a low value since the last time it was in this state it ended up losing.

Since the agent is trying to win, it will only pick actions that will lead to states with high value. To start game 2 the agent avoids action a0, since it has low value, and picks another action at random, say a2. Again, the agent continues to play; but this time he ends up winning. Since the agent won, we can say that every state the agent was in during episode 2 has a high value. Hopefully, the figure below is helpful in visualizing this process.

Note the values are actually numbers, we will get into finding the correct numerical value of each state next. Image by author
Note the values are actually numbers, we will get into finding the correct numerical value of each state next. Image by author

You might see a problem with what we have so far. What if the action a0 really is the best action to take, and it was another action taken after a0 that caused us to lose that game? As of now, if our agent always avoids a0 it will never see that it is the best action to take. To solve this, we always make our agent have a chance to take a random action. If our agent takes a random action, it has a chance to take a0 and give that action another shot. We call this exploring the state space. We usually have the agent take completely random actions at first, then as time goes on we slowly reduce the chance that the agent takes a random action. An algorithm that does this is discussed in the tic tac toe project linked to at the bottom of this page.

The agent will continue this process until it knows the best action to take in every state it can possibly be in. To summarize, we are literally just telling the agent to try everything, recording everything that happens, and finding which actions lead to the best states (the states that lead to the largest reward).

Finding the value of each state

To find the correct value of each state, we start by setting the value of each state to 0.

A state’s correct value is equal to the expected total future reward for that episode.

See the figure below to help let this soak in.

It is best to start at the last state. In S5 we receive a reward of +1. Therefore, the value of S5 is 1. Now S4 does not get a reward, but it leads to S5 which gives a reward of +1. Therefore, S4 has a value of the reward it has received + the value of S5, or a value of 0+1=1. Similarly, V(S3) = R3 + V(S4) = -1+0+1 = 0. In general, we can say;

The Value of a state = The reward received in that state + the value of the next state
The Value of a state = The reward received in that state + the value of the next state

It is important to realize that we are sampling over many episodes, so instead of setting the value explicitly to Rt + V(St+1), we turn this into an iterative process and just take a small step in the direction of Rt + V(St+1). This brings us to:

But we have a very subtle problem with the function above. Imagine we are updating state V(s1) with current value 0, with Rt = 0. Say V(s2) = 1. If we repeatedly apply the update function to V(s1) then it would blow up to infinity.

We can fix this problem by offsetting our update by the current value of the state(see below). This brings us to our value update function:

This should look a lot like gradient descent. If this function is repeatedly applied, it will make the state values converge to their correct values.
This should look a lot like gradient descent. If this function is repeatedly applied, it will make the state values converge to their correct values.

Another explanation of why we must use the difference in values is this; we must take a small step in the direction of the correct value relative to the current value of the state.

Theoretical Section Summary

In summary, the agent begins by playing several episodes. While it is playing these episodes, sometimes it takes a random action to explore the state space, and other times the agent takes whatever action leads to the state with the highest value. After each episode, we update the value of each state the agent was in during the episode using the value update function, starting from the last state and moving backward to the first state. Eventually, this will lead us to the correct value for each state. If then the agent only takes actions that will lead to the state with the highest reward it will have mastered the game (hopefully).

Alright, enough with the boring stuff. If all this theory is still a bit fuzzy to you, don’t worry; true learning happens when we apply this stuff. So, without further ado, let’s build an agent that kicks ass in tic tac toe!

Thank you for reading! If this post helped you in some way or you have a comment or question then please leave a response below and let me know! Also, if you noticed I made a mistake somewhere, or I could’ve explained something more clearly then I would appreciate it if you’d let me know through a response.


Related Articles