Reinforcement learning (RL) 101 with Python

Dynamic programming (DP), Monte Carlo (MC) and Temporal difference (TD) to solve the gridworld state-value function

Gerard Martínez
Towards Data Science

--

In this post we will introduce few basic concepts of classical RL applied to a very simple task called gridworld in order to solve the so-called state-value function, a function that tells us how good is to be in a certain state t based on future rewards that can be achieved from that state. To do so we will use three different approaches: (1) dynamic programming, (2) Monte Carlo simulations and (3) Temporal-Difference (TD).

The Basics

Reinforcement learning is a discipline that tries to develop and understand algorithms to model and train agents that can interact with its environment to maximize a specific goal. The idea is quite straightforward: the agent is aware of its own State t, takes an Action At, which leads him to State t+1 and receives a reward Rt. The following scheme summarizes this iterative process of St →At →Rt →St+1 →At+1 →Rt+1 →St+2…:

Agent-environment interaction cycle. Source: Reinforcement Learning: An Introduction (Sutton, R., Barto A.).

An example of this process would be a robot with the task of collecting empty cans from the ground. For instance, the robot could be given 1 point every time the robot picks a can and 0 the rest of the time. You can imagine that the actions of the robot could be several, e.g. move front/back/left/right, extend the arm up/down, etc. If the robot was fancy enough, the representation of the environment (perceived as states) could be a simple picture of the street in front of the robot. The robot would be set free to wander around and learn to pick the cans, for which we would give a positive reward of +1 per can. We could then set a termination state, for instance picking 10 cans (reaching reward = 10). The robot would loop in the agent-environment cycle until the terminal state would be achieved, which would mean the end of the task or episode, as it is known.

The gridworld task

A representation of the gridworld task. Source: Reinforcement Learning: An Introduction (Sutton, R., Barto A.).

The gridworld task is similar to the aforementioned example, just that in this case the robot must move through the grid to end up in a termination state (grey squares). Each grid square is a state. The actions that can be taken are up, down, left or right and we assume that these actions are deterministic, meaning every time that the robot picks the option to go up, the robot will go up. There’s an exception, which is when the robot hits the wall. In this case, the final state is the same as the initial state (cannot break the wall). Finally, for every move or attempt against the wall, a reward of -1 will be given except if the initial state is a terminal state, in which case the reward will be 0 and no further action will needed to be taken because the robot would have ended the game.

Now, there are different ways the robot could pick an action. These rules based on which the robot picks an action is what is called the policy. In the simplest of cases, imagine the robot would move to every direction with the same probability, i.e. there is 25% probability it moves to top, 25% to left, 25% to bottom and 25% to right. Let’s call this the random policy. Following this random policy, the question is: what’s the value or how good it is for the robot to be in each of the gridworld states/squares?

Dynamic programming and policy iteration: evaluation and improvement

If the objective is to end up in a grey square, it is evident that the squares next to a grey one are better because there’s higher chance to end up in a terminal state following the random policy. But how can we quantify how good are each of these squares/states? Or, what is the same, how can we calculate a function V(St) (known as state-value function) that for each state St gives us its real value?

Let’s first talk about the concept of value. Value could be calculated as the sum of all future rewards that can be achieved from a state t. The intuitive difference between value and reward is like happiness to pleasure. While immediate pleasure can be satisfying, it does not ensure a long lasting happiness because it is not taking into consideration all the future rewards, it only takes care of the immediate next one. In RL, the value of a state is the same: the total value is not only the immediate reward but the sum of all future rewards that can be achieved.

A way to solve the aforementioned state-value function is to use policy iteration, an algorithm included in a field of mathematics called dynamic programming. The algorithm is shown in the following box:

Iterative policy evaluation algorithm. Source: Reinforcement Learning: An Introduction (Sutton, R., Barto A.).

The key of the algorithm is the assignment to V(s), which you can find commented here:

Source: edited from Reinforcement Learning: An Introduction (Sutton, R., Barto A.)

The idea is that we start with a value function that is an array of 4x4 dimensions (as big as the grid) with zeroes. Now we iterate for each state and we calculate its new value as the weighted sum of the reward (-1) plus the value of each neighbor states (s’). Notice two things: the V(s’) is the expected value of the final/neighbor state s’ (at the beginning the expected value is 0 as we initialize the value function with zeroes). Finally, the V(s’) is multiplied by a gamma, which is the discounting factor. In our case we use gamma=1 but the idea of the discounting factor is that immediate rewards (the r in our equation) are more important than the future rewards (reflected by the value of s’) and we can adjust the gamma to reflect this fact.

Finally, notice that we can repeat this process over and over in which we “sweep” and update the state-value function for all the states. These values can get iteratively updated until reaching convergence. In fact in the iterative policy evaluation algorithm, you can see we calculate some delta that reflect how much the value of a state changes respect the previous value. These deltas decay over the iterations and are supposed to reach 0 at the infinity.

Here’s an example of how the value function is updated:

Source: Reinforcement Learning: An Introduction (Sutton, R., Barto A.)

Notice in the right column that as we update the values of the states we can now generate more and more efficient policies until we reach the optimal “rules” a robot must follow to end up in the termination states as fast as possible.

Finally, here’s a Python implementation of the iterative policy evaluation and update. Observe in the end how the deltas for each state decay to 0 as we reach convergence.

Monte Carlo methods

While the previous approach assumes we have a complete knowledge of the environment, many times this is not the case. Monte Carlo (MC) methods are able to learn directly from experience or episodes rather than relying on the prior knowledge of the environment dynamics.

The term “Monte Carlo” is often used broadly for any estimation method whose operation involves a significant random component.

Interestingly, in many cases is possible to generate experiences sampled according to the desired probability distributions but infeasible to obtain the distributions in explicit form.

Here’s the algorithm to estimate the value function following MC:

Source: Reinforcement Learning: An Introduction (Sutton, R., Barto A.)

The Monte Carlo approach to solve the gridworld task is somewhat naive but effective. Basically we can produce n simulations starting from random points of the grid, and let the robot move randomly to the four directions until a termination state is achieved. For each simulation we save the 4 values: (1) the initial state, (2) the action taken, (3) the reward received and (4) the final state. In the end, a simulation is just an array containing x arrays of these values, x being the number of steps the robot had to take until reaching a terminal state.

Now, from these simulations, we iterate from the end of the “experience” array, and compute G as the previous state value in the same experience (weighed by gamma, the discount factor) plus the received reward in that state. We then store G in an array of Returns(St). Finally, for each state we compute the average of the Returns(St) and we set this as the state value at a particular iteration.

Here you can find a Python implementation of this approach applied to the same previous task: the worldgrid.

Note that varying the gamma can decrease the convergence time as we can see in the last two plots using gamma=1 and gamma=0.6. The good side of this approach is that:

  1. Technically, we don’t have to compute all the state-values for all the states if we don’t want. We could just focus on a particular grid point and start all the simulations from that initial state to sample episodes that include that state, ignoring all others. This can radically decrease the computational expense.
  2. As we said before, this approach does not require a full understanding of the environment dynamics and we can learn directly from experience or simulation.

Temporal-difference Learning

Finally, the last method we will explore is temporal-difference (TD). This third method is said to merge the best of dynamic programming and the best of Monte Carlo approaches. Here we enumerate some of its strong points:

  1. As the dynamic programming method, during the optimization of the value function for an initial state, we use the expected values of next state to enrich the prediction. This process is called bootstrapping.
  2. As in Monte Carlo, we don’t have to have a model of the environment dynamics and can learn directly from experience.
  3. Furthermore, unlike MC, we don’t have to wait until the end of the episode to start learning. In fact, in the case of TD(0) or one-step TD, we learn at each and every step we take. This particularly powerful because: on one hand, the nature of learning is truly “online” and on the other hand we can deal with tasks which do not have a clear terminal state, learning and approximating value functions ad infinitum (suitable for non-deterministic non-episodic or time-varying value functions).

Here’s the algorithm to calculate the value function using temporal-difference:

Source: Reinforcement Learning: An Introduction (Sutton, R., Barto A.)

And here’s the jupyter notebook with the Python implementation

Notice that adjusting alpha and gamma parameters is critical in this case to reach convergence.

Acknowledgements

Finally, I’d like to mention that most of the work here is inspired or drawn from the latest edition of the Andrew G. and Richard S. book called Reinforcement Learning: An Introduction, amazing work that these authors have made publicly accessible here.

--

--