Reinforcement Learning — TD(λ) Introduction(1)

Apply offline-λ on Random Walk

Jeremy Zhang
Towards Data Science

--

In this article, we will be talking about TD(λ), which is a generic reinforcement learning method that unifies both Monte Carlo simulation and 1-step TD method. We have been talking about TD method exhaustively, and if you remember, in TD(n) method, I have said it is also a unification of MC simulation and 1-step TD, but in TD(n), one has to keep track of all the traces along the way and update current estimation based on value n-step ahead, in TD(λ), we are going to see a more elegant unification.

In this article, we will be:

  1. Learning the idea of TD(λ)
  2. Introducing the forward view update method — offline-λ return
  3. Applying the method on random walk example

The Idea of TD(λ)

TD(λ) is, in fact, an extension of TD(n) method, remember that in TD(n), we have the accumulated reward of the following form:

This value estimation up to step t+n is used to update the value on step t , and what TD(λ) does is to averaging the value, for example, using

0.5*Gt:t+2 + 0.5*Gt:t+4

as the target value. But instead of using direct weights, it uses λ as a parameter to control the weight and make it sum up to 1:

TD(λ)

I bet the image looks familiar, an agent start from a state, by taking an action, it reaches next state, and after that it chooses another action, and the SARSA process goes on and on. So the first column is in fact TD(1) method, which is being assigned weigh of 1-λ , and the second column is TD(2), which has a weight of (1-λ)λ , …, until the last TD(n) is assigned weight of λ^(T-t-1) (T is the length of the episode). Note that the weight decays as n increases and the total summation is 1. A more general form of TD(λ) is:

From the formula, we can see that when λ = 1 , only the last term is kept and this is essentially Monte Carlo method, as the state, action process goes all the way to the end, when λ = 0 , the term reduces to G_t:t+1, which is 1-step TD method, and for 0 < λ < 1 , the method becomes a mixed of weighted TD(n) method.

Off-line λ-Return (Forward View)

With the target G_t defined, we are now moving to our first algorithm definition. The summarised update formula can be defined as:

offline λ-return

The update rule is the same as general semi-gradient method, the only difference lies in the target value I stated above.

Referring to an image from Sutton’s book, this method is also called forward view learning algorithm, as at each state, the update process looks forward to value of G_t:t+1 , G_t:t+2 , …, and based on a weighted value of which to update the current state.

Forward Update on Random Walk

Now let’s get to the implementation of the algorithm on the random walk example. We have learnt the random walk example here, but FYI, in random walk, an agent starts at the middle position, and at each step, it has equal probability to move one step either to the left or right(action policy is fixed), and by only ending at either left or most can it stop an episode.

We are going to implement a 19-state random walk, and although the state space is in fact discrete, we can still apply generalisation algorithm on it.

Value Function

The value function is simple and explicit. We have 19 states and 2 ending states, so in total we have 21 states, and each state has a weight, which is essentially its value estimation. The value function returns the value of a specific state and learn function update current estimation based on the difference delta , which in this case is Gt — v(St, wt) ( alpha is learning rate).

Some General Functions

As this is not our first time implement a random walk, I will list some commonly shared functions together:

At each state, an agent chooseActiontakeActiongiveReward and repeat until reach the end of the game.

Play and Update

Again, the major difference lies in the playing process and computing delta for each state.

At each episode, we will need to first keep track of all states till the game ends in order to get a forward view when updating the value of each state. The self.states records all states along the way, and self.reward keeps only the latest reward, as all rewards are 0 along the way except the final state.

The second part is to update the state value estimation after the end of the game. Recall the formula:

For each state at time t and step n , we need to compute the value of G_t:t+n , and combining them with decaying weights in order to get the target value of St . So the function gt2tn computes the value given t and n , which is defined as:

and gtlambda in the previous code snippet is the target value. In addition, we also set atruncate value, when lambda_power is too small, we simply ignore the value. With target gtlambda and current value from valueFunc , we are able to compute the difference delta and update the estimation using function learn we defined above.

Off-line λ-Return & TD(n)

Remember in TD(n) session, we applied n-step TD method on random walk with exactly same settings. Now with off-line λ-Return introduced, let’s compare the learning result of both:

Image from Reinforcement Learning an Introduction

We can see that both RMS error curves are similar and the result is comparable. Generally, the best learning result usually occurs with intermediate learning rate and λ value (I copied the image from Sutton’s book, as this one is way more clear than the one I plotted).

That is it for forward view update, please check out the full implementation here. In next post, we will be learning backward update, which is a more elegant TD(λ) method by using eligibility trace.

Reference:

--

--