Update: The best way of learning and practicing TD method is by going to http://rl-lab.com/gridworld-td
Suppose you are driving your car equipped with a GPS. At the start of your journey the GPS gives you an estimate of the arrival time (based on statistical data), as you drive and you hit traffic jams (or not), it refines its estimate and gives you other arrival times.
You notice that at each portion of the trip you are provided with some estimate about the arrival time.
Now suppose that your GPS does not give you any estimate but stores the data until you arrive then gives you a detailed report on how much time each part of the road took. Would this be useful for you ?
The answer will be: it depends on what you want to do. But for sure you will appreciate having early on feedback even if it was not very accurate.
This is the difference between Monte Carlo and Temporal Difference. The latter method of the example is Monte Carlo based, because it waits until the arrival to destination then compute the estimate of each portion of the trip. While the former is Temporal Difference.
As a matter of fact, if you merge Monte Carlo (MC) and Dynamic Programming (DP) methods you obtain Temporal Difference (TD) method.
Note: TD will be noted as TD(0) which means it will look ahead one step. TD(0) is particular case of TD(n).
Recall that in MC we play a whole episode until the end then we compute the discounted reward for each state that appeared in the episode. We do large number of episodes then we average the different values that we obtain for each state.
In DP we randomly initialize all the states, and then we iteratively compute the value of each state based on the (previously computed) values of the surrounding states. We keep doing this until we notice that there is no considerable improvement in any state value.
Policy Prediction
We have seen that in MC we play the episode until the end, then we move backward to assign each state the discounted return G of the episode. But this means we have to wait until the end to know the value of G. However in TD(0), we update the current state based on the estimate of the next state. Remember the example with the GPS, at one point the GPS might notice that your speed dropped to 10Km/h so it updates its estimate of arrival time by +30min for example, but this might be a very brief slowdown and in few minutes you are accelerating again, and the GPS updates its estimate by -20min. Same with TD(0), V(s) is updated according the following formula:
![where α is the step size ∈ ]0,1], 𝛄 is the discount factor](https://miro.medium.com/1*s80b7_pfTReXENYNaCUK4g.png)
This is an incremental average computation. Check "Incremental Average Computation" at the end of the article for details.
Clearly the estimate of V(s) won’t be accurate based on one episode only. Same as your car trip that day! You won’t have a good estimate of the total time nor the time of each portion by doing it only once. Probably that day you were lucky, there was no traffic jam, or on the opposite you were unlucky and you got stuck in an out of the ordinary jam due to a car accident. But if you do the trip everyday (playing more episodes) you will be able to refine your estimates by each passing day.
The algorithm (in pseudo code) for policy evaluation in TD(0) is as follows:
Evaluate_Policy(policy):
randomly_initialize_non_terminal_states_values()
Loop number_of_episodes:
let s = start_state()
# Play episode until the end
Loop until game_over():
let a = get_action(policy, s, 0.1)
# get action to perform on state s according
# to the given policy 90% of the time, and a
# random action 10% of the time.
let (s', r) = make_move(s, a) #make move from s using a and get
#the new state s' and the reward r
# incrementally compute the average at V(s). Notice that V(s)
# depends on an estimate of V(s') and not on the return
# G as in MC
let V(s) = V(s) + alpha * [r + gamma * V(s') - V(s)]
let s = s'
End Loop
End Loop
Policy Control
Policy control in TD(0) has two implementations: Sarsa and Q-Learning.
SARSA is an On-Policy method, which means it computes the Q-value according to a certain policy and then the agent follows that policy.
Q-Learning is an Off-Policy method. It consists of computing the Q-value according to a greedy policy, but the agent does not necessarily follow the greedy policy.
SARSA
As usual when it comes to performing actions you need to compute Action-State function (Q-value) because it maps state and action to estimate. In Monte Carlo article we explained why V(s) alone is not helpful to determine the optimal policy (the plan, or the action to take at each state). So suppose we are at state s and we want to compute the Q-value based on state s and action a, as we have seen earlier that TD(0) uses incremental average to compute the value of any state. This average computation, is expressed in terms of value of next state. Since we are computing Q-value, then we have to get the Q-value of the next state s’. However Q needs two parameters which are the state and the action.

The way SARSA resolves this, in order get that Q-value, is to choose an action a’ (based on epsilon-greedy method) at state s’ and then when the agent arrives at s’ it we will perform action a’.
The figure below gives an example SARSA.


In the left grid the agent is at state s, it computes the value of action going North (blue arrow), to be able to do the computation it needs the Q-value at s’ going East (grey arrow). The right grid shows when the agent moved to state s’, it follows the action previously decided by the policy and computes the Q-value of the action going East (blue arrow)…
The following is the pseudo code of SARSA:
SARRA():
#initialization
for each state s in AllNonTerminalStates:
for each action a in Actions(s):
Q(s,a) = random()
for each s in TerminalStates:
Q(s,_) = 0 #Q(s) = 0 for all actions in s
Loop number_of_episodes:
let s = start_state()
# get action to perform on state s according
# to the given policy 90% of the time, and a
# random action 10% of the time.
let a = get_epsilon_greedy_action(s, 0.1)
# Play episode until the end
Loop until game_over():
# make move from s using a and get the new state s'
# and the reward r
let (s', r) = make_move(s, a)
# choose action to perform on state s'
# a' will be used executed in the next iteration
# but for the moment it will be used to get Q(s', a')
let a' = get_epsilon_greedy_action(s', 0.1)
# incrementally compute the average at Q(s,a)
let Q(s, a) = Q(s, a) + alpha*[r + gamma * Q(s', a') - Q(s, a)]
let s = s' # move to the next state
let a = a' # use the same action a' as determined above
End Loop
End Loop
Q-Learning
Q-learning is similar to SARSA except that when computing Q(s,a) it uses the greedy policy in determining the Q(s’,a’) from the next state s’. Remember that the greedy policy selects the action that gives the highest Q-value. However, and this is important, it does not necessarily follow that greedy policy.

The image blow illustrates the mechanism of Q-Learning:

The left grid shows the agent at state s computing the value of Q when going North (blue arrow). For this purpose it uses in the computation the Q-value determined by the greedy policy at state s’ (orange arrow). The right grid shows the agent moving to the state s’, but does not necessarily follow the action determined by the greedy policy (orange arrow), instead it chooses a random action (blue arrow).
The algorithm of Q-learning is like the following:
QLearning():
#initialization
for each state s in AllNonTerminalStates:
for each action a in Actions(s):
Q(s,a) = random()
for each s in TerminalStates:
Q(s,_) = 0 #Q(s) = 0 for all actions in s
Loop number_of_episodes:
let s = start_state()
# Play episode until the end
Loop until game_over():
# get action to perform on state s according
# to the given policy 90% of the time, and a
# random action 10% of the time.
let a = get_epsilon_greedy_action(s, 0.1)
# make move from s using a and get the new state s'
# and the reward r
let (s', r) = make_move(s, a)
# choose the max Q-value (qmax) on state s'
let qmax = get_max_qvalue_on_state(s')
# incrementally compute the average at Q(s,a)
let Q(s, a) = Q(s, a) + alpha*[r + gamma * qmax - Q(s, a)]
let s = s' # move to the next state
End Loop
End Loop
Incremental Average Computation
This paragraph shows how the incremental average computation is derived. The terms of the average are arranged in a way to have both A(n+1) and A(n).

Notice that 1/(n+1) represents the term alpha in the State-Value and Action-Value functions.
Conclusion
Temporal Difference is better than Dynamic Programming method because it does not require a model of the environment, nor the rewards and probability distributions. TD has also advantage over Monte Carlo methods since no need to wait until the end of the episode to know the return, only one time step is required.