Introduction to Reinforcement Learning: Temporal Difference, SARSA, Q-Learning

Alessandro Paticchio
Towards Data Science
8 min readOct 11, 2022

--

Photo by Phillip Glickman on Unsplash

Reinforcement Learning is one of the most intricate fields of Machine Learning, due to its mathematical complexity, as well as the ambitious tasks it tries to achieve.

Simply put, the final goal of a RL algorithm is to make an agent capable of acting in an environment as to maximize the total reward.

Easier said than done: that single sentence hides a variety of questions, such as:

  1. How do I define a “goal” and a “reward”?
  2. How do I make an agent “act”?
  3. How do I model the “environment”?

The problem is way harder than it seems, yet a lot of progress has been done so far, the most famous examples being scientific breakthroughs such as AlphaGo and AlphaTensor by DeepMind.

Nonetheless, if you want to understand deeply a field you have to start from the foundations. The scope of this post will be to give a high-level overview of RL, as well as introduce three algorithms that represent the basics of many others more advanced: Temporal Difference, SARSA, and Q-Learning.

Reinforcement Learning: the building blocks

In RL we leverage a simple model to describe how an agent interacts with the space, which is the Markov Decision Process.

MDPs represent a mathematical framework for modeling situations where dynamics are partly controlled by an agent and partly caused by the characteristics of an environment. In an MDP, the agent continuously interacts with the environment, which responds with rewards affecting subsequent agent’s actions. Let’s have a look at the following picture:

A simple MDP, image by the author

At each time step t, the agent receives a representation of the environment’s state Sₜ, which conditions the choice of an action Aₜ. At the next time step t+1, the agent receives a reward Rₜ₊₁ that informs it about how good its past action was, as well as a new representation of the environment.

In a finite MDP, states Sₜ, actions Aₜ, and rewards Rₜ have a finite number of elements, with Sₜ and Rₜ having well-defined discrete probability distributions depending only on the preceding states and actions. These characteristics make the problem usually more tractable. However not all the problems can be modeled with finite states, actions or rewards, yet many of the assumptions and conclusions we will be making are easily generalizable to continuous settings.

The continuous interaction between an agent and the environment generates a trajectory:

S₀, A₀, R₀, S₁, A₁, R₁, S₂, A₂, R₂…

The probability of transitioning from state Sₜ to Sₜ₊₁ is thereby influenced by the choice of the action Aₜ, but, given Sₜ and Aₜ, it is conditionally independent of all previous states and actions. Therefore we say that the state transitions of an MDP satisfy the Markov Property.

The final goal of the agent is to maximize the total return over the long run, which usually keeps into account a discounting factor γ, calculated as:

Discounted return, equation from [1]

The discounting factor γ determines how much the agent craves immediate rewards (γ = 0) or has a more long-term view and aims for high returns in the future (γ = 1).

The agent should behave in order to transition to states that yield high rewards. The behavior of an agent is determined by a policy π(a|s), a function that outputs the probability of performing an action a, given that the agent is in state s. Under a given policy, we can define a state-value function (and a state-action value function), which measures the expected return the agent gets from starting in s (and performing a) and following π thereafter.

State value function and State-action value function, equation from [1]

Now, the goals of RL algorithms are typically two:

  1. Prediction: estimating the value functions of a given policy π
Estimates for state value functions and state-action value functions for policy π

2. Control: finding the optimal policy π*, namely the policy that maximizes the return in the long run.

Estimates for state value functions and state-action value functions for optimal policy π
Optimal greedy policy, equation from [1], obtained by choosing the action that yields the highest Q-value

Note that these two tasks are inherently correlated: if an agent is following a policy π, my first goal is to evaluate how good that policy is (Prediction). Once I estimated the value functions of the policy, I can decide how to change it in order to obtain an even better value function (Control).

Knowing the model of the environment, namely knowing the probability distributions of actions and rewards, gives a significant advantage in solving these two tasks. Nonetheless, the model is oftentimes hidden or too complex to handle, hence algorithms better learn from experience (or episodes), which are sequences of states, actions, and rewards, ending with terminal states. For instance, playing a full game of Tic Tac Toe represents an episode, with the terminal state reached when one of the players wins (or both draw).

A simple Tic Tac Toe episode, image by the author

The list of algorithms available to solve Prediction and Control tasks is long and each of them is capable of obtaining optimal results under certain assumptions. Here I decided to cover the three pillar algorithms which proved to work fairly well overall, but also represent the foundations of many more sophisticated ones:

  • Temporal Difference
  • SARSA
  • Q-Learning

Temporal Difference

Temporal Difference is said to be the central idea of Reinforcement Learning since it learns from raw experience without a model of the environment. It solves the Prediction problem, namely it estimates the value function of the states of an MDP.

The problem with learning from experience is that one has to wait for the end of the episode (and hence for the final return) to update the estimates of the value functions. TD overcomes this obstacle, by waiting for one single step to do it.

The update rule for the simplest version of TD, α is the update rate. Equation from [1]

At each time step t, the value function of state Sₜ is immediately updated based on the reward Rₜ₊₁ immediately received, and on the estimates of the new state Sₜ₊₁. This is the simplest version of TD, called TD(0), where the zero represents the fact that one has to wait for a single step to perform the update, yet the idea is extendable to more complex algorithms like n-step TD and TD(λ). As you notice, the update rule of a state makes use of the estimates of another one: this method is called bootstrapping.

Procedure of TD(0), from [1]

We mentioned what are the pros of using TD, here it’s a summary:

  1. It doesn’t need a model of the environment
  2. Its estimates converge (fast) to the real state value function
  3. It performs online incremental updates, namely it does not need to wait for the end of the episodes to update the estimates of the value function.

SARSA

I know what you’re thinking: such a weird name for an algorithm. Well, indeed the name of SARSA comes from the transition from state S to S’:

Trajectory from state S to S’ that gave name to SARSA algorithm

SARSA is an on-policy TD method to solve the Control problem, which means it tries to find an optimal policy for a given task. Why “on policy”?

In on-policy learning, the optimal value function is learned from actions taken using the current policy 𝜋(𝑎|𝑠).

The update rule for SARSA, equation from [1]. Note that we know estimate Q(s, a), instead of V(s)

Indeed the reason why SARSA is defined as “on-policy” is the fact that we leverage the current policy to update Q(s, a). If that’s not clear, wait to see what Q-Learning does and check the difference.

Now, how does SARSA learn the optimal policy? We mentioned it’s a TD method, which means it makes use of the TD update rule to estimate the value function. But how does it improve the policy to obtain a greater value function?

SARSA accomplishes this by changing the policy π in an ε-greedy fashion: at each step we perform:

  • the action with the greatest Q(a,s), with probability 1 - ε
  • another random action, each one having probability ε / N(a) of being picked up, where N(a) is the number of possible actions

The parameter ε balances the exploitation vs. exploration tradeoff: at the same time you want to exploit what you already know, namely performing the best action, but you also want to explore other possibilities, namely discovering possible better actions.

SARSA procedure, from [1]

Once the procedure ends, if we extensively visited every state-action pair, we will have an optimal ε-greedy Q*(a, s), which easily leads to an optimal ε-greedy policy:

Q-Learning

Dulcis in fundo, Q-Learning is instead an off-policy TD method to solve the Control problem.

In off-policy learning, the optimal value function is learned from actions taken independently from the current policy 𝜋(𝑎|𝑠).

Q-Learning is actually very similar to SARSA. The only (but meaningful) difference lies in the update rule of Q(a, s), which is:

The update rule for Q-Learning, equation from [1]

In this case, the Q-value is not updated according to the ε-greedy policy we are following, yet it is based on the best action we could take from the next state Sₜ₊₁ (and this is very different from SARSA!).

Q-Learning procedure, from [1]

Likewise, as long as each state-action value pair is visited an infinite amount of time, Q-Learning converges to the optimal Q*(a, s) and to the optimal policy.

So… SARSA or Q-Learning?

We saw two algorithms for solving the Control problems, however, there’s a difference in the way they learn. In the learning process, SARSA is guided by the chosen ε-greedy action, which might be (and it’s often the case) sub-optimal. On the other hand, Q-Learning always follows the greedy action in the update of Q(s, a).

This (big) difference makes it so that the policy learned by SARSA is only near-optimal.

So shouldn’t you always choose Q-Learning? Well, technically you could. But there are cases where choosing the greedy action might imply some risk and costly errors, which you may avoid or limitate by choosing a more conservative algorithm like SARSA.

Conclusions

In this post, I summarized the key concepts of RL, starting from the basic elements, such as the concepts of states, actions, and rewards, up to sophisticated methods to find the best policy to reach a specific goal.

Reinforcement Learning is a set of complex methods, sometimes hard to understand and implement, yet those methods are often based on the concepts illustrated here, hence I suggest you master them before advancing further.

I hope you found this post useful! If that’s the case, why don’t you…
1. clap 👏
2. follow me ❤️
3. check my GitHub page 💻
4. add me on LinkedIn! 🤝 (I’m always open for a chat!!!)

Credits

[1] “Reinforcement Learning: An Introduction”, by Andrew Barto and Richard S. Sutton

--

--

ML Engineer @ Casavo | Graduate @ Polimi | Former Research Fellow @ Harvard. Former Vice President of Polimi Data Scientists.