Intro to reinforcement learning: temporal difference learning, SARSA vs. Q-learning

Gentle explanation and implementation of SARSA and Q-learning in the context of CartPole game.

Published in

Towards Data Science

9 min readFeb 23, 2021

Reinforcement learning (RL) is surely a rising field, with the huge influence from the performance of AlphaZero (the best chess engine as of now). RL is a subfield of machine learning that teaches agents to perform in an environment to maximize rewards overtime.

Among RL’s model-free methods is temporal difference (TD) learning, with SARSA and Q-learning (QL) being two of the most used algorithms. I chose to explore SARSA and QL to highlight a subtle difference between on-policy learning and off-learning, which we will discuss later in the post.

This post assumes you have basic knowledge of the agent, environment, action, and rewards within RL's scope. A brief introduction can be found here.

The outline of this post include:

Temporal difference learning (TD learning)
Parameters
QL & SARSA
Comparison
Implementation
Conclusion

We will compare these two algorithms via the CartPole game implementation. This post's code can be found here: QL code, SARSA code, and the fully functioning code. (the fully-functioning code has both algorithms implemented and trained on cart pole game)

The TD learning will be a bit mathematical, but feel free to skim through and jump directly to QL and SARSA.

Temporal Difference Learning (TD Learning)

One of the problems with the environment is that rewards usually are not immediately observable. For example, in tic-tac-toe or others, we only know the reward(s) on the final move (terminal state). All other moves will have 0 immediate rewards.

TD learning is an unsupervised technique to predict a variable's expected value in a sequence of states. TD uses a mathematical trick to replace complex reasoning about the future with a simple learning procedure that can produce the same results. Instead of calculating the total future reward, TD tries to predict the combination of immediate reward and its own reward prediction at the next moment in time. (more info can be found here)

Mathematically, the key concept of TD learning is the discounted return:

Where the reward at time t is the combination of discounted rewards in the future. It implies that future rewards are valued less. The TD Error is the difference between the ultimate correct reward (V*_t) and our current prediction (V_t).

And similar to other optimization methods, the current value will be updated by its value + learning_rate * error:

Alright, that’s enough math for the day. The equations above are the core idea for TD learning, which will help us understand QL and SARSA codes.

Parameters

Alpha (α): learning rate. This parameter shows how much we should adjust our estimates based on the error. The learning rate is between 0 and 1. A large learning rate adjusts aggressively and might lead to fluctuating training results — not converging. A small learning rate adjusts slowly, which will take more time to converge.

Gamma (γ): the discount rate. How much we are valuing future rewards. The discount rate is between 0 and 1. The bigger the discount rate, we more we valuing the future rewards.

e (coming up in the next section on “e-greedy” policy): the ratio reflective of exploration vs. exploitation. We explore new options with probability e and stay at the current max with probability 1-e. The larger e implies more exploration while training.

QL & SARSA

QL and SARSA both store the rewards of the current state and the corresponding action for future updating. A state is usually represented by the coordinate of its components. If the environment is continuous, we will have an infinite number of states. To counter this problem, we need to discretize the state by segmenting them into buckets.

The way I did this is by splitting the continuous space into grids and have a separate function to discretize them (4 variables, I split them into 10 boxes each → 10⁴ boxes in my weight matrix). For more rigorous applications, you can segment them into more boxes (split into 100 boxes instead of 10 like above). The more boxes you have, the more details your model can learn and longer to learn, and more memory space.

Be cautious about the number of boxes you split. If too small, you can have an underperforming model. If too large, you would need large memory space. Also, I recommend limiting the region of the game. For example, in the cart pole game in OpenAI’s gym, I ignore the white space above the image and the area near the boundary to limit my state space. These are some design choices you have to make while making the models.

As it needs to memorize the state space, QL and SARSA will work best for constrained and limited actions games (like CartPole moving left and right) and not well on more complex (like chess with many possible moves). To avoid storing all the state space for games with larger state space, we can use a deep q network. This approach combines reinforcement learning with neural networks. We shall talk more about this deep q network in some later posts. For now, back to QL and SARSA.

Quick definition: “policy” is a strategy that an agent uses to pursue a goal. Greedy (choosing the best value) is a policy. The greedy algorithm can make us stuck in the local minima. Hence, we have “e-greedy,” a policy ask that e chance it will explore, and (1-e) chance of following the optimal path. e-greedy is applied to balance the exploration and exploration of reinforcement learning. (learn more about exploring vs. exploiting here). In this implementation, we use e-greedy as the policy.

Back to QL & SARSA, they are both updated using the TD learning formula above, with a slight difference: (Q is the storage to keep all the rewards at discrete space and action, s represents the state, and a represents action)

QL

Math (equation 6.8 in “Reinforcement Learning: An Introduction”)

Code:

SARSA

Math (equation 6.7 in “Reinforcement Learning: An Introduction”):

Code (e-greedy policy):

The Difference

The difference is very subtle: For QL, which is the off-policy algorithm, when passing the reward from the next state (s_, a_) to the current state, it takes the maximum possible reward of the new state (s_) and ignores whatever policy we are using. For SARSA, which is on-policy, we still follow the policy (e-greedy), compute the next state (a_), and pass the reward corresponding to that exact a_ back the previous step.

To reiterate, QL considers the best possible case if you get to the next state, while SARSA considers the reward if we follow the current policy at the next state. Hence, if our policy is greedy, SARSA and QL will be the same. But we are using e-greedy here, so there is a slight difference.

Comparison

QL and SARSA are both excellent initial approaches for reinforcement learning problems. A few key notes to select when to use QL or SARSA:

Both approach work in a finite environment (or a discretized continuous environment)
QL directly learns the optimal policy while SARSA learns a “near” optimal policy. QL is a more aggressive agent, while SARSA is more conservative. An example is walking near the cliff. QL will take the shortest path because it is optimal (with the risk of falling), while SARSA will take the longer, safer route (to avoid unexpected falling).
In practice, if you want to fast in a fast-iterating environment, QL should be your choice. However, if mistakes are costly (unexpected minimal failure — robots), then SARSA is the better option.
If your state space is too large, try exploring the deep q network. I will hopefully write up a post about this topic soon. Stay tuned!

Implementation

For the CartPole game, OpenAI’s gym has a prebuilt environment. A few gym syntaxes are listed here: (learn more about OpenAI gym here)

Next, as QL and SARSA work best in discrete state space, and the cart pole game is continuous, we will discretize them to into smaller bins. More bins, the better the performance. The more bins would help the model account for more specific state space, leading to better overall performance. However, more bins would require training with more games, costing computational power. If time, computational power, and storage space are your constraints, stay with a small number of bins. Otherwise, you are welcomed to try a larger number of bins. Also, try with a small number of bins to check the performance before scaling up.

Putting a few graphs showing the performance of my algorithms. Note that in OpenAI’s gym cart pole game, the maximum step you can reach is 200 (and the game will self terminate by then).

QL training:

SARSA training:

QL testing:

SARSA testing:

The training graphs show the performance of the agent after many games. The x_axis indicates the number of games we trained on, and the y_axis represents the max step they can take (capped at 200 due to OpenAI gym’s setup).
The testing graphs show the performance in the implementation phase (after we finished the training). The histogram shows what is the distribution of the outcomes for each model when we play 1000 games.
The results are as expected because SARSA chooses to play safer (usually) compared to QL. Hence, it might take less dramatic steps along with the game, leading to better performance. That is one possible explanation for the performance of SARSA over QL.

Combine models? Why not?

Random thoughts: if both models aren’t working awesomely (if only my SARSA didn’t work as great), we can try combining both models.

The decision will be based on “beta * QL + (1-beta) * SARSA”.

In some cases, it could help the performance if you tune the“beta” well. Adjusting the beta means varying the beta to see which results are best. This tuning is more of an art to see which works best. You can try looping through multiple values of beta and see which yields the highest result.

Our SARSA is always successful for this use case, so in this particular case there is no need for aggregation, but it might be worth trying for other games! Let me know in the comment section if this combined version works better than the individual model in any other games!

Conclusion

Here are a quick introduction and comparison between QL and SARSA. I hope it helps you understand SARSA and QL better and see the differences between on-policy and off-policy learning.

As promised, here is the code for all the algorithms and agents (with existing Q tables). If you want to see a running video of the agent playing game, run it in VS Code instead of GG Colab.

Feel free to leave a response with any thoughts or questions and give my blog a follow if you enjoyed this post and want to see more! You can also find me on LinkedIn. Enjoy learning!

References:
1. Ashraf, M. (2018, December 3). Reinforcement Learning Demystified: Exploration vs. Exploitation in Multi-armed Bandit setting. Retrieved from https://towardsdatascience.com/reinforcement-learning-demystified-exploration-vs-exploitation-in-multi-armed-bandit-setting-be950d2ee9f6
2. Dabney, W., & Kurth-Nelson, Z. (n.d.). Dopamine and temporal difference learning: A fruitful relationship between neuroscience and AI. Retrieved from https://deepmind.com/blog/article/Dopamine-and-temporal-difference-learning-A-fruitful-relationship-between-neuroscience-and-AI
3. Lee, D. (2020, April 12). Reinforcement Learning, Part 1: A Brief Introduction. Retrieved from https://medium.com/ai³-theory-practice-business/reinforcement-learning-part-1-a-brief-introduction-a53a849771cf
4. OpenAI. (n.d.). A toolkit for developing and comparing reinforcement learning algorithms. Retrieved from https://gym.openai.com/docs/
5. Stanford PDP Lab. (2015, December 16). Chapter 9 Temporal-Difference Learning. Retrieved from https://web.stanford.edu/group/pdplab/pdphandbook/handbookch10.html
6. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. Cambridge (Mass.): The MIT Press.
7. Tabor, P. (2020, June 23). SARSA.py. Retrieved from https://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/Fundamentals/sarsa.py