
Reinforcement Learning (RL) is one of the learning paradigms in machine learning that learns an optimal policy mapping states to actions by interacting with an environment to achieve the goal. In this article, I will introduce the two most commonly used RL algorithm: Q-Learning and SARSA.
Similar to the Monte Carlo Algorithm (MC), Q-Learning and SARSA Algorithms are also model-free RL algorithms that do not use the transition probability distribution associated with Markov Decision Process (MDP). Instead, they learn the optimal policy from experience. The main difference between MC and Q-Learning or SARSA algorithm is that MC needs to sample the whole trajectory to learn the value function and find the optimal policy. However, for some problems getting a whole trajectory might be time-consuming. So, it might be good if the algorithm is able to update a policy after every action rather than after getting the whole trajectory.
In Q-Learning and SARSA, we only need one step trajectory (𝑠,𝑎,𝑟,𝑠’ ) instead of the whole trajectory. Moreover, from these two algorithms, we can also highlight the difference between on-policy and off-policy learning which I will discuss later in this article.
Q-Learning
The updating rule in Q-Learning is as follow:

The difference between the new sample and the old estimation is used to update the old estimation.
![Figure 1: Q-Learning - An off-policy learning algorithm [1]](https://towardsdatascience.com/wp-content/uploads/2021/06/1AligqHeKpZQfeks7KErqIA.png)
A Step-by-step Example
Suppose a 6 rooms environment as MDP. We’ll number each room 0 through 6 and the rooms are connected by a door / an arrow, as shown in the figure below. The objective is to let the agent move to room 5. Note that only room 2, room 4 and room 6 can lead into room 5 (destination).

Goal: Put an agent in any room, and from that room, go to room 5. Reward: The doors that lead immediately to the goal have an instant reward of 100. Other doors not directly connected to the target room have a 0 reward.
This tutorial will introduce the conceptual knowledge of Q-learning through simple and easy-to-understand examples. It describes how an agent learning from an unknown environment through Reinforcement Learning methods.


According to the above Q-table, we can select actions based on the maximum Q value. E.g. If the agent is in room 1, it will have 2 different routes that can lead to room 5. From room 1 agent can either move to room 4 or room 6 which give the same maximum Q-value of 80. Afterwards, the agent will then move to room 5 in both cases which gives the maximum Q-value of 100. Similarly, if the agent is in room 3, it can either move to room 1 or room 4 which give the same maximum Q-value.
SARSA
The updating rule in Sarsa is as follow:

The update rule is similar to Q-Learning with some differences.
![Figure 3: SARSA - an on-policy learning algorithm [1]](https://towardsdatascience.com/wp-content/uploads/2021/06/1mHNdrdmeMe_EUVALDTr3aw.png)
ε-greedy for exploration in algorithm means with ε probability, the agent will take action randomly. This method is used to increase the exploration because, without it, the agent may be stuck in a local optimal.
SARSA is on-policy which update the Q-table with the (S, A, R, S’) samples generated by the current policy. (S’, A’) is the next state and next action in transition samples. After reaching S’, it will take action A’ and use Q(S’, A’) to update the Q-value. While Q-Learning is off-policy which use maximum Q possible value in state S’ to update Q value in the future. However, the action with maximum Q possible value may not be the actual action the agent will take in the future because with ε probability the agent will take a random action. In other words, the action used to update policy in Q-Learning is different from the true action the agent will take.
We’ll use this toolkit to solve the FrozenLake environment. There are a wide variety of games available like the Atari 2600 ones, text-based games etc. Check out all of them here.
The tutorial below will use the SARSA algorithm to solve the FrozenLake from the gym environment.
References
[1] Sutton, R., & Barto, A. (2017). Reinforcement learning: An introduction. , The MIT Press