To learn the basics of Reinforcement Learning (RL), Sutton & Barto’s [1] textbook example of cliff walking is an excellent start, providing an illustration of both Q-learning and SARSA. As you might know, SARSA is an acronym for State, Action, Reward, State, Action – the trajectory used to update the value functions. To preserve the analogy, Q-learning may be summarized as State, Action, Reward, State or SARS (note the second action doesn’t matter!). The difference is that SARSA is on-policy and Q-learning is off-policy. What on earth does that mean?
Maneuvering the cliff
Before going into that question, let’s first define our cliff world. Our brave agent stands near a cliff, trying to reach the other side of the cliff walking the minimal number of steps (allowed are left, right, up, down). To model this objective, reaching the goal ends the episode and gives a reward of +10, each intermediate step costs -1, and falling on the cliff costs costs a whopping -100. Needless to say, walking off cliffs should be avoided.

In the beginning, we are absolutely clueless, but we quickly learn by observation. After tumbling off the cliff a few times or roaming around in no-man’s land, we should learn how to reach the destination. With each step we observe a reward and use the temporal difference (TD) update equation. Let’s look at Sarsa first:

In plain English: we compare our predictions to observed rewards, and use the difference to make better predictions. In plain mathematics: we measure the error (r_t+1 + Q_t+1)-Q_t
, and update our estimate for Q_t
with that error. Note that when Q_t=(r_t+1+Q_t+1)
for all states and actions, our algorithm has converged.
Let’s write out what the Q-values represent, because this is important for understanding. Consider the reward trajectory of a full episode. With simplified notation, we could define our Q-values as follows:

Of course, rewards are uncertain, so we cannot simply set our Q-values equal to a single reward trajectory. The idea is straightforward though. The Q-value predicts the value of a trajectory. Whenr_T=-100
due to falling into the cliff, all Q-values would be heavily affected by that.
Now, Q-learning works slightly different from SARSA:

You see why Q-learning may be called a SARS algorithm: for our update, the actual action we took ata_t+1
doesn’t matter when evaluating our action a_t
– r_t+1
is not in the equation. Maybe a_t+1
lead us off the cliff in our real sample trajectory; it doesn’t matter because we take the maximum value of the future action, rather than the real one.
Learning by exploring
During training, both algorithms select the best action at every tile, given our current beliefs (‘greedy’ action selection):

Especially early on, these beliefs (the Q-values) might be completely wrong – if we always take the longest path because it worked once, we’d never learn there is a shortcut. Exploration is key to get better policies.
An epsilon-greedy policy mostly follows the greedy approach, but sometimes does something else. Suppose that 5% (ϵ=0.05
) of the time, we take a random action. When standing at the edge, that random action could just tell us to ‘step to the right’ when we really, really shouldn’t. For both our Q-learning agent and our SARSA agent, such an action would end badly.
It gets more interesting when standing one tile away from the cliff. We can see why our sure-footed Q-learning is quite comfortable moving closer to the edge, while our wobbly SARSA friend might prefer to keep a safe distance. For the reward update, Q-learning simply doesn’t look more than one step into the future; the actual action at the next tile doesn’t matter.
![Temporal difference update with SARSA (on-policy). The large negative reward obtained when falling down the cliff is used to update the Q(s,a) values for all preceding states and actions, using the actual trajectory. [own work by author]](https://towardsdatascience.com/wp-content/uploads/2021/06/15i_eooWI0skZxY2WJgt8dA.png)
![Temporal difference update with Q-learning (off-policy). The large negative reward obtained when falling down the cliff does not influence earlier Q-values, as the optimal trajectory is used for the updates. [own work by author]](https://towardsdatascience.com/wp-content/uploads/2021/06/1wu23e-rTianWOYi7E58D6A.png)
The result? A trained Q-learning agent often follows the optimal path, whereas the SARSA agent takes a bit of a detour. See for instance the simulated sample paths below:

Beyond the cliff (on-policy vs. off-policy)
Ok so far, but cliff walking is a stylized textbook example. A mountain goat could figure this out (they actually do so on a daily basis). What can we do with this information for actual RL problems?

It is unlikely you will use either SARSA or Q-learning for more challenging reinforcement learning problems. In both cases we need to fill a table with Q-values. For an accurate lookup table, we must observe each state-action pair multiple times to learn all Q(s,a)
. The cliff walking problem only has 192 state-action pairs (48*4), but typical RL problems are way too large to capture in a lookup table.
Despite not using the actual algorithms, the principles of on-policy and off-policy learning remain the same. In short: (i) an update function using the actual reward trajectory to update Q-values is on-policy, (ii) an update function includes a maximum operator is likely off-policy. Whatever exploration scheme we use, the on-/off-policy dilemma remains relevant – we get deeper into that later. The cliff-walking problem merely illuminates the trade-offs and behaviors.
Behavior policy or target policy?
To fully comprehend the dilemma between on-policy and off-policy, one should grasp the distinction between behavior- and target policies. When evaluating RL algorithms, it’s important to clarify which performance we actually measure.
The behavior policy is a policy that includes random exploration, the target policy does not. If we always follow the best path we learned (target policy), there’s no reason for a detour around the cliff. If we deploy the behavior policy in reality, we should take caution. Often, we use the behavior policy to learn in a safe setting – most likely a simulation – and deploy the resulting target policy in real life. However, this is not always the case.
To illustrate: suppose we train an actual robot (rather than a virtual agent) to safely maneuver to the other side of a cliff. You don’t want your expensive robot to drive down the cliff while happily exploring. If new observations have real-life consequences, you will obviously be much more cautious while exploring. Also consider testing your trading algorithm on historical data or learning by setting it loose in the stock market with real money – very different implications.
A more common reason for careful exploring is that experiments are often computationally expensive. As Q-learning has less consistent reward patterns, the trajectories have a larger variance than SARSA and convergence is more challenging. In addition, the embedded maximization problem max a
∈ A(s_t)` might take substantial effort to solve – not every action space contains only four moves.
Exploration is rarely ‘free’, even in a virtual setting.
Head-to-head
The burning question: which algorithm is better? Which one should you use? Q-learning or SARSA?
At the surface, Q-learning clearly provides a shorter path (often optimal) and thus a better solution, but that ignores the nuances that hide underneath. The figures below show a clear trade-off; SARSA on average receives a larger reward (fewer falls into the cliff) whereas Q-learning on average requires fewer steps (lower impact of failed explorations). Of course, this only applies to the behavioral policy – the target policy obtained with Q-learning is clearly superior.

Once again, we cannot simply compare target policies and conclude that Q-learning is better. It happens to hold for ϵ=0.05
, but substitute it by ϵ=0.001
and we may obtain optimal paths with both. With exploration tending to 0, SARSA and Q-learning converge to the same approach. Exploration typically decreases while learning – for instance by setting ϵ=1/n
– and with that the superficially perceived benefit of Q-learning vanishes.
‘It depends’ would be the most faithful answer to the question ‘which one is better?’, but not the most useful one. Let’s try to derive some rules of thumb.
Use on-policy learning (SARSA) when:
- Learning is expensive
- Computational effort is a bottleneck
- You don’t need to explore everything
- The action problem is hard to solve
- You don’t mind fine-tuning the exploration parameter
- You are willing to sacrifice some rewards for robustness
- Rewards while learning matters
- You are risk-averse
Use off-policy learning (Q-learning) when:
- Learning is cheap
- Computational effort is less relevant
- You want to explore a lot
- The action problem is easy to solve
- You don’t want to bother with fine-tuning exploration
- Rewards while learning doesn’t matter
- Only performance of target policy is important
- You are risk-seeking
Again, both can achieve the same solution quality, proper tuning is essential, it depends on the problem and purpose, there is no one-size-fits-all answer, etc. etc. They are rules of thumb, after all. The best way to figure out which approach you need? Go stand at the edge of the cliff, and decide how it feels.
Want to see the full code? Check it out on my GitHub repository.
Interested in solving the cliff walking problem with Deep Q-learning? Please check out my article with a TensorFlow implementation:
Looking for a policy gradient implementation instead?
Cliff-Walking Problem With The Discrete Policy Gradient Algorithm
References
[1]Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.