
- What are the most common mistakes students make on Reinforcement Learning exams?
- What are some possible questions?
In this series of articles, I will answer these questions and offer insight into how to tackle the test.
Due to my teaching experience, I encountered different problems students might be experiencing, and I will mainly focus on my own experiences.
As for those who are no longer students, the article may still be informative to gain new knowledge.
Check the previous articles here.
Here, I’ll cover a few more of the questions that I left out of my previous article (you can find it here).
The questions below were the most difficult ones of the entire exercise, but as we can see, they can be answered in few lines.
The questions test students’ knowledge of probability and reinforcement learning, as well as their problem-solving skills.
The questions are the following ones:

First, some clarifications:
- π is the policy here, where π(a|s) denotes the probability of choosing action a given a state s.
- The target y is the TD-target, which changes if we use an on-policy or off-policy algorithm (check my previous article if this is not clear https://towardsdatascience.com/an-example-of-reinforcement-learning-exam-rationale-behind-the-questions-part-1-682d1358b571).
- The variant of SARSA (a.k.a. Expected SARSA) is an algorithm where the TD-target is computing according to the following formula

- The bias is just a measure of the goodness of a policy. If the policy is optimal, we expect the bias term to be 0.
Solution of question (a)
Question (a) is about being comfortable with the tower rule property in probability and knowing what a TD-target is.
This is a classical problem-solving exercise where a student needs to take one step at a time. It is often a good idea to simply write out what one wishes to prove/solve.
First, in order not to get confused, let us indicate by y the target value with SARSA, and by y’ the target value using the variant of SARSA. Since Q(s,a) is the same value in each bias term, the exercise boils down to show that the following equality holds

Next: what are y and y’? These are simply TD-targets. And we already know how to write them.

Therefore, the original problem simplifies to prove the following equality
This is really the first tricky point of the exercise. How can **** we move one from here? No panic.
- Note that on the left-hand side we have two random variables, the state-action pair at time t+1, whilst on the right-hand side just 1 random variable (on the right-hand side there is only one random variable, the state at time t+1).
This is where one needs to refresh his/her probability toolkit. The idea is to wash out the effect of the extra random variable on the left-hand side by using the tower property of conditional expectation.
Simply do the following (read from left to right):

-
Explanation: consider the expression on the left-hand side, and take an average over the next action, pretending that you know what is the next state (we get the expression in the middle). Then, write what is that average (final step on the right).
- This shows that the bias of SARSA is equivalent to the bias of the variant of SARSA (a.k.a. Expected Sarsa).
- The tower property provides us with the ability to deal with one random variable at a time, which makes our lives easier when dealing with the expectation of many random variables.
Solution of question (b)
In the second question (b), we are asked to answer if the two algorithms converge to the same Q values (you can check again the questions below).

This is a kind of open-ended question, where the student is forced to reflect on the issue. The student needs to understand which hypotheses lead to a positive (or negative) answer.
An important thing to notice is that the policy π is fixed, and does not change over time. This implies that we are not really interested in learning a better policy but just learning the value of π.
How can we start?
- Firstly, we need to assume that the Robbins-Monro conditions are satisfied.
- Secondly, we know that the variant of SARSA uses a behavior policy that is ϵ-greedy policy.
These two things imply that the variant of SARSA will converge to the Q-values of the policy π (not the optimal Q values).
- Since the behavior policy will explore each state-action pair infinitely often, then the algorithm will converge to the Q-values of the policy that we use to compute the TD-target y, that is, the Q-values of π.
What about SARSA?
- SARSA will learn the value of the state-action pairs that are visited infinitely often.
- However, it may be that some state-action pairs are never visited. For example, consider the case where the only way to reach a specific state z is to perform action a in state s. If that probability is 0 (π(a|s)=0), we will never be able to learn the Q-values in z.

- On the other hand, the variant of SARSA will learn the Q-values in z since the behavior policy, in this case, is ϵ-greedy (this guarantees that the state z will be visited at some point)
- Therefore, the only way is to guarantee that π is explorative enough. This requires the following condition to be satisfied: we need a positive probability of picking all possible actions in every state, that is, π(a|s)>0 for all state-action pairs.
- If the latter condition is satisfied, then the Q-values learned through SARSA converge to the Q-values of the policy π.
Conclusions and next articles
This is the second article of a series where I describe some of the most common questions you can find in Reinforcement Learning tests.
Over the next articles, I will show more exercises and discuss other reinforcement learning problems in more detail.
I will primarily discuss methodology and the theoretical aspects of reinforcement learning, focusing on what teachers expect from students.
I hope this article has inspired you in the way you solve exercises, so that it may be helpful for your upcoming exam or future studies!
Thank you for reading!