The world’s leading publication for data science, AI, and ML professionals.

An example of Reinforcement Learning exam – Rationale behind the questions (part 1)

What should you expect in the exam? My experience as a teacher

Photo by Nguyen Dang Hoang Nhu on Unsplash
Photo by Nguyen Dang Hoang Nhu on Unsplash
  • What are the most common mistakes students make on Reinforcement Learning exams?
  • Is there a way to prepare effectively for the exam?
  • What are the most important topics to study in Reinforcement Learning?
  • What are some possible questions?

In this series of articles, I will answer these questions and offer insight into how to tackle the test.

Due to my experience teaching at KTH Royal Institute of Technology, I encountered different problems students might be experiencing, and I will mainly focus on my own experiences. Doctoral students at KTH are required to teach as part of their activities, and this is one of my favorite things to do as a doctoral student.

As for those who are no longer students, the article may still be informative to gain new knowledge.

This will hopefully be of help to many students.


Introduction to the exercise: Expected SARSA and On-policy vs Off-policy learning

Introduction to the exercise
Introduction to the exercise

Last semester (January 2021), according to my supervisor, I was quite creative in designing the exam question. In the figure above, you can find an introduction to the exercise.

As a graduate course, students are expected to develop reflexivity and critical thinking skills as well as an understanding of reinforcement learning.

Thus, I designed the activity to evaluate the students’ skills in the following areas:

  1. Critical thinking
  2. Understanding of basic Reinforcement Learning concepts: on-policy, off-policy learning, and convergence
  3. Ability to adapt an algorithm to different needs

I will briefly describe how I assessed each of these three points.


First question – Critical thinking and understanding

To my students, I usually teach them how to think through their work. After graduation, I firmly believe that having this kind of skills is essential and necessary.

Part of the first exercise of the exam was aimed at testing theoretical knowledge and capability to evaluate algorithms. Essentially, this test evaluates the critical judgment of the student in assessing a piece of work.

The image below shows the pseudo-code of a faulty SARSA algorithm.

This is the first question I asked the student

  1. Spot all the mistakes in the algorithm. Motivate why those are mistakes, and correct them.

The algorithm contains a number of errors.

The question itself may appear straightforward, but it is not. First of all, I did not provide the total number of mistakes. It made the exercise resemble a real situation in which you have to evaluate someone else’s work. Additionally, students need to be aware of basic reinforcement learning concepts in order to fix the errors (otherwise how can you fix them?)

This exercise is worth 3 points (out of 10), and students often mistakenly believe that each error is worth 1 point, for a total of 3 points (the overall exam is worth 50 points, with 5 exercises, each worth 10 points).

Unfortunately, this is a naive way of thinking, and students should avoid thinking in those terms. A mistake may be worth more than 1 point, or less than 1 point.

In total, there are 6 mistakes, each worth half a point.

Nevertheless, the mistakes don’t all share the same importance (in my opinion), and I may use a different approach in the future.

But, what are the mistakes?

Most of the mistakes require the student to pay attention to the details, while the rest are simple mathematical errors (plus sign instead of minus sign, etc…).

The first mistake is about the learning rate
The first mistake is about the learning rate
  • A first mistake involves the Robbins-Monro conditions, one of the cornerstones of reinforcement learning. Stochastic approximation algorithms are used to learn from data in reinforcement learning. The following two conditions are necessary for the convergence of stochastic approximation schemes: ∑ α(t) = ∞ and ∑ α²(t)<∞.
  • In the exercise, the latter requirement is not satisfied. One can simply choose a different learning rate, such as 1/n.
Three mistakes are about the target value
Three mistakes are about the target value
  • A second major mistake is the computation of the target value y(t). SARSA is an on-policy learning method.
  • This means that the target value should be computed according to the action taken by the behavior policy (in other words, if you took an action x, you should use the same action x to compute the target value), and not using the max operating as in Q learning!
  • Another mistake is the missing discount factor!
  • The correct answer is y(t) = r + λ Q(s(t+1), a(t+1)).
  • The only minor (really minor) mistake is that when the episode is terminal (i.e., we have reached the final state), the target value should equal the reward only (y(t)=r).
Two mistakes are about the update of the Q values
Two mistakes are about the update of the Q values
  • Last, but not least, the computation of the Q values also contains some minor errors.
  • We update the values of the current state and action pairs, not the successive ones! It should be Q(s(t), a(t)) on the left hand-side, and not Q(s(t+1), a(t+1)).
  • Moreover, there is a minus sign in front of α that should be a plus.

Second and third questions: critical thinking, understanding, and adaptation

Second and third questions of the exercise
Second and third questions of the exercise

The second and third questions in the exercise gave me a bit of freedom to experiment with the algorithm.

It was intended to test the student’s ability to adapt to unexpected situations.

First, I change the way I calculate the target value. To shuffle things a bit, I introduce a new element, the policy π. Using this method, I can assess their understanding of a basic reinforcement learning concept, off-policy vs on-policy learning.

Per se, the questions are not difficult. Nevertheless, the introduction of unexpected elements in an exam can have a huge mental impact on students. The majority of them may feel nervous as a result of this change, which may affect their performance.

However, as long as the change does not require complex answers, I feel that the students should be able to handle it.

You can answer question 2 in one line:

  • In question (2) this policy π is like a free parameter. To answer this question, the student needs to understand the difference between on-policy and off-policy learning.
  • In addition, the notation may be intimidating for the student. Not all students can handle mathematical notation at this level, despite being an engineering school.In my opinion, this reflects a lack of mathematical knowledge, a result of the industrial need to have data scientists/data engineers who can tackle problems that do not require mathematical modeling.
  • The answer to question (2) is plain and simple: π should be simply the behavior policy, i.e., the policy you use to take an action. (therefore the policy μ).

Question 3 also is a bit harder, but not much, and requires 1–2 lines of answer.

  • Off-policy algorithms, given that the behavior policy samples all state-action pairs infinitely often, will learn the Q-values of the policy used in the target value computation.
  • Therefore, given a sufficiently exploring behavior policy, the Q values will converge to the Q values of the policy π, and not the policy μ.

Conclusions and next articles

Photo by Angelina Litvin on Unsplash
Photo by Angelina Litvin on Unsplash

This is the first article of a series where I will describe some of the most common questions you can find in Reinforcement Learning tests.

In this article, I showed some simple, but tricky questions, I proposed in the last exam. Over the next few articles, I will show more exercises and discuss other reinforcement learning problems in more detail.

I will primarily discuss methodology and the theoretical aspects of reinforcement learning, focusing on what teachers expect from students.

I hope this article has inspired you in the way you solve exercises, so that it may be helpful for your upcoming exam or future studies!

Thank you for reading!


Related Articles