Introduction to the Deadly Triad Issue of Reinforcement Learning

Published in

Towards Data Science

6 min readNov 3, 2020

When I learned about Deep Reinforcement Learning, I read about the deadly triad issue, but couldn’t find any satisfying simple explanation (apart from scientific papers). I thus make two articles to explain with intuition (thus little math) first what it is, and then how it’s handled.

This issue states that when one tries to combine TD-learning (or Bootstrapping), off-policy learning, and function approximations (such as Deep Neural Network), instabilities and divergence may arise.

Reinforcement Learning Problem

I here consider you are familiar with the typical formulation of a Reinforcement Learning problem. If not, please have a look at one (or more) of these sources:

Reinforcement Learning: What is, Algorithms, Applications, Example

Reinforcement Learning is defined as a Machine Learning method that is concerned with how software agents should take…

www.guru99.com

Video of Arxiv Insights, very clear and complete

The amazing lectures of David Silver (1.5h each)

I will be using the usual notations.

Learning the Value function

The way we want to tackle our RL problem here is by learning the value function. Especially, we are interested in the Q-value function, that for any state we are in, and any action we choose, gives us the return (how good the action is).

The Q-value function does not give the reward obtained after performing this action, but the (potentially discounted) sum of all the reward you will obtain if, after performing this action, you follow your policy, called the return:

Using stochastic ε-greedy policies

If the agents wants to learn how good an action is, he first has to try it out. This will lead him to another state where he can choose again between different actions. The idea is to:

Explore: try many actions
Exploit: use the positive high rewarded actions more often to have the best possible strategy.

Hence, a classical approach is to first tryout completely random actions, thus you would learn for each state which action is good and which is bad and progressively prioritize the action that is giving you the most positive reward.

The policy is thus here (and most of the time in reinforcement learning) stochastic, which means that it acts randomly.
Be careful not to confuse it with uniformly random, I am not saying here that the agent completely randomly selects the action, it tries to have a higher probability of selected good actions rather than bad ones, but it has to explore, test actions for which it does not know the outcome, and thus cannot have a completely deterministic policy.

There are different ways to learn this function, based on the Bellman Equations, such as Dynamic Programming, Monte Carlo Methods, and Bootstrapping.

Dynamic Programming

If you have an exact model of the environment (i.e. what reward and state you will obtain by performing a specific action in a specific state), you can use dynamic programming to compute the values of your states and actions.

However, for environment where you have a lot of states and actions, or ones that are stochastic, getting the model of the environment is very expensive. Thus, people would prefer to use Monte Carlo techniques (such as Deepmind for AlphaGo).

Monte Carlo

With Monte Carlo, the agent updates the value of the action he considers by selecting it, and then follows his policy until the episode terminates (it arrives in a final states). It can then sum up all the rewards he obtained through the episode and get an approximation of the return. Repeating this process sufficiently many times will eventually converges to the true Q-Value of the state action.

This is very costly too if you can have very long episode, as you have to wait for the episode to finish in order to obtain your return.

Temporal-Difference Learning

To find the correct Q-value, we can observe it is supposed to converge to the return (the discounted sum of reward), for every state/action pair. The return for a state action/pair is:

The reward obtained for this action (R)
The return of the next state/action, obtained after selecting the next action in the next state, which is approximated by its Q-value

Why should the agent wait for the episode to finish? He could do the following:

From state S, pick an action (a) according to the Q-value (in an ε-greedy manner)
Observe the Reward (R) and new State (S’)
Pick an action again (a’), and observe it’s Q-value
Use the Q-value of this new state and action as an approximation of the remaining return:

Here the value is progressively updated with an α factor for stability reasons.

This is a description of the SARSA algorithm. It uses the Bootstrapping technique, which consists in using an estimate of the next value to estimate this value (instead of waiting the whole episodes).

The next image illustrate how those 3 techniques work:

MC, TD learning and DP schemes — *David Silver’s RL course* *lecture 4: “Model-Free Prediction”*

Mixture of TD-learning and Monte Carlo exist, and they are grouped in the TD( λ) family.

Q-Learning

SARSA is an on-policy learning technique, which means it is following its own policy to learn the value function. Another idea would be to use directly the max of the Q-value of the next to compute the return.
This is called Q-Learning and follows:

From state S, pick an action (a) according to the Q-value (in an ε-greedy manner)
Observe the Reward (R) and new State (S’)
Use the maximum Q-value of the next state to update the current Q-value:

The agent is thus using the best possible Q-value, even if he is then not taking the corresponding action. It’s coherent as he will finish by always choosing the best action. This could thus help speed the learning process.

This is called learning off-policy, as you don’t use your current followed policy to update your Q-value function anymore (but a greedy one).

Off-policy learning presents advantages and drawbacks, you can look them up here.

Function approximations (Neural Networks)

When the numbers of states and actions are too high, or when evolving in a continuous environment (which means the same on a computer), one cannot create a value function for every possible state or state-action pair.

The idea is thus to use function approximations, as neural networks. One would then feed the neural network with the state and obtain the Q-value for each action as output. This is thus tractable, and we can then use gradient ascent techniques to turn this neural net into a good Q-value function approximation, telling us on what button (or combination of button) to press for each image.

A clear advantage of using approximations is that you would get similar value for similar states (which seems consistent and useful), but we’ll see it’s not in the next post.

The Deadly Triad Issue

The deadly triad issue is the fact that combining Bootstrapping, off-policy learning and function approximations ends up in a lot of instabilities, or no convergence. In this next post I explain why it happens and how it’s handled.