An intro to understanding Double Q-Learning

Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com
Q-learning (Watkins, 1989) is considered one of the breakthroughs in TD control reinforcement learning algorithm.
However in his paper Double Q-Learning Hado van Hasselt explains how Q-Learning performs very poorly in some stochastic environments. He pointed out that the poor performance is caused by large overestimation of action values due to the use of Max Q(s’,a) in Q-learning.
To remedy this problem he proposed the Double Q-Learning method.
The Problem
Consider an MDP having four states two of which are terminal states. State A is always considered at start state, and has two actions, either Right or Left. The Right action gives zero reward and lands in terminal state C. The Left action moves the agent to state B with zero reward.

State B has a number of actions, they move the agent to the terminal state D. However (this is important) the reward R of each action from B to D has a random value that follows a normal distribution with mean -0.5 and a variance 1.0.
The expected value of R is known to be negative (-0.5). This means that over a large number of experiments the average value of R is less than zero.
Based on this assumption, it is clear that moving left from A is always a bad idea. However because some of the values of R are positive, Q-Learning will be tricked to consider that moving left from A maximises the reward. In reality this is a bad decision, because even if it works for some episodes, on the long run it is guaranteed to be a negative reward.
So why does Q-Learning overestimate? To reply to this question we consider the following scenario: Let X1 and X2 two random variables that represent the reward of two actions at state B. Since they are random variables, we will compute their expected values E(X1) and E(X2). There is a problem though, we don’t know their expected values, so what we can do is use estimates of those expected values by computing incremental average 𝝁1 and 𝝁2. Those estimates are unbiased because as the number of samples increases, the average over the whole set of values gets closer to E(X1) and E(X2) as it is shown in the table below.

However, Q-Learning uses Max Q(s’,a), represented in the table by Max(𝝁). It can be clearly seen from the table (see red cells) that E(Max(𝝁)) is different than Max E(X). This tells that Max(𝝁) is not a good estimator for Max E(X). It is biased! In other words, when updating Q(s,a) with Max Q(s’,a), Q(s,a) is not moving towards the expected value of the actions at state B which is -0.5.
This scenario gives an intuition for why Q-Learning over-estimates the values. Formal mathematical demonstration can be found in the paper.
The Solution
The proposed solution is to maintain two Q-value functions QA and QB, each one gets update from the other for the next state. The update consists of finding the action a that maximises QA in the next state (Q(s’, a) = Max Q(s’, a)), then use a to get the value of QB(s’, a) in order to update QA(s, a).
The pseudo code below shows how the algorithm behaves. Note that there is a python code at the end of the article that compares the two methods. Line 3 of the algorithm shows how to choose action from the two Q-value functions. For example it is possible to merge the two Q (average the values for each action) then apply epsilon-greedy.
The algorithm updates both QA and QB in an equiprobable manner.

The charts below show a comparison between Double Q-Learning and Q-Learning when the number of actions at state B are 10 and 100 consecutively. It is clear that the Double Q-Learning converges faster than Q-learning. Notice that when the number of actions at B increases, Q-learning needs far more training than Double Q-Learning.


Why Does It Work ?
Van Hasselt proves in his paper that E(Q2(s’, a)) ≤ Max Q1(s’, a). So over enough number of experiments the expected value of Q2(s’, a) is less or equal to Max Q1(s’, a), which means that Q1(s, a) is not updated with a maximum value.
The table below shows the evolution of the Q-Values of the Left action at state A as the number of episodes increases. Notice that in Q-Learning, Q(A, Left) is positive because it is affected by the positive rewards that exist at state B. Because of this positive value the algorithm is more interested in taking the Left action hoping to maximize the rewards. As you can see the percentage of left action keeps increasing until 50th episode. In Double Q-Learning Q1(A, Left) and Q2(A, Left) start slightly negative. As a result the percentage of left action starts decreasing very early, thus saving training time.

Conclusion
The paper shows that Double Q-learning might underestimates the action values at times, but avoids the flaw of the overestimation bias that Q-learning does. It also shows that is this type of problems Double Q-learning reaches good performance levels much more quickly.
Python Code
The following Python code simulates both Double Q-Learning and Q-Learning and outputs a table and a comparative graph for the two methods.