A pair of interrelated neural networks in Deep Q-Network
In DQN and Double DQN models, comparing two interrelated neural networks is crucial.
We will follow a few steps that have been taken in the fight against correlations and overestimations in the development of the DQN and Double DQN algorithms. As an example of the DQN and Double DQN applications, we present the training results for the CartPole-v0 and CartPole-v1 environments. The last section contains some tips on PyTorch tensors.
From lookup table to neural network
Until around 2013, Deep Q-Learning was a Reinforcement Learning (RL) method that worked only with a lookup table. The success of neural networks in Computer Vision has sparked interest in trying them out in RL. The paper “Playing Atari with Deep RL” (V.Mnih et al., 2013, DeepMind) presented the first successful Deep Learning model using a neural network function approximation. In 2015, DeepMind demonstrated that the Deep Q-network agent, receiving only the row pixel data and the game score as inputs, was able to exceed the performance of all previous algorithms.
After these two DeepMind works, the lookup tables were replaced by neural networks and a Q-Learning became a Deep Q-Learning, or, equivalently, Deep Q-Network (DQN). In fact, it was a breakthrough in RL agent training.
The DQN is the algorithm that combines Q-learning with neural networks.
The DQN components
- Pair of Q-Networks (local and target).
- Experience replay — a biologically inspired mechanism.
- ε-greedy mechanism.
- Q-learning, i.e., using the max value for all possible actions.
- Minimize the loss function by gradient descend mechanism.
Correlations are harmful
Reinforcement Learning is known as unstable when a neural network is used as a function approximation. The reasons of this instability are as follows:
- the correlations in the observation sequence,
- small updates to Q-values may significantly change the policy,
- the correlations between Q-values
Q(s_t, a_t)
and the TD-target value
used in the DQN key equation, eq. (2).
Pair of Q-Networks: local and target
One of the important components of the DQN algorithm in the fight against correlations is the use of the target network (q_target
). The target network, (with parameters θ*
) is the same as the local network except that its parameters are copied every τ
steps from the local network (q_local
), so that then θ*_t = θ_t
, and kept fixed on all other steps. In looks as follows:
In CartPole it was set τ = TARGET_UPDATE = 10.
Loss function for DQN agent
Comparing two neural networks representing the same Q-table and finding the point at which these networks are very close is the basic part of the DQN algorithm.
By networks q_local and q_target the tensors current estimate Q(s_t, a_t)
and alternative estimate G_t
, see (1), are calculated. Further, in the function learn()
of the class Agent
- the distance between
Q_targets
andQ_expected
is calculated byMSELoss
function; theloss
value is minimized by gradient descend mechanism; - the
backpropagation
mechanism for theloss
tensor is performed; - the gradient descend algorithm is executed by an optimizer, for example
torch.optim.Adam, torch.optim.RMSprop
or any other.
Experience replay — a biologically inspired mechanism
Another thing that DQN uses to reduce correlations is the experience replay mechanism, which puts data into a specific memory storage and randomly receives data from the memory storage.
ε-greedy mechanism
The DQN ε
-greedy mechanism promotes a reasonable proportion between exploration and exploitation. This mechanism provides the parameter ε
, where 0 < ε < 1
, which allows controlling this proportion. For any ε
, with probability 1-ε
, exploitation is chosen. The exploitation means, that the agent finds the following action by maximizing the Q-value over all possible actions for the given state, see the relevant lines from the function get_action()
:
Annealing ε-greedy mechanism
But how to choose epsilon? One popular option is the annealing ε
-greedy algorithm. For any episode i
, the action is greedy with probability 1-ε
.
In eq.(3), ε_m
is the minimal value of ε
, which must be achieved in the episode number Mε
. By (3) if i = 0
we haveε = 1
; if i = 1
then ε
is close to 1
. For example, for Mε = 50
and ε_m =0.01
, we have ε = 0.98
. Then exploitation is selected with probability 0.02
, and exploration is chosen with probability 0.98
, i.e., a quite random value for action number will be obtained with probability 0.98
. Thus, for first episodes, the action will be chosen very randomly, this is exploration.
If i = Mε
, we get ε = ε_m
. For our example,ε = 0.01
. Then exploitation is chosen with probability 0.99
, and random values for the action number will be obtained only for 1% of cases. For episodes from 0
to Mε
, the value of ε
is decreasing from 1
to ε_m
, and fixed at ε_m
thereafter.
Double DQN vs. DQN
Overestimations in DQN
The DQN algorithm is known to overestimate action values. For the first time, overestimations of DQN were investigated by Thrun and Schwartz (1993). They give an example in which these overestimations asymptotically lead to sub-optimal policies. Suppose, for some state s’
, the true value on all actions a
is Q(s’, a) = 0
and the estimate value for Q-values are distributed some above and below zero. Then the maximum of these estimates > 0 that is the overestimate for true value. In 2015, (Hasselt et. al., DeepMind) shown that estimation errors can drive the estimates up and away from the true optimal values. They supposed the solution that reduces the overestimation: Double DQN.
What is the reason of overestimations? The problem is in max
operator in eqs. (1) and (2). Suppose, the evaluation value for Q(S_t, a)
is already overestimated. Then the action value obtained in eqs. (1) or (2) becomes even more overestimated. The TD-target G_t
in eq.(1) can be rewritten to TD-target G^Q_t
as follows:
Here, θ_t
is the set of weights of the network q_target
, see Table 1 above.
Decoupling action and evaluation
The Double DQN solution is in decoupling the selection of action argmax_a
from the evaluation Q(S_{t+1}, argmax_a)
, see (4). This is done using another network with the set of weights θ’_t
:
By (5) the greedy policy (argmax) is estimated by the neural network
q_local (
weightsθ_t).
The networkq_target (
weightsθ'_t)
is used to fairly evaluate of this policy. This solution is the main idea of the Double DQN.
Loss function for Double DQN agent
For the Double DQN agent case, the tensors Q(s_t, a_t)
and G_t
are calculated as for the DQN, see Tables 1 ans 2.
The main difference between Table 1 and Table 2 is the calculation of the tensor Q_target_next
(evaluation of the Q-value):
- in Table 1,
Q_target_next
is calculated only by one networkq_target,
see (4). - in Table 2 ,
Q_target_next
is calculated using two networks:q_target
andq_local
(see (5) andQ_max_action
).
CartPole-v0 and CartPole-v1 training
A pole is attached by an joint to a cart, which moves along a track. The system is controlled by applying a force of +1 or -1 to the cart.
The CartPole is a binary classification problem
The dimension of the CartPole observation space is 4, since there are 4 features that form the input: cart coordinate, velocity, pole’s angle from vertical and its derivative (pole “falling” velocity ). The CartPole is a binary classification problem because at each time step the agent chooses between moving left
or right
. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
The environment CartPole-v0 is considered solved if the getting average reward > 195 over 100 consecutive trials; CartPole-v1 is considered solved if the getting average reward > 475 over 100 consecutive trials.
Training experiments for CartPole-v0 and CartPole-v1
Here are the results of my experiments with DQN and Double DQN, obtained during training CartPole-v0 and CartPole-v1. For all cases, LEARNING_RATE = 0.001. The greedy parameterε
is changed from 1
to ε_m = 0.01
For both CartPole-v0 and CartPole-v1, we put Mε= 50.
- DQN, CartPole-v0, reward 195 is achieved in episode 962.
- DQN, CartPole-v1, reward 475 is achieved in episode 1345.
For CartPole-v0 we put Mε= 200;
for CartPole-v1, we put Mε= 150.
Recall that Mε
is the number of episode where ε
achieves the minimal valueε_m.
3. Double DQN, CartPole-v0, reward 195 is achieved in episode 612.
4. Double DQN, CartPole-v1, reward 475 is achieved in episode 1030.
Choice of hyperparameter Mε
If Mε
is set too large, then the choice of ε
will be performed for a long time in conditions of high probability (> ε_m
) of exploration. In other words, for a long time ε
will be carried out without information accumulated in the neural network. This means that choosing between moving left
or right
, we can be mistaken in half the cases for a very long time.
If Mε
is set too small, then the choice of ε
will be performed for a long time under conditions of high probability (= ε_m
) of exploitation. This can be very bad in the early stages of neural network training because the choice of action using argmax
will be made from the neural network, which is still very crude. Then in many cases, the chosen action will be mistaken.
Conclusion
In developing the DQN and Double DQN algorithms, three steps were taken in the fight against correlations and overestimations: (1) target and local networks, (2) experience replay mechanism, (3) decoupling the selection from the evaluation. These mechanisms have been developed with substantial use of two interrelated neural networks.
Appendix. A bit about PyTorch tensors
with torch.no_grad()
The PyTorch function no_grad()
excludes some elements from the gradient calculation. It is used when there is confidence that the back-propagation process is not performed. This function reduces memory consumption, see get_action().
A similar effect occurs when using the detach()
function. The with
statement clarifies code corresponding to try...finally
blocks.
optim.zero_grad()
clears old gradients from the last step (otherwise the gradients will be just accumulated from all loss.backward()
calls)
view(1,1)
This function returns a new tensor, the same as the original tensor, but of a different shape. Trying to remove the function view(1,1)
in get_action()
, we get the different shapes of the action tensor in two branches of get_action().
Then in learn()
function we get batch.action
that consists of tensors of various shapes. This is failure. The function view(1,1)
changes the shape from tensor([a])
to tensor([[a]]).
Parameters 1,1
mean the number of elements in each dimension. For example, view(1,1,1,1,1) meanstensor([[[[[a]]]]]).
torch.cat
Concatenates the given tuple of tensors into the single tensor. For example, in learn()
function, batch.state
is the tuple of 64 tensors of shape [1,4]. Function torch.cat
transforms this tuple into the single tensor states
of the shape [64,4] as follows:
reshape(-1)
Why do we use reshape(-1)
to find the Q_targets_next
tensor, see Table 2? In learn()
function we compare two tensors: Q_targets.unsqueeze(1)
and Q_expected.
If we don’t use reshape
function, then by Table 3 these tensors have different shape, then comparison is failure.
For other Deep Reinforcement Learning projects, see my github directory. For interrelations between the Bellman equation and neural networks, see my previous paper. The same article provides more tips on PyTorch.
References
[1] V.Minh et. al., Playing Atari with Deep Reinforcement Learning (2013), arXiv:1312.5602
[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461
[3] A.Karpathy, Deep Reinforcement Learning: Pong from Pixels (2016), karpathy.github.io
[4] Rubik’s Code, Introduction to Double Q-Learning, (2020), rubikscode.net
[5] S.Karagiannakos, Taking Deep Q Networks a step further, (2018), TheAISummer
[6] V.Minh et. al., Human-level control through deep reinforcement learning, (2015), Nature
[7] R.Stekolshchik, How does the Bellman equation work in Deep RL?, (2020), TowardsDataScience
[8] C.Yoon, Double Deep Q-Networks 2019, TowardsDataScience
[9] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993),
Carnegie Mellon University, The Robotics Institute
[10] F.Mutsch, CartPole with Q-Learning — First experiences with OpenAI Gym (2017), muetsch.io
[11] T.Seno, Welcome to Deep Reinforcement Learning Part 1 : DQN, (2017), TowardsDataScience
[12] https://towardsdatascience.com/dqn-part-1-vanilla-deep-q-networks-6eb4a00febfb