A pair of interrelated neural networks in Deep Q-Network

In DQN and Double DQN models, comparing two interrelated neural networks is crucial.

Published in

Towards Data Science

10 min readMar 18, 2020

We will follow a few steps that have been taken in the fight against correlations and overestimations in the development of the DQN and Double DQN algorithms. As an example of the DQN and Double DQN applications, we present the training results for the CartPole-v0 and CartPole-v1 environments. The last section contains some tips on PyTorch tensors.

From lookup table to neural network

Until around 2013, Deep Q-Learning was a Reinforcement Learning (RL) method that worked only with a lookup table. The success of neural networks in Computer Vision has sparked interest in trying them out in RL. The paper “Playing Atari with Deep RL” (V.Mnih et al., 2013, DeepMind) presented the first successful Deep Learning model using a neural network function approximation. In 2015, DeepMind demonstrated that the Deep Q-network agent, receiving only the row pixel data and the game score as inputs, was able to exceed the performance of all previous algorithms.

After these two DeepMind works, the lookup tables were replaced by neural networks and a Q-Learning became a Deep Q-Learning, or, equivalently, Deep Q-Network (DQN). In fact, it was a breakthrough in RL agent training.

The DQN is the algorithm that combines Q-learning with neural networks.

The DQN components

Pair of Q-Networks (local and target).
Experience replay — a biologically inspired mechanism.
ε-greedy mechanism.
Q-learning, i.e., using the max value for all possible actions.
Minimize the loss function by gradient descend mechanism.

Correlations are harmful

Reinforcement Learning is known as unstable when a neural network is used as a function approximation. The reasons of this instability are as follows:

the correlations in the observation sequence,
small updates to Q-values may significantly change the policy,
the correlations between Q-values Q(s_t, a_t) and the TD-target value

TD-target

used in the DQN key equation, eq. (2).

DQN key equation

Pair of Q-Networks: local and target

One of the important components of the DQN algorithm in the fight against correlations is the use of the target network (q_target). The target network, (with parameters θ*) is the same as the local network except that its parameters are copied every τ steps from the local network (q_local), so that then θ*_t = θ_t, and kept fixed on all other steps. In looks as follows:

In CartPole it was set τ = TARGET_UPDATE = 10.

Loss function for DQN agent

Comparing two neural networks representing the same Q-table and finding the point at which these networks are very close is the basic part of the DQN algorithm.

By networks q_local and q_target the tensors current estimate Q(s_t, a_t) and alternative estimate G_t, see (1), are calculated. Further, in the function learn() of the class Agent

**Table 1.** **Three tensors (in PyTorch), DQN**

the distance between Q_targets and Q_expected is calculated by MSELoss function; the loss value is minimized by gradient descend mechanism;
the backpropagation mechanism for the loss tensor is performed;
the gradient descend algorithm is executed by an optimizer, for example torch.optim.Adam, torch.optim.RMSprop or any other.

The learn function for DQN

Experience replay — a biologically inspired mechanism

Another thing that DQN uses to reduce correlations is the experience replay mechanism, which puts data into a specific memory storage and randomly receives data from the memory storage.

ε-greedy mechanism

The DQN ε-greedy mechanism promotes a reasonable proportion between exploration and exploitation. This mechanism provides the parameter ε, where 0 < ε < 1, which allows controlling this proportion. For any ε, with probability 1-ε, exploitation is chosen. The exploitation means, that the agent finds the following action by maximizing the Q-value over all possible actions for the given state, see the relevant lines from the function get_action():

Annealing ε-greedy mechanism

But how to choose epsilon? One popular option is the annealing ε-greedy algorithm. For any episode i, the action is greedy with probability 1-ε.

In eq.(3), ε_m is the minimal value of ε , which must be achieved in the episode number Mε. By (3) if i = 0 we haveε = 1; if i = 1 then ε is close to 1. For example, for Mε = 50 and ε_m =0.01, we have ε = 0.98. Then exploitation is selected with probability 0.02, and exploration is chosen with probability 0.98, i.e., a quite random value for action number will be obtained with probability 0.98. Thus, for first episodes, the action will be chosen very randomly, this is exploration.

If i = Mε, we get ε = ε_m. For our example,ε = 0.01. Then exploitation is chosen with probability 0.99, and random values for the action number will be obtained only for 1% of cases. For episodes from 0 to Mε, the value of ε is decreasing from 1 to ε_m, and fixed at ε_m thereafter.

Double DQN vs. DQN

Overestimations in DQN

The DQN algorithm is known to overestimate action values. For the first time, overestimations of DQN were investigated by Thrun and Schwartz (1993). They give an example in which these overestimations asymptotically lead to sub-optimal policies. Suppose, for some state s’ , the true value on all actions a is Q(s’, a) = 0 and the estimate value for Q-values are distributed some above and below zero. Then the maximum of these estimates > 0 that is the overestimate for true value. In 2015, (Hasselt et. al., DeepMind) shown that estimation errors can drive the estimates up and away from the true optimal values. They supposed the solution that reduces the overestimation: Double DQN.

What is the reason of overestimations? The problem is in max operator in eqs. (1) and (2). Suppose, the evaluation value for Q(S_t, a) is already overestimated. Then the action value obtained in eqs. (1) or (2) becomes even more overestimated. The TD-target G_t in eq.(1) can be rewritten to TD-target G^Q_t as follows:

Here, θ_t is the set of weights of the network q_target, see Table 1 above.

Decoupling action and evaluation

The Double DQN solution is in decoupling the selection of action argmax_a from the evaluation Q(S_{t+1}, argmax_a), see (4). This is done using another network with the set of weights θ’_t :

By (5) the greedy policy (argmax) is estimated by the neural network q_local (weights θ_t). The network q_target (weights θ'_t) is used to fairly evaluate of this policy. This solution is the main idea of the Double DQN.

Loss function for Double DQN agent

For the Double DQN agent case, the tensors Q(s_t, a_t) and G_t are calculated as for the DQN, see Tables 1 ans 2.

**Table 2. Four Tensors (in PyTorch), Double DQN**

The main difference between Table 1 and Table 2 is the calculation of the tensor Q_target_next (evaluation of the Q-value):

in Table 1, Q_target_next is calculated only by one network q_target, see (4).
in Table 2 , Q_target_next is calculated using two networks: q_target and q_local (see (5) and Q_max_action).

The learn function for Double DQN

CartPole-v0 and CartPole-v1 training

A pole is attached by an joint to a cart, which moves along a track. The system is controlled by applying a force of +1 or -1 to the cart.

The CartPole is a binary classification problem

The dimension of the CartPole observation space is 4, since there are 4 features that form the input: cart coordinate, velocity, pole’s angle from vertical and its derivative (pole “falling” velocity ). The CartPole is a binary classification problem because at each time step the agent chooses between moving left or right. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

The environment CartPole-v0 is considered solved if the getting average reward > 195 over 100 consecutive trials; CartPole-v1 is considered solved if the getting average reward > 475 over 100 consecutive trials.

Training experiments for CartPole-v0 and CartPole-v1

Here are the results of my experiments with DQN and Double DQN, obtained during training CartPole-v0 and CartPole-v1. For all cases, LEARNING_RATE = 0.001. The greedy parameterεis changed from 1 to ε_m = 0.01

CartPole with DQN

For both CartPole-v0 and CartPole-v1, we put Mε= 50.

DQN, CartPole-v0, reward 195 is achieved in episode 962.
DQN, CartPole-v1, reward 475 is achieved in episode 1345.

CartPole with Double DQN

For CartPole-v0 we put Mε= 200;for CartPole-v1, we put Mε= 150. Recall that Mε is the number of episode where ε achieves the minimal valueε_m.

3. Double DQN, CartPole-v0, reward 195 is achieved in episode 612.

4. Double DQN, CartPole-v1, reward 475 is achieved in episode 1030.

Choice of hyperparameter Mε

If Mε is set too large, then the choice of ε will be performed for a long time in conditions of high probability (> ε_m) of exploration. In other words, for a long time ε will be carried out without information accumulated in the neural network. This means that choosing between moving left or right , we can be mistaken in half the cases for a very long time.

If Mε is set too small, then the choice of ε will be performed for a long time under conditions of high probability (= ε_m) of exploitation. This can be very bad in the early stages of neural network training because the choice of action using argmax will be made from the neural network, which is still very crude. Then in many cases, the chosen action will be mistaken.

Conclusion

In developing the DQN and Double DQN algorithms, three steps were taken in the fight against correlations and overestimations: (1) target and local networks, (2) experience replay mechanism, (3) decoupling the selection from the evaluation. These mechanisms have been developed with substantial use of two interrelated neural networks.

Appendix. A bit about PyTorch tensors

with torch.no_grad()

The PyTorch function no_grad() excludes some elements from the gradient calculation. It is used when there is confidence that the back-propagation process is not performed. This function reduces memory consumption, see get_action(). A similar effect occurs when using the detach() function. The with statement clarifies code corresponding to try...finally blocks.

optim.zero_grad()

clears old gradients from the last step (otherwise the gradients will be just accumulated from all loss.backward() calls)

view(1,1)

This function returns a new tensor, the same as the original tensor, but of a different shape. Trying to remove the function view(1,1) in get_action(), we get the different shapes of the action tensor in two branches of get_action().Then in learn() function we get batch.action that consists of tensors of various shapes. This is failure. The function view(1,1) changes the shape from tensor([a]) to tensor([[a]]). Parameters 1,1 mean the number of elements in each dimension. For example, view(1,1,1,1,1) means
tensor([[[[[a]]]]]).

torch.cat

Concatenates the given tuple of tensors into the single tensor. For example, in learn() function, batch.state is the tuple of 64 tensors of shape [1,4]. Function torch.cat transforms this tuple into the single tensor states of the shape [64,4] as follows:

reshape(-1)

Why do we use reshape(-1) to find the Q_targets_nexttensor, see Table 2? In learn() function we compare two tensors: Q_targets.unsqueeze(1) and Q_expected. If we don’t use reshape function, then by Table 3 these tensors have different shape, then comparison is failure.

**Table 3 Shapes of tensors compared in learn() function**

For other Deep Reinforcement Learning projects, see my github directory. For interrelations between the Bellman equation and neural networks, see my previous paper. The same article provides more tips on PyTorch.

References

[1] V.Minh et. al., Playing Atari with Deep Reinforcement Learning (2013), arXiv:1312.5602

[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461

[3] A.Karpathy, Deep Reinforcement Learning: Pong from Pixels (2016), karpathy.github.io

[4] Rubik’s Code, Introduction to Double Q-Learning, (2020), rubikscode.net

[5] S.Karagiannakos, Taking Deep Q Networks a step further, (2018), TheAISummer

[6] V.Minh et. al., Human-level control through deep reinforcement learning, (2015), Nature

[7] R.Stekolshchik, How does the Bellman equation work in Deep RL?, (2020), TowardsDataScience

[8] C.Yoon, Double Deep Q-Networks 2019, TowardsDataScience

[9] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993),
Carnegie Mellon University, The Robotics Institute

[10] F.Mutsch, CartPole with Q-Learning — First experiences with OpenAI Gym (2017), muetsch.io

[11] T.Seno, Welcome to Deep Reinforcement Learning Part 1 : DQN, (2017), TowardsDataScience

[12] https://towardsdatascience.com/dqn-part-1-vanilla-deep-q-networks-6eb4a00febfb