Three aspects of Deep RL: noise, overestimation and exploration

Two faces of noise. Noise can be harmful, it can lead to a systematic overestimation. However, noise can be useful, such as noise for exploration.

Published in

Towards Data Science

11 min readMay 1, 2020

We touch on various sides of noise in Deep Reinforcement Learning models. Part 1 discusses overestimation, that is the harmful property resulting from noise. Parts 2 deals with noise used for exploration, this is the useful noise. In the appendix, we will look at one more example of noise: adaptive noise.

Part 1. We will see how researchers tried to overcome overestimation in models. First step is decoupling of the action selection from action evaluation. It was realized in Double DQN. The second step relates to the Actor-Critic architecture: here we decouple the value neural network (critic) from the policy neural network (actor). DDPG and TD3 use this architecture.

Part 2. Exploration as a major challenge of learning. The main issue is exploration noise. We relate to models DQN, Double DQN, DDPG and TD3. Neural network models using some noise parameters have more capabilities for exploration and are more successful in Deep RL algorithms.

Appendix. We consider the Hill-Climbing, the simple gradient-free algorithm. This algorithm adds adaptive noise directly to input variables, namely to the weight matrix determining the neural network.

Part 1. In efforts to overcome overestimation

DQN and Double DQN algorithms turned out to be very successful in the case of discrete action spaces. However, it is known that these algorithms suffer from overestimation. This harmful property is much worse than underestimation, because underestimation does not accumulate. Let us see how researchers tried to overcome overestimation.

Overestimation in DQN.

The problem is in maximization operator using for the calculation of the target value Gt. Suppose, the evaluation value for Q(S_{t+1}, a) is already overestimated. Then from DQN key equations (see below) the agent observes that error also accumulates for Q(S_t, a) .

Here, Rt is the reward at time t; Gt is the cumulative reward also know as TD-target; Q(s, a) is the Q-value table of the shape [space x action].

Thrun and Schwartz in “Issues in Using Function Approximation for Reinforcement Learning” (1993) observed that using function approximators (i.e, neural networks) instead of just lookup tables (this is the basic technique of Q-learning) causes some noise on the output predictions. They gave an example in which the overestimation asymptotically lead to suboptimal policies.

Decoupling in Double DQN.

In 2015, Haselt et. al. in “Deep Reinforcement Learning with Double Q-learning” shown that estimation errors can drive the estimates up and away from the true optimal values. They supposed the solution that reduces the overestimation: Double DQN.

The important thing that has been done in Double DQN is decoupling of the action selection from action evaluation. Let us make this clear.

Gt formula for DQN: the Q-value Q(S_t, a) used for the action selection (in red) and the Q-value Q(S_t, a) used for the action evaluation (in blue) are determined by the same neural network with the weight vector θ_t.
Gt formula for Double DQN: the Q-value used for the action selection and the Q-value used for the action evaluation are determined by two different neural networks with weight vectors θ_t and θ'_t. These networks are called current and target.

However, due to the slowly changing policy, estimates of the value of the current and target neural networks are still too similar, and this still causes a consistent overestimation.

Actor-Critic architecture in DDPG.

DDPG is one of the first algorithms that tried to use the Q-learning technique of DQN models for continuous action spaces. DDPG stands for Deep Deterministic Policy Gradient. In this case, we cannot use the maximization operator of Q-values over all actions, however, we can use the function approximator, a neural network representing Q-values. We presume that there exists a certain function Q(s, a) which is differentiable with respect to the action argument a.However, finding argmax(Q(S_t, a)) on all actions a for the given state S_t means that we must solve the optimization task at every time step. This is a very expensive task. To overcome this obstacle, a group of researchers from DeepMind in the work “Continuous control with deep reinforcement learning” used the Actor-Critic architecture. They used two neural networks: one, as before, in DQN: Q-network representing Q-values; another one is the actor function 𝜋(s) which provides a*, the maximum for the value function Q(s, a) as follows:

Part 2. Exploration as a major challenge of learning

Why explore?

In addition to overestimation, there is another problem in Deep RL, no less difficult. This is exploration. We cannot unconditionally believe in maximum values of the Q-table or in the value of a* = 𝜋(s). Why not? Firstly, at the beginning of training, the corresponding neural network is still “young and stupid”, and its maximum values are far from reality. Secondly, perhaps not the maximum values will lead us to the optimal strategy after hard training.

In life, we often have to solve the following problem: to follow the beaten path — there is little risk and little reward; or to take a new unknown path with great risk — but, with some probability, a big win is possible there. Maybe it will be just super, you don’t know.

Exploration vs. exploitation

Exploitation means, that the agent uses the accumulated knowledge to select the following action. In our case, this means that for the given state, the agent finds the following action that maximizes the Q-value. The exploration means that the following action will be selected randomly.

There is no rule that determines which strategy is better: exploration or exploitation. The real goal is to find a true balance between these two strategies. As we can see, the balance strategy changes in the learning process.

Exploration in DQN and Double DQN

One way to ensure adequate exploration in DQN and Double DQN is to use the annealingε-greedy mechanism. For the first episodes, exploitation is selected with a small probability, for example, 0.02 (i.e., the action will be chosen very randomly) and the exploration is selected with a probability 0.98. Starting from a certain number of episode Mε, the exploration will be performed with a minimal probability ε_m, for example, ε_m= 0.01, and the exploitation is chosen with probability 0.99. The probability formula of exploration ε can be realized as follows:

**Annealing** `ε`**-greedy mechanism, probability formula of exploration** `ε`

where i is the episode number. Let Mε = 100, ε_m = 0.01. Then the probability ε of exploration looks as follows:

**Gradual decrease in probability from 1 to** `ε_m =` **0.01**

Exploration in DDPG

In RL models with continuous action spaces, instead of ε-greedy mechanism undirected exploration is applied. This method is used in DDPG , PPO and other continuous control algorithms. Authors of DDPG (Lillicrap et al., 2015) constructed undirected exploration policy 𝜋’ by adding noise sampled from a noise process N to the actor policy 𝜋(s):

**Policy** `𝜋(s) with exploration noise`

where N is the noise given by Ornstein-Uhlenbeck, correlated noise process. In the TD3 paper authors (Fujimoto et. al., 2018) proposed to use the classic Gaussian noise, this is the quote:

…we use an off-policy exploration strategy, adding Gaussian noise N(0; 0:1) to each action. Unlike the original implementation of DDPG, we used uncorrelated noise for exploration as we found noise drawn from the Ornstein-Uhlenbeck (Uhlenbeck & Ornstein, 1930) process offered no performance benefits.

A common failure mode for DDPG is that the learned Q-function begins to overestimate Q-values, then the policy (actor function) leads to significant errors.

Exploration in TD3

The name TD3 stands for Twin Delayed Deep Deterministic. TD3 retains the Actor-Critic architecture used in DDPG, and adds 3 new properties that greatly help to overcome overestimation:

TD3 maintains a pair of critics Q1 amd Q2 (hence the name “twin”) along with a single actor. For each time step, TD3 uses the smaller of the two Q-values.
TD3 updates the policy (and target networks) less frequently than the Q-function updates (one policy update (actor) for every two Q-function (critic) updates)
TD3 adds exploration noise to the target action. TD3 uses Gaussian noise, not Ornstein-Uhlenbeck noise as in DDPG.

Exploration noise in trials with PyBullet Hopper

PyBullet is a Python module for robotics and Deep RL based on the Bullet Physics SDK. Let us look at HopperBulletEnv, one of PyBullet environments associated with articulated bodies:

The HopperBulletEnv environment is considered solved if the achieved score exceeds 2500. In TD3 trials with the HopperBulletEnv environment, I got, among others, the following results for std = 0.1 and std = 0.3:

**Two trials for HopperBulletEnv with TD3, std of noise= 0.1 and 0.3**

Here, std is the standard deviation of exploration noise in TD3. In both trials, threshold 2500 was not reached. However, I noticed the following oddities.

In trial std = 0.3, there are a lot of values near 2500 (however less than 2500) and at the same time, the average value decreases all the time. This is explained as follows: the number of small values prevails over the number of large values, and the difference between these numbers increases.
In trial std = 0.1, the average values reach large values but in general, the values decrease. The reason of this, again, is that the number of small values prevails over the number of large values.
It seemed to me that the prevalence of very small values is associated with too big noise standard deviation. Then I decide to reduce std to 0.02, it was enough to solve the environment.

**HopperBulletEnv with TD3, std of noise = 0.02**

App. Hill-Climbing algorithm with adaptive noise

Forerunner of tensors

We illustrate the properties of the Hill-Climbing algorithm applied to the Cartpole environment. The neural network model here is so simple that does not use tensors (no PyTorch, no Tensorflow), the neural network uses only the simplest matrix of shape [4 x 2], that is the forerunner of tensors.

Class Policy in Hill-Climbing algorithm

The Hill-Climbing algorithm seeks to maximize a target function Go, which in our particular case is the cumulative discounted reward:

where γ is the discount factor, 0 < γ < 1, and Rk is the reward obtained at the time step k of the episode. The target function Go looks in Python as follows:

discounts = [gamma**i for i in range(len(rewards)+1)]
Go = sum([a*b for a,b in zip(discounts, rewards)])

As always in Deep RL, we try to exceed a certain threshold. For Cartpole-v0, this threshold score is 195, and for Cartpole-v1 it is 475. Hill-Climbing is a simple gradient-free algorithm (i.e., we do not use the gradient ascent/gradient descent methods). We try to climb to the top of the curve by only changing the arguments of the target function Go using a certain adaptive noise. However, what is the argument of our target function?

The argument of Go is the weight matrix determining the neural network that underlies in our model. The weight matrix example for episodes 0–5 are presented here:

**Weight vectors [4 x 2] of the neural network for episodes 0–5**

Adaptive noise scale

The adaptive noise scaling for our model is realized as follows. If the current value of the target function is better than the best value obtained for the target function, we divide the noise scale by 2, and this noise is added to the weight matrix. If the current value of the target function is worse than the best obtained value, we multiply the noise scale by 2, and this noise is added to the best obtained value of the weight matrix. In both cases, a noise scale is added with some random factor different for any element of the matrix.

**Noise Scale and Score graphs by episodes**

For Cartpole-v1, if the weight matrix is initialized to non-zero small values (see above the left top matrix), the number of episodes = 112. Note that if the weight matrix is initialized to zeros then the number of episodes is increased from 112 to 168. The same for Cartpole-v0.

For more information on Cartpole-v0/Cartpole-v1 with adaptive noise scaling, see the project on Github.

A more generic formula for the noise scale

As we saw above, the noise scale adaptively increases or decreases depending on whether the target function is lower or higher than the best obtained value. The noise scale in this algorithm is 2. In the paper “Parameter Space Noise for Exploration” authors considers more generic formula:

where α is a noise scale, d is a certain distance measure between perturbed and non-perturbed policy, and δ is a threshold value. In Appendix C, authors consider the possible forms of the distance function d for algorithms DQN, DDPG and TPRO.

References

[1] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993), Carnegie Mellon University, The Robotics Institute

[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461

[3] T.P. Lillicrap et.al., Continuous control with deep reinforcement learning (2015), arXiv:1509.02971

[4] Yuxi Li, Deep Reinforcement Learning: An Overview (2018), arXiv:1701.07274v6

[5] S.Fujimoto et.al, Addressing Function Approximation Error in Actor-Critic Methods (2018), arXiv: arXiv:1802.09477v3

[6] Better Exploration with Parameter Noise, OpenAI.com, https://openai.com/blog/better-exploration-with-parameter-noise/

[7] M.Plappert et.al. , Parameter Space Noise for Exploration, OpenAI, arXiv:1706.01905v2, ICLR 2018

[8] B.Mahyavanshi, Introduction to Hill Climbing | Artificial Intelligence, Medium, 2019

[9] Deep Deterministic Policy Gradient, OpenAI, Spinning Up, https://spinningup.openai.com/en/latest/algorithms/ddpg.html

[10] What Does Stochastic Mean in Machine Learning? (2019), Machine Learning Mastery,
https://machinelearningmastery.com/stochastic-in-machine-learning/

[11] C. Colas et. al., GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithm (2018), arXiv:1802.05054

[12]https://en.wikipedia.org/wiki/Ornstein–Uhlenbeck_process, Ornstein–Uhlenbeck process

[13] E.Lindwurm, Intuition: Exploration vs Exploitation (2019), TowardsDataScience

[14] M.Watts, Introduction to Reinforcement Learning (DDPG and TD3) for News Recommendation (2019), TowardsDataScience

[15] T.Stafford, Fundamentals of learning: the exploration-exploitation trade-off (2012), https://tomstafford.staff.shef.ac.uk/?p=48

[16] Bullet Real-Time Physics Simulation (2020), https://pybullet.org/wordpress/

[17] R.Stekolshchik, A pair of interrelated neural networks in DQN (2020), TowardsDataScience

[18] R.Stekolshchik, How does the Bellman equation work in Deep RL? (2020), TowardsDataScience