The world’s leading publication for data science, AI, and ML professionals.

Newton’s Method with Evolution Strategy for Reinforcement Learning Problems

We extend Salimans et al.'s Evolution Strategy by combining it with Newton's Method for training reinforcement learning problems.

Introduction

Given an environment with an associated reward function and a policy neural network (i.e. an agent), the goal of deep Reinforcement Learning is to optimise the weights (i.e. hidden parameters) of the policy neural network such that it will maximise the current and all future returns with respect to the reward function (by performing optimal actions per given state). To simply put it, reinforcement learning is the process of finding an agent such that given a state in the environment, the actions the agent perform will result in optimal current returns and maximum future returns. Although, this idea rather straightforward, the difficulty is in training the policy neural network. This is due to the fact that in order to calculate its gradient, one has to estimate the gradient of the expected reward by sampling, which is not a straightforward process as one finds in supervised learning problems. Thus, there exists a vast number of highly complex algorithms in the literature to tackle this problem (e.g. AC3, AC2, DQN, DDPG, TD3, etc.).

Another algorithm for solving such problems is the evolution strategy (ES) __ by Salimans et al.¹. The authors show that ES rivals the performance of standard reinforcement learning algorithms, where the Atari and MuJoCo environments are used for benchmark models. The authors demonstrate that ES simpler to implement as there is no need for backpropagation, it is highly parallelisable (linear speedup is observed), it can be applied to applications where standard reinforcement learning algorithms are not applicable (i.e. networks that are not differentiable), it does not suffer in environments with sparse rewards, it has consistent exploration, and it has fewer hyperparameters.

By using Salimans _et al._¹ work as basis, we extend the evolution strategy method by combining it with Newton’s method (NES). In our reading, we demonstrate that with our approach, one can achieve a training algorithm that is independent of the learning rate; thus, reducing the number of hyperparameters with respect to the standard ES method. With numerical experiments, we observe that NES can achieve faster convergence and reach maxima above the standard ES, given appropriately chosen hyperparameters.

Derivation and Main Results

Let s be a state, w be the weights tensor, NN be a policy neural network, α be the learning rate and R be the reward function. Also, define F(w) = R(NN(w,s)). Note that our goal is to maximise the reward function, R (i.e. F), and thus, we may employ gradient-assent, which can be defined as,

ww + αG(w) ,

where, G(w)= ∂F(w)/∂w . However, G cannot be calculated explicitly, and thus, it must be estimated, which is not a very straightforward task with existing reinforcement learning algorithms. To overcome this difficulty, Salimans _et al._¹ propose the following; Let 𝔼(X) = mean([X, X, ⋅⋅⋅ X]) be the expectation of some distribution X, where m _ is the number of samples (i.e. population size). Also, let σ be some hyperparameter and ϵ be a random sample from a multidimensional normal distribution (with same dimensions as w) with mean 0 and variance 1, i.e. ϵ **** ∼ N_(0,1), where _di_m(ϵ) = _di_m(w**). Thus, evolution strategy can be expressed as,

ww +α𝔼(ϵF_(_w ** + ϵ**σ))/σ . (1)

Although equation (1) is far more simple than standard reinforcement learning algorithms, we can still reduce its complexity by eliminating the hyperparameter α, which will be the goal in our reading.

Recall Newton’s method, and when applied to our problem, i.e. finding the roots to G, it can be expressed as,

wwG(w)/H(w) , (2)

where H(w)= ∂G(w)/∂w . Now, perturb the weight w by ϵσ (i.e. w + ϵσ) in F and use Taylor expansion to obtain the following,

F(w + ϵσ) ≈ F(w) + σϵG(w) + 0.5σ²(ϵ⋅(ϵH(w))) + ⋅⋅⋅ ,

ϵF(w + ϵσ) ≈ ϵF(w) + σϵ(ϵG(w)) + ⋅⋅⋅ .

Thus, taking the expectation of the above expansions and some simple rearranging, one finds the following,

𝔼(ϵ(ϵG_(_w))) = 𝔼(ϵF(w + ϵσ))/σ , (3)

𝔼(ϵH_(_w))) = 2(𝔼(F(w + ϵσ)) – F(w))/σ² . (4)

Now, substitute equations (3) and (4) into equation (2) to find Newton’s Method with Evolution Strategy, which we define as,

ww – 0.5σ𝔼(ϵF_(w ** + ϵ**σ))/(𝔼(F(_w ** + ϵσ)) – F(w**)) . (5)

As the reader can see that equation (5) eliminates the necessity for the learning rate. A further advantage of NES over standard ES is that one need not know in advance that the application requires gradient-assent or gradient-descent, as NES can converge to any minimum or maximum point without prior specification.

Using equation (5), the pseudocode for Newton’s Method with Evolution Strategy can be expressed as,

initialise w
for number of episodes: 
  epsilons = [] 
  returns = []
  for number of samples:
    epsilon = standard_normal_distribution(dim=dim(w))
    return = reward_function(policy_neural_network(w + sigma*epsilon))
    epsilons.append(epsilon) 
    returns.append(return) 
  return = reward_function(policy_neural_network(w))
  w = w - 0.5*sigma*mean(returns*epsilons)/(mean(returns) - return)

As the reader can see from the above pseudocode that no additional complexity is introduced to ES by NES as return and returns (in the denominator) are both calculated in the standard ES method.

As a simple python example, we present Karpathy et al.²’s simple ES python example, but modified for NES,

# simple NES example: minimise a quadratic around some solution point
import numpy as np
solution = np.array([0.5, 0.1, -0.3])
def f(w): return -np.sum((w - solution)**2)

npop = 1000    # population size
sigma = 0.1    # noise standard deviation
w = np.random.randn(3) # initial guess
for i in range(300):
  N = np.random.randn(npop, 3)
  R = np.zeros(npop) 
  for j in range(npop):
    w_try = w + sigma*N[j]
    R[j] = f(w_try)
  denominator = np.mean(R) - f(w)
  w = w - 0.5 * sigma * np.dot(N.T, R) / npop / denominator

Numerical Results

As a simple numerical experiment, we compare the Lazy Programmer Inc.³’s simple ES example against our NES algorithm⁴ (due to the random initialisation of the weights, results may vary at each run).

Figure (1): Newton's method vs standard evolution strategy for various σ, where the maximum return is 0.
Figure (1): Newton’s method vs standard evolution strategy for various σ, where the maximum return is 0.

As the reader can see from figure (1) that NES outperforms standard ES methods by often faster convergence and by always achieving a higher return.

Further numerical results for various reinforcement learning applications show that NES method is highly unstable. We often have to clip the update term, handle the denominator-equals-zero case separately (e.g. if 𝔼(F_(_w ** + ϵσ))=F(w), then use standard ES method) and fix the sign of the denominator, depending on if one is seeking a maximum (i.e. -∥𝔼(F_(_w ** + ϵσ)) – F(w))∥) or a minimum (i.e. +∥𝔼(F_(_w ** + ϵσ)) – F(w**))∥). In fact, the simple example that we presented in figure (1) is calculated with clipping and singular case handled separately. Furthermore, NES require a relatively large sample size (i.e. large population size) for convergence, making it impractical for complex applications (i.e. a policy network with a large number of parameters, or with ambiguous maxima or minima points).

Despite above highlighted flaws, when NES achieve convergence, we observe a faster convergence for relatively large σ and higher returns for relatively small σ compared to the standard ES method (given that the sample size is relatively large).

Note that our numerical experiments are conducted for bespoke applications (i.e. not benchmark applications), and thus, any conclusions implied by our numerical results may be regarded as speculative.

Conclusions

In our reading, we combined Newton’s method and Salimans _et al._¹ evolution strategy (ES) to derive an alternative method for training deep reinforcement learning policy neural networks. With this approach, we gained all the advantages of the standard evolution strategy but with one less hyperparameter (i.e. no learning rate) and with no added complexity to the algorithm.

Our numerical results indicate that Newton’s method with evolution strategy (NES) is highly unstable, and require careful tuning of hyperparameters and stableising methods: implying that the loss in the number of hyperparameters with respect to standard ES (or any other reinforcement learning algorithm) came at a cost of higher sensitivity of the remaining hyperparameters.

On events when our NES method converges, it showed faster convergence and higher returns (i.e. a policy neural network with weights that are closer to the optimal weights) compared to the standard ES method.

We conclude by reminding the reader that our numerical experiments were conducted for bespoke applications. We leave benchmark numerical experiments as future work.


[1] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever. (7 Sep 2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. https://arxiv.org/abs/1703.03864

[2] Andrej Karpathy, Tim Salimans, Jonathan Ho, Peter Chen, Ilya Sutskever, John Schulman, Greg Brockman, Szymon Sidor. (24 March 2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. https://openai.com/blog/evolution-strategies/

[3] Lazy Programmer Inc, bob7783. (8 May 2019). _essimple.py. __ https://github.com/lazyprogrammer/machine_learning_examples/blob/master/rl3/es_simple.py

[4] Kavinda Jayawardana. (29 Dec 2020). _Newtons_Method_Evolution_Strategy_simple.p_y. https://github.com/kavjay/Newtons_Method_Evolution_Strategy_simple/blob/main/Newtons_Method_Evolution_Strategy_simple.py


Related Articles