Entropy in Soft Actor-Critic (Part 2)

Rafael Stekolshchik
Towards Data Science
10 min readJun 7, 2021

--

SAC algorithms perform iteration that alternates between policy evaluation and policy improvement. In the policy evaluation step, the algorithm calculates the value of policy 𝜋 in accordance with the maximum entropy and the soft Bellman equation. During the policy improvement step, the target policy is updated to be as close as possible to the softmax function of the previous policy using minimum cross-entropy. In this step, for each state, the policy is updated by means of the Kullback-Leiber divergence.

source: 123rf.com

Finding the policy

In Part 1, in accordance with the SAC algorithm, using the maximum entropy principle, we constructed the soft state-value function V(s), the soft action-value function Q(s, a) and the soft Bellman equation with the built-in entropy term. The following goal is to build an algorithm for finding the best policy 𝜋(a|s). To this end, the SAC algorithm implements the principle of minimum cross-entropy, using the KL-divergence and the softmax function.

KL-divergence

Cross-entropy H(p,q) and KL-divergence

Let p and q be discrete probability distributions over the set of events {x,…,x_n} associated with some discrete variable X. The cross-entropy of the distribution q with respect to the distribution p over a given set of events is defined as follows:

Cross-entropy (Image by Author)

Here Ep[∙] is the expected value operator with respect to the distribution p.

For discrete probability distributions p and q with the same probability space X , the Kullback–Leibler divergence (or KL-divergence) from q to p is defined as

KL-divergence (Image by Author)

Entropy and cross-entropy are connected by KL-divergence:

Entropy, cross-entropy and KL-divergence (Image by Author)

The KL-divergence is defined only if for all x from q(x)=0 it follows p(x)=0. If p(x)=0, then the corresponding term in the sum is interpreted as 0 since by L’Hôpital’s rule

Contribution of the term with p(x) = 0 (Image by Author)

Let p be a fixed reference distribution, then the cross-entropy and KL-divergence are identical up to an additive constant (since p is fixed). When p=q, KL-divergence takes the minimum value = 0, and the cross-entropy H(p,q) takes the minimum value H(p,q) = H(p).

KL-divergence and distance metrics

Cross-entropy H(p, q) cannot be a distance metric, since it has the disadvantage that H(p, p) = H(p) but not zero. The KL-divergence cannot be a distance metric, since it is not symmetric, i.e., Dᴋʟ(pq) ≠ Dᴋʟ(qp) and it does not satisfy the triangle inequality.

KL-divergence for Gaussian pairs, generated by python code from App 2.1

The KL-divergence can be thought of as a measure of how far the distribution q is from the distribution p, but it cannot be used as a distance metric. However, the infinitesimal form of the KL-divergence, the matrix of all second partial derivatives (Hessian) constitutes the metric known as the Fisher information metric. According to Chentsov’s theorem, [Do17], Fisher information metric is a variant of the Riemannian metric, which is the essential part in Riemannian geometry.

Bias of KL-divergence estimation

The KL-divergence can be highly dependent on the number of samples, and the KL-divergence estimation may not converge to something good when the number of samples tends to infinity. An example of this is shown in Figure (a), we have two fixed Gaussian distributions : p ~ N(0,2) and q ~ N(8,2). However, the number of samples changes in the interval [500, 5000]. The KL-divergence increases linearly with the number of samples.

If the number of samples is fixed and the distribution parameter changes, the KL-divergence estimation can be significantly biased, and the bias can be highly distribution dependent. In Figure (b), we have again two Gaussian distributions : p ~ N(0,2) and q~ N(3, std): mean = 3 and std runs over the interval [1,20]. In this interval, the KL-divergence changes greatly. See [Pe08], [No18] for more details on the KL-divergence estimations.

(a) Dependence of Dᴋʟ(pq) on the number of samples, (b) Dependence of Dᴋʟ(pq) on the standard deviation (graphs (a) and (b) are generated by python code from App 2.2).

Actor-Critic

Deterministic policy vs. stochastic policy

For the case of a discrete action space, there is a successful algorithm DQN (Deep Q-Network). One of the successful attempts to transfer the DQN approach to a continuous action space with the Actor-Critic architecture was the algorithm DDPG , the key component of which is deterministic policy, [Li15]. The main advantage of the deterministic policy is that the policy gradient can be computed more efficiently than for the stochastic policy, which require more samples to compute, [Si14].

However, a common failure mode for DDPG is that the learned Q-function regularly overestimates Q-values. It is well-known fact that a good policy cannot be learned if exploration is not efficient enough, [St20b]. A less trivial fact was established in [Ma19]:

“…if exploration does find the reward consistently but not early enough, an actor-critic algorithm can get stuck into a configuration from which rewarded samples are just ignored.”

As demonstrated in [Ma19], for a deterministic policy gradient update, even in the case of a very simple environment (such as 1D-TOY), the actor update stops even if the behavior policy regularly detects a reward.

Stochastic actor and term “soft”

In Part 1 we considered a Boltzmann distribution used in statistical mechanics. This is a probability distribution that gives the probability pᵢ of state i of a system:

Boltzmann distribution and partition function (Image by Author)

In statistical mechanics, the normalization denominator of equation(4) is called the partition function and is denoted by Z. To more accurately express the partition function, we need to use the integral, not the sum, but for our purposes it is sufficient to use the discrete case.

In probability theory, the softmax function p(x), like the Boltzmann distribution, is a function that takes n real numbers xᵢ, where i runs over the event space {1,…,n} and normalizes them:

Softmax function p(x) and partition function Z (Image by Author)

The function p maps any real values xᵢ into the half-open interval (0, 1], the sum of all p(xᵢ) is 1. Thus, the function p is a probability distribution over the event space {1,, n}. The softmax function is also know as Normalized Exponential Transformation. Softmax is a way to create a kind of competition between events. In particular, in Soft Actor-Critic the stochastic agent performs selection action using softmax. Softmax changes the winner-take-all strategy, which chooses the maximum value, to a stochastic strategy, where each action can be chosen with some probability. That is the motivation of the term “soft”, [Br89]. Apparently, in Soft Actor-Critic the term “soft” refers to the fact that the softmax function is used.

The partition function Z occurs also in neural networks, and has the same meaning as in statistical mechanics: the sum of discrete terms, which is the normalization denominator in the softmax function as in equation (5).

Policy improvement step

In Soft Actor-Critic the policy is updated by the following formula:

Updating the policy (Image by Author)

The first distribution in equation (6) is the distribution 𝜋 '( ∙| s_t). This is a distribution from the set of Gaussian distributions Π. Essentially, this distribution provides minimum cross-entropy. That is the same as the minimum of KL-divergence in equation (2').

“Since in practice we prefer policies that are tractable, we will additionally restrict the policy to some set of policies Π, which can correspond, for example, to a parameterized family of distributions such as Gaussians”, [SAC-1].

The second distribution in equation (6) is constructed using the softmax function, and the denominator is the partition function (5). The values of the state-value function Q (s_t, aᵢ) obtained for the previous policy function 𝜋_old constitute a set of xᵢ values for the softmax function in equation (5).

Expected KL-divergence

Let us transform the KL-divergence from equation (6). According to (2), since the logarithm of the ratio is the difference of the logarithms, the KL-divergence looks like this:

Transformed KL-divergence as difference of logarithms (Image by Author)

We aim to minimize the KL-divergence on the set of Gaussian distributions Π. The log-partition function does not depend on the actions a and the choice of policy 𝜋' ∈ Π, then the log-partition function log Z(s_t) can be discarded. Further, the policy parameters can be learned by directly minimizing the expected KL-divergence in equation (7) multiplied by the temperature parameter α

Objective function for minimizing expected KL divergence (Image by Author)

Equation (7') is used to compute policy_loss tensor.

Policy calculation steps

Policy calculation steps (Image by Author)

Tips to SAC Implementation

Init function of SAC Agent

In the SAC agent __init__ function, theself.critic and self.critic_target members are of type QNetwork, and self.policy is of type GaussianPolicy. The QNetwork and GaussianPolicy classes defined in model.py are neural networks.

Function __init__ of class soft_actor_critic_agent

Policy loss and Critic loss

Tensor policy_loss represents eq.(7) from [SAC-2]
Tensors Q1_loss and Q2_loss represent eq.(5) from [SAC-2]
  • The policy.sample function, which takes state_batch as input, returns the tensors action_batch_pi and log_pi. Both tensors are used in calculating of policy_loss.
  • There are two (critic) neural networks Q1 and Q2, two instances of QNetwork, [Hop20]. Note that SAC (as in the TD3 algorithm) uses two critic neural networks to mitigate the positive bias during the policy improvement step.
  • The policy.sample function, which takes next_state_batch as input, returns the tensors next_state_action and next_state_log_pi. Both tensors are used in computation of two Q-loss-values: Q1_loss, Q2_loss.
  • Two tensors Q1_loss and Q2_loss are calculated by PyTorch mse_loss() function. The arguments to mse_loss() are obtained in two ways: (I) the values of action-values function Q(s,a), (II) the soft Bellman equation (Part 1, (10)). The difference between the arguments to mse_loss() is called the soft Bellman residual.
Calculation of tensors policy_loss and Q1_loss, Q2_loss

Implementation schema

What happens in the function update_parameters (Image by Author)

The above diagram basically reflects what happens in the update_parameters function, the core of the SAC algorithm.

Conclusion
The KL-divergence associated with cross-entropy can be considered almost as a distance metric on the space of probability distributions. In the infinitesimal form the KL-divergence even coincides with some specific distance metric in Riemannian geometry. Such an “almost” metrics generated by the KL-divergence, used to consider proximity in the space of Gaussian distributions, where the best policy is sought at each step of improvement. It turns out that stochastic policy works better than deterministic policy, in particular there are examples where the optimal policy gradient update for deterministic policy is stalled and does not improve policy even for very simple environments.

source: 123rf.com

What will be in Part 3?

How log 𝜋(a|s) is calculated, the reparameterization trick in the context of SAC, and something other like that…

App 1. Gibbs’ inequality

Let p= {p₁,, p_n} and q= {q, …, q_n} be two probability distributions. Suppose I is the set of all i for which pᵢ is non-zero. Then

Gibbs’ inequality

Let us see why the Gibbs’ inequality holds. First,

Here, y = ln(x)

Then

The proof of Gibbs’ inequality

The sum of all nonzero values pᵢ is 1. Some nonzero qᵢ may be excluded from the sum, since according to the definition, we choose only those indices for which pᵢ is not equal to zero. To go from ln(x) to log(x) we just need to divide all terms by ln2.

App 2. Python code: KL-divergence

App 2.1. Gaussian pairs

App 2.2 Dependence KL-divergence on standard deviation

References

[Br89] J. Bridly, Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters, 1989, NeurIPS Proceedings

[Do17] J.Dowty, Chentsov’s theorem for exponential families, 2017, arXiv

[Hop20] Project — HopperBulletEnv with Soft Actor-Critic (SAC), 2020, github

[Li15] T.Lillicrap et al., Continuous control with deep reinforcement learning, 2015, arXiv

[Ma19] G. Matheron et al., The problem with DDPG: understanding failures in deterministic environments with sparse rewards, 2019, arXiv

[No18] Y.-K. Noh et al. , Bias Reduction and Metric Learning for Nearest-Neighbor Estimation of Kullback-Leibler Divergence, 2018, IEEE Xplore

[Pe08] F. Perez-Cruz, Kullback-Leibler divergence estimation of continuous distributions, 2008, IEEE Xplore

[Sa18] Y.Sako, Is the term “softmax” driving you nuts, 2018, Medium

[SAC-1] T.Haarnoja et al., Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018, arXiv

[SAC-2] T.Haarnoja et al., Soft Actor-Critic Algorithms and Applications, 2019, arXiv

[Si14] D.Silver et al., Deterministic Policy Gradient Algorithms, 2014, DeepMind Technologies

[St20a] R. Stekolshchik, A pair of interrelated neural networks in Deep Q-Network, 2020, TDS

[St20b] R. Stekolshchik, Three aspects of Deep RL: noise, overestimation and exploration, 2020, TDS

[St21] R. Stekolshchik, Entropy in Soft Actor-Critic (Part 1), 2021, TDS

--

--

Ph.D. in Math, Algorithm and SW developer, Researcher. Fan of Deep Learning and Neural Networks. @r.stekol