Thoughts and Theory
Entropy in Soft Actor-Critic (Part 1)
In the probability theory, there are two principles associated with entropy: the principle of maximum entropy and the principle of minimum cross-entropy. At very beginning we notice that there are two types of entropy, however there are more in stock.
The many faces of entropy
First of all, let us emphasize that neither the principle of maximum entropy nor the principle of minimum cross-entropy are theorems, they are only principles of statistical inference. This is very similar to philosophical doctrine. However, these doctrines certainly have mathematical implications. So we have two different types of entropy: entropy and cross-entropy. They are connected by so called related entropy :
Another better known name for relative entropy is Kullback–Leibler divergence, or KL-divergence.
“… this measure may be found in the technical literature under nine different names… My own preference has been the term discrimination information and its basis for the principle of minimum discrimination information (MDI)” S.Kullback [5]
Entropy as a learning tool
In Reinforcement Learning, exploration versus exploitation is an important part of the concept. Making decisions too quickly without enough exploration can be a big failure. Exploration is a major component of learning. As it is known, adding noise into action is one of reasonable things in exploration. Entropy is another powerful exploration tool. A high entropy should ensure us to avoid repeatedly action exploiting the same inconsistency.
Shannon’s entropy
Entropy is a physical property most commonly associated with a state of disorder or uncertainty. Let X be a certain discrete variable, and {x₁, …, xn} be possible values of X with probabilities pᵢ = p(xᵢ). Then information entropy, or just entropy, or Shannon’s entropy is defined as follows:
Intuitively, the entropy, is the uncertainty measure for the the distribution p for the variable X. A typical example is the probability distribution associated with a coin. Let the coin be fair, i.e., heads and tails both have equal probability: p = p(heads) = p(tails) = 1/2. Then,
H(p) = -(1/2 log(1/2) + 1/2 log(1/2)) = - log(1/2) = -(-1) = 1, since we consider log with base 2. This is maximum uncertainty, since it is difficult to predict the following toss. Let us take a not fair coin, p=0.1. Then,
H(p) = -( 0.1 log(0.1) + 0.9 log(0.9) ) = -0.332 -0.137 = 0.469. In this case, the uncertainty is significantly less than maximum uncertainty = 1.
Uniform probability yields maximum uncertainty and therefore maximum entropy.
For entropy curve generation, see python code in App 2.
Maximum entropy in statistical mechanics
In statistical mechanics, a Boltzmann distribution is a probability distribution that gives the probability pᵢ of state i of a system:
where εᵢ is the energy of state i, T is the temperature of the system, k is the Boltzmann constant. The Boltzmann distribution is the distribution that maximizes the entropy H(p). The Boltzmann distribution is also called the Gibbs distribution.
“… a method involving the notion of entropy, the very existence of which depends upon the second law of thermodynamics, will doubtless seem to many far-fetched, and may repel beginners as obscure and difficult of comprehension. This inconvenience is perhaps more than counterbalanced by the advantages of a method which makes the second law of thermodynamics so prominent and gives it so clear and elementary an expression.” (J.W.Gibbs, [6])
The Boltzmann distribution is related to the softmax function, which plays a central role in machine learning and neural networks. We will return to this issue in the following article (“Entropy in Soft Actor-Critic”, Part 2)
Bellman equation with built-in entropy term
Objective function with the entropy term
SAC algorithm is based on maximum entropy RL framework. The actor learning mechanism optimizes policies to maximize both the expected return and the expected entropy of the policy. The standard objective (maximizing expected return) is augmented with an entropy term H(p):
Here, γ is the discount factor, 0 < γ < 1, required to ensure that the sum of rewards and entropies is finite; α > 0 is the so-called temperature parameter that determines the relative importance of the entropy term H(p); R_k is the reward at time k; 𝜋 ( *| s_k) - the probability distribution for policy 𝜋 performed at state s_k; τ = (s_k, a_k)- pairs (state, action) at time k.
Maximum reward and maximum entropy
In the SAC objective function (3), reward R_k is augmented as follows:
This means that the standard maximum reward objective is augmented with an entropy maximization. The SAC algorithm aims to simultaneously maximize expected return and entropy.
“… the maximum entropy formulation provides a substantial improvement in exploration and robustness… maximum entropy policies are robust in the face of model and estimation errors… they improve exploration by acquiring diverse behaviors.” [1]
State-value and action-value functions
The state-value function V𝜋(s_t) estimates the expected return if we start in state s_t. The state-value function including the entropy term is defined as follows:
The action-value function Q𝜋(s_t, a_t) estimates the expected value if we start in state s_t, and carries out arbitrary action a_t. The action-value function including the entropy term is defined as follows:
Soft state-value function
Note that in the entropy term, the sum starts in k=1. Thus, the action-value function Q𝜋(s_t, a_t) differs from the state-value function V𝜋(s_t) in exactly one term:
Substitute expression (2) for entropy H(p) into (6):
The function V𝜋(s) is the so-called soft state-value function, see “Soft Actor-Critic Algorithms and Applications”, p.5.
Soft Bellman equation
Let us separate the first term from the first sum in eq. (5)
By substituting k = p+1 we get
The expression after the γ in eq. (9) is the state-value function V𝜋(s_{t+1}).
Eq. (10) is the modified Bellman equation, see “Soft Actor-Critic Algorithms and Applications”, p.2. This equation is known as soft Bellman equation. The entropy is is an integral part of this equation.
Actor-Critic Neural Networks
Below we present snippets of the SAC algorithm related to the computation of some tensors and neural networks. Q-values qf1
, qf2
, qf1_next
, qf2_next
are calculated using the neural network critic
. Tensors next_state_action
, next_state_log_pi
are calculated using the neural network policy
(actor). The network critic
is defined by class QNetwork , and the network policy
is defined by class GaussianPolicy. There is a third neural network which will be discussed in the following article (“Entropy in Soft Actor-Critic”, Part 2).
Computing tensors V(s) and Q(s,a)
The arrays state_batch,
action_batch,
reward_batch,
next_action_batch,
and mask_batch
are taken from previously saved episodes in the classReplayMemory
.
The soft state-value function V(s) from (7) is represented by the tensor min_qf_next
:
The soft action-value function Q(s,a) from the Bellman equation (10) is represented by the tensor next_q_value
:
Conclusion
J.W. Gibbs used the entropy notion in thermodynamics, John von Neumann extended the classical Gibbs entropy to the field of quantum mechanics to characterize the entropy of entanglement. Upon John von Neumann’s suggestion, Claude Shannon named this entity of missing information similarly to its use in statistical mechanics as entropy. This way the information theory was born. It is clear that the concept of entropy, as a concept that can characterize the system as a whole by its individual parts, was very productive in various fields of knowledge. Today we see
that the entropy notion is very useful in systems connected with deep learning and artificial intelligence. We may soon see other concepts and laws, like entropy linking physics and artificial intelligence.
App 1. Some features of Soft Actor-Critic
Off-policy
SAC is an off-policy algorithm. This means that, the SAC algorithm allows us to reuse already collected data.
memory = ReplayMemory(replay_size)
# Sample a batch from memory, _batch_size_ = 256
state_batch, action_batch, reward_batch, next_state_batch, mask_batch = memory.sample(batch_size=batch_size)
As for on-policy learning algorithms such as Proximal Policy Optimization (PPO) and Soft-Q-Learning (SQL), these algorithms suffer from poor sample complexity. These algorithms require new samples to be collected for each gradient step, that becomes very expensive.
Stochastic Actor-Critic Training
The off-policy algorithm DDPG (deep deterministic policy gradient) can be viewed both as a deterministic actor-critic algorithm and an approximate Q-learning algorithm. However, the interplay between these two algorithms makes DDPG brittle to hyper-parameter settings. SAC avoids the potential instability associated with approximate inference in prior off-policy maximum
entropy algorithm based on soft Q-learning. Instead this SAC combines off-policy actor-critic training with a stochastic actor, and further aims to
maximize the entropy of this actor with an entropy maximization
objective.
Double-Q trick
Soft Actor-Critic isn’t a direct successor to TD3 (having been published roughly concurrently), but it incorporates the clipped double-Q trick:
qf1, qf2 = self.critic(state_batch, action_batch)
qf1_loss = F.mse_loss(qf1, next_q_value)
qf2_loss = F.mse_loss(qf2, next_q_value) pi, log_pi, _ = self.policy.sample(state_batch) qf1_pi, qf2_pi = self.critic(state_batch, pi)
min_qf_pi = torch.min(qf1_pi, qf2_pi) policy_loss = ((self.alpha * log_pi) - min_qf_pi).mean()
Two Q-functions are used to mitigate the positive bias in the policy improvement step.
Maximum entropy vs. entropy regularization
In the PPO algorithm, an entropy regularization term is added to the objective function to ensure sufficient exploration:
The exploration provided by the principle of maximum entropy in SAC allows the agent to discover faster and better policies than it is done by other algorithms (such as PPO) used the entropy regularization term.
App 2. Python code: Entropy Curve Generation
Entropy in Soft Actor-Critic (Part 2)
References
[1] Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018, arXiv
[2] Soft Actor-Critic Algorithms and Applications, 2019, arXiv
[3] Reinforcement Learning with Deep Energy-Based Policies, 2017, arXiv
[4] E.T. Jaynes, Information Theory and Statistical Mechanics. I, II(1957)
[5] S. Kullback, Letter to the Editor: The Kullback–Leibler distance, 1987
[6] J.W.Gibbs, Graphical Methods in the Thermodynamics of Fluids, Retrieved 2011, Wikisource
[7] Soft Actor-Critic, 2018, OpenAI, Spinning Up
[8] Soft Actor-Critic Demystified, 2019, TDS
[9] Three aspects of Deep RL: noise, overestimation and exploration, 2020, TDS
[10] A pair of interrelated neural networks in Deep Q-Network, 2020, TDS
[11] How does the Bellman equation work in Deep RL?, 2020, TDS
[12] Exploration Strategies in Deep Reinforcement Learning, 2020, github.io
[13] Project - HopperBulletEnv with Soft Actor-Critic (SAC), 2020, github
[14] Proximal Policy Optimization Algorithms, 2017, arXiv
[15] Continuous control with deep reinforcement learning, v6, 2015, arXiv
[16] Entropy in Soft Actor-Critic (Part 2), 2021