
Entropy is a concept associated with a state of disorder, randomness, or uncertainty. It can be considered as a measure of information for random variables. Traditionally, it is associated with fields such as thermodynamics, but the term found its way to many other domains.
In 1948, Claude Shannon introduced the notion of entropy in information theory. In this context, an event is considered to offer more information if it has a lower probability of happening; the information of an event is inversely correlated to its probability of occurrence. Intuitively: we learn more from rare events.
The notion of entropy can be formalized as follows:

In Reinforcement Learning (RL), the notion of entropy has been deployed as well, with the purpose of encouraging exploration. In this context, entropy is a measure of predictability of actions returned by a stochastic policy.
Concretely, RL takes the entropy of the policy (i.e., probability distribution of actions) as a bonus and embeds it as a reward component. This article addresses the basic case, but entropy bonuses are an integral part of many state-of-the-art RL algorithms.
What is entropy?
First, let’s first build a bit of intuition for the concept of entropy. The figure below shows policies with low and high entropy, respectively. The low-entropy policy is nearly deterministic; we almost always select the same action. In the high-entropy policy, there is much more randomness in the action that we select.
![Example of low-entropy policy (left) and high-entropy policy (right). In the high-entropy policy, there is much more randomness in the action selection [image by author]](https://towardsdatascience.com/wp-content/uploads/2023/10/1Gnx-mntD0VAMZmlVzsKMOg.png)
Next, let’s consider the entropy of a coin flip.
Shannon’s entropy utilizes a log function with base 2 (np.log2 in Numpy) and the corresponding unit of measurement is called a bit. Other forms of entropy use different bases. These distinctions are not terribly important to grasp the main idea.
What happen if the coin is loaded? The figure below shows that entropy would decrease, as there is more certainty whether a given outcome would occur.
![Entropy of a coin flip with varying probabilities of head and tail, measured in bits. Entropy peaks when the outcome of the coin flip is most uncertain [image from Wikipedia]](https://towardsdatascience.com/wp-content/uploads/2023/10/08olrwlcRB_pXi9eC.png)
Now let’s compute the entropy of a fair die:
Note that the die has a higher entropy (2.58 bit) than the fair coin (1 bit). Although both have uniform outcome probabilities, the die displays lower individual probabilities.
Now, let’s consider the probabilities of a loaded die, e.g., [3/12, 1/12, 2/12, 2/12, 2/12, 2/12]. The corresponding entropy is 2.52, reflecting that outcomes are slightly more predictable now (we are more likely to see 1 eye and less likely to see 2 eyes). Finally, let’s consider a more heavily loaded die with probabilities [7/12, 1/12, 1/12, 1/12, 1/12, 1/12]? Now, we get an entropy of 1.95. The predictability of the outcomes has further increased, as evidenced by the decreased entropy.
Armed with an understanding of entropy, let’s see how we can utilize it in Reinforcement Learning.
Entropy-Regularized Reinforcement Learning
We define entropy in the context of RL, in which action probabilities derive from a stochastic policy π(⋅|s).

We use the entropy of the policy as an entropy bonus, adding it to our reward function. Note that we do this for every time step, implying that present actions also position us to maximize future entropy:

This might seem counterintuitive at first. After all, RL aims to optimize decision-making, which involves routinely picking good decisions over bad ones. Why would we alter our reward signals in a way that encourages to maximize entropy? This would be a good point to introduce the principle of maximum entropy:
"the probability distribution [policy] which best represents the current state of knowledge about a system [environment] is the one with largest entropy, in the context of precisely stated prior data [observed rewards]" – Wikipedia
If the entropy of the policy is large (e.g., shortly after initialization), we don’t know all that much about the impact of different actions. The high-entropy policy reflects that we haven’t yet sufficiently explored the environment and need to observe more rewards from a variety of actions still.
Of course, ultimately we do want to take good actions rather than endlessly explore, meaning we have to decide the emphasis we put on the entropy bonus. This is done through the entropy regularization coefficient α, which is a tunable parameter. For practical purposes, the weighted entropy bonus αH(π(⋅|s) can simply be viewed as a reward component that encourages exploration.
Note that the entropy bonus is always computed over the full action space, so we add the same bonus when evaluating each action. In a typical stochastic policy approach, action probabilities are proportional to their expected rewards, including the entropy bonus (e.g., by applying a softmax function on them). Thus, if entropy is very large relative to the rewards, action probabilities are more or less equal. If entropy is very small, the rewards are leading in defining action probabilities.
Python implementation
Time to implement entropy-regularized reinforcement learning. For the purpose of this article we use a basic discrete policy gradient algorithm in a multi-armed bandit context, but it can easily be extended to more sophisticated environments and algorithms.
Remember that policy gradient algorithms have a built-in exploration mechanism – entropy bonuses applied on inherently deterministic policies (e.g., Q-learning) have more pronounced effects.
Incorporating entropy regularization is quite straightforward. We simply add the entropy bonus to the rewards – so it will be incorporated in the loss function – and proceed as usual. Depending on algorithm and problem setting, you might encounter a number of variants in literature and codebases, but the core idea remains the same. The code snippet below (a TensorFlow implementation of discrete policy gradient) illustrates how it works.
Consider a set of slot machines with mean payoffs μ=[1.00,0.80,0.90,0.98] and st. dev. σ=0.01 for all machines. Clearly, the optimal policy is to play Machine #1 with probability 1.0. However, without sufficient exploration, it is easy to mistake Machine #4 for the best machine.
To illustrate the idea, let’s first see the algorithmic behavior without entropy bonus (α=0). We plot the probability of playing Machine #1. Although each machine starts with an equal probability, the algorithm fairly quickly identifies that Machine #1 yields the highest expected reward and starts playing it with increased probability.
![Here, the algorithm without entropy regularization converges to playing the optimal Machine #1. Left: probability per episode of playing Machine #1. Right: probabilities per machine after 10k episodes [image by author]](https://towardsdatascience.com/wp-content/uploads/2023/10/1T17Zqx15W89b-VFVjiW57w.png)
However, that could have gone very differently… Below we see a run with the same algorithm, only this time it erroneously converges to the suboptimal Machine #4.
![In this case, the algorithm without entropy regularization converges to predominantly playing the suboptimal Machine #4. Left: probability per episode of playing Machine #1. Right: probabilities per machine after 10k episodes [image by author]](https://towardsdatascience.com/wp-content/uploads/2023/10/1ZuEKbQwb7np8HjZ3rU9WVg.png)
Now, we set α=1. This yields an entropy bonus that is large relative to the rewards. Although the probability of playing Machine #1 gradually increases, the regularization component continues to encourage strong exploration even after 10k iterations.
![With entropy regularization, we see that the algorithm still explores a lot after 10k episodes, although slowly recognizing that Machine #1 offers superior rewards. Left: probability per episode of playing Machine #1. Right: probabilities per machine after 10k episodes [image by author]](https://towardsdatascience.com/wp-content/uploads/2023/10/1RWuRr9dNijEzFDLBadO6jQ.png)
Evidently, in practice we don’t know the true best solution, nor the amount of exploration that is desirable. Typically, you’ll encounter values in the neighborhood of α=0.001, but you can imagine the ideal balance between exploration and exploitation strongly depends on the problem. Thus, it often requires some trial-and-error to find an appropriate entropy bonus. The coefficient α may also be dynamic, either through a predetermined decay scheme or being a learned weight in itself.
Applications in Reinforcement Learning
The principle of entropy regularization can be applied to just about any RL algorithm. For instance, you may add entropy bonuses to Q-values and transform the results into action probabilities through a softmax layer (soft Q-learning). State-of-the-art algorithms such as Proximal Policy Optimization (PPO) or soft-actor critic (SAC) typically include an entropy bonus, which is empirically shown to often enhance performance. Specifically, it offers the following three benefits:
I. Better solution quality
As elaborated earlier, the entropy bonus encourages exploration. Particularly when dealing with sparse rewards, this is very helpful, as we rarely receives feedback on action and might erroneously repeat sub-optimal actions for which it happens to overestimate rewards.
II. Better robustness
Given that the entropy bonus encourages to explore more, we will also encounter rare or deviating state-action pairs more often. Because we have a encountered a richer and more diverse set of experiences, we learn a policy that is better equipped to handle a variety of situations. This added robustness enhances the quality of the policy.
III. Facilitate transfer learning
Increased exploration also helps to adapt a learned policy to new tasks and environments. The more diverse experiences allows to better adapt to the new situation, because we already learned from comparable circumstances. As such, entropy regularization is often useful for transfer learning, making it easy to retrain or update learned policies when dealing with changing environments.
TL;DR
- An entropy bonus encourages exploration of the action space, aiming to avoid premature convergence
- The balance between reward (exploitation) and bonus (exploration) is governed through a coefficient that requires fine-tuning
- Entropy bonuses are commonly used in modern RL algorithms such as PPO and SAC
- Exploration enhances quality, robustness and adaptability to new instance variants.
- Entropy regularization is particularly useful when dealing with sparse rewards, when robustness is important, and/or when the policy should be applicable to related problem settings.
If you are interested in entropy regularization in RL, you might want to check out the following articles as well:
soft actor-criticA Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0
Further reading
Ahmed, Z., Le Roux, N., Norouzi, M., & Schuurmans, D. (2019). Understanding the impact of entropy on policy optimization. International Conference on Machine Learning.
Eysenbach, B. & Levine, S. (2022). Maximum Entropy RL (Provably) Solves Some Robust RL Problems. International Conference on Learning Representations.
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. International Conference on Machine Learning.
Reddy, A. (2021). How Maximum Entropy makes Reinforcement Learning Robust. Machine Learning at Berkeley.
Schulman, J., Chen, X., & Abbeel, P. (2017). Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440.
Tang, H. & Haarnoja, T. (2017). Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning. BAIR, Berkeley.
Yu, H., Zhang, H., & Xu, W. (2022). Do You Need the Entropy Reward (in Practice)? arXiv preprint arXiv:2201.12434.