Empowerment as Intrinsic Motivation

In the absence of goals or rewards, be empowered

Chris Marais
Towards Data Science

--

Having money, influential friends, or owning a vehicle means that you are more empowered to decide what kind of future you want to live. It doesn’t necessarily mean that you know which goals are the right ones, but it certainly puts you in a place of power, where many possible futures are available to choose from.

This concept of empowerment was formalised in the context of designing adaptive agents by Klyubin et al. [1]. It is intended as a kind of goal-independent intrinsic motivator for behaviour, and has produced some interesting results for robotics, reinforcement learning and adaptive systems generally. For example, using an empowerment maximisation strategy to control a simulated double pendulum, Jung et al.[4] found the following behaviour:

Maximising empowerment in a double pendulum controller. The controller finds the most unstable point of the pendulum and keeps it there, even though there is no explicitly encoded goal of reaching this state. Image taken from Jung et al. [4]

The double pendulum is controlled to the most unstable point (90 degrees vertical) and kept there, even though no explicit goal of finding this state was programmed into the controller — it simply maximises the empowerment, and it so happens that the top of the pendulum is the most empowered position (this is because more possible future states are reachable from this unstable point).

Empowerment is based on an information-theoretic formalism which considers actions and future sensations as a kind of information transmission channel (like those described by Claude Shannon in his pioneering work on communications engineering [5]). My goal in this article is to explain empowerment in the simplest possible way, including the mathematics of how to calculate it, and its implications for reinforcement learning and AI. At the end of the article I’ve provided a link to my GitHub repository where you can find code implementing empowerment in discrete worlds.

Rewards and Reinforcement Learning

Reinforcement learning is about learning an optimal policy for action in an uncertain world which provides a reward signal at each time step. DeepMind have famously used reinforcement learning techniques to exceed human-level performance on Atari games [2] by training deep networks, using the game score as the reward signal.

Space Invaders Atari game. DeepMind’s Q-learning network outperforms human-level game play by training deep networks using the game pixels as input, and the game score as reward signal.

While this is impressive, biological creatures in the real world face vastly more complex and uncertain worlds — compare the sensory stimuli from your eyes and ears with the simple pixels and score signals that were used as inputs to DeepMind’s neural nets. But more crucially, no obvious reward signal exists at all for biological organisms. There is no all-knowing guardian or ‘oracle’ that tells ants, squirrels, or humans for that matter when they have made a “good” or a “bad” move.

In reality, the most impressive quality of living organisms like ourselves is our ability to continue existing. Not only do our internal organs keep doing what they do to keep us alive, but we generally avoid situations which might lead to our death. Are there any general principles which our brains / bodies might follow so as to increase the chances of our continued existence?

Intrinsic Motivation

Karl Friston has proposed an ambitious framework which suggests that the function of the brain is to minimise a quantity known as free energy [7]. Not only this, but he suggests that minimising this quantity is a must for any organism that aims to avoid the destructive effect of the tendency toward disorder in the universe (the second law of thermodynamics). However, the free energy framework is not the topic of this article, so I won’t go into detail here.

I mention this to introduce the notion of intrinsic motivation. At the base, all organisms must resist the universe’s tendency to disorder in order to continue existing. Most of this work is done by our metabolisms, which harvest energy for useful work, and eject high-entropy waste products. But at the higher level of behaviour, in which much of this useful energy is invested, the question is: are there any general principles which might guide behaviour in the absence of specific goals or reward signals?

Suppose, for instance, that for the moment I have no goal or task to complete, but I know that at some future time a task may arise. In the meantime, is there any principled way to behave so as to maximise my preparedness for this future task?

Empowerment

Well, all else being equal, according to Klyubin, agents should maximise the number of possible future outcomes of their actions. “Keeping your options open” in this way means that when a task does arise, one is as empowered as possible to carry out whatever needs to be done to complete it. Klyubin et al. present the concept nicely in two aptly titled papers: “All Else Being Equal Be Empowered”[1] and “Keep Your Options Open: An Information-Based Driving Principle for Sensorimotor Systems”[3]. Since then, a lot of exciting work in robotics and reinforcement learning have used and extended the concept [8,9].

The main innovation here was formalising empowerment using information theory. To do this the world must be considered as an information-theoretic channel converting actions into future sensory states. By viewing actions and subsequent sensations as being related via a channel, we can precisely quantify empowerment as the information-theoretic channel capacity of the sensor-actuator channel. That is, how much information can I inject into my future sensory states via my actions? Said more intuitively, how manipulatable is my environment?

To define this, we need to first define the mutual information between two random variables X and Y. Suppose X and Y are related by a channel X→Y, described by a conditional distribution p(y|x). That is, values x prescribe probability distributions over values y (remember that channels may be noisy, so that x only determines y with some probability, hence a probability distribution rather than an exact value associated with each x). If the inputs x to the channel are distributed as p(x), the mutual information between X and Y is given by the following equation.

Mutual information between X and Y. Formally, it is the average reduction in uncertainty (entropy) in Y due to knowledge of X (and vice versa, although that may not be immediately clear).

Mutual information quantifies a reduction in uncertainty (entropy) of Y given knowledge of X, and is measured in bits. As it turns out, the relation is symmetric, so that I(X;Y) = I(Y;X).

In our case, the channel of interest is the the sensor-actuator channel. Let’s call A the actuator variable (describing actions) and S the variable defining the sensor reading at the next time step. Given a state of agent/environment system, actions lead to future sensations via a probabilistic rule p(s|a) or channel A→S determined by the environment dynamics at this particular point in the world. A certain distribution over actions p(a) is now sufficient to determine a mutual information I(A;S) (exactly as with X and Y above). The channel capacity, here the empowerment, is simply the maximum possible mutual information over action distributions p(a).

For a fixed channel p(s|a) empowerment is the maximum mutual information over actions distributions p(a).

Similarly, we can consider the random variable A as describing n-step actions and S then sensor reading n time steps later, hence defining so-called n-step empowerment.

To help think about what this means, consider this. When I act, I change my environment, and thus affect what I perceive at the next time step. When all my possible actions result in the exact same sensory reading at the next time step, I have absolutely no control, that is I am not empowered, and this is reflected in a mutual information of zero (no reduction in uncertainty of the outcome). When each of my actions leads to a distinct sensory reading at the next time step, I have a lot of control, so I am very empowered (reflected in a high mutual information).

Klyubin’s Maze World

If your eyes have just glazed over, it’s time to wake up for a revealing example. If you get as excited as I do about these information-theoretic formalisms, it’s worth having a look at Cover & Thomas’s book[6], which covers topics like entropy, mutual information and channel capacity in detail.

In Klyubin’s paper “All Else Being Equal Be Empowered”, a simple maze world example was considered to illustrate the concept. Suppose that an agent exists in a 2D grid world, and can make actions to go North, South, East, West, or Stay. The world contains walls though, making it into a kind of maze. Actions commanding the agent into walls have no effect, and the edges of the world are walled (i.e. no looping round).

Suppose that the agent’s sensor simply reports the absolute position of the agent in the world. The figure below shows the empowerment value at each grid cell in the world, calculated as the channel capacity of the channel relating possible 5-step actions to the sensor reading after the 5-step sequence.

5-step empowerment value at each point in Klyubin’s deterministic maze world. Highly empowered positions are so because more future states are reachable from those states.

We see that in parts of the world where the agent is more “stuck”, the empowerment is lower (the channel capacity is around 3.5bits). “Stuck” means that even the most promising distribution over actions only leads to a limited number of future states. Highly empowered states (like near the middle of the map) correspond to states where many possibilities futures are open to the agent.

Suppose that a “piece of food” will appear at some position in the grid at time t , but until then, no information about that future position is known. How should one act in the meantime? Well, moving to position [5,5] near the center of the grid is probably your best bet, since more future states are available from that position.

Empowerment for AI

Most interesting human behaviour is not the result of reward-based learning (especially not using tens of thousands of examples, as standard machine learning models do). The psychological literature speaks of “intrinsic motivation”, which basically refers to doing tasks for their own inherent pleasure. This kind of behaviour is useful because these kinds of “inherently pleasing” activities tend to teach us valuable, abstract, re-usable concepts which can be applied to a multitude of potential future tasks (think of a child engaging in play).

For any AI that aims to be general-purpose, universal, or autonomous, learning general concepts is a must — and this means information is inherently valuable. For an embodied agent situated in an environment, the only way to learn about the inherent structure in the environment is by interacting with it — and thus empowered positions, where the outcomes of actions are more diverse, are inherently more interesting / fun parts of the world. Empowerment, and other informationally motivated concepts (check out predictive information and homeokinesis for related ideas) will be important tools as we move from task-specific to general-purpose intelligence.

Thanks for reading my article on empowerment. Check out my GitHub repository which has code implementing empowerment for arbitrary discrete environments.

References

[1] Klyubin, A.S., Polani, D. and Nehaniv, C.L., 2005, September. All else being equal be empowered. In European Conference on Artificial Life (pp. 744–753). Springer, Berlin, Heidelberg.

[2] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), p.529.

[3] Klyubin, A.S., Polani, D. and Nehaniv, C.L., 2008. Keep your options open: An information-based driving principle for sensorimotor systems. PloS one, 3(12), p.e4018.

[4] Jung, T., Polani, D. and Stone, P., 2011. Empowerment for continuous agent — environment systems. Adaptive Behavior, 19(1), pp.16–39.

[5] Shannon, C.E., 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), pp.3–55.

[6] Cover, T.M. and Thomas, J.A., 2012. Elements of information theory. John Wiley & Sons.

[7] Friston, K., 2010. The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience, 11(2), p.127.

[8] Mohamed, S. and Rezende, D.J., 2015. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems (pp. 2125–2133).

[9] Montúfar, G., Ghazi-Zahedi, K. and Ay, N., 2016. Information theoretically aided reinforcement learning for embodied agents. arXiv preprint arXiv:1605.09735.

--

--