Using Reinforcement Learning to infer causality could unlock new frontiers in machine learning and potentially avoid another AI winter

The difficulty of inferring cause and effect is omnipresent. Humans face this challenge every day and do a admirable job of surmounting the problem, at least compared with our animal cousins. Machine Learning pioneers such as Yoshua Bengio even suggest that creating algorithms that can infer cause and effect is the key to avoiding another AI winter and unlocking new frontiers in machine intelligence. In this article I explain how the reinforcement learning can be reframed to infer causality, paralleling our human ability to do so and perhaps someday far exceeding it.
For those unfamiliar with reinforcement learning, it refers to a subset of algorithms used to determine optimality in sequential decision making tasks. It is typically presented in terms of an agent learning to take a series of actions within an environment to receive a reward. These algorithms are readily applicable to problems that can be "gamified" such that there are clearly defined actions and rewards. Recently reinforcement learning algorithms have received much acclaim for besting humans in GO, Starcraft and a variety of video games.
The basic means for such reinforcement algorithms to achieve their success is the concept of a prediction error minimized through successive episodes of trial and error training. This method fairly accurately corresponds with the human dopamine learning system which also makes use of a prediction error.
There are two primary hurdles that make inferring causation a difficult problem, one is complexity and the other is correlation. Next I will explain how Reinforcement Learning can be reframed to tackle these.
As mentioned, reinforcement learning is typically presented in terms of an agent making sequential decisions within an environment in pursuit of some reward, such as winning a game. The agent can take actions, and depending on the state of the environment, those actions are either rewarded or not. When they are rewarded, the agent propagates this reward back across all the actions and environment states that led to it eventually receiving the reward. Over many trials, it can thereby determine which combination of actions and environment states were instrumental to receiving the reward and which were superfluous. Looked at another way, the agent determines which combination of actions and environment states were causally related to it receiving a reward and which were not. In effect, the agent is inferring cause and effect in regard to its own actions and a specific reward state of the environment.
The important point is that at its root, RL algorithms are agnostic towards which component in an environment is the agent and which are merely objects. Instead of thinking in terms of an agent, its actions, and an environment, we can think of all objects in the environment as possible agents, and all the possible states an object can exist in as it’s action set. In this way, any causal relationship between a given set of components in an environment and a specific end state can be explored by simply changing which object is the agent and what is the end state that is being rewarded.
Let’s unpack this with an example. A fairly common demonstration of RL is a match to sample task, in which the agent(often a small furry mammal) must learn to take a series of actions to receive a food reward, for instance, by pressing a lever when a light turns on. With training, many animals as well as AI bots can solve such tasks. (See the accompanying video of a reinforcement learning agent solving a match to sample task)
Now imagine we are a naïve social scientist with no knowledge of how this experiment was designed and wish to know what causes the food reward to be unlocked. In other words, we want to know the cause and effect story behind the match to sample task. A potential way to answer this is to employ an RL algorithm that moves sequentially through objects in the environment, treats each as an"agent", and attempts to see if there is a series of actions this agent/object can take to minimize a prediction error in regard to the unlocking of the food reward. An object which is causally connected to the reward state of unlocking the food bowl will have some actions that allow it minimize a prediction error in regard to this reward state. One that is not causally related will have no such state/action combinations, no matter how well its action set correlates with the reward state.
Remember, in our reframing, actions simply refer to the different states that an object can exist in. For instance, we could pose the problem in such a way the the light that turns on and off is the agent and its actions are turning on and off. If there exists a causal relation between the actions of whatever object we have chosen as the agent, and the reward state we want to understand, than there also exists a prediction error in regard to its actions space that can be minimized through successive training episodes. In other words, there is a way this object/agent can causally influence the reward state.
A potential way to unlock the causality story in the environment than is to iteratively move across objects in the environment, treat them as an agent taking actions, and examine which result in a prediction error that can be minimized in regard to the reward state. Again, an object that is not causally connected with the reward state or merely correlated with the reward state, will have a prediction error over its action space that is essentially random and will not diminish with more training episodes. In such a manner RL provides a way to algorithmically probe the elements in a given environment for causal relationships, treating each one in turn as an agent who is attempting to change its behavior through some actions in order to influence the reward state.
Moreover, the rate at which the prediction error diminishes, what is frequently referred to as that agents learning curve, will give some indication as to how causally distant the object/agent is from the reward state. While it may be true that a butterfly flapping its wings can cause a hurricane on the other side of the globe, the number of training episodes necessary to predict this occurrence would be astronomical compared with the number of training episodes necessary to minimize a prediction error in regard to a much more proximate cause. Learning curves can therefore be useful in gauging relative causal proximity, all other things being equal.
There are number of assumptions and limitation that apply to this methodology, namely all those applicable to reinforcement learning itself. It also assumes we have a means assessing the action space for each agent/object in the environment and that this satisfies the Markov property. In the example of the match to sample task, every object in the environment has a small clearly defined action space. For objects with a large or continuous action space, exploring causation may become computationally intractable. It’s also worth noting that this system for inferring causality is only suitable to situations where a large number of sample trials can be generated, just as RL itself is only applicable where "self-play" allows for an enormous number of training episodes. While this may appear a computationally intensive means of exploring causality, with the diminishing cost of compute, an increasing set of environments could become approachable by such methods.
It remains that one of the enduring challenges of science is determining causality in complex systems. Examples include the human genome, where myriad genes and environmental factors contribute to a specific phenotype. In such instances causality can be difficult to infer and methods such as this could potentially shed light on which "actors" or states are instrumental to achieving a specific outcome. It also opens up the field of AI to agents that can "explain" the causal inference used in determining a particular course of action. Another potential application is generating prior knowledge by which an AI could store a list of previously inferred causal relationships and apply them to new action spaces when presented with a challenging problem. Such prior knowledge could vastly diminish the training episodes necessary to succeed at a task and possibly lead to the kind of one shot learning demonstrated by humans.