The world’s leading publication for data science, AI, and ML professionals.

tl;dr Structural Causal Models (DAGs) Made Easy

"Smoking is associated with cancer" vs "Smoking causes cancer", which one is more likely to swing public sentiment?

"Smoking is associated with cancer" vs "Smoking causes cancer", which one is more likely to swing public sentiment? The latter surely; causal conclusions are stronger than associations. We have more data than ever, but developing causal AI engines remains a significant challenge, where are we at?

What came first, the chicken or the egg? Source: publicdomainpictures.net
What came first, the chicken or the egg? Source: publicdomainpictures.net

Association vs Causation

Put simply: Causal relationships are those that remain invariant when external conditions change, associations remain invariant regardless.

Suppose we’d like to understand whether the month of the year is causal for higher ice cream demand. A correlation is a simple model where observed demand for ice-cream is modelled as a linear function of the month of the year. We can represent this relationship by measuring the marginal probability P(ice cream|month). However, how would a machine determine from association alone if ice cream demand causes the month of the year or vice versa?

It can’t! This is because we cannot validate P(ice cream|month) under changing conditions. This is called the static property __ of associative models, including but not limited to correlation and regression.

Critically, Judea Pearl (the father of modern causal theory) rephrases our meek "correlation does not imply causation" to a more rigorous _"one cannot substantiate causal claims from associations alone, even at the population level – behind every causal conclusion there must lie some causal assumption that is not testable in observational studies"._ This is in contrast to experiments where our causal assumptions are directly encoded into the data collection process. So how do we validate the statement "The month of June causes more ice cream demand"?

Some claim Causality by simply controlling for variables Z associated with both the treatment variable x and the outcome variable y. For example, when examining if the month of June (x) causes a spike in ice cream demand (y), the analyst might adjust their model for the weather. If (Z and x) ** and (Z and ** y) are not independent, Z is said to be a confounding variable. So if we account for weather, could our computer use the updated association model P(y|x) to identify the correct causal relationship amongst the guys below?

Not even close, "controlling for Z" is a subset of the causal assumptions we have verified, there are still far too many dependencies and things going on! To see why this is the case, we can illustrate causal relationships in terms of a type of Structural Causal Graph called a Causal Directed Acyclic Graph.

Causal Directed Acyclic Graph (DAG)

In layman’s terms, a DAG is a directed graph that admits no cycles:

This means that starting at any node in the graph, such as D, there is no path you can walk along that will lead you back to D. Importantly, they can be used to encode all the assumptions of a causal model, this is called a causal DAG. Nodes represent events and directed edges represent causal relationships: E -> F implies E is causal F. Because the graph is acyclic, no event can cause itself. Events are connected by paths, and there are three kinds of paths to consider. Let’s move from letters to our ice cream example to gain some actual context:

DAG 1: Open Paths
DAG 1: Open Paths

(1) Open Paths: June is causal for higher temperatures which in turn is causal for increased ice cream demand. Weather acts as a mediator between June and ice cream demand. A healthy economy also causes increased ice cream demand, but since no arrows point into it, the event is independent of month or weather.

DAG 2: Backdoor Paths
DAG 2: Backdoor Paths

(2) Backdoor Paths: Here, two variables share the same causal path. Higher temperatures could increase crop yield in an agriculture-focused economy thus increasing economic health and therefore ice cream demand. Higher temperatures also make people want more ice cream. Thus we can see how weather or month of the year could be a confounding variable for the relationship between agriculture and ice cream demand, since there are associations between both. In causal DAGs, confounding variables can always be found by tracing backdoor paths.

DAG 3: Closed Paths
DAG 3: Closed Paths

(3) Closed Paths: Here, one variable has two effects leading into it. Higher temperatures could make people want to drive around in air conditioned cars. Also, a healthier economy makes cars more affordable. The consequence? There is no association between weather and ice cream demand transmitted through more driving. The driving event variable is known as a collider.

Now we see why simply "controlling for confounding variables" can be misleading. In model (1), conditioning on weather corresponds to conditioning on the mediator, this closes the path and this misrepresents the relationship between the events. Mistakenly controlling for a collider as a confounder in model (3) opens a path from June to ice cream demand via driving, distorting the overall relationship. This is an equally important consideration in experimental design, investigating the effect of weather on ice cream sales in a population prone to driving increases selection bias.

Most importantly, examining causal DAGs can guide our choice of confounding variable: to gather an unbiased estimate of the effect of economic health on ice cream sales in model (2), controlling for weather seems like a good choice.


Related Articles