Causal inference for data scientists: a skeptical view

How and why causal inference fails us

9 min readMay 18, 2020

Introduction

The purpose of this post is to show why causal inference is hard, how it fails us, and why DAGs don’t help.

Machine learning practitioners are concerned with prediction, rarely with explanation. This is a luxury. We work even on problems for which no data generating process can possibly be written down. Nobody believes that an LSTM model trained on Shakespeare’s plays works because it approximates Shakespeare. Yet it works.

In recent years, new tools for causal inference have become available for a broader audience. These tools promise to help non-experts not only predict, but explain patterns in data. Directed acyclic graphs and do-calculus are among the most influential ones.

People love shiny tools and there is danger in it. New tools come with the rush to adopt them, with a feeling of being in the vanguard, with new opportunities. That often makes them unreliable: they are misused, substituted for a better theory, design, or data.

In this post, I focus on directed acyclic graphs and a Python library DoWhy because DAGs are really popular in the machine learning community. Points I make apply equally to the Potential Outcome framework, or any other formal language to express causality.

Why causal inference is hard, in theory

Causal inference relies on causal assumptions. Assumptions are beliefs that allow movement from statistical associations to causation.

Randomized experiments are the gold standard for causal inference because the treatment assignment is random and physically manipulated: one group gets the treatment, one does not. The assumptions here are straightforward, securable by design, and can be conveniently defended.

When there is no control over treatment assignment, say with observational data, researchers try to model it. Modeling here is equivalent to saying “we assume that after adjusting for age, gender, social status, and smoking, runners and non-runners are so similar to each other as if they were randomly assigned to running.” Then one can regress life expectancy on running, declare that ‘running increases life expectancy by n %”, and call it a day.

The logic behind this approach is clunky. It implicitly assumes we know exactly why people start running or live long and the only piece missing is what we are trying to estimate. A story that is not very believable and a bit circular. Also, by a happy coincidence, all parts of our model have available empirical proxies measured without error. Finally, since there is no principle way to check how well the model of choice approximates the real assignment mechanism all its assumptions can be debated for eternity.

This brings us to a situation best summarized by Jaas Sekhon [1]:

“Without an experiment, natural experiment, a discontinuity, or some other strong design, no amount of econometric or statistical modeling can make the move from correlation to causation persuasive”

Why causal inference is hard, in practice

The concerns raised above are better demonstrated with practical examples. Although there are plenty, I stick to three, with one from economics, epidemiology, and political science.

In 1986 Robert Lalonde showed how econometric procedures do not replicate experimental results. He utilized an experiment where individuals were randomly selected into job programs. Randomization allowed him to estimate a programs’ unbiased effect on earning. He then asked: would we have gotten the same estimate without randomization? To mimic observational data Lalonde constructed several non-experimental control groups. After comparing estimates he concluded that econometric procedures fail to replicate experimental results [2].

Epidemiology has the same problems. Consider the story of HDL cholesterol and heart disease. It was believed that ‘good cholesterol’ protects against coronary heart disease. Researchers even declared observational studies to be robust to covariate adjustments. Yet several years later randomized experiments demonstrated that HDL doesn’t protect your heart. For epidemiology this situation is not unique and many epidemiological findings are later overturned by randomized control studies [3].

Democracy-growth studies were a hot topic in political science back in the day. Researchers put GDP per capita or the like on the left side of their equations, democracy on the right and to avoid being unsophisticated threw a bunch of controls there: life expectancy, educational attainment, population size, foreign direct investment, among others. From 470 estimates presented in 81 papers published prior to 2006, 16% of them showed a statistically significant and negative effect of democracy on growth, 20% negative but insignificant, 38% positive but still insignificant, and 38% of the estimates, and you will be really surprised here, were positive and statistically significant [4].

The pattern is clear: no matter how confident researchers are in their observational research, it does not necessarily bring them closer to the truth.

Why DAGs don’t solve the problem, in theory

DAGs are great. They come with great representational power and nice inferential properties: given do-calculus completeness, if an effect is not identifiable with do-calculus, it’s not indefinable elsewhere, at least without additional assumptions. They are also educational: try to draw a simple instrumental variable setup to see for yourself.

But that’s not the point. The point is, the benefits DAGs offer come into play too late to rescue us from the horrors of causal inference in observational data. That’s true that given a particular graph do-calculus tells us what we can estimate and what we can’t. What it does not tell us, however, is how to construct a meaningful DAG.

There is a quote by George Box [5]:

“Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

The scary tigers here are too many observable variables to reason about, god knows how many unobservable variables, things we measure with noise, things we cannot even measure. In that setting the true graph is unknown, and when the true graph is unknown the answer to whether our inference is correct is “No idea” or “No”.

With this in mind, many things about DAGs become less confusing when we add ‘if the correct DAG is known’ to them [6], [7]:

“The task of selecting an appropriate set of covariates to control for confounding has been reduced to a simple “roadblocks” puzzle manageable by a simple algorithm, [if the correct DAG is known]”
“Wouldn’t it be great if we could generate the same data we used for this plot from our observational data, but make it causal? With modern causal inference approaches, we can! [if the correct DAG is known]”

Why DAGs don’t solve the problem, in practice

DoWhy is a great library. The authors make every effort to remind users that causal inference is hard. Yet the fundamental problem remains. Consider the following quote:

“Conceptually, DoWhy was created following two guiding principles: asking causal assumptions explicit and testing robustness of the estimates to violations of those assumptions.”

The core assumption is that the chosen DAG is the correct one among the ocean of alternative DAGs — the assumption for which there are no robustness checks. It is also easily violated.

Let’s violate it. I will use a setting described in DoWhy: Different estimation methods for causal inference. There are 5 common causes W, 2 instruments Z, one binary treatment v0, all other effects are bounded to be within [0, 0.5 × βv0], and the outcome y is fully determined by the set of observable variables. The true treatment effect βv0 is 10.

from https://microsoft.github.io/dowhy/dowhy_estimation_methods.html

Although there are numerous ways to mis-specify a model, I will simulate a very simple one: there is one variable U missing. Even with one missing variable one can draw 511 different combinations of arrows. I will stick only with a subset of possible scenarios: U → outcome, U → outcome and treatment, U → outcome and random common cause, U → treatment and random common cause U → random instrument and treatment, U → random instrument and outcome.

In the tutorial, the authors used six estimators and managed to get close to 10 five times: with linear regression, propensity score stratification, propensity score matching, propensity score weighting, and instrumental variables. In my simulation, I will use all five methods. I will analyze IVs separately, as they don’t rely on the back-door criterion.

Back-door estimators vs unobserved covariate U

The first thing to notice is that when the backdoor criterion is violated, as in the case when U affects both the treatment and the outcome, all estimates and significantly biased. It’s hardly surprising — we can’t deliberately violate an assumption and expect a procedure that relies on it to work. However, the given graph is just one node and two edges away from the true graph. It is still enough to skew the estimates. In fact, this toy example is made to be robust: all the other effects are significantly smaller than the treatment effect, only one common cause is affected at a time, only one unobservable variable is missing, everything is measured without error. In practice, conditions like that are rare.

Another thing to notice is that the regression estimates do better than propensity score ones. This is because regression is simpler. For instance, a regression when the treatment and the common cause are affected by U is unbiased since there is still no open path between the treatment and the outcome. That doesn’t hold for the propensity score estimates, because propensity score estimators are two-stage procedures and require two sets of assumptions. To estimate a propensity score itself the set of observables W must satisfy the back-door criterion with respect to the treatment. It doesn’t since there is an open path between W and the treatment through U.

Now, let’s move to instrumental variables. Having a good instrument is rare. Having two instruments is an unprecedented luxury. In the toy example from the tutorial there are two valid instruments: Z0 and Z1. I will set U to affect instruments randomly, even though only Z0 is used for estimation. I also will set both Z0 and Z1 to be continuous.

Here, in contrast to the back-door, estimates are not significantly biased when U affects the treatment and the outcome. The reason why the IV estimator is so robust is that there is just one arrow pointing from Z0 to v0. That helps to satisfy two assumptions: (i) there is a connection between Z0 and v0, and (ii) removing v0 from the graph leaves no connection between Z0 and y. Then, if Cov[U, Z0] = 0 and Cov[y, Z0] ≠ 0, the treatment effect is simply Cov[y, Z0]/Cov[v0, Z0].

If, however, an arrow from U to Z0 exists, then Cov[U, Z0] ≠ 0, which violates the exclusion restriction assumption (ii) — that the only path through which Z0 affects the outcome is through the treatment. In this simulation, that is the case when U affects both the instrument Z0 and the outcome.

Conclusion

Directed acyclic graphs and do-calculus can very well be the most effective tools out there. They will not help you squeeze the data for causal conclusions that aren’t already there.

References

[1] J. S. Sekhon, Opiates for the matches: Matching methods for causal inference (2009), Annual Review of Political Science, 12, 487–508.

[2] R. J. LaLonde, Evaluating the econometric evaluations of training programs with experimental data (1986), The American economic review, 604–620.

[3] N. Krieger, and G. Davey Smith, Response: FACEing reality: productive tensions between our epidemiological questions, methods and mission (2016), International journal of epidemiology, 45(6), 1852–1865.

[4] H. Doucouliagos and M. A. Ulubaşoğlu, Democracy and economic growth: a meta‐analysis (2008), American Journal of Political Science, 52(1), 61–83.

[5] G. E. P. Box, Science and statistics (1976), Journal of the American Statistical Association, 71 (356), 791–799

[6] J. Pearl, Theoretical impediments to machine learning with seven sparks from the causal revolution (2018), arXiv preprint arXiv:1801.04016.

[7] A. Kelleher and A. Sharma, Introducing the do-sampler for causal inference (2019), Medium