Casual Causal Inference

Why do we need causality in data science?

Aleix Ruiz de Villa
Towards Data Science
4 min readNov 10, 2018

--

This is a series of posts explaining why we need causal inference in data science and machine learning (next one is ‘Use Graphs!’). Causal inference brings a new fresh set of tools and perspectives that let us deal with old problems.

When experimenting is not available

First off, designing and running experiments (typically with A/B testing) is always better than using causal inference techniques: you don’t need to model how data is generated. If you can do that, go for it!

However, there are many situations where this is not entirely possible:

  • If your experiments are unethical (you cannot make a child smoke to test whether it causes cancer or not)
  • The cause does not depend on you (the competence launches a new product and you want to measure its effect on your sales)
  • You have historical data and want to take the most of it.
  • Performing experiments is too costly in terms of money or impact, or they are too cumbersome to bring to practice

A little bit of history

There are three main sources of influence in causal inference: computer science, statistics and epidemiology and econometrics. Active research on causality started in the 80’s.

The computer science branch has been led by Judea Pearl. Its first influences go back to Sewell Wright in 1920’s, when he wrote about graphical models with linear functions. These techniques evolved and now are currently known as the Directed Acyclic Graphs (DAG) approach.

The framework most popular in statistics and epidemiology is known as Potential Outcomes framework and was proposed by Jerzy Neyman in 1923. This was the starting point for developing causal inference from a more statistical point of view. Donald Rubin is the most well known in this approach.

Both frameworks are equivalent, meaning that a theorem in one is a theorem in the other, and every assumption in one can be translated into an equivalent assumption in the other. Differences are a matter of usage. Some problems are easier to formulate in one framework, and some in the other. Jamie Robin and Thomas S. Richardson worked on a framework called Single World Intervention Graphs (SWIG) which acts as a mediator between the two frameworks.

There are some facts about causality that have been known for a while in econometrics. The techniques more popular among econometricians are:

  1. Controlled Regression
  2. Regression Discontinuity Design
  3. Difference-in-Differences
  4. Fixed-Effects Regression
  5. Instrumental Variables

Of course, they can be formulated in terms of the previous frameworks.

Philosophical questions (always in debate)

The nature of causality is really hard. The paradox is that we use it continuously on a daily basis, it is common sense in a lot of cases, but finding a definition assessing in which cases A causes B in the real world is very difficult! We would all agree that raining causes the floor wet (I hope… we may find rain deniers, too). But how do we give a clear definition that distinguishes between the rain or the air being the cause of the wetness? How do we know that the rain is the main cause, and not another unobserved variable that causes rain and the floor wet at the same time? Ponder that as much as you’d like…In the eighteenth century, David Hume had already started to think about the nature of causality, and many philosophers have written about it.

The general agreement in the statistics community is that you cannot prove a causal effect at least without performing an experiment. When you deal with observational data (data obtained passively, without you experimenting), the most you can expect is to talk about correlation (probabilistic dependency). This has created a scenario where talking explicitly about causality in observational data is taboo.

Even though proving causality in such cases is not possible, there are benefits of talking about causality explicitly. The first one is easy to understand: most of our knowledge as humans about how the world works is observational. You don’t make experiments about all things you know. In some cases, it is not even possible: does the sun cause daylight? how do you make an experiment about it, switching the sun on and off? in this case you could try some type of surrogate experiment and argue that this validates the original hypothesis, but it is not straightforward either. Meanwhile, we all agree that the sun causes daylight. The second argument is to avoid misleadingness. When you analyze data is because you want arrive to some conclusions to take further actions. If you think in that way, is because you think those actions affect (and thus are a cause of) some quantity of interest. So, even you talk about correlations for technical correctness, you are going to use those insights in a causal way. So, if your objective is a causal one, you’d better talk explicitly.

Where to start reading

Tutorials

Books

Who am I? I’m Aleix Ruiz de Villa (http://aleixruizdevilla.net), Ph.D. in mathematical analysis and ex-head of data science of some companies. I live in Barcelona, Catalonia, and I’m a cofounder of badass.cat and bacaina.cat.

--

--