Casual Causal Inference

Solving Simpson’s Paradox

Understand a key toy example in causal inference

Aleix Ruiz de Villa

Published in

Towards Data Science

6 min readFeb 20, 2019

This is the fourth post on a series about causal inference and data science. The previous one was “Observing is not intervening”.

Simpson’s paradox is a great example. At first, it challenges our intuition, but then, if we are able to dissect it properly, gives a lot of ideas about how to handle analysis of observational data (data that hadn’t been obtained through a well-designed experiment). It appears in many data analysis. We will walk through it using the well-known case of kidney stones. The techniques explained here can be found in detail in Pearl’s et al “Causal Inference in Statistics: A Primer”.

Kidney stones example

In a hospital, a doctor was dealing with kidney stones. She had two treatments, say A and B. She distributed both among her patients in half (350 for A and 350 for B) and measured each treatment’s success. The results can be found in the following table:

Clearly, treatment B was the best. The job was done and she could go home. But…wait! She knew she had more information and wanted to have a look at it, just in case…She also had the size of the stones and maybe this information was relevant. Then she came up with this table:

Hmmm… For small stones A is better, for large stones A is better but aggregated B is better?! What the f@#*!!

What’s going on?

Then she recalled how patients had been selected. Once a new patient arrived, she had a guess on their stone. Treatment A was performing some kind of surgery while treatment B was basically giving them pills. She knew that giving pills to large stones was less effective, so the hard cases were assigned to treatment A. This process makes treatment comparison more difficult!

Using a graph we can show the data process as follows

The size affects both the treatment assignment and the chances of recovery. This is called confounding, because you cannot distinguish the effect of treatment from the size on the recovery.

Extreme group assignments

To understand how assignments can affect our conclusions, imagine these two made up extreme cases:

A receives only large stones while B receives only small ones
The other way round

As you can see, assigning all hard cases to A, makes A seem to be the least efficient while assigning all hard cases to B makes B seem to be the least efficient. Makes sense!

Interventions

The question we want to answer is the following: if the hospital had to choose only one treatment, which one would it be?

As we saw in the previous post “Observing is not intervening”, this can be expressed in a graph that represents the distribution we want to know.

This distribution would respond to the case we gave treatment A to everyone. Intervention on a variable is defined by removing its dependencies with respect to its antecessors (those variables it depends on). The main question in causal inference is whether we can make inferences about this new intervened distribution only having information from the data generating distribution.

Adjustment formula

The trick is the following: if we focus on small stones, then all the effect of recovery can be explained from the treatment, so we can measure its effectivity

We can do the same for large stones. But then how do we combine these two quantities? Well, in the way they had been assigned if size would have not affected the assignment of treatment. That means we use the global distribution of stone sizes instead of using the distribution from treatment assignment.

As you can see, there is fairly the same number of large and small stones, while each treatment has an uneven distribution of sizes.

Writing this process down we get the so-called adjustment formula.

P(R|A) = P(R|Small, A) * P(Small) + P(R|Large, A) * P(Large)

It can be seen (it is not straightforward, though) that this formula is precisely what we were looking for: calculating the probability of recovery in the interventional graph, but only using data obtained from the data generation graph.

The same can be done for treatment B and then compare the results.

We go from the initial ‘better B’ 3% up to a ‘better A’ 7%!!

Well then, let’s always adjust!

Not so fast…There are situations where adjustment will bring to wrong conclusions. In the example above, for measuring the effect of treatment, size becomes a “noisy thing” and we want to remove its effect. Think now in a different situation. Imagine you have treatments for some illness. Moreover, you know (because you measure it) that affects patients blood pressure. At the same time, you know (because you have seen it many times in your career) that blood pressure also affects your chances of recovery. In this case, you would have data generated in this way

Now treatment has a direct way and an indirect way to affect your recovery. But you are interested in both! You don’t want to remove any effect from it. The idea is that in this case, you don’t need to apply any adjustment because the direct measurements will give you the right quantity. Or equivalently, if you intervene on treatment (recall that means removing the dependencies on the causes of the treatments) you are getting the same graph, so you have to do nothing!

Conclusions

This simple example gives us strong conclusions on how to analyze causal effects in observational data.

Data does not speak on itself: the direct calculation may bring wrong conclusions
More data does not solve the problem: you could increase the number of patients and still get the same paradox!
Correlation is not enough: correlation is a symmetric function while causality is not. The latter has a clear directionality.
Different models lead to different conclusions: depending on the situation, we have to apply the adjustment formula or not, so having different conclusions. And this has been argued only using the graph!
Graphs are a great communication tool: we already saw it in “Use graphs!” and we confirm it once more!