Correlation Is Not Causation… Or Is It?

How you can go beyond handwaving when talking about correlation

Florent Buisson

Published in

Towards Data Science

8 min readJan 20, 2022

Correlation: a kink in Analytics’ neck

“Correlation is not causation” is a phrase you hear a lot in analytics (I’ll abbreviate it from now on as CINC, which I choose to hear as “kink”). Many times in my career, I have seen a business analyst or data scientist present a scatter plot of data showing a correlation between two variables A and B and issue that ritual warning. Unfortunately, 90% of the time, they then proceed to do one of two things:

They continue as if indeed they had proven that variable A causes variable B. For example, “we can see that the number of marketing email received is correlated with a customer’s life-time value. Of course, correlation is not causation. With that said, let’s now talk about how we can ramp up our marketing effort to increase customer’s LTV”. In that case, CINC is nothing more than a thinly veiled disclaimer, protecting the analyst’s ass in case you foolishly took their conclusions at face value.
Alternatively, they state that you can’t draw any further conclusion unless you run a randomized experiment. This approach is more common with analysts trained in statistics and it has the advantage of being intellectually more honest. However in practice business partners often just nod along, and as soon as the speaker has left the room they develop plans based on variable A causing variable B.

However, this sorry state of affairs doesn’t have to be the norm. Whenever we observe a correlation in data, there is actually a limited number of possible scenarios outside of variable A causing variable B:

The observed correlation does not reflect a true correlation in the population of interest,
Variable B causes variable A,
Variable A and variable B share a common cause,
There is a more complex causal structure at play.

1. No true correlation

The simplest case is when there isn’t actually any correlation in the population of interest. There are two ways this can happen: noise (aka. sampling variation) and bias.

NOISE. First, if your sample is “too small”, or if you have drawn too many samples in a row (aka., a fishing expedition), the observed correlation may just be a random fluke. This is a real problem, especially if you rely on p-values as a measure of significance instead of determining economic significance through confidence intervals, but I won’t dwell on it: I feel that most people have a good grasp of that trap and in most business situations, the samples are not that small. If you have a million rows, sampling variation should be very low on your list of potential issues. If your sample is on the smaller side, just use more robust metrics, like the median instead of the mean. People often underestimate how robust the median is, even with very small samples (The math is in the appendix).

BIAS. Bias occurs when your sample is not a good representation of your population of interest. For example, “All customers with an active account last year” is generally a reasonable proxy for “All customers with an active account next year”. On the other hand, “All customers with an active account last year who have provided an email address” is not. Bias is a more insidious problem than noise, because even large samples can fall victim to it, as a recent study about COVID showed [1].

Avoiding bias, or at least recognizing it, doesn’t have to be complicated though. Simply write down as precisely as possible the definition of your sample and the definition of your population of interest. If your sample is truly drawn at random from your population, you’re good to go. In any other case, there might be bias, for example if you reach out to people at random in your population but your sample only includes those who answered or provided complete answers. Try to identify subcategories of people who are part of your population of interest but may be missing or underrepresented in your sample. Pushing it to the limit, if poor older women with a disability and no Internet connection are part of your population, are you reaching them adequately?

If you’re thinking “but that’s a minuscule part of my population!”, I invite you to think again. Subcategories may add up to a large share of your population even if each one of them is small. They may also appear small only from your personal point of view. I’m currently living in West Africa and recently struggled to update an iPhone: it required 1) downloading several gigabytes of data, 2) over WiFi (another phone hotspotting doesn’t work), 3) while charging. But the typical owner of a smartphone in a developing country may not have WiFi at home (their smartphone is their only access to Internet) and WiFi bandwidth in stores is generally limited, assuming they’ll even let you use an electric plug. This may be an “edge case” if you’re living on the US West Coast, but it probably encompasses hundred of millions, if not billions of users of smartphones!

2. Reverse causation (B causes A)

The next possibility is that correlation between variable A and B may stem from variable B causing variable A instead of the other way around. For example, a correlation between the number of marketing email received and customer life-time value may be due to marketing targeting high-LTV customers with their emails. Once you consider that possibility, it’s generally pretty obvious how it could be happening in your data.

3. Confounder (A and B share a common cause)

The last “simple” case is when A and B share a common cause. For example, maybe marketing budget is assigned at the state level in the US, or at the country level internationally. Then customers in California (resp., in the US) may both have a higher LTV and receive more marketing emails than customers in Tennessee (resp., in Nigeria). Again, once you consider that possibility, it’s generally pretty obvious how it could be happening in your data.

4. Anything else (more complex causal structure)

The first 3 cases probably represent 90% of the situations you’ll encounter in practice, but technically speaking they do not cover all possibilities. For the sake of completeness, I’ll briefly talk about what else is out there.

One category of more complex causal structures is when you control, explicitly or implicitly, for a variable you shouldn’t. For example, a military doctor found that the use of tourniquets in the battle field was negatively correlated with survival rates; the issue was that his analysis was based on soldiers arriving at the field hospital. But the main benefit of a tourniquet is that it can allow a soldier with a severe wound to survive until they reaches the hospital instead of bleeding out. This means that more soldiers overall survive but a smaller proportion of those who make it to the hospital do, because we’re adding more severe cases to the mix [2]. As a side note, this example could also be interpreted as a bias in data collection (i.e., the negative correlation observed is not representative of the population of interest), which shows that data collection and data analysis are less separate than people often believe.

Finally, we have situations that seem to have been designed by Nature to trip and confuse scientists. For instance, it has been known for a little while that autism is correlated with simpler gut microbiome (i.e., a less diverse population of bacteria in the gut). Does that mean that microbiome causes autism? A recent study suggests “no, it’s the other way around”: autistic children often have restricted diets because sensory experiences can overwhelm them, and limited food variety leads to limited microbiome variety. But then, how to explain that fecal transplant improves the behavior of autistic children? An emerging hypothesis is that “faecal transplantation, by relieving the uncomfortable symptoms brought about directly by unbalanced microbiomes, improves the behaviour of children with autism, yet does so without affecting the neural underpinnings of the condition” [3]. The corresponding causal diagram would then be:

Ultimately, science progresses by developing increasingly precise and complete models that account for all the facts at hand. The same applies in business: Achieving a deep understanding of customer (or employee) behaviors requires building accurate causal diagrams, as I explain in my book Behavioral Data Analysis with R and Python [4].

Recap and conclusion

Whenever you observe a correlation between variables A and B in your data, there are exactly 4 possibilities apart from A causing B:

The observed correlation does not reflect a true correlation in the population of interest, either because of sampling noise or bias;
Variable B causes variable A;
Variable A and variable B share a common cause;
There is a more complex causal structure at play.

This means that you don’t have to restrict yourself to “Correlation is not causation”. By carefully thinking about the other possibilities and excluding the implausible ones, you can conclude that “this correlation probably reflects causation, which will be confirmed by running an A/B test once we have determined the action we want to take”. And if things get too complicated, you can build causal diagrams to determine what’s going on.

References

[1] https://news.harvard.edu/gazette/story/2021/12/vaccination-surveys-fell-victim-to-big-data-paradox-harvard-researchers-say/.

[2] This example is from Judea Pearl & Dana MacKenzie, The Book of Why: The New Science of Cause and Effect.

[3] The Economist, “How an upset gut microbiome is tied to autism”.

[4] Florent Buisson, Behavioral Data Analysis with R and Python: Customer-Driven Data for Real Business Results.

You can also check out my previous posts on Medium:

Appendix: The robustness of the estimator for the median

Remember that by definition, the median of a population is such that half of the population has a value lower than it, and half the population has a value higher than it. This holds true regardless of the shape of the data distribution, its number of peaks, etc.

This means that if you draw two values x and y at random from that population, there are 4 possibilities:

They’re both below the population median, with probability 0.5*0.5 = 0.25;
They’re both above the population median, with probability 0.25 too;
One is below the population median and the other is above, with probability 0.5.

More generally, if you have N values:

They’re all below the median with probability 0.5^N;
They’re all above the median with probability 0.5^N;
The median is between the lowest and the highest of the N values with probability 1–2*(0.5^N).

This means that even with a sample of only 5 values, there is a 94% chance that the population median is bracketed by your sample. With 10 values, that probability reaches 99.8%. Now, I can’t promise that you’ll be happy with the size of that confidence interval, but at least you’ll have a very clear sense of how much sampling variation matters in the situation at hand.