Image by author

No More “What” Without the “Why”

It’s Time that Organisations Unite Machine Learning and Causal Inference

Dr Michael Ganslmeier
Towards Data Science
6 min readJul 23, 2021

--

Throughout the last months, I had the chance to enable various organizations and leaders leveraging their large databases with machine learning. I was particularly engaging with member organisations which struggle with rising dropout rates (churns) — an issue that became even more serious throughout the pandemic when individual income has been on a declining and the fear of job loss on a rising path.

With machine learning, we used very large membership databases with individual-level information (e.g. age, gender, occupation, marital status, postal code, etc.) to identify the ones with high dropout risks to target them ex ante. A classification problem par excellence.

Machine Learning tells us the “What”, Causal Inference the “Why”

Despite the overall good performance of the machine learning models, our clients were always interested in one obvious question: Why does an individual member leave? Unfortunately, machine learning models are not suited to identify the causes of things but rather they are built to predict things.

However, knowing the reason for leaving is of immense business value as it determines the strategic decisions that leaders have to take. For instance, if someone ends her membership because she is moving abroad, offering lower membership fees will do little to nothing to keep her as a member.

These experiences made me realize that combining the predictive power of machine learning to know the “What” (e.g. individuals with high probability of leaving) with the methods of causal inference to understand the “Why” (e.g. reasons of leaving) is essential to use the massive datasets within organisations to their fullest potential.

Measuring correlation is easy, measuring causality is not

To stick to our churn example, an organisation might be interested whether an increase in its membership fees — one potential explanatory variable — may lead members to leave — the outcome variable. To estimate this causal relationship is anything but trivial. As we all know from Statistics 101, correlation is not causation — and, in fact, the absence of correlation is also not the absence of causation. Thus, the underlying issue of measuring causality cannot be solved with more or even better data; it is also not a (predictive) modelling problem per se which machine learning has turned out so successful in recent years.

[photo credit]

Instead, the reason why causal effects are hard to measure is endogeneity. And one of the most substantive sources is the omitted variable bias, which arises when an unknown omitted factor influences the independent variable of interest and the outcome variable at the same time.

For instance, let’s imagine an organisation comes up with a new explanatory factor — apart from membership fees — and would like to investigate if the age of an individual increases the likelihood of leaving an organisation. A commonly used method such as the ordinary least square (OLS) regression will not deliver meaningful results because both age and the probability of leaving may be influenced by an unknown confounding factor such as individual income. This is because younger people usually earn lower salaries which we often do not have information about.

Unfortunately, the omitted variable bias is present in almost all empirical analysis when dealing with observational data. For machine learning applications, this is not much of a problem per se because all we want is a good prediction of the outcome variable — the “What”. However, since we often need to know the causes of the “What”, understanding the “Why” is essential to enable organisations to adopt strategic actions in their favour. Therefore, beyond conventional ways of measuring associations, alternative approaches are highly needed to reliably quantify the causality.

Many organisations run the best experiments for causal inference without even knowing

To answer this question, let’s dive a bit into the broad spectrum of approaches to credibly measure causal relationships. Nowadays, most empirical scientists would argue that randomized control trials (RCTs) are somewhat of the gold-standard of causal inference. The logic behind RCTs is rather simple: the researcher splits a sample of individuals randomly into two groups and then gives one group the treatment of interest — called “treatment group” — but not the other one — called “control group”. The differences in the outcome between the treated and the control group is then considered as the causal effect of the treatment. If you closely followed the news about the efficacy of the COVID-19 vaccines, you will have noticed that these studies usually use the same research design: clinical RCTs.

[photo credit]

Despite their massive strength and credibility in measuring causality, conducting such controlled experiments is often very expensive, sometimes ethically questionable, and most of the time even impossible. For instance, if we want to understand whether the age has a causal effect on the likelihood of ending the membership using an RCT, we will need to change the age of randomly selected individuals in the treatment group. Obviously, we cannot change the age of a person. Therefore, we need different approaches.

Most of the time, such approaches are broadly called quasi-experimental research designs which are often the only way to measure causality with observational data. One of the earliest and most astonishing applications of these methods dates back to the ingenious London doctor John Snow who has used water-supply data to find out that Cholera was transmitted via water instead of air (which was the dominant view in 1854) — without requiring one single look at the virus through the microscope. His discovery has most probably saved millions of lives.

While researchers nowadays are desperately trying to live up to the ingenuity of Dr. Snow, applying such quasi-experimental methods is anything else than trivial. However, the good news is: many organisations are in fact conducting very large-scale quasi-experiments — often without even knowing. The analysis of many of these datasets enabled astonishing discoveries in business, economic, and political sciences in the last quarter of a century.

For instance, Hartmann and Klapper (2018) and Stephens-Davidowitz et al. (2017) quantified the returns to television advertising using the super bowl as a natural experiment. Anderson and Magruder (2012) were able to find that an “extra half-star rating [on Yelp] causes restaurants to sell out 19 percentage points (49%) more frequently” by adopting a clever regression-discontinuity design. And in a very recent study, Garz and Martin (2020) measured the causal impact of media reporting on vote choice for the current government which is highly relevant for political campaigning for parties in democracies around the world.

[photo credit]

Organisations need to ensure that we use the best of both worlds

All these examples highlight that understanding causal relationship can have large implications for the strategy an organisation adopts to compete successfully in the future. While AI-based solutions have already received a lot of interest beyond academic borders, causal inference has not yet been extensively leveraged for data driven decision making.

However, to predict the “What” and to understand the “Why” is — in my opinion — the most comprehensive and valuable way of exploit the wealth of information our organisations are sitting on nowadays. The challenges ahead are huge. But so are the datasets. We have the computational resources. And the (quasi-)experimental setups.

Leaders need to ask themselves: How can we combine the predictive power of machine learning with the strengths of causal inference to leverage our datasets in a post-pandemic world? To stay ahead, we should no longer ask for the “What” without the “Why”.

Reading suggestions
- The Book of Why, by Judea Pearl
- Causal Inference — The Mixtape, by Scott Cunningham
- Mostly Harmless Econometrics: An Empiricist’s Companion, by Joshua D. Angrist and Jörn-Steffen Pischke

--

--

Fellow at LSE Methodology | PhD from Oxford | working/worked at International Monetary Fund, World Bank, Oxford, LSE, UCL and KCL