The world’s leading publication for data science, AI, and ML professionals.

Unlock the Power of Causal Inference & Front-door Adjustment: An In-depth Guide for Data Scientists

A full explanation of causal inference front-door adjustment with examples including all the Python source code

Photo by Evelyn Paris on Unsplash
Photo by Evelyn Paris on Unsplash

Objective

By the end of this article you will understand the magic of causal inference front-door adjustment that can calculate the effect of an event on an outcome even where there are other factors affecting both that are unmeasured or even unknown and you will have full access to all the Python code.

I have scoured the Internet and many books trying to find a fully working example of the front-door formula in Python and I have drawn a blank, so unless there are sources out there that I have missed, what you are about to read is genuinely unique …

Introduction

In a recent article I explored the power of the backdoor adjustment formula to calculate the true effect of an event on an outcome even if there are observable factors that are "confounding" both …

Unlock the Power of Causal Inference : A Data Scientist’s Guide to Understanding Backdoor…

The aim was to establish the true effect of taking a drug on patient recovery rates and the magic of the backdoor adjustment formula recovered this effect even though "male" was obscuring that result because –

  • A higher proportion of males took the drug compared to females
  • Males had a higher recovery rate than females
Image by Author
Image by Author

In this example "male" is a "confounder" but the values for "male" were included in the observation data and then the back door formula was applied to prove that the drug trial was having a positive impact.

But what if the "confounder" could not be measured and was not included in the data?

A Real World Example

During the 1950’s there was a statistical war raging between scientists who strongly believed that smoking caused respiratory illness and the tobacco companies who managed to produce "evidence" to the contrary.

The essence of this evidence was the proposal by the tobacco companies that a genetic factor was responsible for both smokers taking up smoking and there likelihood of developing respiratory illness. This was a convenient hypothesis for the tobacco companies because it was nearly impossible to test.

Here is a proposal for the causal links between the factors involved …

Image by Author
Image by Author

A Causal Inference Solution

If this is the only data you have i.e. a simple backdoor path from an unobserved confounder to both an event and an outcome then there is nothing that can be done; the true effect cannot be recovered.

However there are other "patterns" where the effect can be recovered including the front-door criteria and instrumental variables. This article will fully explain the first of those patterns.

To satisfy the front-door criteria there needs to be an intermediary between the event and the outcome, and in the smoking example it could look like this –

Image by Author
Image by Author

i.e. smoking causes tar and tar causes respiratory illness rather than a direct causal link.

When this pattern exists, the effect of the event (smoking) on the outcome (respiratory illness) can be isolated and recovered irrespective of the influence of an unobserved confounder using the "Front-Door Adjustment Formula" as proposed by Judea Pearl in "The Book of Why" and "Causal Inference in Statistics".

The Book of Why: The New Science of Cause and Effect (Penguin Science)

Causal Inference in Statistics: A Primer

Excluding the influence of an unobserved confounder seems like magic and the implications genuinely are amazing but if you follow the steps in the rest of this article you will be able to add this amazing technique to your Data Science tool bag with just a few lines of Python code!

Getting Started

The first thing we need are some test data. I have created a synthetic dataset using my BinaryDataGenerator class. If you would like the full source code, head over to this article –

How to Generate Synthetic Data for any Causal Inference Project in less than 10 Lines of Code

Image by Author
Image by Author

A summary analysis of the data is as follows –

  • There were 800 people in the sample.
  • 50% of the sample population were smokers (400/800)
  • 95% of smokers had tar deposits (380/400)
  • 5% of non-smokers had tar deposits (20/400)
  • 15% of smokers with tar had respiratory illness (47/380)
  • 10% of smokers with no tar had respiratory illness (2/20)
  • 95% of smokers with tar had respiratory illness (19/20)
  • 90% of non-smokers with no tar had respiratory illness (342/380)

First Attempt : Using the Pgmpy Library

In my article on backdoor criteria I started by showing a simple solution using pgmpy.

Given how easy it was to apply the backdoor criteria in that example, it should be very straightforward to apply the front-door criteria in the same way. Here is the code that should do it …

The expected result is 4.5% (much more on this later!) but pgmpy crashes with ValueError: Maximum Likelihood Estimator works only for models with all observed variables. Found latent variables: set().

After a lot of research and also raising an issue with the developers my conclusion is that pgmpy does not work when applying the "do" operator (i.e. making an intervention) where there is an unobserved confounder and that pgmpy cannot apply the front-door adjustment formula.

It is worse than that though as the DoWhy library does not work in this instance either.

DoWhy can deal with unobserved confounders when calculating the "Average Treatment Effect" (ATE) but when the "do" operator is being applied to simulate an intervention it fails in the same way as pgmpy.

ATE is applied to continuous variables so we can ask DoWhy a question like "If carbon-dioxide emissions increase by 100 million tonnes what is the causal effect on the increase global temperatures?" and DoWhy will produce a result.

However, when applying a "do" intervention to discrete, binary data for example "What is the probability of respiratory illness given that everyone in the sample smokes?" neither pgmpy or DoWhy can perform the calculation where an unobserved confounder is present and to date I have not found any other libraries that can.

My backdoor article moved on from the pgmpy implementation to provide an example of the maths to show what pgmpy was doing behind the scenes. In this article an understanding of the maths is required up front so that we can build our own implementation of the front-door adjustment formula in Python …

Second Attempt: Working it Out by Hand

The objective is to calculate the Average Causal Effect (ACE) by simulating the following –

  1. Travel back in time and perform and intervention which forces everyone to smoke.
  2. Perform the same time-travelling trick again and this time force everyone to quit.
  3. Subtract the second result from the first.

Expressed mathematically using the "do" operator this amazing feat looks like this –

And as we know that there is an unobserved confounder and a front-door path in the data so we need to substitute each side of the ACE formula with the front-door adjustment formula as proposed by Judea Pearl …

Let’s start with the left hand side of the ACE formula, substitute it for the front-door adjustment formula and use the variables that are present in our data instead of x, y and z. To keep things neat and tidy the following abbreviations will be used: S = smoking, R = respiratory, T = Tar …

t can take values {0, 1} and s can take values {0, 1} so we now need to expand as follows …

… and the inner βˆ‘π‘  terms can be further expanded as follows …

Now it should be a simple matter of substituting the conditional probabilities from the data. A Python function will be provided to calculate any conditional probability from data in the next section, but for now here are the values that are needed …

Substituting these conditional probabilities gives …

So …

… and if you re-calculate all of the steps above again for 𝑃(𝑅=1βˆ£π‘‘π‘œ(𝑆=0)) the answer is …

And so the overall Average Causal Effect (ACE) is …

That was a lot of effort to work out the Average Causal Effect by hand! Fortunately, now that the workings of the front-door adjustment formula are fully understood it is relatively easy to convert all of this to Python so that the whole thing can be fully automated for any dataset where the features are discrete values …

Third Attempt: A Reusable Python Function

The third attempt involves building a re-usable Pythn function that implements the Maths in the previous section for any simple DAG and any DataFrame so that the Maths can be put to one side once it has been understood.

The implementation of this function will need to use of conditional probabilities and it will require a simple Python function to calculate those probabilities from any DataFrame.

I have left the details of the calc_cond_prob function out of this article to keep the focus on front-door adjustment but you can read a full explanation and download the source code from this article …

How to Calculate Conditional Probabilities from Any DataFrame in 3 Lines of Code

Once you have donwloaded calc_cond_prob it can be used to easily calculate conditional probabilities from any DataFrame as follows …

𝑝(π‘Ÿπ‘’π‘ π‘π‘–π‘Ÿπ‘Žπ‘‘π‘œπ‘Ÿπ‘¦=0βˆ£π‘ π‘šπ‘œπ‘˜π‘–π‘›π‘”=0,π‘‘π‘Žπ‘Ÿ=0)=0.1

… or alternatively the outcome / result and events can be specified explicitly as follows …

𝑝(π‘Ÿπ‘’π‘ π‘π‘–π‘Ÿπ‘Žπ‘‘π‘œπ‘Ÿπ‘¦=0βˆ£π‘ π‘šπ‘œπ‘˜π‘–π‘›π‘”=0,π‘‘π‘Žπ‘Ÿ=0)=0.1

The previous section explained the Mathematics behind the Pearlean front-door-adjustment formula and provided a fully worked example.

Given those building blocks (and the calc_cod_prob function) a Python function can be developed that will calculate the front_door_adjustment formula for anny DataFrame that contains the following features –

  • X – treatment
  • Y – outcome
  • Z – mediator

Here is the full source code for front-door adjustment …

… and the function can be called as like this …

Conclusion

To start with the elephant in the room, if the effect of smoking was an increase in the average probability of respiratory illness of just 4.5% this would not persuade many smokers to quit.

However we saw that the individual probability of respiratory illness given smoking 𝑃(π‘Ÿπ‘’π‘ π‘π‘–π‘Ÿπ‘Žπ‘‘π‘œπ‘Ÿπ‘¦=1βˆ£π‘‘π‘œ(π‘ π‘šπ‘œπ‘˜π‘–π‘›π‘”=1))=54.75%.

The reason the average causal effect is so low is that our fictitious tobacco companies pulled the dastardly trick of stacking the deck by ensuring that lots of non-smokers with respiratory illness made it into the sample in an attempt to obfuscate the truth i.e. that smoking does causes respiratory illness.

But even with this noise in the data, and even if we accept the unlikely hypothesis that an unmeasurable genetic factor exists that confounds both the event and the outcome, the magic of the front-door adjustment formula has still uncovered a positive causal link between smoking and respiratory illness!

This amazing outcome is unlike anything I have discovered in other data science techniques and it plays into the most common questions that customers of my Machine Learning predictions always ask, i.e. –

  • Why does that happen?
  • What should I do to change the outcome and improve things?

These types of "why?" questions make the knowledge, ability and understanding required to apply front-door adjustment in order to calculate the effect of "interventions" an invaluable addition to the data science toolkit.

Unfortunately the currently available libraries including pgmpy and DoWhy do not work when applying the "do" operator to discrete data sets that include an unobserved confounder and a front-door path.

That is a massive gap in the functionality of those libaries and having searched at length to find a Python solution with a worked example both online and in books I could not find anything.

Unless I have over-looked some examples that makes this article unique and I wish I had been able to read it when front-door adjustment began to fascinate me rather than having to do all that research myself.

It was a lot of fun though and I really hope you like the result!

Bonus Section

So having said that pgmpy does not work in this scenario and having come so far in my learning journey I decided to write a version of the front-door adjustment formula in Python to correct that omission.

Just to note I decided to re-factor the formula to make the Python implementation a bit more concise changing this …

into this ..

… which is mathematically equivalent and is just like saying –

4 x 3 x 1 x 2 x 2 = 4 x 1x 2 x 2 x 3

Note: see "Causal Inference in Statistics" by Pearl, Glymour and Jewell, p68 (3.15) and p69 (3.16) for a full explanation of this equivalence.

Back to the solution, the first step is to create the causal model using pgmpy classes. To note: the unobserved confounder must be removed from the edges list as this is what causes the BayesianNetwork.fit() method to crash with a ValueError

Once the set-up is complete, the front-door formula can be implemented in Python as follows …

And just to prove that it works, the calculation produces exactly the same results as both the manual calculation and the earlier Python function that works directly on the DataFrame

Connect and Get in Touch …

If you enjoyed this article you can get unlimited access to thousands more by becoming a Medium member for just $5 a month by clicking on my referral link (I will receive a proportion of the fees if you sign up using this link at no extra cost to you).

Join Medium with my referral link – Graham Harrison

… or connect by …

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website – The Data Blog.


Related Articles