Introduction
Causal Inference is an emergent branch of data science concerned with determining the cause-and-effect relationship between events and outcomes and it has the potential to significantly add to the value that machine learning can generate for organisations.
For example, a traditional machine learning algorithm can predict which loan customers are likely to default thereby enabling proactive intervention with customers. However, although this algorithm will be useful to reduce loan defaults, it will have no concept of why they occurred and whilst pro-active intervention is useful knowing the reasons for defaults would enable the underlying cause to be addressed. In this world pro-active intervention may no longer be necessary because the factors that lead to defaulting have been permanently cured.
This is the promise of Causal Inference and why it has the potential to deliver significant impact and outcomes to those organisations that can harness that potential.
There are a number of different approaches but the most common approach typically starts by augmenting the data with a "Directed Acyclic Graph" which encapsulates and visualises the causal relationships in the data and then uses causal inference techniques to ask "what-if" type questions.
The Problem
A Directed Acyclic Graph (DAG) that encapsulates the causal relationships in the data is typically constructed manually (or semi-manually) by data scientists and domain experts working together. Hence the DAG could be wrong which would invalidate any causal calculations leading to flawed conclusions and potentially incorrect decisions.
The Opportunity
A range of techniques exist for "Causal Validation" (the process of validating the DAG against the data) and if these techniques work they can minimise or eliminate errors in the DAG thereby ensuring that the calculations and conclusions are error-free.
The Way Forward
The statistical concept of dependence between random variables can be used to ascertain whether a relationship that exists in the DAG also exists in the data; if it does exist the DAG is more likely to be correct and if not it is more likely to be incorrect.
Getting Started
We are going to need an example DAG to work through the problem which has enough nodes and links to enable a deep exploration of causal validation …
Each node in a DAG is either having a causal effect on other node(s) or other node(s) are having a causal effect on it and the direction of the arrow is the direction of the causal effect. For example, one of the causes of "B" is "C" and one of the causes of "C" is "F".
The example DAG is fictitious hence the node letters / names are unimportant, however "X" is intended to be the "treatment", "Y" is the "effect," and all the other nodes are having some causal impact that would, in a real-world example, obscure the true effects of X on Y.
Note that the light-blue nodes have no inputs (exogenous in causal terminology) and the dark-blue nodes have one or more inputs (endogenous in the terminology).
To get started we will also need some data that matches the DAG. The dataset below is entirely synthetic and has been generated by the author. It exactly encapsulates and matches the structure suggested by the DAG and contains no erroneous or faulty relationships …
Another thing we need before we can get started is a way of extending the pandas DataFrame
and Series
classes with custom methods so that the code we write is clean, concise and easy to understand.
Here is a link to one of my previous articles that provides an end-to-end tutorial on how to extend data frames and why it is a useful thing to do …
How to Extend Pandas DataFrames with Custom Methods to Supercharge Code Functionality & Readability
Understanding Dependence
One definition of dependence is as follows …
Dependence between two random variables means that the occurrence or value of one variable affects the occurrence or value of the other. Variables are considered dependent if the occurrence or value of one variable provides information about the occurrence or value of the other variable.
To unpack this, let’s take another look at our example DAG and consider the causal factors that affect node Y …
In this visualisation we can see that node Y is caused by (and hence dependent on) 5 different factors – C , E , F , G and X.
Now lets take another look at the data that the DAG is representing …
This synthetic data set was created by the author to facilitate the article so I happen to know that the relationship between node Y and those dependent factors is as follows …
Y = 3C + 3E + 2F + 2.5G + 1.5X + ε
(Note: ε represents the error term)
… and this can be tested and verified by picking a row (in this case I have chosen the 3rd row) applying that formula to the data …
Y = -422.1827393983049, error term = 48.75941612372628
We can now see why and how Y is dependent on C, E, F, G and X. If the value of one of those dependent variables changes, the value of Y will also change. We can also see from the DAG that Y should not be dependent (for example) on node D because there is no link between D and Y.
The statement "Y is dependent on C, E, F, G and X" can be represented in a mathematical formula as follows …
… and the statement "Y is independent of D" is represented as follows …
The ⫫ symbol is called a "double up-tack" but the ⫫̸ symbol does not have a commonly accepted name so I have adopted "slashed double up-tack" through personal preference.
Some articles and texts use a single up-tack (⊥ and ⊥̸) instead of double up-tacks but double up-tacks are more common hence that is the standard that I have adopted and used throughout this article and the associated Python code.
To recap then, statistical dependence between two random variables means that "the occurrence or value of one variable affects the occurrence or value of the other" and we now know how this looks visually in the DAG, how to represent it as a mathematical formula (e.g. Y = 3C + 3E + 2F + 2.5G + 1.5X + ε) and also how to represent it using the slashed double up-tack notation (e.g. Y ⫫̸ C, E, F, G, X).
From Dependence to Causal Validation
Causal Inference typically starts with a set of data and then augments that data with a DAG. There are emerging techniques that can reverse engineer a DAG from the data but they are not accurate or consistent hence the most common approach to developing a DAG is to ask the domain experts what they think the causal relationships are and then to validate or test that DAG against the data and amend it as necessary if validation fails.
The DAG has proposed that Y is dependent on C, E, F, G and X and if this dependency exists in the data then there will be confidence that the causal links pointing into node Y are valid and correct and there is a mathematical notation that can be used to represent this as follows …
This scary-looking formula is actually very simple to understand. The "G" subscript of the first slashed double up-tack dependency symbol means "in the graph" (i.e. the DAG) and the second "D" subscript means "in the data" (note that I have seen a "P" subscript in some of the literature but "D" makes more sense to me so that is what I have adopted).
Armed with that knowledge, the whole formula can be read as "If Y is dependent on C, E, F, G and X in the graph then Y should also be dependent on C, E, F, G and X in the data.
It follows then that we just need a mechanism in Python that can detect dependencies in the data. That mechanism can then be used to check each node in the DAG that has in-coming connections and if dependencies are detected in the data that match those in the DAG we can be reasonably confident that there are no spurious connections (causal links) and that the DAG is a valid representation of the data in this respect.
Observing Dependence in the Data
Let’s start by visualising the relationships that exist in the data between C, E, F, G and X and our node of interest Y …
The chart on the right is plotting Y on the x axis and separate lines for C, E, F, G and X on the y axis. If Y is dependent on these other variables then changing the value of one of them should change the value of Y. This means that there should be a positive or negative co-efficient and the lines should exhibit a noticeable slope (either upwards or downwards).
Given that there are definite slopes we can see that 𝑌⫫̸ 𝐶,𝐸,𝐹,𝐺,𝑋 is true i.e. that Y is dependent on C, E, F, G and X in the data.
If however there is no dependence then changing the value of a variable would have little or no effect on Y, the co-efficient should be close to zero and the line should have no slope i.e. it should be flat.
This can be demonstrated by adding the relationship between Y and D to the chart remembering that there is no causal link from D to Y in the DAG so there should be no relationship between Y and D in the data either …
This is looking exactly how we would expect it. C, E, F, G and X all have clear slopes and either a negative or positive co-efficient clearly showing that if the value of those variables changes, the value of Y will be changed so Y is dependent on those variables.
However the slope for D is flat and the co-efficient is very small (just -0.029) so changing the value of D will have a negligible effect on the value of Y and hence the causal relationship 𝑌⫫𝐷 (Y is independent of D) exists in the data.
Implementing Dependency in the Data in Python
The proposed method for detecting dependencies in the data uses the ols class from the statsmodels.formula.api library to perform an ordinary least squares (OLS) regression.
The ols class can be fitted to a data set and the the coefficients or slopes that exist in the data can be extracted and interpreted. Here is how it is done …
The key data in the summary is the middle table which provides some analysis of the variables C, E, F, G and X in respect of their relationship with Y. For example the ols analysis is proposing the following –
𝑌=2.03𝐶+3.02𝐸+1.84𝐹+6.33𝐺+1.54𝑋−25.2
and this is not too far away from the formula that I used to create the dataset which was …
𝑌=3𝐶+3𝐸+2𝐹+2.5𝐺+1.5𝑋+ε
The biggest difference is for node G but for the purposes of validation the magnitude of the co-efficient is not really important, just that a coefficient exists and that the slope is not flat.
Apart from the coef
column the other item of interest is the P>|t|
or p-value column which works as follows …
- The null hypothesis is that there is no relationship between the variable (e.g. E) and the dependent variable (e.g. Y)
- If the p-value is greater than the alpha (usually set at 0.05) then the null hypothesis is rejected i.e. there is a relationship i.e. there is dependence.
For example the p-values for E, G and X are all below 0.05 so the null hypothesis can be rejected and dependence can be assumed.
But what about C ad F? C has a p-value of 0.076 which is slightly above alpha and F has a value if 0.275 which is significantly above our chosen alpha (0.05).
We could just increase alpha until we conclude that all of the variables are dependent but that approach will not work very well in the long run as it will start concluding dependence where none exists.
When I did the original development I almost gave up at this point assuming that ols could not be used as a reliable method to detect dependence across my DAGs and data but then I took another look at the ols analysis.
A co-efficient can be observed for all 5 variables, but the p-value is conclusive in only 3 out of the 5. I then swapped to using the coef
but further on down the line I found instances where the p-value worked but the coef
did not.
After many frustrating hours and a lot of trial-and-error I came established a method which uses both values and that exhibits a high degree of accuracy that has been rigorously tested against a lot of different data and DAGs.
Here is the method that I use to detect dependency …
VALIDATION SUCCESS: Y is dependent on C in the data
VALIDATION SUCCESS: Y is dependent on E in the data
VALIDATION SUCCESS: Y is dependent on F in the data
VALIDATION SUCCESS: Y is dependent on G in the data
VALIDATION SUCCESS: Y is dependent on X in the data
The test I have adopted through trial and error is as follows …
If the p-value is greater than 0.05 AND the co-efficient is less than or equal to 1.0 then assume no dependency, otherwise assume dependency.
This approach does not follow the statistical approach which would be just to consider the p-value in isolation but a significant amount of testing has suggested that it works very reliably.
Optimising the Python Code
One drawback with the approach above is that the formula is embedded in the code i.e. in ols_formula = "Y ~ C + E + F + G + X"
and also in the declarations of dependent_variable
and variables
and this will lead to code repetition in a real-world example.
It would be much better if a way could be found to extend the DataFrame
class to be able to perform dependency tests generically on any dataset.
Fortunately it is easy to add custom methods to the DataFrame
class by using a technique called "monkey patching". If you would like a step-by-step tutorial please take a look at my tutorial article …
How to Extend Pandas DataFrames with Custom Methods to Supercharge Code Functionality & Readability
Here is the optimised code that enables any dependency test to be executed against any dataset …
Once the DataFrame
class has been extended with the dependence
method, it is trivially easy to test any dependency test.
For example, we can try out 𝑌⫫̸𝐶,𝐸,𝐹,𝐺,𝑋 which should validate as True with
no errors …
We can try out 𝑌⫫̸𝐶,𝐸,𝐹,𝐺,𝑋,𝐷 which should validate as False indica
ting that "D" is an error because Y is not dependent on it …
These tests both pass and the success rate is very high across all of the DAGs and datasets I have tried out to ascertain the accuracy of this approach.
Bringing it All Together
To summarise, the relatively small code-base presented above achieves an impressive outcome i.e. to enable any dependency test to be carried out on any dataset to indicate whether that test passes or not and where it fails to specifically highlight the errors.
However, more is needed. Let’s assume that when we consulted our domain experts the DAG they produced contained an error and that those experts had assumed that a causal link (or dependency) existed from node D to node Y.
The proposed DAG would now look like this …
Armed with our new capability we can easily test the DAG out for node Y as follows …
… and as we have seen in the results above node "D" will be accurately identified as an "error". We have therefore identified a "spurious edge" i.e. a link that exists in the DAG but that does not exist in the data and this tells us that the DAG must be adjusted to removed that spurious edge in order to be accurate.
It therefore follows that the following must hold …
- Start with a proposed DAG.
- Iterate over all nodes.
- Execute a dependency test for all incoming connections.
- Accumulate a list of all errors.
The accumulated list of errors will instantly indicate all spurious edges / connections / dependencies which must be removed from the proposed DAG to produce a new DAG that is free of all spurious edges (i.e. dependencies that exist in the DAG but not in the data).
The code to achieve this is as follows …
Testing the Full Algorithm to Detect Spurious Edges in the DAG
With these few lines of code it is now possible to test any DAG (represented by a set of edges) against any data (represented by a pandas DataFrame
) to see if there are any "spurious" edges in the DAG that do not exist in the data.
Let’s start by testing the case where the DAG correctly represents all of the causal links in the data (remembering that df_causal
correctly represents the DAG as it was synthetically created by the author to be an exact representation) …
A ⫫̸ D
B ⫫̸ A, C
C ⫫̸ D, F
E ⫫̸ C
X ⫫̸ A, B, E, F, G
Y ⫫̸ C, E, F, G, X
[]
No errors were detected where the DAG matches the data.
Now let’s add a non-existent causal link D => Y into our DAG and re-run the code …
A ⫫̸ D
B ⫫̸ A, C
C ⫫̸ D, F
E ⫫̸ C
X ⫫̸ A, B, E, F, G
Y ⫫̸ C, D, E, F, G, X
[('D', 'Y')]
The "spurious" edge was correctly identified in the DAG! But what about when there are multiple spurious causal relationships in the DAG that do not exist in the data? Will our algorithm still perform?
To test this out a second non-existent causal link A => E is added to the DAG …
A ⫫̸ D
B ⫫̸ A, C
C ⫫̸ D, F
E ⫫̸ A, C
X ⫫̸ A, B, E, F, G
Y ⫫̸ C, D, E, F, G, X
[('A', 'E'), ('D', 'Y')]
This test has also passed. If two spurious causal relationships are added to the DAG that do not exist in the data they are both correctly detected and identified as errors.
Testing the Algorithm to Destruction
These promising results give rise to the question "So, just how accurate is this methods?" i.e. how many spurious causal relationships can continue to be added to the DAG before it fails to detect them correctly.
To answer this questions the author devised a challenging test that started by identifying every single valid causal link that could exist in the DAG but that does not. In the case of this particular DAG the full set of possible links looks like this …
A test harness was then used to randomly select any 3 of the possible missing links at the same time, and to repeat that test for different sets in order to ascertain the accuracy of the validation algorithm.
The results are astounding. The simple algorithm presented here detects any combination of 3 spurious links (using the example DAG and data) with 100% accuracy. Even changing the test to select any 12 of the possible spurious links together produces an accuracy of 90%!
Bonus Section: Separate vs. Combined Dependency Testing
Throughout the article the set of dependencies for a given node has been established by looking at all of the "parent" nodes and then creating a single statement of dependences, for example …
You may be wondering if the same set of tests are equivalent …
One of the challenges the author faced was to assume that these separate tests are equivalent to the single overall test when detecting spurious edges but trial-and-error during testing led to the firm conclusion that this is not the case.
When looking for the spurious edge Y => D the implementing the 𝑌⫫̸𝐶,𝐸,𝐹,𝐺,𝑋,𝐷 test was 100% reliable but testing separately for 𝑌⫫̸𝐷 does not work and this was proven by executing many rounds of automated testing to compare the accuracy of the two methods.
The assumption is that because the formula that encapsulates the relationships between these variables is 𝑌 = 3𝐶 + 3𝐸 + 2𝐹 + 2.5𝐺 + 1.5𝑋 + ε the OLS test which underpins the implementation of dependency needs to consider all of the variables together and this also bears out another truism in causal inference …
it is very difficult or maybe even impossible to reverse engineer a DAG from the data but when a "first stab" has been made that gets most of the way there the task becomes achievable
The morale of this section is: consider all of the incoming relationships for each node together when testing for dependency because if they are tested separately it simply does not work.
Connect and Get in Touch …
If you enjoyed this article, you can get unlimited access to thousands more by becoming a Medium member for just $5 a month by clicking on my referral link (I will receive a proportion of the fees if you sign up using this link at no extra cost to you).
… or connect by …
Subscribing to a free e-mail whenever I publish a new story.
Taking a quick look at my previous articles.
Downloading my free strategic data-driven decision making framework.
Visiting my Data Science website – The Data Blog.