In my previous articles, we learned about confounders and colliders in observational data that hinder establishing reliable causal relationships. The solution Pearl provided is to draw causal diagrams and use the backdoor criterion to find the sets of confounders to block and leave the colliders and mediators along.
However, when dealing with confounding variables that cannot be observed or measured, it becomes difficult to estimate causality from the observational data. Coping with this issue, in Chapter 7 of "The Book of Why," Judea Pearl introduced the do-calculus rules. These rules are particularly useful for the front-door criterion and instrumental variables. They can be used to establish causality even when unobservable confounding variables are present.
In Chapter 8, we will explore the amazing world of counterfactuals. Opening with Poet Robert Frost’s famous lines:
"And sorry I could not travel both
And be one traveler, long I stood…"
Pearl states although it’s impossible to travel both paths or step into the same rivers twice, our brains can imagine what would have happened if we had taken the other path. In order to specify and pass down the wisdom to robots, Pearl introduces the distinctions between necessary cause and sufficient cause and how to utilize Structure Causal Models to conduct counterfactual analysis systematically.
As we pass the midpoint, the chapters are getting more technical and information-intensive. In the following section, I will first discuss how to deal with unobserved confounders, unfortunately, with some math, for Rung 2 interventions. Then, I will discuss counterfactuals, a Rung 3 application.
The front-door criterion
Starting with a causal diagram that we use to understand the causal impact of X on Y:
Here, X is impacting Y through a mediator, M. However, we cannot estimate causality directly from data without controlling the confounder U. The backdoor path from "U -> X" generates a spurious correlation between X and Y. The backdoor criterion tells us to control U, but what if U is unobservable?
For example, in analyzing the causal relationship between smoke (X) and cancer (Y), we might see this causal diagram:
Here, smoking causes cancer through the accumulation of tar deposits, and there is a confounder, the smoking gene, which, as argued by some researchers, can influence both one’s smoking behaviors and the chances of getting lung cancer. We couldn’t collect data on such genes because we don’t know whether they exist. Thus, back-door adjustments cannot work in this case.
To get the causal impact, we can use the front-door criterion instead. The front door here is the mediating process, where smoking increases tar deposits and increases the chances of getting cancer. If we cannot estimate smoke to cancer directly, can we estimate how smoke causally impacts tar with how tar causally impacts cancer instead? Here are the steps:
Step 1: Smoking -> Tar
There is only one back door path from smoking to tar:
Smoking <- Smoking Gene -> Cancer <- Tar
And it’s blocked by the Collider cancer. So, we can estimate the causal impact of smoking to tar directly from data by calculating the conditional probabilities:
The causal impact of smoking on tar is:
P(tar|smoking) - P(tar|no smoking)
Step 2: Tar -> Cancer
There is one back door path from tar to cancer:
Tar <- Smoking <- Smoking Gene -> Cancer
Here, both smoking and the smoking gene are the confounders, but we can control one of them to block the path. Since we don’t have data about the smoking gene, we can control smoking instead:
The causal impact of tar on cancer is:
P(cancer|do(tar)) - P(cancer|do(no tar))
where,
P(cancer|do(tar)) = P(cancer|tar,smoking) * P(smoking) +
P(cancer|tar,no smoking) * P(no smoking)
P(cancer|do(no tar)) = P(cancer|no tar,smoking) * P(smoking) +
P(cancer|no tar,no smoking) * P(no smoking)
To estimate how tar impacts cancer causally, we measure the following four probabilities from the data:
- The probability of getting cancer through accumulated tar in the smoking population:
P(cancer|tar, smoking) * P(smoking)
- The probability of getting cancer through accumulating tar in the non-smoking population:
P(cancer|tar, no smoking) * P(no smoking)
- The probability of getting cancer without enough tar in the smoking population:
P(cancer|no tar, smoking) * P(smoking)
- The probability of getting cancer without enough tar in the non-smoking population:
P(cancer|no tar, no smoking) * P(no smoking)
Step 3: Smoking -> Tar -> Cancer
Once we know the causal impact of smoking on tar and the causal impact of tar on cancer, we can derive the unbiased causal impact of smoking on cancer through the front door adjustment:
The causal impact of smoking on cancer is:
P(cancer|do(smoking)) = P(cancer|do(tar)) * P(tar|do(smoking)) +
P(cancer|do(no tar)) * P(no tar|do(smoking))
######Math Alert########################
Since no backdoor between smoking and tar:
P(tar|do(smoking)) = P(tar|smoking)
&
P(no tar|do(smoking)) = P(no Tar|smoking)
And from back door adjustment for tar and cancer:
P(cancer|do(tar)) = P(cancer|tar,smoking) * P(smoking) +
P(cancer|tar,no smoking) * P(no smoking)
&
P(cancer|do(no tar)) = P(cancer|no tar,smoking) * P(smoking) +
P(cancer|no tar,no smoking) * P(no smoking)
Finally,
P(cancer|do(smoking)) = (P(cancer|tar,smoking) * P(smoking) +
P(cancer|tar,no smoking) * P(no smoking))
* P(tar|smoking)
+
(P(cancer|no tar,smoking) * P(smoking) +
P(cancer|no tar,no smoking) * P(no smoking))
* P(no tar|smoking)
Or, in more general terms, for any causal diagram that resembles the previous one in these ways:
- X affects Y through only a mediator M, which is also the front door;
- There is an unobservable confounder U, correlated with X and Y but not with the mediator M;
The following formula using the front door adjustment works:
Compared to the back door adjustment:
Both formulas estimate the causal impact of X on Y, and they all remove the do operator successfully. This means we can estimate the causal impact from data, i.e., drawing Rung 2&3 conclusions from Rung 1 data. In the front door adjustment formula, we also removed the unobserved confounders U successfully. In the smoking to cancer case, we can estimate the impact of smoking without including the smoking gene.
The front door criterion opens the door for squeezing a bit more juice out of observational data, even facing unmeasurable confounders. However, there are challenges regarding its applications since the real world usually functions messier and more complicated than the textbook causal diagrams. For example, the unobserved confounder could also impact the mediator:
Even with this complication, as long as the relationship between U and M is weak, using the "gold standard" randomized control trial (RCT) estimation of P(Y|do(X)) as a benchmark, the empirical research shows that using the front door adjustment provides a better estimation than the back door adjustment without blocking all necessary back doors due to unobserved confounders.
Do-calculus rules
Using the example above that illustrates the front door adjustment, Pearl also summaries three rules that provide general guidance in the process of removing _do_s :
- Rule 1 shows if variable W is irrelevant to Y, or its direct path to Y is blocked by variable Z, then removing or adding W will not change the following probabilities:
- Rule 2 shows if a variable set Z blocks all back-door paths from X to Y, then conditional on Z will remove the do operation:
- Rule 3 shows if there are no causal paths from X to Y, then we can remove the do operation entirely:
The above three rules set the foundation for _do-_calculus that allows us to derive Rung 2 and 3 causal impacts out of observational data. Rule 1 shows what variables are useful to collect from the data; Rule 2 shows how to derive intervention, a Rung 2 conclusion, out of observational data; Rule 3 shows whether an intervention will be effective.
More math alert: skip if math gives you a headache
Using the three rules, we can revisit the smoking-to-cancer causal diagram:
And see how we utilize Rules 2 and 3 even in more general cases. Here are the steps:
Here, to understand how smoking causally impacts cancer, we are using the front door path smoking -> tar -> cancer in 7 steps:
- Step 1 is based on probability theories where we introduce the mediator t, tar, to estimate the causal impact;
- Step 2 is based on Rule 2. Since tar -> cancer’s back door is blocked by s, smoking, we can replace (do(s), t) with (do(s), do(t));
- Step 3 is based on Rule 2 again. Since smoking -> tar doesn’t have back doors, we can replace do(s) with s;
- Step 4 is based on Rule 3. Since smoking is only causally impacting cancer through tar, we can replace P(c|do(s), do(t)) with P(c|do(t));
- Step 5 is based on probability theories, where we introduce different spectra of S into the equation. In this case, we have smoking vs non-smoking;
- Step 6 is based on Rule 2. Again, tar -> cancer’s back door is blocked by s, so we replace (do(t), s’) with (t, s’);
- Step 7 is based on Rule 3. Since tar doesn’t have a causal impact on smoking, we replace P(s’|do(t)) with P(s’).
In the last equation, we can see that we have removed the do operation completely, and there are no unobservable variables. The next step will be using data to calculate the causal impact.
Instrument variable
Another way to deal with the unobserved confounder is to find instrument variables. The definition of instrument variables is better illustrated in a causal diagram:
Supposedly, when estimating the causal impact of X on Y, there is an unobservable confounder U that prevents us from getting the correct estimation. If there exists a variable Z that satisfies:
- Z and U are not correlated. In the diagram, there is no arrow between Z and U;
- Z is the direct cause of X;
- Z only affects Y through X. In other words, there is no direct or indirect arrow from Z to Y except Z->X->Y.
Z will be a very good instrument variable if all conditions are satisfied. Instrument variables are very useful in many scientific fields. In the book, Pearl provides an example of using instrument variables to study a medicine’s treatment effect during clinical trials.
The medicine was invented to lower patients’ cholesterol levels. Although clinical trials are usually randomized, they still face the challenge of noncompliance, where subjects receive the medicine but choose not to take it.
Their decisions on not taking the medicine can depend on multiple factors, like how sick they are, and are usually unobservable or hard to measure. The existence of noncompliances will reduce the estimation of drug effectiveness, and we don’t really have a good way to predict how many noncompliances there will be in a clinical trial.
In this case, researchers introduce an instrument variable, "Assigned," into the RCT design. "Assigned" takes the value one if the patient is randomly assigned to receive the drug and zero if they receive a placebo. We will have the following causal diagram:
"Assigned" is an instrument variable because:
- The assignment of receiving pills is randomized among patients. Thus, it is not correlated with any confounders U;
- Which groups patients are assigned to will determine which treatment they receive, either a drug or a placebo. Thus, "Assigned" is the direct cause of "Received";
- Whether a patient is put into a placebo group or not doesn’t affect his or her cholesterol level directly. Thus, "Assigned" only affects Cholesterol through Received.
When we find or establish an instrument variable, we can estimate three relationships in the observational data:
- The causal impact of Assigned -> Received;
- The causal impact of Assigned -> Cholesterol;
- The causal impact of Received -> Cholesterol by removing the "Assigned -> Received" impact from "Assigned -> Cholesterol".
That’s the definition of an instrument variable, along with an example. Pearl’s book contains more examples of how instrument variables can be used to solve the unobserved confounders’ problems.
Counterfactuals: What could have been?
Moving from Rung 2 to Rung 3 of the causality ladder, we are now facing the problem of finding what could have been the outcome without the treatment. This is different from Rung 2 interventions in two aspects:
- From average causal effect to individual causal effect:
So far, the causal impact we have discussed is focused on a population or subpopulation. For example, does smoke cause cancer or not for everyone? However, the more applicable question, especially in solving real-world problems, is the individual causal effect. For example, if I start smoking, does it cause me to have cancer? If I give this customer a discount, will he or she buy more products? The personalized causation can be inferred through counterfactual analysis.
2. There are two types of counterfactuals:
In binary settings, the outcome and potential outcome have two options. In the example of smoke vs cancer, the outcome is either getting cancer or no cancer. Correspondingly, there would be two types of potential outcomes and two sets of causal factors:
- Necessary causation: If one person gets cancer, the potential outcome is no cancer, and we determine whether smoking is what the lawyers called "but-for causation": the cancer would not have developed but for the smoking behavior.
- Sufficient causation: If one person does not get cancer, the potential outcome is cancer, and we determine whether having a smoking behavior would have developed cancer for this person.
Distinguishing necessary causation and sufficient causation not only helps robot think more like a human in determining causality but us find better action points given different goals and scenarios—more on this in the next section.
Matching vs Structure Causal Models (SCM)
When posed with the question of what could have been, we may approach it as a missing data problem. Take the example of figuring out how increasing education can improve income for different people. Here is the data summarized in a table:
Here, we have various data entries from different employees and their current experience level (EX) and their highest education level (ED). To simplify, we assume three levels of education: 0 = high school degree, 1 = college degree, and 2 = graduate degree.
Their salaries are also reported as S0, S1, S2. Note that since each employee can only have one highest-level education degree at a given time point, we will have two missing entries for S0, S1, and S2 for all employees.
If we approach filling the question marks in the above table as a missing data problem, we have two ways:
- Matching:
We find similar employees and match salary levels at different education levels. For example, in the table, we only have one extra feature – Experience, and we see Bert and Caroline both have nine years of experience, then we can have S2(Bert) = S2(Caroline) = $97,000, and S1(Caroline) = S1(Bert) = $92,500. Two missing data filled!
2. Linear Regression or more complicated models
Under the strong assumption that all data came from some unknown random sources, standard statistical models find models that best fit the data. In a linear regression model, we might find an equation like this for this particular problem:
S = $65,000 + 2,500EX + 5,000ED
Looking at the coefficients, the equation tells us on average, one level increase in education increases salary by $5,000. We can use more complicated models when feature space increases.
What’s wrong with these methods? Fundamentally, they are both approaches that are data-driven rather than model-driven. We are still trying to solve a Rung 3 problem with Rung 1 approaches. Thus, no matter how complicated our models get and how many more features we gain to predict the outcome variable, we still face the fundamental flaw of missing the causal mechanisms.
In this simple example, one problem we have is Experience and Education are not independent of each other. Generally, more education may reduce the years of experience for an individual. If Bert were to have a graduate degree rather than the college degree he has right now, his experience level would be lower than nine years. Thus, he would not be a good match with the nine-year experienced graduate degree holder Caroline anymore. In summarize, we will have a causal graph that looks like this:
The causal diagram indicates that Education not only has a direct causal impact on Salary, but it also affects Salary through the mediator Experience. Thus, we will need two equations:
S = f_s(EX, ED, U_s)
where
EX = f_ex(ED,U_ex)
There are several points to make regarding these equations, which constitute the Structure Causal Models (SCM) for this problem:
- Salary is a function of Experience, Education, and some unobservable variable that affects Salary U_s. Note the unobservable variable is exogenous, meaning they are not correlated with Education and Experience;
- Experience is a function of Education and some unobservable variables that affect Experience U_ex.
- The fact that there is no function shows Education as a function of Experience means that there are no causal effects from Experience to Education. This is the assumption we make.
- These two equations assume causalities between the right-hand side (the outcome) and the left-hand side(the treatment).
- The unobservable variables U_s and U_ex quantify the uncertainties at the individual level, which differs from Bayesian networks that use probabilities at causal links to quantify uncertainties. They are independent of Experience or Education and can be customized by individuals.
- Function f represents the relationship between the features and the outcome variable. It can be both linear and non-linear, depending on the assumptions.
If we still assume a linear relationship, we will have the following equations based on the data and our understanding of how education affects experience:
S = $65,000 + 2,500*EX + 5,000*ED + U_s
and
EX= 10 – 4*ED + U_ex
With these equations, we can calculate every one’s idiosyncratic factors U_s and U_ex to predict counterfactuals. Take Alice as an example. We know S(Alice) is 81000, EX(Alice) is six, and ED(Alice) is 0. First, plug these into the second equation to get U_ex. We want to plug in the second equation because it only contains ED, the only causal factor of our interest:
6 = 10–4*0 + U_ex(Alice)
-> U_ex(Alice) = -4
Thus,
EX(Alice) = 10 – 4*ED(Alice) - 4
Here, rather than simply plugging EX(Alice) into the equations, we are using this value indirectly. Knowing EX(Alice) equals six helps us calculate U_ex(Alice). Then we plug S(Alice), ED(Alice) and U_ex(Alice) into the first equation to get U_s(Alice):
81000 = 65000 + 2500*(10 – 4*ED + U_ex) + 5000*ED + U_s(Alice)
-> U_s(Alice) = -5000*ED - 2500*U_ex(Alice) -9000
-> U_s(Alice) = 0 - 2500*(-4) - 9000 = 1000
#Note DO NOT calculate by plug EX=6 directly:
81000 = 65000 + 2500*6 + 5000*0 + U_s(Alice)
Here we have the functions for Alice:
S(Alice) = $65,000 + 2,500*EX + 5,000*ED + U_s(Alice)
and
EX= 10 – 4*ED + U_ex(Alice)
Once the SCMs are ready, we can do a counterfactual analysis for Alice. We can calculate what would be her salary had she attended college. If ED(Alice) is 1 instead of 0, we will first calculate:
EX(Alice) = 10–4+(-4) = 2
Then calculate S_1(Alice):
S_1(Alice) = $65,000 + 2,500*2 + 5,000*1 + 1000 = $76,000
Note the difference we might get from the regression model where we plug in ED(Alice) = 1 and EX(Alice)=6:
S = 65000 + 2500*6 + 5000*1 = $85,000 -> Biased estimation
# If Alice has six years of experience and a high school degree now,
# She couldn't get six years of experience if she goes to college.
This is a simple example of utilizing SCMs to understand the causal impact and calculate the counterfactuals at the individual level. In summary, Pearl calls "the first law of Causal Inference":
The equation shows the potential outcome Y_x(u) can be imputed by Model M_x if we can remove all back-door paths to X. Here, the model takes much more flexibility than linear regression models as long as the Causality is embedded based on the causal diagram.
Necessary Cause (PN) vs. Sufficient Cause (PS)
To understand counterfactuals, we have two different measurements: the probability of necessity (PN) and the probability of sufficiency (PS). To see the difference, let’s use an example: the house is set on fire because someone struck the match, and there is oxygen in the air:
Both match and oxygen are causal factors in a house on fire. However, they differ in PN and PS. The probability of necessity is:
PN = P(Y_x=0 = 0|x=1, Y=1)
In this case, the house was on fire, and a match had struck (Y=1, x=1). This probability asks if the house would not have been on fire had the match not struck (Y_(x=0) = 0). This probability is very high because even though we have enough oxygen, with fire from the match, the house will not be on fire.
The same logic applies when we calculate the PN for x=oxygen. The fire would not occur if there were not enough oxygen in the house, even if we struck the match.
The probability of necessity shows what would happen to the outcome variable if the treatment hadn’t happened. If the probability is high, it means this treatment is a necessary cause. In a court setting, proving the victim wouldn’t have died if the accused hadn’t done something is enough to convict.
On the other hand, the probability of sufficient is:
PS = P(Y_x=1 = 1|x=0, Y=0)
In this case, the house was not on fire, and a match hadn’t struck (Y=0, x=0). This probability asks if the house would have been on fire had the match struck (Y_(x=0) = 0). This probability is also very high because usually, oxygen is everywhere, and the house is very likely to be on fire with oxygen and fire from striking the match.
The PS for oxygen, however, is very low. The fire would not likely occur just because the house has oxygen. Oxygen is not sufficient to set the house on fire. We need the action of bringing other fire sources like striking the match.
Thus, the probability of sufficient tells us what would happen to the outcome variable if the treatment had happened. If the probability is high, it means this treatment is a sufficient cause. In this case, striking the match qualifies as a sufficient cause for a house fire, but oxygen doesn’t.
Why distinguish PS and PN
Why bother making these distinctions? In a nutshell, although multiple variables can be the causal factors for an outcome, human brains automatically "rank" these factors based on some conditions. Psychologists found that humans imagining actions could have been done to turn around an undesired outcome. They are more likely to:
- Imagine undoing a rare event than a common one. For example, striking a match is a more rare event compared to having oxygen in the house. (Hopefully!)
- Human is more likely to blame their own actions, i.e., striking a match, than events not under their control.
As Pearl emphasizes, embedding both PN and PS into models would suggest a systematic way for teaching robots to produce meaningful explanations of the observed outcomes.
In addition, understanding the distinctions between PS and PN guides us to the action points. Climate scientists who study and find the cause of the extreme heat waves may have two different statements:
- PN: There is a 90% probability that man-made climate change was a necessary cause of a heat wave;
- PS: There is an 80% probability that climate change will be sufficient to produce a heat wave this strong at least once every 50 years.
The PN statement tells us about attribution: who was responsible for the heat wave – man-made climate change. It finds the cause. The PS statement tells us about policy. It says we are better prepared for the heat waves because more is happening. Who caused it? Not specified. Both statements are informative, just in different ways.
What a long read! This completes the 5th article in this "Read with Me" series for Judea Pearl’s "The Book of Why." Thanks for hanging to the end. Chapters 7 and 8 are definitely information-dense. I hope this article is helpful to you. If you haven’t read the first four articles, check them out here:
- Kick-off: Start with A Cat Story
- Chapter1&2: Data Tells Us ‘What" and We Always Seek for "Why"
- Chapter3&4: Causal Diagram: Confronting the Achilles’ Heel in Observational Data
- Chapter5&6:Why Understanding the Data-Generation Process Is More Important Than the Data Itself
If interested, subscribe to my email list to join the currently ongoing biweekly discussions. We will have one more article to come that will focus on mediators and conclude the series:
And a bonus article:
As always, I highly encourage you to read, think, and share your main takeaways here or on your own blog.
Thanks for reading. If you like this article, don’t forget to:
- Check my recent articles about the 4Ds in data storytelling: making art out of science; continuous learning in data science; how I become a data scientist;
- Check my other articles on different topics like data science interview preparation; causal inference;
- Subscribe to my email list;
- Sign up for medium membership;
- Or follow me on YouTube and watch my most recent YouTube videos: