The world’s leading publication for data science, AI, and ML professionals.

Causal Diagram: Confronting the Achilles’ Heel in Observational Data

"The Book of Why" Chapters 3&4, a Read with Me series

Causal Diagram: Confronting the Achilles’ Heel in Observational Data

In my previous two articles, I kicked off the "Read with Me" series and finished reading the first two chapters from "The Book of Why" by Judea Pearl. These articles discuss the necessity of introducing causality in enabling human-like decision-making and emphasize the Ladder of Causation that sets up the foundation for future discussions. In this article, we will explore the keyholes that open the door from the first to the second rung of the ladder of causation, allowing us to move beyond probability and into causal thinking. We will go from Bayes’s rule to the Bayesian network to, finally, the Causal Diagrams.


From Bayes’s rule to inverse probability

As a fan of detective novels, my favorite series is Sherlock Holmes. I still remember all these days and nights I read them without noticing time passing by. Years later, lots of the case details had already disappeared from my memories, but I still remember the famous quotes like everyone else:

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

Translating this quote into the field of statistics, there are two types of probabilities – – forward probability and inverse probability. Based on Sherlock Holmes’s deductive reasoning, detective work is just finding the murderer with the highest inverse probability.

Going from forward probability to inverse probability, we are not only just flipping the variables sequentially but also enforcing a causal relationship. As briefly discussed in the previous article, Bayes’s rule provides a bridge that connects objective data (evidence) with subjective opinions (prior belief). Based on Bayes’s rule, we can calculate conditional probabilities from any two variables. For any variable A and B, given that B has happened, the probability of A happening is:

P(A|B) = P(A&B)/P(B)

The belief that A happens will be updated based on the probability of B happening. The less likely B happens, P(B) gets smaller, the more belief I have for A to happen. Since P(B) is smaller or equal to 1, P(A|B) is always greater than or equal to P(A&B). This is saying that the belief a person attributes to A after discovering B is never lower than the degree of belief that a person attributes to A and B before discovering B. Note the conditional probability here applies to all variable relationships, even the non-causal ones. However, the inverse probability only applies to the causal relationship.

Assume the two events are Cause and Evidence. Forward probability represents the probability of Evidence given the probability of Cause. Inverse probability, on the other hand, starts from the result and shows the probability of Cause given the probability of Evidence. If we can identify the causal relationship between Cause and Evidence, then we can deduct the probability of Cause based on what we observe, which is more applicable in solving real-world problems.

In the book, Pearl gives an application in estimating what is the probability of having breast cancer given that the mammogram test comes out positive, i.e., what is P(disease|test)? First of all, there is a clear causal relationship where breast cancer is a cause, and the mammogram test results are the evidence. When we see a positive test result, it doesn’t mean this patient has cancer for sure because no test is 100% accurate. However, we can deduct the probability of this patient having breast cancer based on the test quality, which is defined as the sensitivity of the test P(test| disease). The test sensitivity is actually the forward probability, which is applicable to the general population.

In addition, individual-specific information can also improve our estimation of the inverse probability for each patient. For example, if this patient is from a family with several family members diagnosed with breast cancer, then a positive test result will be more trustworthy than a patient without family cancer history. This patient-specific information is added as a prior in the final formula that indicates how to update prior (the probability of having breast cancer) given the evidence (observing a positive test result):

Updated odds of D = Likelihood Ratio * Prior odds of D

In mathematical terms, it is:

Using both conditional probability and likelihood ratios, we can update beliefs in both directions. If we have new information from the cause, we can update our belief in evidence through conditional probability:

P(T| D) = P(T & D)/P(D)

P(T| D) changes because of the changes in P(D). If we have new information from the evidence, conditional probability is not correct since testing positive does not make you have breast cancer. causal relationship is reversed. However, we could use the likelihood ratio of the evidence to update our belief.

So far, we have only discussed two causally connected variables, but this rule can be applied across a whole causal network, with the parent node indicating the cause and the child node showing the evidence. The child node will update its belief by applying conditional probability, and the parent node update its beliefs by multiplying the likelihood ratios. Applying these two rules in the whole network is called belief propagation. With these rules, we go beyond Bayes’ rule to understand both how a cause affects the generation of evidence and how observing evidence helps us deduce causes.


Confounders, the Achilles’ heel in nonexperimental studies

Belief propagation helps us understand the interactions among variables if we can identify the causal relationship correctly. In the real world, going beyond two variables, we will need to expand the causal relationship to a causal diagram to derive causal impact systematically. But before we move toward the causal diagram, which is the core of this book, let’s briefly discuss what has been preventing us from deriving Causality from observational data, which are called confounders.

"Confounding" means "mixing" in English. It is the variable that is correlated with both X and Y. Note the correlation could be both causal and non-causal. Moreover, in the graph below, I didn’t specify arrows between X&Z and Y&Z since, in the causal case, X, Y, and Z can all be either the cause or the result, establishing different causal diagrams that we will discuss in the next section. The left panel shows how having a confounder Z introduces a spurious correlation between X and Y.

In the right panel, if there is a causal relationship between X and Y, a confounder Z that affects both the cause X and the result Y will introduce confounding bias if not treating probably. We will not be able to disentangle the true causal effect of X on Y if we do not exclude the impact induced by confounders.

In experimental studies, the randomness in assigning subjects to treatment and control groups would resolve the confounder bias from the source (more to this in the last section). However, conducting experiments to study the causal effects is not always practical and ethical, in which cases, we will have no choice but to try deriving the true cause impact from observational data. Unlike experimental data, confounders exist in observational data because there are always factors that affect both the cause and the result.

For example, to study whether smoking causes lung cancer, one of the confounders is age. Different age groups have very different smoking rates, and the older, the higher chance of getting lung cancer. We will have to control age, and other confounders before getting the true causal impact. The common method statisticians and social scientists use to combat confounder bias is to "control" as many confounders as possible in their models. There are several issues with this method:

  • Not all confounders are measurable: Intuitively, we can figure out what might be confounders in a causal relationship we are interested in. However, it is not always possible to quantify these variables or find a suitable proxy to include them in the model. For example, when studying whether higher education causes higher income, one of the confounders can be "ambition." Ambitious people will be more likely to be motivated to get higher education and higher-paying jobs, but how can we quantify this subjective variable in observational studies?
  • Omitted variables: No matter how many variables we try to include in the studies, it is still very likely we do not include all the necessary and correct confounders or their proxies in our models, thus making the causal impact biased.
  • Control confounders induce biases: On the other hand, in practice, statisticians who, in order to make sure there are no confounders left behind, will include as many variables as possible in the model to ensure a debiased estimation. However, this overcontrolling can actually cause biases instead. As written by a political blogger named Ezra Klein:

"You see it all the time in studies. ‘We controlled for…’ And then the list starts. The longer the better, Income. Age. Race. Religion. Height. Hair color. Sexual preference. Crossfit attendance. Love of parents. Coke or Pepsi. The more things you can control for, the stronger your study is – – or, at least, the stronger your study seems. Controls give the feeling of specificity, of precision…. But sometimes, you can control for too much. Sometimes you end up controlling for the thing you are trying to measure."

Ultimately, when solving confounders bias, we face so many issues because it is a Rung 2 question that requires us to study causal relationships among variables. Thus, a Rung 1 solution that does not involve causal structures like drawing a causal diagram will not be enough. In the next section, we will see how utilizing causal diagrams can help us define confounders and control confounders in a systematic and trustworthy way.


Establish causal diagrams, the keyholes to causality

Three Basic Structures

To understand what is a causal diagram, we can start with the fundamental building blocks for all networks. There are three basic types of junctions that characterize any pattern of arrows in the network:

The three basic types exist both in the Bayesian network and causal diagram. Applying Bayes’s rule across variables constructs a Bayesian network, which is nothing more than a compact representation of a huge probability table. If we see the chain structure A -> B -> C in a Bayesian network, the missing arrow between A and C means A and C are independent once we know the value of B. If the same chain structure is observed in a causal diagram, in addition to the same independence we will observe between A and C if we control B, we are also adding the causality flow through the arrow. This structure shows C is caused by B, B is caused by A, and A is external. If we change the structure to C -> B -> A, or to the fork structure A <- B -> C, we will see the exact same independence between A and C, holding B constant, but the causal structure has changed drastically. In other words, data can’t tell us everything. No matter how large the data, we cannot distinguish A -> B -> C, C -> B -> A, and A <- B -> C without adding subjective causal assumptions.

Additionally, moving from the Bayesian network to a causal diagram, we are also transforming the Rung 1 probability thinking to Rung 2 and Rung 3 causal thinking. Instead of using the probabilistic expression "once we know the value of B," we can instead say "once we hold B constant," the same as moving from "seeing B’" to "intervening B." In the later section, we will see this difference comes from the P(Y|X) and P(Y| do(X)). Bayesian networks can only tell us how likely one event is, given that we observe another event. However, causal diagrams can answer interventional and counterfactual questions.

The back-door criterion

Causal diagrams not only transform us into causal thinking but also equips us with a trustworthy tool to find and verify causal effect among observational data. As mentioned in the previous chapter, identifying the right confounders is the main challenge. To solve this issue, Pearl introduces the do-operator and the back-door criterion.

The key is to figure out the causal diagrams, and the do-operator erases all the arrows that come into X, thus preventing any information about X from flowing in the noncausal direction. While P(Y| X) shows the causal effect with confounding bias, the probability P(Y| do(X)) shows the true causal impact. It means holding other confounders constant by blocking their information flows, how will Y change if I change X? Based on different causal structures, we need to control or not control different variables to block information flows.

In order to get P(Y| do(X)), we need to ensure the information flow from X to Y only comes directly from X to Y. In order to achieve this goal, we need to block every noncausal path between X and Y without perturbing any causal paths. These noncausal paths are called the back-door path to X, which is any path from X to Y that starts with an arrow pointing into X. It is easier to understand this concept with the five following examples:

By specifying the causal diagrams, we have transformed the process of controlling as many confounders as possible into identifying the backdoor paths and figuring out how to block them efficiently. As mentioned in the notes, it is not always the case to control as many variables as possible to ensure a true causal effect. In fact, controlling the wrong variables can:

  • Reduce or block the causal path between X and Y. For example, in Game 1, we will block the causal path between X and Y if we control for A, and partially block it if we control for B, the descendant of A.
  • Introduce collider bias for X and Y. For example, in Game 4, controlling B will make X and Y dependent when there is no causal relationship. Game 4 is also called the "M bias" because of it’s shape.
  • Control the right confounder, not as many as possible. For example, in Game 5, we can choose to control B and A in the same time, or just control C to achieve the same result.

Each of the diagrams can be found with real world examples. For instance, in Game 1 represents a medical application that estimate the effect of smoking (X) on misscarriages (Y). A is an underlying abnormality that is induced by smoking, which is unobservable since we don’t know which specific abnormality is induced through smoking. B is a history of previous miscarriages. It is very tempting to include miscarriage history into the model, but from the causal diagram, we can see if we do that, it will partial inactivating the mechanism through which smoking contributes to miscarriages, thus underestimate the true causal impact. There are a lot more real world applications in these two chapters in Pearl’s book. Even though the causal diagram can get too complicated to have the human brains figuring out the backdoor paths, don’t forget we will always have computer algorithms that are experts in cracking these types of problems.

Why do randomized controlled trials(RCT) work?

We have enough discussion using causal diagrams for the non-experimental studies. How can we use the causal diagram and the backdoor criterior to explain why RCT works for deriving unbiased causal impact? Let’s see an example where we try to figure out how does different fertillizer affect soil yield. In real world, farmer decide which fertilizer to use based on a lot of factors, like soil fertility, texture, which also affect yield. We can show it in a causal diagram:

All the orange lines shows the confounding relationships that bias the causal impact from fertilizer to yield. To fix that, we will need to control all these confounders in the model. Note it may not be possible because the "Other" factor here can be hard to name and quantify. However, if now we design a experiment that will decide which feritlizer to use for each plot purely on drawing random cards. Now the causal diagram becomes something like this:

By adding the random card to the diagram, we can remove all the confounding orange lines from the previous diagram because which fertilizer we use doesn’t depend on any of these variables anymore. It is purely a randomized decision only affected by the random card draw. The backdoor critierion has met to estimate the causal impact of fertilizer and yield.


That’s all I want to share from Chapter 3 and 4 of "The Book of Why" by Judea Pearl, which completes the third article in this "Read with Me" series. I hope this article is helpful to you. If you haven’t read the first two articles, check them out here:

Read with Me: A Causality Book Club

Data Tells Us "What" and We Always Seek for "Why"

If you are interested, subscribe to my email list to join the biweekly discussions that will become more and more technical:

There are much more details and example that Pearl has shown in the book. As always, I highly encourage you to read, think and share what’s your main takeaways either here or at your own blog post.

Thanks for reading. If you like this article, don’t forget to:

Reference

The Book of Why by Judea Pearl


Related Articles