Beyond A/B Testing: Primer on Causal Inference

Making the most out of your experiments and observational data

Published in

Towards Data Science

28 min readFeb 28, 2020

Applied statistics is seeing rapid adoption of Pearlian causality and Bayesian statistics, which closely mimic how humans learn about the world. Certainly, increased computational power and accessible tools have contributed to their popularity. The real appeal, however, is much more fundamental:

interpretability. We can communicate using diagrams and plain language. (Have you tried explaining a p-value to a non-technical audience?)
collaboration. These frameworks require expert judgment. The interpretability makes it easier to involve others.
relevance. Business questions are better answered under these frameworks.

Outside of experimentation teams, data scientists have traditionally focused on prediction. Yet, as the adage goes, “correlation does not imply causation”.

Your umbrella company wants to increase revenue. You notice that wet ground predicts an increase in sales. Should you go around spraying water on the ground to stimulate sales?

Predictive models, on their own, cannot answer such questions. Important business questions demand causal inference.

Prediction and inference are opposite goals. Correct inference often requires us to sacrifice predictive power. A model tuned to maximize predictive power can lead to incorrect causal inference. We will talk about this in the second half of the article.

This article serves as an introductory guide/reference to causality. As such, it’s quite lengthy. It’s divided into three parts:

Experiments
Quasi-experiments
Observational data

As we go down the list, the study becomes less costly and more feasible to carry out. In exchange, we have to make stronger and more outlandish assumptions.

This introductory article won’t go into heterogeneity of effects, inference on graphs / networks, interference, or time-varying effects. A basic knowledge of hypothesis testing is assumed.

Experiments

Simple A/B Testing

We have come to the point where people colloquially refer to experiments as “A/B tests”. Subjects are randomized into control and treatment groups. Any difference observed is the causal effect from the intervention.

From a frequentist standpoint, you want to do a Welch’s t-test or proportion test. These frequentist tests assume that you run the test once with an appropriate sample size. I recommend setting up a monitoring dashboard to quickly find out if something is implemented incorrectly (e.g. conversion drops to zero) but the testing should be done only once at the end of an agreed-upon period.

A dashboard of p-values is dangerous because we shouldn’t act on p-values. Early termination (unless you really know what you’re doing) or running an experiment until significance (ew, gross) will make false positive rate absurdly high, rendering the experiments largely useless. Here is R code to simulate an oversimplified situation where the null hypothesis is true and we’re supposed to test on the sixth week with α = 0.05:

library(data.table)num_sim <- 100
num_week <- 20
obs_per_week <- 1000set.seed(123)
p_values <- matrix(NA, nrow = num_sim, ncol = num_week)for(i in 1:num_sim){
  for(j in 1:num_week){
    A <- matrix(rnorm(obs_per_week*num_week, 10, 1), 
                nrow = obs_per_week, 
                ncol = num_week)
    B <- matrix(rnorm(obs_per_week*num_week, 10, 1), 
                nrow = obs_per_week, 
                ncol = num_week)
    p_values[i, j] <- t.test(
      as.numeric(A[,1:j]),
      as.numeric(B[,1:j])
    )$p.value
  }
}p_values <- melt(
  p_values,
  varnames = c('simulation', 'week'),
  value.name = 'p_value'
)
p_values <- data.table(p_values)# number of p < 0.05 on each week
p_values[,.(num_significant = sum(p_value < 0.05)), by = week]
mean(p_values[,.(num_significant = sum(p_value < 0.05)), 
              by = week]$num_significant)# number of ever reaching p < 0.05 on any given week
length(p_values[p_value < 0.05, unique(simulation)])# number of getting p < 0.05 on or before the sixth week
length(p_values[(p_value < 0.05) & (week <= 6), unique(simulation)])

If you do the test on any single week, then on average we get a false positive rate at the predetermined α = 0.05. Using this seed, early termination results in 26% false positive rate and running the experiment until significance, up to 20 weeks, results in 61% false positive rate. It is very important to pre-register the experiment and agree on the duration in advance.

You should have some historical data to guesstimate the mean and standard deviation of the control group, which you plug into a sample size calculator to determine how long the experiment should be run. Ideally that’s the case, but the mean will change depending on how long the experiment is run as the proportion of new vs loyal customers will go up with time.

In practice, I recommend:

calculate power as a function of duration (holding α and effect size fixed). Possibly a sensitivity analysis of how power changes when we modify α or effect size.
use the arbitrary α = 0.1 and β = 0.9 instead of the equally arbitrary α = 0.05 and β = 0.8 if you think a false negative is just as bad as a false positive.
perform a one-sided test. We usually care more about the sign than the magnitude unless implementing the alternative comes with a change in operational cost.

The last one relates to how A/B testing is often used for decision making. Consider the scenarios:

B significantly better than A. Ship B.
B significantly worse than A. Keep A.
B not statistically different from A. ??? Do we keep A or ship B?

If you’re going to subscribe to the binary nature of the NHST framework, you might as well binarize the decisions too to reap higher statistical power.

As a toy example, suppose the baseline conversion rate is a stable 20% and we get 1000 new visitors per week. How many weeks should we run the experiment, if our decision rule is to ship B if it’s significantly better than A? We can present this plot:

Effect size of 20% is unrealistic. If we care about 10%, then a 6-week experiment gives us pretty high power. And if we want to detect a 5% effect size, then we might have to turn to other options. The power plot is a contour plot to visually help in decision making.

Getting an adequate sample size is often infeasible. According to a post by booking.com:

Detecting small effects can be challenging. Imagine running an e-commerce website with a typical conversion rate of 2%. Using Booking.com’s power calculator (open-source code here), you can discover that detecting a relative change of 1% to your conversion rate will require an experiment with over 12 million users.

For most companies, A/B testing on 12 million users is downright impossible. In fact, one of the main competitive advantages of large tech companies is that they can run much more powerful experiments — smaller companies cannot replicate this. In this case, I recommend going by gut feel or going Bayesian.

Going by gut feel sounds contrary to “data-driven decision making” but running a frequentist analysis on far too few samples is more like “noise-driven decision making”. It is a waste of resources for something that we know in advance will be inconclusive (due to bad methodology, not significance of results). Garbage in, garbage out.

The Bayesian approach to A/B testing is to treat the group assignment as a random effect, i.e. we use a Bayesian hierarchical model. Otherwise, the prior can be chosen using empirical Bayes. Using a uniform / extremely weak prior is heavily frowned upon (for more details, check my other article).

Word of caution: your metric should be aggregated by your sampling unit. If you randomize which users get A or B, then the metric to be tested needs to be X per user. Something like average order value will lead to misleading results because we are not randomizing the orders. Intuitively, something like average order value places 0 weight on customers who didn’t buy, while placing more weight for customers who order frequently.

Aggregating by user will discard some information but it is necessary to get the correct results. You can set up a Bayesian hierarchical model instead of aggregating, but those are computationally challenging for large data sets.

(Fractional) Factorial Design

The factorial design is generally superior to a simple A/B test. You either:

put in extra work to get better insights. You can answer “which parts work well?” instead of “which version is better?” The insight is transferable to future work. Also, you can analyze interaction effects.
put in less resources to get the same insights. You can reuse the samples to test multiple hypotheses.

Unless you are a tech giant who can get enough statistical power from 1% of the userbase, you should do equal splits. This typically yields the most powerful experiment and has an added benefit of clear methodology. Conclusions can change depending on which sum of squares you use, but the four types yield the same result if your design is balanced (equally split).

Although dummy / one-hot encoding is more common, for experiments we prefer encoding control and treatment as -1 and 1. The effect sizes are more interpretable. Dummy encoding yields coefficients that are relative to a reference cell — rarely the quantity we’re interested in.

In a balanced factorial design with three covariates, you should equally split the subjects into eight groups:

The correlation matrix is a diagonal matrix so the effect size estimates are independent of each other. In fact, even the interaction effects are independent of everything else. Try computing the correlation of this design matrix:

This lets us reuse the samples to test multiple hypotheses and get higher quality insights. To test each hypothesis, we compare the white vs red cells:

I am sorry to call some people out, but there is some terrible advice going around that shows complete lack of understanding of how factorial designs work. A factorial experiment is not the same as “multivariate testing”. It is not true that you need more samples to obtain the same power as in a simple A/B test. You need the same number of samples because the samples get reused to test each hypothesis — the entire point of the factorial design. You can compute sample sizes the same way you would in a simple A/B test, though you might not detect the interaction effect unless it’s glaringly large.

Pretend you want to optimize your banner. We want to test the effect of font size and image size. Perhaps if you test them individually, users prefer large font, and users prefer large pictures. But large font and large picture together makes the banner too crowded — there is a negative interaction effect.

The insights can be transferred to future banner design. If you test A = small font and small picture vs B = large font and large picture, we don’t know what specific change caused the impact. Even worse, B might perform worse than A even though some components work because the large negative (interaction) effects mask the positive effects.

If you want to test many changes at once, the number of banners to make grows exponentially. Testing n covariates requires 2^n combinations. Using the fractional factorial design, you can cut down the number of combinations by at least half if you are okay with being unable to estimate some higher-order interaction effects.

Sometimes changes are deemed too risky and you are only allowed to do an experiment on, say, 20% of the user base. When it makes sense to have a low, medium, and high settings, keep the 80% “untouchable” as all 0 (medium setting) while equally splitting the 20% into -1 and +1 (low and high setting) groups. This way you don’t have to discard any of your user base and the “untouchable” customers enhance statistical power by providing a tighter estimate of the intercept, if linearity is assumed.

Bayesian analysis would treat the covariates as crossed random effects. (Not sure how to analyze the interaction terms.)

Word of caution: factorial designs do not make sense when the covariates cannot possibly be independent, e.g. because of funnels. Suppose your website’s funnel is X → Y → Z. Change(X) will affect who ends up in Y. The effect of change(Y) is dependent on change(X) even if we set it up as a factorial experiment. If possible, I recommend splitting the user base so some get tested only on change(X) while others get tested only on change(Y).

Crossover Design

In a crossover experiment, you randomize the subjects on the order: whether they receive AB or BA. When applicable, this results in tighter intervals because you’re using a paired t-test (or a random effect with many levels). Key assumptions:

No carryover effect. Suppose you want to measure heart rate of running vs standing still. If you ask someone to run and then stand still, the person’s heart rate will be higher than if they stood still without having just ran — a carryover effect.
A and B are not related to dropout. If you administer B then A, and B causes some subjects to never receive A, then estimates are incorrect.

Due to these restrictions, crossovers are typically used on products rather than people. Products are more reliably on the shelf for two periods. Users might arrive in the first period but not the second, and vice versa.

Except in the Great Can War of 2015 where these sentient cans marched off the shelves (source)

An example application is to evaluate the display of a snippet on product listings. Hold the product rankings constant for the experiment period. On the product ranking / search page, let half of the products have a snippet (B) and the other without a snippet (A). Reverse it in the second period. You can reasonably estimate the impact of the snippet on customer behavior.

Well, actually, no. Can you spot what’s wrong with this experiment?

If customers are more likely to choose B over A, then some of the purchases are shifted from A to B. To minimize interference, we can choose product categories that are unrelated to each other and do AB on half of the categories and BA on the other half. The data should be analyzed using a mixed model. Carryover effect is still possible if B is so bad that it causes some customers to leave.

Blocking

Classically, blocking is used when we use multiple measuring devices or need to conduct the test in multiple batches, e.g. you are a bakery and use multiple scales to weigh ingredients and don’t have an absurdly large oven. Maybe today has higher humidity than yesterday, which affects the end product. We want to take that into account.

Blocking has powerful applications in surveys to keep them short. It splits a factorial experiment into multiple batches, each of which is a fractional factorial.

Page 1 of 48. Estimated completion time: 216 minutes. (source)

I like forced choice surveys (“Which do you prefer: A or B?”) because we can infer things from behavior. What people say and what people do are completely different things. If we set up the options like a factorial experiment, then we can split the questionnaire into smaller surveys.

For instance, one time I ideally wanted each person to answer 32 questions. This is a dumb idea. People won’t complete the survey, and even if they do, they’ll answer randomly towards the end. Instead, through blocking, we can get solid inferences from 8 questions per person.

Other Remarks

There are other, more advanced ways to do your experiment. The response surface method is like gradient-based optimization. Simplex lattice designs are used for optimizing proportions of ingredients and can be used for establishing ranking weights. Much of the research into experimental methodology in industry deals with increasing the power of the experiment, usually through counterfactuals, stratification, choosing observation sites to maximize information gain, or clever ways to essentially turn a t-test into a paired t-test, among others.

Quasi-Experiments

Now that we have covered experiments, we start to go into the territory of uncomfortable assumptions. There are many methods that fall under quasi-experiments: difference-in-differences, interrupted time series, and synthetic controls. However, Google’s CausalImpact package is so powerful and flexible that it should cover most practical needs. These are extremely useful for ad performance measurement and pricing experiments.

Causal Impact

The basic intuition is simple: we observe a time series X with some intervention (like sales data and we launched a marketing campaign). We want to build a counterfactual: what would the time series have been like without the intervention? We look for ingredients (possibly many time series) to put into a blender, and hopefully the end result is a good counterfactual. The difference between observed and counterfactual is our causal effect estimate.

Key assumptions:

Changes in X do not affect the ingredients in the synthetic control
The relationship between X and the ingredients would’ve continued the same way without the intervention

Most of the work in this type of analysis is in finding and validating the ingredients for the counterfactual. Bad ingredients result in completely arbitrary estimates: even for periods with no intervention, a large “causal effect” might be estimated. Garbage in, garbage out. Here’s an illustrative example:

library(CausalImpact)set.seed(1)
x <- cumsum(rnorm(100, 0, 1))
y <- cumsum(rnorm(100, 0, 1))
dat <- zoo(cbind(y, x))
plot(dat, main = "Two random walks")impact <- CausalImpact(
  data = dat, 
  pre.period = c(1,80), 
  post.period = c(81, 100)
)
plot(impact)

We simulated two random walks with no intervention, yet CausalImpact estimates a huge positive effect:

Garbage in, garbage out. As we start to go beyond experiments, most of the work is on validating the reasonableness of assumptions. In many cases, we cannot even validate assumptions; significant human judgment is needed to assess the sanity of the model.

Once we have the ingredients, the analysis is very easy to do if you follow the CausalImpact documentation.

In practice, the ingredients relate to geographic location, e.g. using LA time series to predict NYC time series.

As a rule of thumb, the post-intervention period shouldn’t be too long (1–3 weeks is ideal) because forecasts break down the farther we look ahead. The pre-intervention period should be about 3–4 times as long as the post-intervention period. We hope there are no major structural changes spanning the entire period, and the series too far in the past might have a different relationship.

Personally what works for me is:

Generate a list of ingredient candidates. Exclude the ones with potential interference. For instance, if you are running an ad campaign in downtown Manhattan, using midtown as an ingredient is a bad idea because the people who saw the downtown ad might make the purchase around midtown.
If you have data going back pretty far, split it into three time periods. Use the farthest back to find your ingredients, the middle to validate, and the most recent to estimate the causal effect.
Using the farthest data, look at correlations. If your time series needs log transform and/or first differencing to be “roughly stationary”, do so. Very high or very low correlations with X indicate potential candidates.
The first validation step is to use the middle period, preferably where you know there was no major intervention. Try building the CausalImpact model by choosing random intervention dates, with the length of the post-intervention period corresponding to your study design. The estimated causal effect should be close to zero. Otherwise, look at other potential ingredients.
The second validation step is to use the middle period again. If your CausalImpact uses X~Y+Z, try other combinations like Y~X+Z and Z~X+Y. They should likewise estimate no causal effect.
If the previous steps work well, run the analysis and hope that the estimate is not spurious. We have done our best to ensure that the estimate is valid.

The ingredients and X should be chosen before the quasi-experiment is run. The worst case scenario is that the intervention has been done but we cannot build a decent counterfactual for X because none of the ingredients work.

Observational Data

The two previous sections covered cases where we have directly intervened. Sometimes we cannot intervene due to real-life constraints such as ethics or cost, so we only have observational data to work with.

Whether we like it or not, the overwhelming majority of data is observational. Experiments are expensive; tracking user activity is much easier. It’s unfortunate, since small well-collected data can be more useful than large observational data — they can be analyzed with fewer assumptions. It’s hard to draw valid conclusions from observational data alone. Pearl’s causality framework has grown increasingly popular in this space. It is equivalent to Rubin’s potential outcomes framework, but presents assumptions in easy-to-digest diagrams instead of obtuse mathematical equations.

Controlling for barometric pressure, Mount Everest has the same altitude as the Dead Sea. (source)

The most important takeaway is that adding too many predictors can lead to wrong results. Models tuned for predictive performance can lead to incorrect causal inferences. There is a big push against “controlling for everything”. In my previous article about Bayesian statistics, I explained how applied statistics is moving away from mindless procedures to critical thinking. Choosing predictors through lasso or, even worse, stepwise regression, simply does not work for observational data.

In this field, it is common to use linear regression instead of logistic regression to estimate causal effect on a binary outcome. You might scoff at the idea, but it’s done for good reason. Log odds are not collapsible. With observational data, we can get wildly different conclusions if we use non-collapsible metrics and add too many predictors or fall victim to the omitted variable bias. Unless you are absolutely certain that you have the perfect set of predictors (as is the case for randomized controlled experiments), it might be prudent to use linear regression to estimate causal effects.

Propensity matching is omitted from this article. Much has been written about why it’s terrible and how it will lead to wrong estimates. If you can compute propensity scores, you can use inverse propensity weighting instead. Just like stepwise regression, propensity matching is a terrible practice that needs to die out. Please, just don’t.

What Is a Causal DAG?

A Bayesian network is a directed acyclic graph (DAG) that implies a factorization of a joint distribution. The edges (arrows) show dependency. For instance, X → Y means that Y depends on X and the joint distribution p(x,y) can be factored as p(x) p(y|x).

A causal DAG makes the stronger statement that arrows indicate causality. X → Y means that X causes Y. For simplicity, this article will assume that all variables are binary.

Why go through all this trouble? Because once we go past a few variables, it gets really messy, even with four variables:

The last term is not p(x|u,v,w) because u affects x only through v and w. Once we know v and w, knowing u provides no additional information. Thus, DAGs have the Markov property.

The DAG is much easier to inspect than the joint probability distribution. Even non-technical folks can chime in on the reasonableness of the assumptions and we can better involve the domain experts. “Does U really cause V?”

The most uncomfortable thing about causal DAGs is that we cannot verify our assumptions, except in cases where the DAG implies some conditional independencies. Why should the edge X → Y exist? This should be discussed with domain experts. Some things we can all agree on, e.g. drunk driving → car accident. Other assumptions can be murky. DAGs are very “subjective”.

Yet, not using DAGs is much worse. A single regression equation can correspond to many DAGs. Which is it? When someone does regression analysis and computes a p-value, what are they even testing? All of these DAGs correspond to Z~X+Y:

There are many other possible DAGs with unobserved variables. When someone computes a p-value on the coefficient of Y, what are they even testing? I’ll explain in a bit, but the interpretation of the coefficient is very different depending on the causal diagram. Analysis done purely through regression sidesteps the question under the guise of “objectivity”.

These causal diagrams allow non-statisticians to discuss the model and assumptions. They put everything in a digestible graphical format. Perhaps the regression model is wrong; perhaps it answers the wrong business question. Other people can help us find out. These causal diagrams should be a must for analyzing observational data.

Rules of a Causal DAG

We can classify paths going through triplet nodes:

information passes through a chain X → Y → Z
information passes through a fork X ← Y → Z
information gets blocked by an inverted fork X → Y ← Z. The Y is known as a collider.

Simply put, in X → Y → Z, Z is caused by X. Conditioning on Y will flip the on/off switch. Once we know Y, knowing X provides no additional information because X affects Z only through Y. Information from X to Z gets blocked.

Why are the rules the way they are?

Rain causes people to fall by making the ground slippery through mediation. If we know that the floor is slippery, then knowing whether it’s raining or not provides no additional information in regards to people falling. Rain cannot make people fall if the ground is not slippery.

Rain makes the ground wet. Rain causes people to buy umbrellas. Wet ground is correlated with people buying umbrellas. A fork is commonly known as confounding. We can predict umbrella sales by whether or not the ground is wet, but it’s silly to claim that wet ground causes rain. Controlling for rain will render “wet ground” useless as a predictor. If we know it’s raining, we don’t need to know that the ground is wet.

Colliders are tricky to conceptualize. Spilling water makes the ground wet. Rain makes the ground wet. But obviously spilling water does not cause rain or vice versa. If we know the ground is wet and we just spilled water on the floor, then it reduces the probability of rain. They are two competing explanations. Once the phenomenon has been explained by one thing, it becomes less likely that both causes are at play. When we condition on a collider, two independent events become correlated (“collider bias”). We often condition on a collider not by choice but because of missing data, i.e. censoring or selection bias.

Now that we understand how information flows through a DAG, we can use it to estimate causal effects.

The do() Operator and Backdoor Adjustment

The do() notation indicates that we are changing something. In general, P(Y|X=x) is not the same as P(Y|do(X=x)). The former deals with observation (“Given that we observe X=x…”) while the latter deals with intervention (“Given that we set X to x…”). Or, to put it another way, the former deals with prediction while the latter deals with causal inference. The two goals are usually at odds with each other.

The do(X=x) operator wants to block all paths with an edge coming into X (“backdoor”) and adjust for confounders, while keeping all “front-foor” paths open (paths starting with an edge coming out of X). Or, if you prefer thinking in terms of trees: we want to construct a new tree with X as the root node. Let’s use this DAG for illustration:

What is the causal effect of V on X? We want to compute E[X|do(V=1)] - E[X|do(V=0)]. Take a look at all the possible paths:

V → X
V → W ← X
V ← U → W ← X

The only path that has an edge coming into V is the last one. Break it down into triplets:

V ← U → W is a fork, so information flows
U → W ← X is a collider, so information gets blocked

This path is already blocked by the collider without us doing anything! Had we conditioned on W, we would’ve opened the backdoor and get wrong estimates. The correct regression equation is X~V. We don’t need to adjust for anything. While adding U would’ve had no impact, adding W would’ve led to disastrously wrong results (“collider bias”) — it improves predictive power but the causal estimate is garbage, reiterating that prediction and inference are opposite goals. Here, I emphasize again: think about what predictors you use. Throwing everything into the regression will lead to wrong results.

This example is very simple. In practice, you will need to supply the DAG and let some algorithms determine which variables you should throw into the regression. (Though there are some heuristics to learn the “simplest DAG” that is consistent with the data.) This concept will be used throughout the rest of this article: variable selection for causal inference should be determined by DAGs.

Sometimes, you run a DAG through the algorithm and discover that it is impossible to estimate the causal effect given your observations! But using the DAG we can discuss what variables we need and which ones are feasible to collect.

Mediation Analysis

Mediation analysis is concerned with a DAG that looks like this. There is a direct path X → Z and an indirect path X → Y → Z. Remember when I said interpretation of coefficients depends on the DAG?

This DAG claims that a change in X will cause a change in Y, which through the chain will cause a change in Z.

Suppose we fit the regression Z~X+Y. What is the interpretation for β, the coefficient of X? Holding Y constant, an increase of 1 unit of X is associated with an increase of β in Z. See the problem? When we intervene and change X, then Y will change as well.

The total effect is given by (X → Z) + (X → Y → Z). If we assume the latter effect is multiplicative, then we can claim (X → Y → Z) = (X → Y) × (Y → Z). This assumption is reasonable if we work with linear models. A unit increase in X causes an increase of (X → Y) in Y, and that (X → Y) increase in Y translates to a (X → Y) × (Y → Z) increase in Z. We can estimate the causal effect of X on Z using two different ways:

regress Z~X. No need to adjust for Y. The coefficient is the total effect.
regress Z~X+Y and regress Y~X. Then compute the total effect algebraically.

Why bother with the second method? Comprehending it is key to understanding front-door adjustment and instrumental variables, which will be covered at the end of the article.

Front-Door Adjustment

Sometimes, you cannot block all the backdoors but can still estimate the causal effect using the front-door adjustment, which works for DAGs that look like:

Where U is unobserved and we want to the causal impact of X on Z. To be precise, we require:

The direct path from X to Z goes through Y
There is no unblocked backdoor path from X to Y
All backdoor paths from Y to Z get blocked by X

This method is actually very clever. Personally it’s an “aha!” moment once I saw it. Recall the relationship (X → Y → Z) = (X → Y) × (Y → Z). Then:

We cannot directly estimate (X → Y → Z) because of an unobserved confounder.
We can estimate (X → Y) because they don’t share any common confounders.
We can estimate (Y → Z). There is a direct path Y → Z and a backdoor path Y ← X → U → Z and we only need to adjust for X to close the backdoor path.

So we can compute the causal effect by breaking it down into (X → Y → Z) = (X → Y) × (Y → Z) since we can estimate the components in the right hand side.

Inverse Propensity Weighting

Selection bias is one of the biggest problems of observational data, making inverse propensity weighting (IPW) the bread and butter of this field. For instance, a business might ask “How does subscription affect user engagement?” A naive approach is to compare the means of subscribers vs non-subscribers. However, why would a non-active user even buy a subscription? Active users are more likely to subscribe — some reverse causation. It is completely ridiculous to claim that the act of subscribing causes user engagement to jump that much.

IPW creates a pseudo-population that better represents the true population. It corrects for selection bias. We take weighted averages using 1/propensity as the weights, where propensity is defined as the probability of being selected into that group. In this case, if the user is subscribed, then their propensity is the probability that they are subscribed given the predictors.

Suppose this is the true DAG. We simulate some data and analyze it using R code:

library(data.table)set.seed(111)
likes_movies <- sample(c(0,1), 1000, replace = TRUE)
x1 <- sample(c(0,1), 1000, replace = TRUE, prob = c(0.9, 0.1))
x2 <- sample(c(0,1), 1000, replace = TRUE, prob = c(0.2, 0.8))
subscribed <- ifelse(likes_movies, x1, x2)
engagement <- 30 + 100 * likes_movies + 20 * subscribed + rnorm(1000, 0, 10)dat <- data.table(likes_movies, subscribed, engagement)
dat_agg <- dat[,.(avg_engagement = round(mean(engagement), 2),
                  count = .N),
               by = .(likes_movies, subscribed)]
dat_agg# naive comparison of means
dat[,.(mean(engagement)), by = subscribed]# inverse propensity weighting
propensity_model <- glm(subscribed~likes_movies, data = dat)
dat$propensity <- ifelse(
  dat$subscribed,
  fitted(propensity_model),
  1 - fitted(propensity_model)
)
dat[,.(sum(engagement / propensity) / sum(1/propensity)), by = subscribed]

In this simulation, we make a subscription increase user engagement by 20. The simulated data:

If we naively compare the averages aggregated by subscribed, we get 114 - 63 = 51. But we know the true coefficient is 20! This estimate is completely wrong.

If we use IPW, we get the estimate of 103 - 83 = 20, the true causal effect.

But wait a second. While the previous example is colloquially called selection bias, it’s a confounder in causal literature. We could’ve simply run regression:

In causality, selection bias refers to conditioning on a collider, so something like:

I struggle coming up with an example so I’ll use Hernan’s example. X is folic acid supplement given to pregnant mothers, which reduces the chance of cardiac malformation. Y is cardiac malformation, which increases the chance of mortality in the womb. While X does reduce mortality by reducing Y, it also leads to healthier fetuses and reduces mortality in other ways. We can only observe cardiac malformation conditional on the baby being born. What is the total effect of X on Y?

Regression on its own cannot answer the question because we are missing the data when the child was never born. We need to use IPW using predicted probability of being missing.

This kind of problem often appears in surveys where non-response rate can depend on multiple factors.

IPW is a very powerful technique but can be difficult to use in practice. Propensities can be very close to 0, leading to unstable estimates (and a near-violation of the positivity assumption). Also, the propensity estimates should be unbiased, so we are restricted to logistic regression without regularization. The propensity scores might be biased anyway if we misspecify the model, e.g. nonlinearities or omitted variables. Use it with care.

Using Doubly Robust Machine Learning

Causal forests are hot topic but can be confusing; and, from my experience, it doesn’t scale well. I prefer the simpler doubly robust estimation described by Hernan. I’ll walk us through this DAG:

Identify the variables we need to control for using your DAG. In this case, A.
Split your data into three folds.
Using fold 1, train a model to predict πhat = P(X=1) using A.
Using fold 2, train a model to predict Yhat using A.
Using fold 3, train a model to predict DR1 – DR0, which will be explained in a bit.
Rotate through the folds in steps 3–5, then average the predictions from step 5.

Using part (4) we can vary X=0 and X=1 to get Yhat estimates for both cases. We can then plug the predictions into the doubly robust estimator formula:

And we can estimate the causal effect as:

It’s called doubly robust because as long as at least one of the models (propensity or conditional mean) makes unbiased predictions, then this causal effect estimate is unbiased. In other words, even if we get one model wrong, we’re still fine.

So why does this work well? Let’s take a look at DR1 (the proof for DR0 is similar):

If our propensity model is unbiased, then the part on the right has expectation 0, while the part on the left is IPW. IPW is unbiased if the propensity model is unbiased. Hence, this DR1 estimator is unbiased.

Rearranging gives us:

If our conditional mean model is unbiased, then the part on the left has expectation 0, while the part on the right is an unbiased estimator of the potential outcome. Hence, this DR1 estimator is unbiased.

Most machine learning algorithms are biased due to regularization, so it’s likely that both the propensity and conditional mean models are biased. We hope that the bias is small enough such that when we multiply the two biased predictions, the total bias shrinks quickly as n grows. Asymptotically, the estimate follows a normal distribution centered at the truth.

Instrumental Variables

This section has been left for last because it takes a completely different approach

Much of the literature on instrumental variables (IV) are super confusing, but a DAG makes the intuition super clear. In the above example, we want to estimate the causal effect of Y on Z. However, U is an unobserved confounder: we know it’s there, but we can’t measure it so we can’t control for it.

X is an IV because it satisfies the following criteria:

X has a causal effect on Y
X affects Z only through Y
X and Z do not share confounders

So now we think in terms of the causal effect of X on Z. The causal effect X → Z can be decomposed into two parts: X → Y and Y → Z. Based on the assumptions, we:

can estimate X → Z
can estimate X → Y
have relationship (X → Z) = (X → Y) × (Y → Z)

Do you see it? Through some algebra, we can estimate Y → Z !

This is often done by regressing Y~X and then using the fitted values of Y to predict Z. Conceptually, we change Y so that only the edge X → Y is accounted for, breaking the edge U → Y.

The whole concept can seem odd, but think about it: how do we interpret IV if X is a perfect predictor of Y? That’s an A/B test! Instrumental variables are “imperfect randomizers”.

In reality, causal effect estimates using IV can have extremely large intervals. If X is weakly predictive of Y and X → Y is small, then we are dividing by a very small number that has a lot of noise and it can lead to bias. Conceptually, a weak IV is like if you ran an A/B test but a lot of people randomized into A choose to go into B, and vice versa. If this noncompliance gets bad enough, you can’t estimate anything. Thus, if you are using two-stage least squares, then X should be a strong predictor of Y. Otherwise, be extremely wary of IV estimates; consider other methods such as Bayesian modeling.

In Closing

This article covers the major simple approaches to causality as I know it. The focus is on practical application. These approaches should be enough to handle the vast majority of problems.

If you think anything is missing or incorrect, please let me know in the comments so I can amend the article. I want this article to be the go-to reference for newer practitioners in this field. It should provide enough guidance to point them to the right methodology for their specific problem and let them search the keywords and concepts on their own.