Methods for inferring Causality

Matching, Propensity Weighing, Doubly Robust ML, Regression Discontinuity Design

Published in

Towards Data Science

9 min readJan 25, 2022

In our previous article Part 1: Getting started with Causal Inference, we covered the basics of causal inference and gave a lot of attention to Regression. We also discussed that regression is the not only way to close backdoors in causal estimation design. In this article, we are going to discuss some other methods, all aiming to achieve the same thing, that is, to make treatment and control groups similar in everything except in treatment.

Getting started with Causal Inference

Going beyond “Correlation is not Causation” and understanding the “why”

towardsdatascience.com

Matching

The goal of matching is to reduce the bias for the estimated treatment effect in an observational-data study, by finding, for every treated unit, one (or more) non-treated unit(s) with similar observable characteristics against which the covariates are balanced out. If there is some confounder, say age, which affects both the treatment and outcome, thereby making treatment and control group incomparable, we can make them comparable by matching each treated unit with a similar unit from the control group. In our example, people with similar ages from the treatment and control group are compared together, the final effect is the average of all this.

ATE for matching is calculated as above, where Yj(i) is the sample from the control group which is most similar to the treatment group.2Ti -1 is used to match both ways.

This helps in removing some bias, but what if the difference in Y₀ of matched treatment and control units is not 0. It can happen that because of matching discrepancies, we still have bias. Fortunately, we have a way to improve further on the bias reduction journey that we have embarked upon simply with the help of our old friend. Once again, regression to the rescue!

µ₀(X i) is the outcome value for treated unit i, had it not been treated, i.e., this is Y₀ for the treated unit.

µ₀(X j(i)) is the outcome of the control unit j that is matched to treated unit i, i.e., this is Y₀ for the control unit that is matched to the treated unit. So in the ATE equation above, we are making sure that Y₀ of treated and matched units are not contributing to the ATE.

Suppose we want to measure the impact of medication on recovery days, but in our experimental design, we have confounders like age, severity, and gender of the patient.

ATE =16.8957995464987

If we directly calculate ATE without adding any control, it looks like medication is increasing recovery days, but we know its because of confounding, the treatment and control group are not comparable. Lets see what happens if we match the treatment and control units using KNN, and use regression to further reduce the bias (bias correction) as shown above

ATE = −7.36266090614142

Voila! This surely makes more sense. The recovery days are reduced by 7 days because of medicine, once we have controlled for confounders through matching.

Propensity Score Stratification: Instead of matching on covariates either directly or using some distance metrics, we can match on propensity scores, which is the conditional probability of treatment given all covariates P(T|X), as denoted as P(x). Stratification subclassifies the individual using quantiles of the Propensity Score, the treatment effect is calculated within each stratum and combined using weights( proportion of units which lie in each stratum) to get the final estimate. Its been shown that dividing into 5 strata can reduce up to 90% bias.

Let's implement this using the dowhy library for causal inference from Microsoft.

Matching works well when there are lots of control units to match treatment units with. In propensity terms it is called common support, the propensity distribution between treat and untreated should have a good overlap for reliable estimates.

Propensity Score Weighting

We discussed above that instead of conditioning on covariates, we can condition on a single value called propensity score, which is the conditional probability of getting the treatment T given the covariate P(T|X) such that

X(confounder) affects the T(Treatment) through function P(x), so controlling for P(x) indirectly controls for X. We can use this propensity score for matching and also directly use this propensity score in linear regression to control for bias, instead of conditioning it on all confounders. We will now be using P(x) as a scaling parameter, and this method is known as Inverse Probability of Treatment Weighting(IPTW)

Simplifying and integrating over X, we get.

From the equation above, for the treatment group, we scale the outcome by the inverse of propensity. Higher weight is given for someone with a low probability of treatment (looks untreated) when it is actually in the treatment group and vice versa for the control group. This creates a population where every unit is scaled by the inverse of propensity, thereby controlling for all X’s on which propensity is conditioned.

We will be using the Lalonde dataset to find the impact of treatment on real earnings in 78 first by using regression and including all confounders in the regression equation, then by using propensity score in the regression equation instead of confounding variables, and finally we will use propensity score weighing method from the DoWhy package.

ATE is around 1671 from regression with all the confounding variables. Now let’s get propensity score using Logistic regression and use that in linear regression, controlling for P(x) instead of X.

Propensity score weighing function from the dowhy package

Additional notes on DoWhy

The DoWhy package provides us with some methods for getting more confidence in our results, called refutation methods. Let’s understand this using the above example.

Random Common Cause Refuter: Adds randomly generated covariates to the data and reruns the analysis to see if the causal estimate changes or not. The causal estimate should not change much because of a random variable.

Placebo Treatment Refuter: Randomly assigns any covariate as treatment. The causal estimate should move towards zero.

Data Subset Refuter: Similar to cross-validation, it creates subsets of data and measures if our causal estimates vary across subsets. There should not be much variation in our estimates.

Passing these refutation tests doesn’t mean our causal estimate is correct. It is just used to provide more confidence in our causal estimates. Remember, we discussed in the previous article, that Causal inference with observational data requires internal validity, we have to make sure that we have controlled for variables that contribute to bias. This is different from the external validity required for a predictive model which we implement using a train test split to get reliable predictions(not causal estimates)

If you want to know more about the DoWhy package, the link to the documentation is given in the references. Moving on!

Doubly Robust Estimation

We have used regression models and propensity score-based methods to control for confounders, but there is also a way by which we can use both methods to make sure our causal estimates are more robust. It is called Doubly Robust Estimation.

Let’s look at the first part here, It µ1 (X) is correct, and P(x) is wrongly estimated, then E[Ti (Yi -µ1 (Xi))]=0, because Ti selects only treated cases and for those (Yi -µ1 (Xi)) very close to 0. So being correct in µ1 (X) wipes out the necessity of P(x) being correct. Rearranging the terms we can also prove that when P(x) is correct we don't need µ(X) to be correct.

Causal estimate= 1619.51

Regression Discontinuity Design

Regression discontinuity design can be used whenever there is a clear threshold that separates the treatment from control. Based on the threshold we can reduce the bias by identifying the population just below the threshold as control and just above it as treatment. For example in a mobile game, a player just below a certain score can be identified as control and just above that as treatment when the score results in changing the level. We can identify the impact of level on other say skills as people just below and above the threshold might have similar skills.

Implementing an RDD is as simple as creating a dummy variable whose value is 0 below the threshold and 1 above it. Let’s implement this to study the impact of level change in a game on engagement, where we have the threshold at a score of 100(level change). The data contains games scores and engagement scores.

Engagement increases by 7.66 units with the change in level at a score of 100. Since we are more interested in getting the effect at the threshold, it makes sense to get a good fit on data points around the threshold than those away from it. We can do this using weights in linear regression. We will be using the kernel below as weights for in our linear regression model

One can see that the impact of level change increased to 8.3 by using weighted linear regression with higher weights near the threshold.

All these methods discussed here and in the previous article are useful, but not completely failproof. We might not be able to completely eliminate the bias, as there are always some unobserved confounders affecting our design, but it is possible to reduce some bias by controlling for confounders using different methods, and it is surely more reliable than predictive inference. Causal insights can trigger A/B tests, which is the gold standard of causal inference.

In our previous article, we covered the basics of causal inference and in this article, we covered methods for getting causal estimates. Both these articles aimed at getting average treatment effect, but what is even better is to get a more personalized effect of treatment on treated units, also known as heterogeneous treatment effect. For example, if the average treatment effect tells us if we should roll out an email marketing campaign and the heterogeneous treatment effect tells us which customers we should target with our marketing campaign to increase our ROI. In our next article, we will discuss about heterogeneous treatment effects and ways to estimate that.

If you like these articles, don’t forget to be generous with claps and subscribe to get a notification whenever my next article is published.