Bcn Causal ALGO, Hands-on Tutorials

Propensity Scores and Inverse Probability Weighting in Causal Inference

A global overview

Aleix Ruiz de Villa

Published in

Towards Data Science

7 min readFeb 1, 2021

Source: https://pixabay.com/es/photos/horizontales-edad-pesos-930716/

This is a joint work with Núria Correa Mañas, Jesus Cerquides, Joan Capdevila Pujol and Borja Velasco within the Causal ALGO Bcn. You can find a hands-on post by Núria Correa Mañas here!

In this post we are going to talk about two well known techniques used to calculate Average Treatment Effects (ATEs): propensity score analysis and inverse probability weighting. This post assumes you have the very basic notions of causal inference, that is you understand the problem of estimating effects under the presence of confounding.

Confounding variables are those that affect the distribution of the treatment and at the same time the outcome. Image by Author

The main strength of propensity score analysis is its ability to reduce a multidimensional problem into an unidimensional one. Once propensity scores have been calculated for each observation, we can ensure that we are properly comparing two possibly different populations, the treatment and control groups. Moreover, the effect of the treatment can be posteriorly calculated only based on such scores. Inverse probability weighting contributes with a different numerical formula with the same objective, calculating ATEs.

ATEs reminder

ATEs stand for average treatment effects. That is, when you have two groups, treated and untreated patients, you want to see which is the effect of the treatment into some outcome (probability to recover, for instance). Unless you run a Randomized Controlled Trial (RCT), you cannot just look at the difference of the recovery between the two, because treated and non treated groups potentially have very different set of attributes.

Treatment groups may have different attributes distribution. Image by Author

Let us remind first the basics of confounding adjustment through the very well know Kidney Stones problem. In this problem, doctors wanted to know which, among two treatments A and B, had a better recovery rate. When they had a look at the data they found this strange situation.

Simpson’s paradox — Kidney Stones. Image by Author

When data was stratified by the size of the stone, treatment A was better in both cases. However when analyzed altogether, treatment B was better. It turned out that doctors were assigning treatment previously guessing the size of the stone. Since treatment A was surgery, while B was some kind of pills, larger stones (which were more difficult to cure) tended to be mostly treated by treatment A, producing an imbalance of distributions among treatments. This dynamic can be represented through a Directed Acyclic Graph (DAG)

In this case X is a variable, but the same formulation works when it is a vector, which will be the case in practice. To remove the effect of confounding (variables that affect both the treatment and outcome), provided no confounder is missing, we can calculate the ATE with the following adjustment formula.

This formula basically says what would happen if you have treatment t, to everyone, not just the particular selection of patients that happened in the past. Understanding properly this formula takes some time and we will not explain here. References on this topic can be found here.

Our main goal is to know what would happen if we gave treatment A or B to everyone.

Inverse Probability Weighting

As we have seen, the main problem with analyzing data is that groups are imbalanced. This can possible create a biased comparison between groups.

Motivational example with made up data. Image by Author

One idea that could come up to our mind is what if we re-weight each groups in a way that they reflect the global distribution of sizes instead that the ones of each treatment group. For instance, for the large stones subgroup

we can use the weights shown on the figure. These weights precisely are the inverses of the propensity score, the probability of being assigned to a particular treatment group, given patients attributes (we will talk in more detail about this in the next section).

This intuition can be formally reflected in the following formula, where, multiplying by the propensity score, we arrive at the Inverse Probability Weighting formula

This formula has a numerical problem. We have to divide by the propensity score. In some cases, for some x, it may happen that the probability of treatment may be very low, which can easily increase its variance. So it is not always advised.

Propensity Score

Let’s view the adjustment formula from another point of view. Recall

where X is a vector of attribute with all potential confounders. Imagine you want to compare the effect of the treated and non treated patients to the outcome y. You would like to compare

So the previous formula can be read as take a patient with attributes x, and calculate the impact of the treatment and non treatment

and add these quantities weighting with the frequency you will see these attributes x, i.e. P(x).

However there is a problem with this approach: each patient only has one version of the treatment (treated or not), so one of the two terms will not be available (unless of course you have another patient with the same exact attributes, which doesn’t happen often). This is why, in some cases, causal inference problems can be seen as a missing data problem.

One approximate solution is to find for each patient with attributes x, another with attributes x’ as close as possible. This is known as matching. This intuitive idea comes to a practical problem, mainly to define what “close as possible” means. This is because different attributes may impact differently on our problem and it is not clear how to prioritize them accordingly.

Rubins and Rosenbaum in “The central role of the propensity score in observational studies for causal effects” (1983) came to a solution. To find a comparable patient, you don’t need to find another with the same attributes. It is enough to find one which has the same probability of being chosen! This quantity is known as propensity score.

That is, for each set of attributes x, you need to calculate

This can be done with logistic regression (or, in fact, with any machine learning model that suits you).

Calculating propensity scores for the whole population helps us manage the common support assumption. Again, for each patient we need to find a similar within the other treatment group. What if there is some type of patients in one group too different from the rest of the oposite group? this means we cannot find a proper match, so we cannot say what would happen with this patient if we had done the alternative treatment.

Visualizing the distributions of both groups we can identify regions were there is no proper match.

source: https://www.slideshare.net/erf_latest/propensity-score-matching-methods

Those patients outside the common support should be removed from the comparison. Rubin and Rosenbaum’s paper ensures that after this selection, the comparison is proper, meaning unbiased, so it can be formally done.

Propensity scores can also be used to calculate ATEs. With some calculations (omitted for simplicity), we can see that the difference of treated and non treated can be calculated in a similar fashion than before, but only using propensity scores

This formula leads to different algorithms to calculate the ATE, some proposed in Rubin and Rosenbaum’s paper.

Version 1: For each treated patient find another untreated with same propensity score and calculate the difference in outcomes. Average all the results.
Version 2: For each level of propensity p sample a treated and untreated patient with that level and calculate the difference. Average all the results.
Sub-classification: Bin propensity scores and for each level p, calculate the mean treatment difference. Average all the results.

Conclusions

Average Treatment Effects can be calculated using equivalent formulations: inverse probability weighting and propensity scores
Inverse Probability Weighting, since potentially dividing by small probabilities can suffer from large variations
Propensity Scores can be used to find a region of common support
Propensity Scores can be used in a variety of ways to calculate ATEs