The world’s leading publication for data science, AI, and ML professionals.

Heterogeneous treatment effect and Meta Learners

Causal Inference using S learner, T Learner, and X learner

Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

So far in our previous articles, Getting Started with Causal Inference and Methods for Inferring Causality, we talked about the average treatment effect, which helps us to understand if a particular treatment should be given or not, for example, if we should roll out a particular marketing campaign or if we should increase the price of the product (Should we treat), etc. But what it doesn’t tell us is who we should roll out or increase the prices for, as the effect of treatment would differ among individuals (Who do we treat), and if we optimize for this, then we stand a better chance of improving our ROI. We now wish to go beyond the average treatment effect (ATE) that we discussed in previous blogs and estimate conditional average treatment effect (CATE) or heterogeneous treatment effect to better personalize treatment regimes and understand how treatment effect varies across individuals

Machine learning models do a great job at estimating E[Y|X], because, in the predictive universe, we are passive observers, predicting the outcome Y say sales, given a set of features X. But in industry, there are certain levers that we would like to tune to increase sales, and we are more interested in E[Y|X, T], where X is the set of features that we can’t or don’t want to control and T is the treatment variable that we want to intervene to increase sales. The heterogeneous treatment effect is then the process of estimating the causal relationship between Tᵢ on Yᵢ, under the context of Xᵢ.

Most cases of causal inference involve ordering units based on treatment effects, for example, to find units/ individuals that are more likely to positively respond to treatment. Before jumping into Meta Learners, a model for heterogeneous treatment effect used by tech giants like Uber and Swiggy, and Booking.com (case studies linked in this article), let us understand CATE with an example of price elasticity, using the simplest model we know, regression. Suppose a juice shop wants to know how they can change the price without losing sales, i.e. price elasticity with respect to temperature or day of the week, etc. Here is the toy dataset for the same.

Let’s build a simple model to get price elasticity, that is, how a unit change in price affects sales. This is exactly what the regression coefficients give us.

We see we get the Average treatment effect (ATE) of price as -1.025, which is constant for all units. But what would be more useful is how price elasticity changes with temperature, as this would enable us to change prices on certain days without losing a lot of sales. We can do this through regression by introducing an interaction term in the equation

Here the price elasticity is given by -3.45 + 0.0848 *temp. The effect of price on sales decreases with an increase in temp for our juice brand. This is a very simple example merely used to introduce the concept of CATE. Now let us delve into Meta learners and more industry-used cases of CATE.

Meta Learners

Meta learners are a set of algorithms that are built on top of Machine Learning algorithms like Random Forests, Xgboost, or Neural Networks to help estimate CATE. (Künzel)

S learner

In S learner or single estimator learner, the treatment indicator is included as a feature, similar to all other features, without the indicator being given any special role. We then make predictions under different regimes and the difference in prediction will be our CATE estimate.

Image by the author
Image by the author

We will be using this example throughout, where w is our treatment, which can be any form of intervention, for example, an email sent to a bank customer.

S learner is a very simplistic model which might work well for certain datasets, but since it deploys a single ML model, which is usually a regularized model, it tends to bias the treatment effect towards zero. If the treatment effect is very weak in explaining the outcome, the S learner can completely disregard the treatment variable. The next learner overcomes this problem.

T learner

T learner overcomes the problem of disregarding the treatment effect completely by building on a model for each treatment value, for binary treatment, two models are developed, hence the name T learner

Image by the author
Image by the author

T learner avoids not picking up weak treatment effects but still suffers from regularization. Consider the example in this paper from Künzel et al. 2019. Suppose the treatment effect is constant and we have some non-linearity in outcome Y and the number of treated units is far less than the control units

Source: https://arxiv.org/pdf/1706.03461v3.pdf
Source: https://arxiv.org/pdf/1706.03461v3.pdf

Because we have only a few treated observations, M₁ will be a simple linear model, while M₀ will be more complex, but since we have more data for control units, it won’t overfit. However when we estimate cate, M₁(x)-M₀(x), the non-linearity of M₀ subtracted from the linearity of M₁, will give us a nonlinear cate, which is wrong, since cate is a constant 1 in this case. To overcome this issue, Künzel in the same paper proposes X learners.

X learner

The X learner uses information from the control group to develop better estimators for the treatment group and vice versa. It has three steps, one of which is a propensity score model.

Image by the author
Image by the author

Using the same data as above, Künzel explain how X learner overcomes the drawbacks of T learners. In the first stage, we develop two models similar to the T learner. In the second stage, we calculate τ(X, T=0) which ** is the imputed treatment effect on the untreated, the model based on τ(X, T=0) is not good, since the M₁ is a simple model built using less data, it does not capture nonlinearity in Y, and hence τ(X, T=0)is nonlinear. τ(X, T=0) is represented by red dots, and Mτ0(X) is** represented by a red dotted line.

The imputed treatment effect for the treated τ(X, T=1)represented by blue dots, is estimated using M₀ which is trained on the larger untreated model, and since its imputed treatment effects are correct we are able to obtain a correct Mτ1(X) model.

https://arxiv.org/pdf/1706.03461v3.pdf
https://arxiv.org/pdf/1706.03461v3.pdf

So we have two models, one with the correct imputed treatment effect developed on larger untreated data and the other with the incorrect imputed treatment effect. We combine these second-stage models using propensity score

e(x) is the propensity score model, and since we have few treated units, e(x) is small, giving small weight to incorrect model Mτ0(X)and higher weight 1-e(x) to correct model Mτ1(X)

https://arxiv.org/pdf/1706.03461v3.pdf
https://arxiv.org/pdf/1706.03461v3.pdf

We don’t have to implement these learners ourselves. There are two causal learning libraries for heterogeneous treatment effects, one from Uber called CausalML and another from Microsoft called EconML, which also provides functions for feature importance and Shapley values

Implementing X learner using causalml.

EconML implementation of X learners

I strongly recommend going through the Meta Learner and additional supporting functions in CausalML and EconML documentation.

Meta Learners are commonly used in the industry for uplift modeling. Below are a few case studies from Uber, Booking.com, and Swiggy where meta-learners have been used to solve industry-relevant problems.

Case studies and Research: Uplift Modelling using Meta Learners

Uplift Modeling for Multiple Treatments with Cost Optimization at Uber using Meta Learners

Uplift Tutorial: From Causal Inference to Personalisation at Booking.com

Uplift Modelling aka Heterogeneous Treatment evaluation at Swiggy


In this article, we looked at meta-learners as one of the methods for estimating CATE. However, meta-learners do not work with continuous treatment. In the next article, we will look at another method called Double Machine Learning, which works with both continuous and discrete treatments.

References


Related Articles