A Story of Treatment and Response

Introduction to Predicting Treatment Effects on Recipients of Marketing Campaigns

Published in

Towards Data Science

9 min readMar 3, 2023

Imagine that you are responsible for selecting the customers in your company who should receive the next printed advertising material.

Of course, only those customers for whom it has the greatest impact should receive it. But that is easier said than done. Therefore, in this article, I would like to discuss the different ways machine learning can be used to predict treatment effects of a stimulus, such as a promotion.

The method is illustrated with a marketing campaign, but can also be applied to other problems. In addition, there is a proof of concept for the approach presented in this article. You can find it here.

Please note, although I mention sources, this article is based on my experience in this field. If you see any areas for improvement, please drop me a comment.

What is Response and How to Measure It?

Definition of Response

Any person receiving treatment can respond to it in two different ways. The person can either ignore the treatment or react to it. This reaction represents the treatment effect.

First, we need to determine how we are going to measure the reaction. In the context of a promotion, the purchase itself is often used in the literature as a binary term for the response.

This may be sufficient if the strength of the response is fairly homogeneous. However, for most use cases, the intensity of the response should also be considered. In our example, this would be the order value, which varies depending on the strength of the reaction.

Therefore, I recommend defining the response as the revenue generated from treating the customer with a promotion.

Measurement of Response

Based on this definition, we can now move on to measuring the treatment effect.

This is also where the first challenge lies. A purchase can occur as a causal effect of a treatment, but it can also occur in a “natural” way. For example, imagine that the customer orders two months after receiving an ad. Would you consider this order to be a causal result of the advertising? In my experience, there are two ways to solve this problem:

Response with treatment reference

One way is to directly attribute the response to a promotional treatment. This can be done by tracking the underlying treatment via voucher codes or advertising-specific item numbers. If such a causal assignment is possible, one can speak of a response with treatment reference.

If the advertising material contains a voucher with no or a very low minimum order value, in my experience almost all customers redeem it. With promotional item numbers, however, it is much more difficult to make a causal assignment, as many customers also search for the items directly on the web. This may vary depending on the specific business.

Illustration for response with treatment reference (Image by Author)

Response without treatment reference

The other way is to attribute sales to the promotion for a specific period of time, such as 30 days after an advertising campaign. This can be described as the response without treatment reference.

Determining the optimal time period requires careful consideration — the evaluation period should be consistent for all advertising materials and should not overlap with subsequent treatments. It should also capture the majority of plausible responses while being kept as short as possible.

Illustration for response without treatment reference (Image by Author)

Both methods are not perfect. On the one hand, there are customers who consciously or unconsciously conceal the treatment reference. (e.g. not redeeming the voucher or using the given item number). On the other hand, not every purchase is causally triggered by treatment, even if it is within the specified period.

In my opinion, whenever a large proportion of the reactions can be causally attributed to an advertising medium, the revenue with treatment reference should be used. This makes model creation easier and requires less data to deliver good results.

Choose your Approach

Once you have decided on the definition of the label (dependent variable) and created some independent variables, you can start modeling. Depending on how the label looks, different models are needed to predict customer response.

As mentioned in the introduction, I prefer the order value as the label. However, it is also possible to predict only the purchase probability. Therefore, this should also be taken into account.

ITE = Individual Treatment Effekt / CATE = Conditional Average Treatment Effekt (Table by Author)

The table above shows multiple methods to model the treatment effect. As you can see, different label definitions require different modeling approaches.

In this article, I will focus on methods for predicting treatment-related responses where a causal relationship can be established (see table above: “With treatment reference”). Uplift modeling is a separate topic that requires its own article. I will link it here as soon as it is published. Until then, I am happy to refer you to Shelby Temple:

A Quick Uplift Modeling Introduction

Learn how uplift modeling can improve classic data science applications.

towardsdatascience.com

Accordingly, this article will only deal with predicting sales that are directly attributable to an advertising campaign. This approach is sufficient for most use cases, especially since uplift modeling requires control groups that have not received any advertising material.

The Dataset

Datasets with multiple periods

To create an effective model, it’s helpful to use data from multiple past promotions. It’s best if these campaigns cover a full year, so the model can understand how different types of advertising material and seasonal changes affect results.

Historical data may not always be needed for customer attributes that don’t change due to the promotion, such as age and gender. However, for other attributes, such as how long ago the customer made a purchase, it is important to ensure that the treatment did not affect the independent variables. (e.g. including the purchase made because of the advertisement)

Unfortunately, if one independent variable is changed by the customer’s reaction, it is enough to greatly affect the entire model. In this case, the model’s results are significantly worse when deployed in production than they appear in testing. To avoid this, it is advisable to separate the pre-promotion data from the post-promotion responses even before any variables are created.

Zero-Inflated Datasets

The next challenge in modeling customer response is the presence of a lot of zero values in the dataset. This is because many customers do not respond to promotional materials and, therefore, do not make a purchase. (For comparison, a response rate of 10% is considered very good for printed advertising material.)

To better illustrate this, I am using Kevin Hillstrom’s dataset included in the Scikit-Uplift package for this article and in the accompanying notebook.

The Dataset contains multiple independent variables and the response to an email marketing campaign. The distribution of responses to the email campaign can be displayed using a histogram and looks like this:

Histplot of the response in Hillstroem’s dataset. (Image by Author)

Such datasets are usually referred to as zero-inflated datasets. They can be handled in two different ways: Oversampling or Modeling.

For oversampling, I recommend SMOTER (Synthetic Minority Oversampling Technique for Regression) or SMOGN (Synthetic Minority Oversampling Technique for Regression with Gaussian Noise). These algorithms can help to model zero-inflated datasets.

Modeling

My favorite approach, however, is not to oversample, but to adjust the model according to the data.

Some regressors can handle zero-inflated datasets by default, such as tree-based regressions like a random forest regressor. However, experience has shown that this is not the best way to build such a model.

On the one hand, this is due to the rather low explainability, and on the other hand, to the poorer performance, especially if the dataset contains a large number of independent variables.

Therefore, I usually use a two-step model. Here, the dataset is decomposed into a classification problem and a regression problem.

Example for the use of a two-step model to predict a zero-inflated Dataset (Image by Author)

The advantage of this approach is that the regression no longer has to learn a dataset that includes zero values, since the zero/non-zero classification is handled by its own estimator.

Classification:
The classification component predicts the probability of a data point to be zero. The classifier is trained on the complete dataset. (Oversampling can still be useful)

Regression:
The Regression is trained only on data points that are greater than zero. This way the regression does not have to deal with zero inflation and a large selection of regressors can be chosen.

Models using similar approaches are often called Response-Spent Meta-Learners (Risk/Severity Meta-Learners in the insurance industry). They typically perform better and offer more flexibility in choosing models.

The results of the two models can be combined relatively easily by forming the expected purchase value from the probability of purchase and the sales value at purchase.

Example of the function to form the expected value (Image by Author)

As a refresher on the topic of expected values, I recommend reading this article.

What is Expected Value?

An intuitive explanation of expected value with simple examples using games

towardsdatascience.com

Based on the procedure presented here, I have also created a short notebook to serve as a proof of concept. Here, I compare the performance of a random forest regressor to that of the meta-learner, which is fitted with both a linear and logistic regression.

You can find it here: Corresponding Notebook

Evaluation

The meta-learner can be evaluated either as a regression, or the components (regression and classification) can be assessed individually.

In practice, however, if the model’s performance is not being communicated to data scientists, it is usually better to avoid using values such as RSME or MAE, and instead answer the question: ‘How well can the model distinguish good customers from bad?’

To demonstrate this, I sort the customers by descending predicted quality (from the customer with the highest prediction to the customer with the lowest prediction) and plot the actual or theoretical sales achieved in a campaign. (Note: The campaign used for evaluation must not be included in the training dataset).

Example of a good recipient selection (Synthetic Data, Image by Author)

With a good selection model, the customers who have a very good forecast should also generate very high revenue, which should drop off very quickly and go towards zero.

Conclusion

In this article, I have shared my experiences using machine learning for customer selection. However, it is important to note that when customer response is not sufficiently measurable, a different approach needs to be taken.

To address this, my next article will extend the procedure presented here to predict uplift (Individual Treatment Effect/Conditional Average Treatment Effect).

Otherwise, I am glad that you have made it this far and would like to thank you for your attention. If you have any questions or suggestions, please feel free to leave a comment.

Dataset:

The dataset is from Kevin Hillstrom’s blog “MineThatData” and is used in many Python packages and scientific publications. In my article, I refer to the implementation of the dataset in the Python package “Scikit-Uplift”, which was released under the MIT license.

Ressources:

McCrary, M. Enhanced customer targeting with multi-stage models: Predicting customer sales and profit in the retail industry. J Target Meas Anal Mark 17, 273–295 (2009). https://doi.org/10.1057/jt.2009.22

Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P. (2013). SMOTE for Regression. In: Correia, L., Reis, L.P., Cascalho, J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science(), vol 8154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33

Torgo, L., Ribeiro, R.P., Branco P. (2017) SMOGN: a Pre-processing Approach for Imbalanced Regression. In: Proceedings of Machine Learning Resarch 2017