From Prediction to Action — How to Learn Optimal Policies From Data (1/4)

Published in

Towards Data Science

7 min readJun 10, 2021

Which way to go? — Photo by Vladislav Babienko on Unsplash

In data science work, we build models to make predictions. And using these predictions, we make decisions or take action.

Sometimes, the relationship between prediction and action is straightforward:

you are trying to decide as you leave home for work if you should take an umbrella or not. You predict the chance of rain. If it is high enough, you take an umbrella. If not, you leave the umbrella at home (source)
you are trying to decide if a person in an image is X. If the predicted probability is high enough, you tag the image with the person’s name. If not, you leave it untagged.

Other times, the relationship between prediction and action is more complicated.

Let’s say you are a data scientist working for a subscription business like Netflix. You want to prevent customers from cancelling their monthly subscription by offering them incentives to stay.

This is the well-known churn problem and a standard Data Science/Machine Learning approach would be as follows.

First collect historical data on customers who cancelled and customers who didn’t. Use their attributes, their interactions with the business etc. to form a “feature vector” for each customer. Add a flag that indicates if the customer cancelled or not, and create a dataset like this:

With this dataset, build a model (e.g., logistic regression) that, given a customer’s feature vector, predicts the probability that they will cancel.

This may be a very complex model. It may have hundreds or thousands of features for each customer (i.e. each x may be a really long vector) and may have been built using state-of-the-art methods (e.g., Deep Learning, XGBoost)

But what should you do with such a model?

Should you contact customers with a high predicted probability of churning and offer them a reward or discount if they renew?

This is a reasonable approach and is widely used. It will work for some customers but maybe not for all customers (reference).

Some customers will ignore the reward and cancel anyway.
Others may not have been thinking of leaving at all but will gladly pocket the reward and stay — you just wasted the reward on them.
For some others who may not have been thinking of leaving, maybe your reward reminded them that they have not been using your product/service all that much and they might click the ‘cancel’ button.

And if you have more than one possible action, it gets even more complicated. Customers may react differently to different actions. Customer Sally may renew if offered a “get 20% off your monthly bill if you renew for a year” reward but not if offered a “stream 2 more devices for free” reward. For customer Albert, it may be the opposite.

Imagine if you could predict the probability of churn when a customer is targeted with each possible action you could take, including the action of “do nothing”.

Of course, estimating the probability of churn for each action is just the starting point. In a business setting, you may care ultimately about a financial outcome — for example, the expected profit from each customer over the next 12 months. If you know the cost/revenue impact of each action, you can transform the model above to this:

With this model in hand, for any customer, you can predict the expected outcome for each action, and “read off” the best action for that customer (i.e. find the maximum of 3 numbers!)

We will learn later how to build a model like the one above. But first, let’s step back and abstract out the key elements of the churn problem.

a set of customers (I am using ‘customer’ broadly to refer to an entity that you want to apply an action to. For instance, it could be a patient, a website visitor, a prospect, a business, a geographic region, …you get the idea). All the relevant information about the customer’s attributes, behavior and other contextual information is represented as a vector of features.
a set of possible actions
the outcome to you when a customer is targeted with an action (importantly, different customers may respond differently to actions)
Your objective: for each customer, find the action with the best outcome

Let’s pause for a couple of quick definitions.

A function that assigns an action to each feature vector is called a policy.

A function that assigns the best action to each feature vector is an optimal policy.

We want to learn optimal policies.

Policy optimization problems are extremely common.

Personalized medicine: For a medical condition, there are 4 possible treatments. The effectiveness of each treatment may vary across patients, depending on each patient’s characteristics and medical history. What’s the optimal treatment policy i.e., which treatment should we “apply” to each patient, so as to maximize positive health outcomes?
Campaign Targeting: For e-commerce visitors who abandon a shopping cart, we want to email them one of 3 offers: 20% off, free shipping, gift with purchase. Different shoppers will respond differently to these offers. What’s the optimal targeting policy i.e., which offer to “apply” to each shopper to maximize expected profit? (the churn problem described above is also an instance of “Campaign Targeting”)
Health/Safety Monitoring: Cities check business establishments for compliance with health and safety regulations by sending inspectors. Not every building can be checked in every time period with available staff so we need an optimal inspection policy: for each establishment, pick one of two actions — inspect this time period, don’t inspect this time period — so as to maximize overall safety. It is not just a matter of predicting which buildings are more likely to have violations. The “response” of a building owner to fix a violation that’s found by an inspector is important.

A building may be at higher risk of fire due to old wiring, but other considerations make it difficult to replace the wiring. Other units may have lower predicted risk, but it may be easy and inexpensive to make substantial improvements. Another consideration is responsiveness; if violations entail fines, some firms may be more sensitive to the prospect of fines than others.
Source: Overview article by Prof. Susan Athey of Stanford, which I recommend reading in its entirety.

OK, how do we learn optimal policies?

Let’s look at perhaps the most straightforward situation first.

Returning to the churn example, let’s imagine that we have data from a randomized experiment: a sample of customers were randomly chosen and then randomly assigned to one of the three actions — 20% off, stream two more devices, and do nothing — and their renewal/cancel response was logged.

In this lucky situation, we have these three nice datasets:

(Please note that the customers in these three groups are different. I am using the same x symbols for the Customer Feature Vector column in all three tables, just for convenience)

We can build three standard classification models now.

With these three models in hand, finding the optimal policy is easy: run any customer’s x through each model, get the three predicted cancellation probabilities, translate them to expected profit (or whatever financial outcome you care about), and pick the action with the highest expected profit.

Nice. But as I mentioned earlier, this is the most straightforward situation.

In practice, we may not be so fortunate.

What if there is data but it is not from a randomized experiment that you can safely use for modeling. Maybe there’s only observational data from the past where offers were made not to a random sample of customers, but only those likely to cancel (for instance). Building a model on this data and applying it to all customers may lead to bad predictions.

What if an action — say, stream two more devices for free — has never been tried before, so there’s no data on how it affects cancellation behavior. Creative businesspeople are always coming up with new actions/tactics/strategies to try so this is a real possibility.

We need to handle these and other issues carefully.

But not to worry :-). This problem and its variants have been studied extensively in many fields — Causal Inference, Econometrics, Uplift Modeling in Marketing, and (especially) Contextual Bandits/Reinforcement Learning — and many practical and elegant approaches have been devised. These techniques typically combine prediction models (classification/regression) with smart training-data collection via randomized experiments.

In my opinion, policy evaluation and policy optimization using these techniques is a powerful but under-appreciated and under-utilized data science superpower.

In Parts 2–3–4, we will dive in and acquire this superpower :-)

In Part 2, we will describe how to create a dataset so that it is suited for policy learning.
In Part 3, we will learn a simple (and, in my opinion, magical) way to estimate the outcome of any policy.
In Part 4, we will learn how to find an optimal policy.

From Prediction to Action — How to Learn Optimal Policies From Data (1/4)

Written by Rama Ramakrishnan