From Prediction to Action — How to Learn Optimal Policies From Data (4/4)

In this final post of the series, we will learn how to find an optimal policy.

Rama Ramakrishnan

Published in

Towards Data Science

9 min readAug 12, 2021

Recap:

In Part 1, we discussed the need to learn optimal policies from data. Policy optimization covers a vast range of practical situations and we briefly looked at examples from healthcare, churn prevention, target marketing and city government.
In Part 2, we described how to create a dataset so that it is suited for policy optimization.
In Part 3, we presented a simple way to use such a dataset to estimate the outcome of any policy.

At the outset, it is worth noting that researchers in many fields — causal inference, econometrics, and contextual bandits — have worked on this problem for many years and have devised numerous algorithms to solve it — doubly-robust methods, direct methods, transformed-outcome methods, double machine learning, the X-Learner, the R-Learner and so on. Take a look at Section 3 of this recent survey to see what I mean.

Given this abundance of riches, I was faced with the difficult question of what to focus on in this post. But as luck would have it, one of the research teams that have contributed significantly to this area organized “A Contextual Bandit Bake-off” in 2018 where many of the key algorithms competed with each other.

A relatively simple approach called Importance-Weighted Regression (IWR) performed admirably in the competition and I describe it below. IWR directly leverages our knowledge of how to build regression models, is intuitive to understand and can be easily implemented using any supervised learning regression approach of your choice (if you are interested in the underlying theory, please see Practical Contextual Bandits with Regression Oracles).

In Part 2, we assembled a dataset like this:

Or, to be more concrete, let’s use the Netflix churn example with 3 actions: offer 20% off, offer free streaming on 2 more devices, do nothing.

Our objective is to learn a function from this dataset that assigns the best action to each customer i.e. we want to learn an optimal policy.

Learning an optimal policy using Importance Weighted Regression (IWR) is a three-step process.

Step 1: Split the dataset above based on the actual action that was assigned.

You will now have as many datasets as the number of actions.

Step 2: With each dataset, build a regression model that predicts the outcome from the customer feature vector x.

Crucially, you need to use weighted squared loss as the loss function, where the weights are the reciprocal of the numbers in the probability column.

Regression models typically default to minimizing squared loss so that’s pretty standard. The extra thing you have to do is to send in a weight for each data point in the training set.

Let’s write this down algebraically so that there’s no ambiguity.

That’s it.

(BTW, weighted regression is commonly supported out of the box. In scikit-learn, for instance, linear regression, gradient boosting regression and random forest regression all support weights in a simple and uniform way, using the sample_weight argument of the fit function.

From https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Just set sample_weight to the reciprocal of the numbers in the probability column when you call the fit function)

At the end of this step, you would have built as many regression models as there are actions.

Step 3: For any customer feature vector x, run x through each of the models to get the predicted outcome for its action.

Now, pick the action with the highest predicted outcome.

You have learned an optimal policy!

As you saw above, the only new thing in IWR (compared to traditional regression) was the use of the reciprocal of the probabilities as weights.

What’s the intuition behind this use of probabilities?

For simplicity, imagine that your customer feature vector x had just one element: it is a binary variable that is 1 if the customer in question has ever watched a documentary on Netflix and 0 if they haven’t.

After consulting with business stakeholders, let’s say you have come up with the following data collection policy (defined in Part 2):

Data Collection Policy (Image by Author)

Let’s focus on the “20% off” column.

Let’s say we choose a random sample of N customers for data collection and documentary watchers represent a fraction p of the sample, so approximately Np customers in our sample will be documentary watchers and N(1-p) customers won’t be.

Given this, the approximate number of customers in the “20% off” dataset will be:

(i.e., Of the Np customers in our sample who watch documentaries, 50% of those will be assigned “20% off” according to our data collection policy so we get N x p x 0.5)

But we now have a problem. The mix of x =1 and x=0 customers in the “20% off” column is different from the overall mix of x =1 and x=0 customers in the sample. If we build a regression model with this skewed dataset, its prediction of the outcome for a “20% off” action may be biased.

More generally, in a model that you plan to use across your entire customer base, the data points you use to build the model should have x distributed in the same way as across your customer base.

Returning to our example, how can we fix this issue?

One approach is to “make copies” of the points in such a way that the mix of x=1 and x=0 points in the “20% off” column matches the overall mix of x =1 and x=0 customers in the sample.

Let’s say we use a multiplier of u for the x=0 points and v for the x=1 points. After the multiplication (i.e. copying) is done, the new number of points is shown in the rightmost column below.

How should we choose u and v such that the adjusted number of customers in the “20% off” column match the numbers in the overall sample?

Simple. Just set the multipliers to the reciprocals of the probabilities!

Set u = 1/0.5 and v = 1/0.2.

The multipliers and the probabilities cancel out nicely and the adjusted numbers match the overall numbers!

(In practice, we don’t make copies of the data points. Instead, we simply insert the reciprocals of the probabilities as weights into the loss function and it has the same effect)

To summarize, using the reciprocal of the probabilities as weights in the regression undoes the skewing of the data points caused by our data collection policy.

If this reciprocal-of-probabilities business reminded you of how the Horvitz-Thompson Estimator from Part 3 works, your instincts are spot on! In fact, we can use a similar argument to prove that IWR works. If you are curious, an algebraic proof is at the bottom of the post.

(BTW, this idea of using probability reciprocals to make adjustments is a simple example of a very general and powerful technique called importance sampling that is widely used in many fields)

You have learned an optimal policy using IWR, and are ready to put your optimal policy to work.

I recommend one last tweak before you launch.

The world keeps changing and the customer behavioral patterns captured and embodied in your optimal policy will change over time. A policy that’s optimal today may not be optimal next quarter.

Of course, you can periodically gather new data using a data collection policy and learn a new optimal policy with that data. But there’s a better way.

Add a bit of randomization to your optimal policy and then launch (see the discussion on epsilon-greedy in Part 2 to refresh your memory on how to do this).

By doing so, your optimal policy doubles as a data collection policy. Every so often, you can take the data from this, re-learn an optimal policy using IWR, and switch to that policy (but, again, when you launch it, you will do so with a bit of randomization added).

In this post, we have described policy optimization as a batch cycle: run a data-collection policy for some time, use IWR on the collected data to learn an optimal policy, deploy the optimal policy with a bit of randomization added, and repeat every so often.

But this can be done in real-time. In the extreme case, we can do this after every interaction (i.e., after every x ->action->outcome is observed) by incrementally updating the IWR model corresponding to the action that was assigned. See here for more.

Before we conclude, let’s consider a very important special case: when we have just two actions to choose from. Often, one of these is the active thing (e.g., give medicine X to the patient, offer a discount to the customer) and the other one is the passive thing (e.g., give the placebo, do nothing).

In Causal Inference/Econometrics in particular, the two-action (or the two-treatment) scenario is extremely common, since oftentimes you want to estimate the effect of an intervention (aka action or treatment) on an existing system, compared to not intervening.

In this special case, you can actually learn an optimal policy with just one model that directly predicts the difference in outcomes between the two actions. In some situations, this may be more effective than learning two separate models. To learn more, see section 3 of this recent survey and follow the references therein.

In this 4-part series, I have tried to extract the most practical elements of policy optimization from a vast and rapidly growing body of research. There’s tons of really interesting material — here are a few recommendations if you want to learn more.

IWR was first described only in 2018 (reference) and if I had tried to write this post before 2018, I would have probably picked the Doubly-Robust Method. More generally, there are numerous alternatives to IWR and if you are interested, start with this recent survey and go where your curiosity takes you.
If you are interested in libraries/modules for policy optimization, here’s a sample: Vowpal Wabbit, CausalML, EconML, PyLift, and Uplift. In particular, the IWR approach described in this post is implemented in the Vowpal Wabbit library.
If you are interested in a commercially available policy optimization product/service, here’s an example.

Happy learning!

Optional

For the curious, here’s an algebraic proof of how IWR works.

From Prediction to Action — How to Learn Optimal Policies From Data (4/4)

In this final post of the series, we will learn how to find an optimal policy.

Written by Rama Ramakrishnan