From Prediction to Action — How to Learn Optimal Policies From Data (3/4)

Published in

Towards Data Science

5 min readAug 6, 2021

In Part 1, we discussed the need to learn optimal policies from data. Policy optimization covers a vast range of practical situations and we briefly looked at examples from healthcare, churn prevention, target marketing and city government.

In Part 2, we described how to create a dataset so that it is suited for policy optimization.

In this post, we describe a simple (and, in my opinion, magical) way to use such a dataset to estimate the outcome of any policy.

In Part 4, we will learn how to use such a dataset to find an optimal policy.

Even when you have the ability to find optimal policies, it is very useful to be able to quickly estimate the outcome of any policy.

For instance, in business settings there’s always interest in simple policies that apply the same action to every customer. These policies are easy to understand and to execute, so business leaders naturally prefer them. To help decide if they should go with a simple policy or not, they may want to know how much outcome they are “giving up” by using a simple policy instead of a complex one. If the drop in expected outcome isn’t that much, we may as well go with the simple policy, right?

(BTW, simple policies are everywhere. In clinical trials, for example, we are interested in the difference in average outcomes between two simple policies: (1) give everyone the medicine, and (2) give everyone the placebo)

Let’s return to the Netflix customer churn example we have been working with in this blog series.

Suppose your boss comes to you with an idea for a new policy:

If a customer binge-watched a show in the past month, “do nothing”.
Otherwise, if the customer’s viewing hours last month are 10–30% less than the previous month, offer “2 more devices”. If the customer’s viewing hours last month dropped by more than 30% from the previous month, offer “20% off”.
If none of these criteria apply, “do nothing”.

Sure, you can do an experiment to estimate the average outcome of this policy: take a random sample of customers and run the policy on them, and measure the outcomes. But this will take time and effort.

There’s a faster way. You can use the dataset we prepared in Part 2 to quickly estimate the average outcome of your boss’s brainwave without running an experiment.

The key to this is a magical thing called the Horvitz-Thompson Estimator and here’s how it works.

Let’s start with the dataset assembled in Part 2.

For every customer in the dataset above, determine the action specified by the proposed policy (in practice, this should be a straightforward SQL query or something similar).

Next, add an “Adjusted Outcome” column to the dataset.

Now for the clever part: for every customer where the actual action that was assigned matches the action specified by the new policy, divide the outcome by the probability of the assigned action and stick it in the adjusted outcome column.

For example, the actual action that was assigned and the action specified by the new policy are both “20% off” for the third customer, so we take the outcome 1.0 and divide by the probability of being assigned the “20% off” action (which is 92%) to get an adjusted outcome of 1.0/0.92 = 1.1 (similarly for the other three rows marked with red boxes).

For all the other rows in the dataset (i.e. where the actions don’t match), assign 0.0 as the adjusted outcome.

The “Adjusted Outcome” column is complete. Now, simply take the average of this column and voila! You have an estimate of the average outcome of your boss’ policy!

A couple of quick notes:

We are dividing by probabilities to calculate the adjusted outcome. Now you understand why I insisted in Part 2 that every probability in the data collection policy needs to be non-zero.
In the same vein, it is advisable to make sure the probabilities in your data collection policy aren’t too small. Dividing outcomes by very small probabilities will inflate the adjusted outcome numbers and make the estimate less reliable.
There is a ‘normalized’ version of the Horvitz-Thompson Estimator called the Hajek Estimator that has better statistical properties. Instead of finding the simple average of the ‘Adjusted Outcome’ column, we divide the sum of the column by the sum of the reciprocals of the probabilities used to adjust the outcomes (i.e.,divide by 1/0.92 + 1/0.45 + 1/0.22 + … + 1/0.56). See Section 5.1 of Yang et al (2020) for more.
I want to re-iterate a caveat from Part 2: if the training dataset was assembled not via a randomized experiment but from historical observational data, the customers in the dataset will not be a random sample and that can seriously bias the estimated outcome. So take those numbers with a “pinch of salt”.

Given this ability to evaluate any policy, you can search for a good policy by brainstorming ideas for new polices, estimating how good they are, and trying to improve them.

Or, you can directly find an optimal policy. In Part 4, we will do exactly that.

For the mathematically curious, here’s my attempt at a clearer version of the proof on the Wikipedia page that shows how the Horvitz-Thompson Estimator unbiasedly estimates the outcome of any policy.

From Prediction to Action — How to Learn Optimal Policies From Data (3/4)

Written by Rama Ramakrishnan