From Prediction to Action — How to Learn Optimal Policies From Data (2/4)

Rama Ramakrishnan
Towards Data Science
12 min readJul 21, 2021

--

Photo by Beth Macdonald on Unsplash

In Part 1, we discussed the need to learn optimal policies from data. Policy optimization covers a vast range of practical situations (e.g., arguably, every personalization problem is a policy optimization problem) and we briefly looked at examples from healthcare, churn prevention, target marketing and city government.

To learn optimal policies, we need to assemble the right data. In this post, we will describe how to create a dataset so that it is suited for policy learning.

With the right dataset at hand:

  • In Part 3, we will learn a simple (and, in my opinion, magical) way to estimate the outcome of any policy.
  • In Part 4, we will learn how to find an optimal policy.

Let’s recap the problem definition from Part 1. You have:

  • a set of customers (I am using ‘customer’ broadly to refer to an entity that you want to apply an action to. For instance, it could be a patient, a website visitor, a prospect, a business, a geographic region, …you get the idea). All the relevant information about the customer’s attributes, behavior and other contextual information is represented as a vector of features.
  • a set of possible actions
  • the outcome to you when a customer is targeted with an action (importantly, different customers may respond differently to actions)
  • Your objective: for each customer, find the action with the best outcome. Equivalently stated, learn a function that assigns the best action to each customer feature vector. This is called an optimal policy.

Strategies for Dataset Creation

Let’s start with perhaps the most important thing to remember about dataset creation for policy optimization:

The ideal dataset for policy learning comes from a randomized experiment.

If you have the freedom to run such an experiment, great!

But if you can’t run a randomized experiment, don’t despair — if you have access to past observational data, you can still create a training dataset from it.

We will cover both scenarios in detail in this post, starting with the happy scenario where you can run a randomized experiment. We will see that there is much more flexibility here than may be apparent at first glance.

Dataset Creation via a Randomized Experiment

Creating a training dataset by means of a randomized experiment is a 3-step process.

  • Step1: select a random sample of customers from the customer base.
  • Step 2: decide how to randomly assign actions to each customer in the sample
  • Step 3: run the experiment and collect data on outcomes

Step 1 is straightforward so let’s turn to Step 2.

Referring back to the Netflix churn management example from Part 1, recall that we considered three actions: “20% off for next year”, “stream two more devices” and “do nothing”.

For the moment, let’s assume that we have only two actions, “stream 2 more devices” or “do nothing”.

What’s the simplest way to randomly assign one of these two actions to every customer in the sample? For every customer, toss a fair coin. If it comes up heads, assign “2 more devices” and if it comes up tails, assign “do nothing”.

This “scheme” can be described as follows:

More generally, if we have N actions, we can make every action equally likely for every customer. For the 3-action Netflix example, our scheme will look like this:

Here’s where it gets interesting.

It turns out that we can choose any probability for each customer-action combination, as long as two conditions are satisfied:

  • For any customer, the probabilities of all the actions must add up to 1.0 (i.e., the sum of the probabilities in each row must be 1.0). This just ensures that exactly one action is assigned to each customer.
  • Every probability must be non-zero. This ensures that every action has a chance of being assigned to a customer.

If we work with probabilities satisfying these two conditions, the resulting datasets can be safely used for policy evaluation and policy optimization.

Here’s an example of probabilities that satisfy the two conditions:

I want to draw your attention to two things:

  • Notice that the probabilities are different for different customers. This is totally fine, since it doesn’t violate the two conditions above. In fact, this ability to choose different probabilities for different customers gives us enormous flexibility, as we will see shortly.
  • Notice that the table above is a kind of policy: given x, instead of picking a specific action, you pick an action randomly according to the probabilities defined for that x. A policy that satisfies the two conditions above is called a behavior policy in the Reinforcement Learning literature but I will call it a data collection policy (since I find the latter name easier to remember)

The freedom to choose different probabilities for different customers can be very useful in practice.

A common situation is when we don’t want to waste an “expensive” action (e.g., a 50% discount!) on customers who are unlikely to need them.

Concretely, let’s say we want to apply the “20% off for next year” action only to customers with a high probability of cancellation, and not to others, since we suspect that customers at low risk may just pocket the reward and we’d have wasted an expensive discount on them. But we don’t care how often “stream two more devices” and “do nothing” are applied.

We can easily accommodate this business requirement in our data collection policy. Here’s one way.

First, we use historical data to build a model to predict the probability of cancellation for every customer.

You can use any classification approach of your choice to build this model as long as it provides a probability prediction rather than just a 0–1 class prediction. Run this model on your customer sample to get this:

Now, consider a customer in the sample whose predicted probability of cancellation is p. For this customer, set the probability of being assigned the “20% off for next year” action to p. Take the remaining probability (i.e., 1- p) and split it equally among the other actions: “stream two more devices” gets a probability of (1-p)/2 and “do nothing” gets (1-p)/2.

It is easy to confirm that this is a valid data collection policy: all three numbers are non-zero and, of course, they add up to 1.0 (For now, I am assuming that the prediction model doesn’t output probabilities that are exactly equal to 1.0 or 0.0. I will revisit this assumption later in the post).

Applying this logic, we get this data collection policy:

The probability of cancellation for the first customer is 70% so “20% off” gets a probability of 70%, “stream two more devices” gets 15% and “do nothing” gets 15%.

You can see that customers with a low predicted probability of cancellation (e.g., the last customer in the example) have a low chance of getting the expensive “20% off” discount, which is what we wanted.

This is just one example of how we can flexibly define a data collection policy to accommodate business considerations. Feel free to use your creativity and knowledge of the business to define other data collection policies. Just make sure that the probabilities are non-zero and that they add up to 1.0 for every customer in the sample.

Once a valid data collection policy has been specified, we reach Step 3: running the experiment and measuring the outcomes.

This step is straightforward:

For each customer in your sample, randomly select an action according to that customer’s probabilities, apply the selected action to that customer, and measure the outcome.

If you know how to do random sampling, picking an action according to the probabilities is easy and there are many ways to do it. For a customer with 70%-15%-15% probabilities for the three actions, for example, generate a uniform random number in the 0–1 range and if the value is between 0.0–0.7, pick “20% off for next year”, if it is between 0.70–0.85, pick “stream two more devices”, and if it is between 0.85–1.0, pick “do nothing”.

When the experiment concludes, the final data will look like this.

Note that we have appended two more columns: the actual action that was assigned to each customer, and the value of the outcome. For simplicity, I am showing a binary outcome but you can use anything (e.g., next-year’s profit).

Dataset Creation from Observational Data

Let’s now consider the situation where it is difficult for you to run a randomized experiment to collect data. But we can still build a dataset that can be used for policy evaluation and learning, IF data on actions applied to customers in the past and the resulting outcomes are available (see the table below).

There are two significant caveats we need to keep in mind here:

  • The customers for whom we have historical data on actions and outcomes data may not be a random sample of the customer base.
  • While you can see which action was assigned to each customer, there’s no information on how exactly the action was chosen for that customer. A random data collection policy may or may not have been used — we don’t know.

Because of the lack of randomization described above, the results of policy evaluation (to be covered in Part 3) and policy optimization (to be covered in Part 4) must be taken “with a pinch of salt” i.e., with some measure of skepticism.

With that warning out of the way, let’s look at how to build a training dataset from observational data. The basic idea is to “impute” the customer-action probabilities from the historical dataset i.e. to fill in the missing values in this table:

How? By using the actual action that was assigned as a class label

… and building a multi-class classification model with the above dataset!

Now we can fill in the table with the predicted probabilities from the model …

… and our training dataset is ready.

You may be thinking, “Not so fast. What if the predictive model returns probabilities that are exactly zero for certain customer-action combinations? Wouldn’t that violate the no-zero-probabilities condition we specified early on?”

You are absolutely right, it would violate that condition and we can’t use the probabilities.

Before we look at how to address this issue, it is worth pointing out that this is not uncommon in practice. In many business situations, “common-sense business rules” are used and can lead to historical data where predicted probabilities can be zero (BTW, you can think of “business rule” as just another name for “policy” in this context).

Consider this example of a business rule:

If a customer binge-watched a show in the past month, “do nothing”.

Otherwise: if the customer’s viewing hours last month are 0–30% less than the previous month, offer “2 more devices”. If the customer’s viewing hours last month dropped by more than 30% from the previous month, offer “20% off”.

Let’s assume that the feature vector x has features that represent binge-watching, % change in viewing hours etc.

If the above rule/policy is in use, historical data will look like the following table (to make the table easier to read, I have grouped the records so that the binge-watchers are together, the 0–30% records are together and so on).

Now, if you run a classification tree algorithm on this historical data using x as the input and the assigned action as the target class label (as explained earlier in the post), it will likely “reverse engineer” your policy from the data and generate a tree like this …

… leading to predicted probabilities that are exactly 0.0 and 1.0.

Since this violates the no-zero-probabilities condition, we can’t use this dataset for policy evaluation and optimization.

What can we do?

First, the bad news: to handle this, we have to run a randomized experiment and collect fresh data.

But there is a possibility of good news. If you can’t run a randomized experiment because you don’t have the infrastructure to do so, I am afraid I don’t have a solution :-(

But if the reason is that

  • your business stakeholders aren’t used to randomized experimentation or
  • they are happy with the current policy and have a “if it ain’t broke, don’t fix it” mindset or
  • they feel that changing the current policy is risky,

you may be able to convince them with the following “gentle” proposal.

Let me run this experiment on a customer sample: For 90% of customers in the sample, we will follow the current policy. For 10% of the customers, we will pick an action randomly.

This popular approach (called epsilon-greedy) may be palatable to the business since the “risk” of something going wrong is less than 10%.

More formally:

  • set epsilon to a small fraction (e.g., 0.1)
  • for every customer, draw from a uniform (0,1) random variable. If its values is less than epilson (e.g., less than 0.1), randomly pick an action with equal probability and assign that action to the customer. If not, use the current policy to decide what action to assign to that customer.

As you can see, 1- epsilon% (e.g., 90%) of the time, this approach will pick the action dictated by the current policy. But epsilon% (e.g., 10%) of the time, this approach will pick one of the three actions with equal probability.

Equivalently, you can think of this as combining two policies with a 90%–10% weighted average…

… to get a new one that satisfies the “no zero probabilities” condition:

By keeping epsilon small, you can assure business stakeholders that you aren’t taking a big risk or incurring a big expense in the name of data collection. This approach can be a good way to “ease” business stakeholders into the experimentation mindset. And as they get more used to it and can tolerate more risk, you can set epsilon to a higher value.

(BTW, what if we have never tried an action historically and therefore have no data about its outcomes? Just add the new action to the “randomly assign an action” branch of epsilon-greedy and start collecting data)

At this point, using the approaches described above, let’s say you have assembled a dataset like this, with the probabilities all non-zero.

Or, more generally:

This dataset is a very useful thing to have.

In Part 3, we will learn how to use this dataset to quickly estimate the outcome of any policy (i.e., policy evaluation) without running any experiments.

In Part 4, we will learn how to use it to find an optimal policy.

--

--

MIT Professor, AI/ML entrepreneur/advisor. Prev: Founder/CEO CQuotient, SVP Data Science Salesforce, Chief Scientist/VP Oracle Retail, McKinsey. MIT PhD.