How to Reduce A\B Testing Duration using Surrogate Metrics

A method developed by LinkedIn to approximate long-term metrics

Michael Berk
Towards Data Science

--

When running A/B tests, we often try to improve long-term metrics. However, to properly measure a long-term metric’s impact requires long experiment durations.

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term

To tackle this problem, researchers at LinkedIn published a 2019 paper that outlines a method for replacing long-term metrics with predictions using short-term covariates.

The prediction method is left to the user, however the paper outlines requirements that ensure statistical validity. Relative to other methods covered in this series, surrogate metrics are quite labor intensive because you must develop a robust forecasting model. That being said, the hard work pays off — surrogate metrics are one of the most robust options for running A/B tests with long-term metrics.

Here’s how it works…

Technical TLDR

  1. Define your north star metric. North star metrics define the success/failure an experiment.
  2. Develop a surrogate metric by predicting our north star. The features for our predicted north star must be observable during our experiment timeframe.
  3. Use our surrogate metric in our experiment. Note that when calculating statistical significance we have to adjust our variance to account for prediction error in our model.

But, what’s actually going on?

Let’s slow down a bit and try to understand how surrogate metrics actually work.

When should we use surrogate metrics?

Surrogate metrics are really useful when our north star metric is something that takes a long time to measure. For instance, let’s look at lifetime value (LTV), which is the total revenue we expect to receive from a user over their lifetime.

Now some businesses have very short user lifetimes, but for most cases a users’ lifetime is a prohibitively long experiment duration. So, we turn to surrogate metrics.

How are surrogate metrics calculated?

As noted above, a surrogate metric is a prediction of the north star metric using predictors observed during our experiment.

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term
Figure 1: illustration of the difference in time requirement between a predicted surrogate metric and the summed LTV. Image by author.

As illustrated in figure 1, by simply using the features on the left (in blue), we can forecast an approximation for LTV. The prediction method is up to the creator, however unlike most basic predictions, small prediction intervals are very helpful.

Well, when conducting experiments we look to use our metric variance to determine the “acceptable” range of values. If our observed treatment exceeds that range, we reject the null hypothesis and conclude that our treatment had a statistically significant impact.

The larger our prediction error, the larger our “acceptable” range, leading to a low-powered test. In short, less variation in our prediction interval increases our ability to detect statistically significant changes.

So, when implementing a surrogate metric, try to use a feature set and algorithm that has a low variance confidence interval around the prediction.

How should you create a surrogate metric?

There’s surrogate metrics in a nutshell, but talk about the requirements of the model as well as some best-practices.

To have a statically “correct” surrogate metric, that metric must be the only piece of information required to determine the north star metric. In other words, there cannot be missing information in our model that would improve the fit.

This requirement is called the Prentice criteria and is mathematically defined by the following equation:

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term
Figure 2: Prentice criteria formula. Image by author.

Here, Y is the our predicted value and y is the true value. The probability on the left represents the likelihood that our predicted value for y is correct, conditioned on our surrogate metric S and treatment W. The probability on the right represents the same, but excludes that we condition on the treatment.

So if this equation is true, y is independent of whether the user is in the treatment or control. And, if this is the case, the surrogate metric is the only thing that’s required to predict y.

In practice it’s near impossible to meet this criteria, however the closer we are to equality, the more statistically valid the results.

3 Tips for picking a metric

In addition to the Prentice criteria, there are three other key concepts that lead to an effective surrogate metric:

First, it’s important to predict the surrogate metric well. The researchers at LinkedIn cited an R² value of 0.69 as effective, but the definition of a good prediction depends on your use case.

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term
Figure 3: illustration of an effective prediction. Image by author.

Second, as noted above, precise surrogate metrics are more effective than imprecise ones. So, not only do we care about high accuracy, but we also care how precise we are in our estimate. Precision can be demonstrated by the width of our prediction intervals — the smaller the better.

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term
Figure 4: Illustration of wide vs. narrow prediction intervals. Image by author.

Third, our estimates would ideally be interpretable. Experiments are used to inform business decisions, so if we can’t understand why something works it’s harder to pitch that decision and develop related ideas.

If you’re able to roughly meet the Prentice criteria and follow the above three tips, you should be able to develop an effective surrogate metric.

Adjustments to Improve your Results

Now that we know the requirements for our surrogate metric model, let’s understand how to use this metric. The researchers at LinkedIn suggest two adjustments.

The first adjustment increases the variance used in our statistical significance calculation to account for variance due to prediction error:

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term
Figure 5: formula for the adjusted variances of our north star using a surrogate metric. Image by author.

This adjusted variance comes in to play when when calculating statistical significance. Below, we use a the corrected variance in formula for a t-statistic:

A/B Testing Surrogate Metrics Variance Reduction Sample Size Duration Data Science Short Term
Figure 6: formula for the t-statistic for the north star metric using a surrogate metric. Image by author.

In figure 6, the numerator is the average lift of our surrogate metric i.e. the difference between the surrogate metric in our treatment and control. The denominator is the square root of the adjusted variance. Var(μ_s) is the variance of the mean and σ_e² is the variance of our error in our prediction. Note that underscore (_) denotes subscript.

Note that if we were using a perfect surrogate metric prediction i.e. there is no error, the equation simplifies to the t-statistic formula.

The second adjustment is optional, but highly recommended. In the above adjustment we greatly increased our “acceptance” range by increasing the variance of our metric. So, to gain statistical power and remove some of this variance, it’s suggested that you employ a variance reduction technique.

Variance reduction comes in many forms, but the method cited in the paper is called CUPED. In one line, CUPED uses pre-experiment data to account for natural variation in the data and remove it, thereby reducing our variance.

LinkedIn Case Study

LinkedIn was facing a similar problem with forecasting whether a feature helped people get jobs — the job pipeline can last months. They implemented a surrogate metric that leverages application data to forecast whether a user will get a job. The predicted values and true values showed an R² of 0.69.

Prior to correcting the variance, they saw that 30 out of 203 experiments were statistically significant. After correcting the variance, only 2 were stat. sig. However, with a CUPED variance reduction technique, they were able to find 10 stat. sig. experiments.

Implementation Notes

  • The authors did not look into how the length of the experiment improves predictive accuracy/precision. You’d expect that more data leads to better predictions, but these improvements are highly dependent on the subject at hand.
  • CUPED is effective for variance reduction outside of this method and is referenced by Google, Netflix, and Air BnB.
  • The effectiveness of this method depends on wether you can predict the north star metric. If you can develop an accurate and precise model, the method will work well. If not, it will be ineffective.
  • The authors note that it’s often helpful to develop surrogate metric predictions for different user groups and platforms. Because variance in our prediction is especially harmful, it’s sometimes easiest to cut our data into groups and look at those individually

Thanks for reading! I’ll be writing 42 more posts that bring “academic” research to the DS industry. Check out my comments for links/ideas on developing surrogate metrics.

--

--