How to Double A/B Testing Speed with CUPED

Microsoft’s variance reduction that’s becoming industry standard.

Published in

Towards Data Science

5 min readMay 18, 2021

Controlled-experiment Using Pre-Existing Data (CUPED) is a variance reduction technique created by Microsoft in 2013. Since then, it has been implemented at Netflix, Booking.com, BBC, and many others.

In short, CUPED uses pre-experiment data to control for natural variation in an experiment’s north star metric. By removing natural variation, we can run statistical tests that require a smaller sample size. CUPED can be added to virtually any A/B testing framework; it’s computationally efficient and fairly straightforward to code.

Finally, it’s pronounced CUE-PED.

Variance reduction impact on statistical signifigance in an A/B test.

Technical TLDR

1) Pick a pre-experiment covariate (X). The covariate should be highly correlated with the experiment’s north star metric (Y) and should not be impacted by experiment treatments. Often, the best covariate is the north star metric prior to the experiment period.

2) Calculate the CUPED-adjusted metric (Ŷ) for each experimental condition. So, instead of using a metric average (Y) to calculate lift, we use the CUPED-adjusted metric: Ŷ.

Y_HAT = avg(Y) - (cov(X,Y)/var(X)) * avg(X)

3) Calculate the CUPED-adjusted variance for each experimental condition.

VAR_Y_HAT = var(Y) * (1 - corr(X,Y)**2)

4) Use the CUPED-adjusted metrics. In the section below, we’ll walk through using the CUPED-adjusted metrics in a t-test.

But, how does CUPED actually work?

Ok, let’s pump the brakes a bit and try to understand what’s happening.

A/B Testing Lift Corespondance with P-Value

When running a classical A/B test, there are 3 components that impact our ability to determine statistical significance: sample size (n), standard deviation(σ), and lift (Δ), which is just the difference between the treatment mean and the control mean.

If we’re looking to find the most extreme t-statistic, our goal would be to increase n and Δ while also decreasing σ. Unfortunately, these values are often assumed to be fixed; we determine sample size at the beginning of an experiment and cross our fingers that the treatment is impactful and thereby has a large lift.

Standard deviation, on the other hand, is a characteristic of the north star metric we pick. Great, so switch out a high variance metric for a lower variance metric? That could work, but CUPED provides an alternative. By using data prior to the experiment, CUPED is able to control for variation inherent to our metric and remove it.

The team at Microsoft identified two methods that leverage concepts from Monte Carlo sampling to reduce variance. Here’s the key point for both: any pre-experiment data is independent of the experiment and therefore can be leveraged to reduce variance. We know this to be true because when we randomize users into control and treatment groups, we can assume that both experimental groups have identical characteristics i.e. all confounding variables are evenly distributed between those groups.

So, because we can pick any covariate that will not be systematically impacted by the treatment, we would ideally find the one that minimizes the variance of Y. Unsurprisingly, often this covariate is just Y prior to the experiment.

Computing the Adjusted Metric Value and Variance

Ok, now that we have some background, let’s understand how to calculate CUPED.

For each user in each treatment, we will calculate their CUPED-adjusted value where…

Ŷi (y-hat) is the CUPED-adjusted metric,
Yi is the north star metric,
θ (theta) is a constant with a value of cov(X,Y)/var(X), where X and Y correspond to all users in a given treatment,
Xi is the covariate, and
i is the subscript correspond to a given user.

That’s it. Pretty straightforward right?

You may notice that this equation looks a lot like linear regression. As it turns out, we are effectively doing the same thing as Ordinary Least Squares (OLS) regression; θ’s optimal value is the same as that of an OLS regression coefficient: cov(X,Y)/var(X).

After doing some more work, we determine that…

Do you see why a high correlation between X and Y would lead to the greatest reduction in variance?

Now that we have the CUPED-adjusted metric and the variance of that metric, we can run a t-test and hopefully observe a smaller p-value. And, if you’re calculating by hand, note that the denominator must account for our treatment and control having different variances. It’s not that bad of a calculation — just follow this link for the full formula.

t test formula with CUPED adjusted metrics

Final Point: Missing Data

CUPED works beautifully until we don’t have pre-experiment data for our users. For instance, we could be running an experiment on new users who have never visited our site before. When this is the case, we implement the simplest solution and just use an un-adjusted metric for those users. Now if all of your users are first-time users, you may want to select a different pre-experiment X, but that’s up to you.

Implementation Notes (Microsoft 2013)

In general, the optimal pre-experiment window is 1–2 weeks. A shorter window doesn’t capture enough variance and a longer window captures noise.
For longer experiments, a longer pre-experiment window is needed to ensure that the same users are observed both during and prior to the experiment.
CUPED performance is highly dependent on the metric; a metric with high variance across the user population will perform well.
Make sure you that your covariate X will be evenly distributed between treatment and control. If it’s systematically impacted by a treatment, CUPED becomes invalid.
The CUPED method above only removes variance that can be accounted for linearly. Non-linear methods will be discussed in future posts.

Check out the comments for a link to the paper and other CUPED resources.