Don’t Use a T-Test for A/B Testing

How to use multiple linear regression to determine ATE and statistical significance

Published in

Towards Data Science

7 min readFeb 2, 2022

Have you ever wanted to speed up an A/B test? Well, here’s arguably the highest ROI solution for variance reduction in an A/B testing setting.

Frequentist experimentation is commonly leveraged for experimentation. However, as compared to Bayesian or sequential regimes, frequentist A/B tests often require large sample sizes, which slows down company iteration. Below we will discuss a practical solution that allows us to reach statistical significance in half the time, if not less.

Unlike with prior posts, we’re not going to review a paper. Instead we will bring together research that I’ve covered over the past 8 months, linking relevant resources along the way.

Without further ado, let’s dive in…

Technical TLDR

T-Tests are the simplest frequentist method for determining statistical significance in A/B tests. OLS, with the functional form north_star ~ const + treatment_indicator is mathematically identical to a T-Test, however it allows us to include variance reducing covariates. A famous example is CUPED.

By leveraging a robust causal model instead of a T-Test, we are given the freedom to dramatically reduce variance.

1 — Setup

Let’s slow down a bit and really understand what’s going on. To facilitate communication, we’re going to start with an example.

We work as data scientists at an online app store. Our goal is to determine the best ranking algorithm for app store games (figure 1).

statistical signifigance a b testing variance reduction CUPED linear regression t test — Figure 1: example of an online app store. Image by author.

We can leverage any engagement data, for instance clicks, downloads, rating counts, etc. We can also use a weighted combination of these metrics.

However, to test the best system, we will leverage the gold standard of causal inference: an A/B test. Our control arm will be the default ranking algorithm and each treatment arm will be a new ranking algorithm. Finally, our north star metric, the metric that we use as a success criteria for the experiment, will be revenue.

2 —The Baseline: T-Tests

The traditional frequentist A/B testing procedure would leverage a two-sample T-Test to see if our treatment arm shows a statistically significant difference from our control. It’s very simple and efficient to implement, as shown below…

from scipy.stats import ttest_ind
import numpy as np
np.random.seed(1)mu_c = 5
mu_t = 5 + 0.05                                 # 1% liftc = np.random.normal(loc=mu_c, scale=1, size=1e4)
t = np.random.normal(loc=mu_t, scale=1, size=1e4)ate = np.mean(t) - np.mean(c)                # 0.04966
percent_lift = ate / np.mean(c)              # 0.0099
p_val = ttest_ind(t, c).pvalue               # 0.00044

In the example above we create a control and treatment array where treatment is, on average, 1% greater than control. This represents a 1% lift, indicating the algorithm used by our treatment arm increases revenue by 1%.

From there, we calculate the percent lift and statistical significance. As shown by the last two lines of code, as expected our simulation returned a lift of 0.99%, which is very close to 1%. Also note that we observed a p-value of 0.00044, which is well below the standard 0.05 threshold.

Great! A T-Test works.

Now, while T-Tests are very statistically robust, they leave a lot on the table. The main shortcoming of interest is that they assume all variance in our metric observation is truly noise. Often, in the real world, we call systematic variance noise because we don’t have variables that reliably account for variance.

But, in most experiments, not all variance is unexplainable, we will use linear regression to control for these trends.

3 — Linear Regression

Linear regression is a much more powerful tool than a basic T-Test. But before we get into it, let’s take a step back.

The hypothesis test that we choose is determined by our data generating mechanism — in this case, random assignment to either treatment or control. With random assignment and the central limit theorem, we’re guaranteed to meet all assumptions of a T-Test.

With linear regression regression however, we have to be a bit more careful with our covariates. The main issue me must avoid is colinearity between features, especially with our treatment indicator. But that’s very doable.

Also note that unlike with many causal inference models, we don’t have to assume that model is correct because given random assignment we have asymptotic inference. So, in english, we can use linear regression to determine the ATE and statistical significance.

Let’s see an example.

3.1— Parity with a T-Test

Below we leverage the statsmodels package to recreate the conclusions from our T-Test…

import statsmodels.api as sm# 1. Create treatment indicator variable (1 = treat)
is_treat = np.append(np.repeat(1, len(t)), np.repeat(0, len(c)))# 2. Create independent and dependent vars
x = np.array([is_treat, np.repeat(1, len(t) + len(c))]).T
y = np.append(t, c)# 3. Fit LM
lm = sm.OLS(y, x).fit()# 4. Observe ATE and stat sig
ate_ols  = lm.params[0]                                # 0.04966
pval_ols = lm.pvalues[0]                               # 0.00044
round(ate_ols, 6) == round(ate, 6)                     # True

We first create a binary treatment dummy variable and a constant. Then we fit our model using both features as predictors against our metric of interest (revenue). Finally, we extract the treatment indicator’s coefficient and its p value.

As we can see, the average treatment effect (ATE) produced by linear regression is the same as those from a T-Test. Mathematically, they’re identical.

So now we can clearly see that the steps for OLS and a T-Test produce the same results, but then why would we use OLS at all?

4 — Multiple Linear Regression

As hinted at by the title of the section, OLS allows us to use multiple variables. As we increase the number of relevant variables present, we theoretically will reduce the variance and reach statistical significance faster.

Now it’s important to reiterate that we must avoid colinear features. This means that none of our features can be very correlated with each other. When this happens, not only will the model coefficients behave strangely, but the variance will be inflated which hurts our ability to determine statistical significance.

4.1 — Explanatory Variables

Let’s look at another example. In the below code we develop two covariates: seasonal and covariate_random. Note that in the real world, we’d actually explanatory variables such as day of the week or total revenue from a user.

seasonal = np.sin(np.arange(len(control))) + np.array([1 if np.random.uniform(len(control)) > 0.7 else 0])
covariate_random = np.random.chisquare(5, size=n)c *= seasonal
c += covariate_randomt *= seasonal
t += covariate_random

After using both the seasonal and random covariate to scale our treatment and control, we next fit a multiple linear regression. This multiple regression lacks our covariate_random variable. The output can be seen in in figure 2…

x = np.array([np.repeat(1, len(t) + len(c)), 
              is_treat, 
              np.append(seasonal, seasonal))]).T
y = np.append(t, c)lm = sm.OLS(y, x).fit()
print(lm.summary())

In figure 2, our treatment indicator (x1) indicates a 0.0350 ATE. However, it is not statistically significant, as shown by the 0.469 in the P>|t| column. Let’s also note that the standard error of our treatment indicator variable is o.o48.

Now, covariate_random has vital information on our dependent variable revenue — we used it to create our treatment/control values. So, let’s add it into the model and see what happens…

As you can see, the coefficient of x1, our treatment indicator, still is 0.0350. That’s good to see — we haven’t added a variable with collinearity. However, what’s really impressive is the P>|t| column — we have gone from a p value of 0.469 to 0.044.

By simply including predictive covariates, we’re able to reach statistical significance. We didn’t have to collect more data e.g. run the experiment for longer. We just leverage statistics and knowledge about the variance of our data.

It’s easy to see how this trivial example can generalize to real world situations. If you have a robust causal model, your organization can iterate much more quickly and expose fewer users to sub-optimal treatments.

Final point from this section (thanks to Alvin in the comments), it may be a good idea to explore interactions when exploring the relationship between the treatment indicator variable and other explanatory variables.

4.2 — Covariate Selection with Regularization

It’s important to note that in a real-world example, there will be lots of potential covariates, some of which may include prior purchasing behavior, duration on the platform, time of day, day of week, etc. Theoretically, a robust multivariate model will take all of these into account (assuming there isn’t collinearity between features).

Here, it makes sense to do some feature selection for our causal model. But note that including regularization terms like LASSO in our casual model breaks independence between covariates and our treatment indicator, thereby ruining causal inference.

If you are building a multivariate causal model, be sure to do your feature selection using an independent dataset. Do not include regularization in the final model.

4.3 — Variance Reduction with CUPED

One final tip is leverage variance reduction techniques, such as CUPED. In one sentence, CUPED uses pre-experiment data to remove “usual” variance from our experiment data. With a variance reduced ATE, we are able to reach statistical significance much faster.

ML models have recently become popular to reduce experiment variance as well, but they involve a ton more work. That being said, if you’re a large company, it might be worth it.

5 — Summary

In this post we walked through using multiple linear regression for estimating ATE, determining statistical significance, and reducing variance. If you’re using frequentist methods like a T-Test for your experiments, it may benefit you to leverage multivariate OLS instead. However, be sure that you are meeting the assumptions of OLS — if you’re not, your ATE and statistical significance calculations will be wrong.

Thanks for reading! I’ll be writing 17 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.