What to Do When the A/B Test Result Is Not Significant

4-step Framework to Land on Concrete Actions

Mintao Wei

Published in

Towards Data Science

8 min readApr 25, 2022

Too Long; Don’t Read

Step 1: Check Experiment Set-up & Data Correctness
Step 2: Check if Sample Size and Statistical Power are Sufficient: if the collected samples satisfy the minimum required sample size, conduct nonparametric tests and assess if the experiment is underpowered. If the traffic is insufficient, keep waiting since formulating inference hypotheses based on historical trends is usually not recommended (last choice).
Step 3: Check if the New Product Feature is Successful: delve deeper into granular user & product analysis to evaluate the effectiveness of the product feature (e.g. map user journey using heatmaps/funnels, explore product adoption, break down by user groups, etc).
Step 4: Determine the Next Move: suggest extending the experiment period or reducing the variance if implied by Step 2, or deliver relevant actionable analysis to product managers if motivated by Step 3.

Introduction

This article discusses how data scientists should react when the experiment turns out to be statistically insignificant. It is best for people who have some basic knowledge of A/B tests such as statistical power and minimum sample size calculation, and are interested in diving into details.

How often do we observe insignificant results?

Sadly but true, unexpectedly insignificant results can occur quite often in the industry. Based on my personal experience and observation on the data science team, almost one-third of the experiments end up with insignificant p-values on north-star metrics. Additionally, it is likely that there will be other indicators than those key metrics failing to demonstrate statistical significance.

Therefore, it is crucial that we follow scientific methods to interpret these test outputs and generate trustworthy conclusions. Inspired by <Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing> (see reference), I would like to introduce a sequential approach that takes us from this seemingly clueless state to an actionable phase.

What are some common reasons?

The optimal solution will not form until we develop a good understanding of the cause. If assuming no novelty effect, change aversion, and network effect (these deserve a whole different chapter and thus not our focus of this article), common reasons include:

The testing feature is really not that successful.
The variance of the evaluation metrics is too large to reflect treatment change.
The experiment impacts only a small subset of the randomized population, diluting the evaluation metrics and causing the lift insignificant
The actual treatment effect is smaller than the minimum detectable effect, and thus the experiment lift is not interpreted as a significant result given the pre-determined sample size.

Essentially, the above 4 reasons can be categorized into two key problems:

Product Problems: the new feature or strategy is in fact not effective in driving the evaluation metrics.
Insufficient Statistical Power Problem: In other words, the experiment is underpowered and insensitive to detect the minor effect we are anticipating, even though it does exist. It is usually caused by inadequate samples and large variance.

The first type of problem is easy to comprehend, so I would spend most of this article discussing the second problem, which is usually more common.

An evaluation of 115 A/B tests at GoodUI.org suggests that most were underpowered (Georgiev 2018).

One of the causes for insufficient statistical power is that most of the real-world A/B tests target a niche subset of the users rather than the entire population, making the eligible experiment user group no longer has a ~1 billion magnitude (MAU for most large tech companies), but around a hundred times smaller than that.

For instance, the sample of UI experiments for a checkout button on the checkout page for an e-commerce platform only refers to the users who land on the checkout page. If we assume the MAU for this e-commerce platform is 1 billion, the general user journey is ‘landing page — product detail page — cart — checkout page’, and the conversion rate is 5% for each stage, then the most eligible experiment traffic we can retrieve in a month is only:

1 billion users×5%×5%×5%=125,000 users.

While an MAU of 1 billion is non-trivial, 125,000 is really not a large number for volatile metrics such as GMV per user to make statistical inferences. Furthermore, if consumer-facing A/B experiments of this kind are already experiencing a shortage of traffic, it might be even more difficult when we conduct testings for other business parties such as creators on TikTok, merchants on Amazon, advertisers on Facebook, etc. This reiterates the necessity to understand what should data scientists do when encountering insufficient powers and insignificant results.

Source: by Campaign Creators on Unsplash

Expanding in Detail: What Should Data Scientists Do?

Step 1: Check Experiment Set-up & Data Correctness

Check the experiment parameters were distributed as expected: Double-check to make sure the percentage of the population traffic exposed to this experiment matches the planning. Furthermore, the users in the treatment group can indeed see all the new product feature changes and users in the control group can only interact with the current version. While they seem to be very basic or even unnecessary sanity checks, accidents happened more than once on the data science team in my past organizations.
Ensure the data is correct and trustworthy: Data scientists are required to check the following 3 things specifically:
- Is the SQL to construct key metrics correct?
- Is the traffic diversion truly random? (considering using the Chi-squared test on the difference in eligible populations between the two groups)
- Is the traffic in the two groups homogenous? (check if the p-value distribution is uniform on key characteristics such as gender, age, etc)

Step 2: Check if Sample Size and Statistical Power are Sufficient

If the actual accumulated traffic hasn’t reached our pre-calculated minimum sample size:
Obviously, keeping waiting is the optimal solution from the perspective of statistics. However, we sometimes need to give suggestions due to external factors such as pushy product managers or upcoming management meetings. Under circumstances such as this, though not recommended, we can make empirical inferences based on historical patterns — If we observe the treatment and control difference is positive across most days in the past, then it is likely just a matter of time to demonstrate statistical significance. When formulating inferences based on these descriptive patterns, we need to warn ourselves not to fake a story for p-values.
If the actual accumulated traffic has already reached our pre-calculated minimum sample size:
2–1. Conduct Non-parametric Test
It is recommended to start with a non-parametric test to collect more information for our reference. Though these non-parametric tests are less robust and have less statistical power, they can be more accurate when our underlying data is potentially non-normal and skewed or even twisted by extreme values.
2–2. Assess if the Power is Sufficient
There are unfortunately no straightforward post-test methods to tell if a test is underpowered. My personal approach is based on empirical experience, taking into account historical experiments and business context. However, I believe asking ourselves two questions can be helpful in evaluating the sufficiency of the A/B test power:
(1) Is our metric diluted by using the more general population for calculation rather than just the exposed experiment group?
(2) Is our key metric undergoing a higher variance than expected?

Step 3: Check if New Product Feature is Successful

The goal of delving deeper is to uncover and rule out any possible product reasons that account for the insignificant results. Some techniques are listed below:
- User Conversion Analysis: visualize user journeys in funnel diagrams and understand how users’ behaviors are affected by the treatment.
- Between-Page Analysis: analyze how treatments on page A affect users’ behaviors on other pages.
- Within-Page Analysis: analyze how the treatment feature on page A affects how users interact with other modules on page A using heatmap.
- User Group Analysis: break down experiment cohorts by key business dimensions such as geographic regions, traffic source, and new/old users. Explore the possibility that one group reflects high statistical significance while others are indifferent towards the change and dilute the treatment effect.

Step 4: Determine the Next Move

If the insignificant result is more likely to be an insufficient power problem, we have two categories of approaches to increase the testing sensitivity:
1–1. Enlarge the Sample Size
(1) Extend the experiment period to let the test run for a longer time frame;
(2) Turn up the traffic ratio and expose the experiment to more users.
1–2. Reduce the Variance
(1) Designing less volatile metrics (e.g. ‘total number of searches’ usually has a larger variance than ‘the number of searchers’);
(2) Transforming a metric through binarization or log transformation;
(3) Implementing stratification or Controlled-experiment Using Pre-Existing Data (CUPED): Stratification reduces variance by combining results from individual strata, and CUPED uses pre-experiment data to control for the intrinsic variance. Both are very useful techniques that can be easily applied to almost every A/B testing framework. (See How to Double A/B Testing Speed with CUPED for more details)
(4) Design paired experiments: If you have several experiments splitting traffic and each has its own control, consider pooling the separate controls to form a larger, shared control group, and pair each individual in the two groups upfront to reduce the idiosyncratic variance. This idea is adopted by the data science team at Alibaba Taobao to develop more robust experiments on merchants, which have suffered from inadequate samples for a long time.
If the insignificant result is more likely to be the product problem, I would suggest formulating a structured report and delivering this message in an intact manner. It is encouraged to attach the analysis in Step 3 about where the features went wrong and what are potential directions for improvement.

Image by Author — Flow Chart Diagram of the Above-mentioned Framework

Summary and Caveat: Don’t Peek at P-values

Everyone wants impacts. Product managers don’t want the feature they designed from scratch to be strangled in the cradle because of the insignificant result. Data scientists also don’t want to forego a potential piece that they could write in their performance review.

However, one should always remember that data scientists are titled ‘scientists’ because they are supposed to employ scientific ways to analyze the product, whereas peeking at p-values and manipulating the data to fake a story is certainly not one of them.

A recommended way to approach the insignificant test result is to exhaustively explore its reasons while cautiously preventing ourselves from inadvertently making false-discovery mistakes to manipulate a statistically significant conclusion.

Reference

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. In Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (p. I). Cambridge: Cambridge University Press.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design,
Analysis, and Interpretation. W. W. Norton & Company.
https://www.amazon.com/Field-Experiments-Design-AnalysisInterpretation/dp/0393979954.

See How to Double A/B Testing Speed with CUPED for more details by Michael Berk: https://towardsdatascience.com/how-to-double-a-b-testing-speed-with-cuped-f80460825a90

Underpowered A/B Tests — Confusions, Myths, and Reality by Georgi Georgiev: https://blog.analytics-toolkit.com/2020/underpowered-a-b-tests-confusions-myths-reality/