The world’s leading publication for data science, AI, and ML professionals.

Why most A/B Tests are Not Efficient

Demystifying Pure Exploration (Part 0)

Notes from Industry

A/B testing has many application areas such as drug testing. (Photo by ThisisEngineering RAEng on Unsplash)
A/B testing has many application areas such as drug testing. (Photo by ThisisEngineering RAEng on Unsplash)

For several decades now, A/B testing has been a mainstay of statistics, becoming the bedrock upon which the entire edifice of controlled randomized testing, that most sacred of scientific corroboration techniques, has been built. Given the plethora of articles on the topic on this site, I don’t think there is any need to convince the reader of the importance of A/B testing or, for that matter, refer to the ever expanding list of applications areas into which this technique seems to find its way. One crucial aspect of A/B testing, however, has received surprisingly little attention and the purpose of this article is to fill that lacuna.

Quick Summary of A/B Testing.

The basic procedure itself, as it is currently performed, is quite simple. Take website optimization, for example. A company wishes to launch a new version of its site that it hopes will better influence users to sign up for the company’s service than the old version. Users are split into two groups, the Control group (Group A) and the Test group (Group B) and shown the old and new versions respectively. Each user, in signing up or not signing up, is then assumed to provide some noisy feedback about the quality of the webpage version they see. This noisy feedback is modelled via two probability densities (or mass functions as the case may be) fA and fB with means µA and µB. The percentage of sign ups in each groups then indicates which version is preferable. Statistically, this boils down to testing which of the following hypotheses is true using noisy samples

(Image by Author)
(Image by Author)

One could also test for a chosen tolerance τ > 0 by replacing zero in the inequalities above with τ. As the figure below shows, methods most commonly used in the industry involve

  • using classical statistics to find the numbers nA and nB of users (I’ll use the general term "samples" in place of "users" from now on) beforehand,
  • conducting the test, and
  • taking a decision.
The traditional way in which A/B tests are conducted. (Image by Author)
The traditional way in which A/B tests are conducted. (Image by Author)

Seems sound, doesn’t it? Well, not entirely. While the procedure is scientifically correct, that’s not the focus of this article. As the title suggests, the intention is to help improve the efficiency of A/B testing by introducing the reader to more modern, sophisticated ways of testing hypotheses.

Efficient in what sense? Consider the number of samples consumed by the test, i.e., the term N = (nA + nB). This number, called the sample size, is dictated by several factors like noise variance and the [Type I, Type II] error margins that the designer finds acceptable. Notice that in the above procedure, all decisions related to sampling are taken before the experiment begins. Specifically, there is no notion of taking advantage of the increasing clarity offered by incoming data to aid decision making. Even when using the best possible sample size estimates, I will now argue why this is, in many circumstances, provably suboptimal. Naturally, the alternative is to allow decisions on how many samples to take to be dictated by the data itself. In other words, to turn the A/B test Online.


Sequential Statistics

The math that allows decision making based on incoming data and hence, whose sample size is not fixed in advance, is called Sequential Statistics. Instead, the incoming data is analyzed sample by sample and, when "enough" evidence is collected in favor of a hypothesis, a pre-defined stopping rule stops further data collection. So, a sequential version of the usual A/B test for webpage optimization works on an incoming stream of users and consists of

  • a sampling rule that assigns an incoming user to either the control or test group (here, webpage version),
  • a stopping rule, and
  • a decision rule that accepts/rejects H0 based on available evidence.
Sequential tests can provide a speed up of up to 4x! (Image by Author)
Sequential tests can provide a speed up of up to 4x! (Image by Author)

Illustrative Example

Before going further, let’s first convince ourselves of the superiority of sequential tests with a quick example. For a given level of Type I and Type II error, one of the most efficient sequential tests is Abraham Wald’s Sequential Probability Ratio Test (SPRT). Keeping with our webpage example, this test is conducted as follows.

For simplicity, assume µA < µB and that we know fA and fB up to a permutation, i.e., we are only trying to find which of the two webpages generates sign up samples from fA and which from fB: Let f0 = fA ⊗ fB represent the the situation where the control webpage is not better than the test and f1 = fB ⊗ fA, the opposite situation. Two thresholds L0 and L1 (0 < L0 < 1 < L1) are chosen in advance and the following are used

1. Sampling rule:

  • The incoming stream of users is divided into streams of pairs and users in each pair are randomly assigned to one of the two webpages.
  • The nth pair’s reaction is recorded as sample 2-tuple
(Image by Author)
(Image by Author)
  • Define Zn = Xn – Yn, and let g0, g1 be the distributions of Zn under f0 and f1 respectively.

2. Stopping Rule:

The cumulative likelihood ratio up to time n, given by

(Image by Author)
(Image by Author)

is computed and compared against the threshold. If either λn < L0 or λn > L1 the test is stopped. Else, the test repeats with pair n + 1.

3. Decision Rule:

If λn < L0 accept H0 and if λn > L1 reject H0.

When fA ≡ N(µA; σ²) and fB ≡ N(µB; σ²), which is true in many natural applications, one can show that the mean number sample size with the SPRT is approximately (2σ²)/(µA-µB)² . Ordinary A/B tests on the other hand, require approximately (8σ²)/(µA-µB)² samples. The sequential procedure, therefore, represents a 4x improvement in efficiency!


The above example, while being quite illustrative, does not entirely match reality since in general the means µA and µB are unknown. In the literature, this example is a special case of a much more general (sequential) statistical construct called a Two-armed Bandit. Specifically, it belongs to a class of problems called "Best Arm Identification," or Pure Exploration. The purpose of this article was to pique the reader’s interest sequential methods and in the forthcoming parts of this series, we will discuss Multi-armed Bandits in the present context in greater detail.

Approaching A/B testing from the Multi-armed Bandit (MAB) perspective offers several advantages.

  • MAB algorithms for pure exploration do not require knowledge of the means.
  • What if multiple (3 or more) hypotheses need to be tested simultaneously? In our example, the company might want to find the best of several webpage versions. MAB algorithms can easily handle this situation.

  • The analysis of pure exploration as we shall see, is full of very tight bounds on sample size and hence, can improve the efficiency of such tests significantly.

Concluding Remarks

I hope I’ve convinced the reader of the need to understand Best Arm Identification in greater detail. In the next part of this series, we will design and analyze a very simple pure exploration algorithm (no, it’s not Thompson Sampling). We will examine the expression for the sample size of this algorithm piece by piece, and see that it contains important factors connected to natural phenomena that lie at the heart of the theory of stochastic processes.

It is to be noted that much of the research in pure exploration over the last three decades has been informed by these factors. Therefore, a discussion of what they mean and if they can or cannot be avoided will form the concluding portion of this series.

(Note: references will be provided at the end of Part 2.)


Related Articles