The world’s leading publication for data science, AI, and ML professionals.

Experiments, Peeking, and Optimal Stopping

How to run valid experiments with smaller sample sizes with the Sequential Probability Ratio Test

CAUSAL DATA SCIENCE

Cover, image by Author
Cover, image by Author

In the decade preceding the Second World War, there was a massive increase in industrial production of war materials, so there was a need to ensure that products, especially munitions, were reliable. The testing of war materials is not only expensive but also destructive since, for example, bullets need to be fired in order to be tested.

Therefore, the U.S. government was presented with the following dilemma: how many bullets should one fire out of a batch before declaring the batch reliable? Clearly, if we were to fire all the bullets, we would know the exact amount of functioning bullets in a crate. However, there would be no bullets left to use.

Because of the growing relevance of these statistical problems, in 1939, a group of prominent statisticians and economists joined forces at Columbia University’s Statistical Research Group (SGR). The group included, among others, W. Allen Wallis, Jacob Wolfowitz, and Abraham Wald. According to Wallis himself, the SGR group was "composed of what surely must be the most extraordinary group of statisticians ever organized, taking into account both number and quality"[2].

Their work was of first-order importance and classified, to the point that Wallis reports:

"It is said that as Wald worked on sequential analysis his pages were snatched away and given a security classification. Being still an "enemy alien", he did not have a security clearance so, the story has it, he was not allowed to know of his results." [Wallis (1980)]

Indeed, the group worked under the pressure from the U.S. Army to deliver fast practical solutions that could be readily deployed on the field. For example, Wallis reports that

_"during the Battle of the Bulge in December 1944, several high-ranking Army officers flew to Washington from the battle, spent a day discussing the best settings on proximity fuzes for air bursts of artillery shells against ground troops, and flew back to the battle to put into effect advice from, among others, Milton Friedman, whose earlier studies of the fuzes had given him extensive and accurate knowledge of the way the fuzes actually performed."_ [Wallis (1980)]

The most prominent result that came out of the SGR experience was undoubtedly the Sequential Probability Ratio Test. The idea first came to Wallis and Friedman that realized that

"it might pay to use a test which would not be as efficient as the classical tests if a sample of exactly N were to be taken, but which would more than offset this disadvantage by providing a good chance of terminating early when used sequentially." [Wallis (1980)]

The two economists exposed the idea to the statistician Jacob Wolfowitz who initially

"seemed to be something distasteful about the idea of people so ignorant of mathematics as Milton and I venturing to meddle with such sacred ideas as those of most powerful statistics, etc. No doubt this antipathy was strengthened by our calling the new tests ‘supercolossal’ on the grounds that they are more powerful than ‘most powerful’ tests." [Wallis (1980)]

Ultimately, the two economists managed to draw the attention of both Wolfowitz and Wald that started to formally work on the idea. The results remained top secret until the end of the war when Wald published his Sequential Tests of Statistical Hypotheses article.

In this post, after a quick introduction to hypothesis testing, we are going to explore the Sequential Probability Ratio Test and implement it in Python.

Hypothesis Testing

When we design an A/B test or, more generally, an experiment, the standard steps are the following

  1. Define a null hypothesis H₀, usually a zero effect of the experiment on a metric of interest
  2. Define a significance level α, usually equal to 0.05, which represents the maximum probability of rejecting the null hypothesis, when it is true
  3. Define an alternative hypothesis H₁, usually the minimum effect size that we would like to detect
  4. Define a power level 1-β, usually equal to 0.8 (i.e. β=0.2), which represents the minimum probability of rejecting the null hypothesis H₀, when the alternative H₁ is true
  5. Pick a test statistic whose distribution is known under both hypotheses, usually the sample average of the metric of interest
  6. Compute the minimum sample size, in order to achieve the desired power level 1-β, given all the test parameters

Then, we run the test and, depending on the realized value of the test statistic, we decide whether to reject the null hypothesis or not. In particular, we reject the null hypothesis if the p-value, i.e. the probability of observing under the null hypothesis a statistic as or more extreme than the sample statistic, is lower than the significance level α.

Remember that rejecting the null hypothesis does not imply accepting the alternative hypothesis.

Peeking

Suppose that halfway through the experiment we were to peek at the data and notice that, for that intermediate value of the test statistic, we would reject the null hypothesis. Should we stop the experiment? If we do, what happens?

The answer is that we should not stop the experiment. If we do, the test would not achieve the desired significance level, or, in other terms, our confidence intervals would have the wrong coverage.

Example

Let’s see what I mean with an example. Suppose our data generating process is a standard normal distribution with unknown mean μ and known variance σ=1: X ∼ N(μ,1).

Let’s pick the following (arbitrary) hypothesis that we wish to test:

Test hypothesis, image by Author
Test hypothesis, image by Author

After each observation n, we compute the z test statistic

z test statistic, image by Author
z test statistic, image by Author

where X̅ₙ is the sample mean from a sample X₁, X₂, …, Xₙ, of size n, σ is the standard deviation of the population, and μ₀ is the population mean, under the null hypothesis. The term in the denominator is the variance of the sample mean. Under the null hypothesis of zero mean, the test statistic is distributed as a standard normal distribution with zero mean and unit variance, N(0,1).

Let’s code the test in Python. I import some code from utils to make the plots prettier.

import numpy as np
from src.utils import *

zstat = lambda x: np.mean(x) * np.sqrt(len(x))
zstat.__name__ = 'z-statistic'

Suppose we want a test with significance level α=0.05 and power 1-β=0.8. What sample size n do we need?

We need a sample size such that

  1. The probability of rejecting the null hypothesis H₀, when H₀ is true, is at most α=0.05
  2. The probability of not rejecting the null hypothesis H₀, when H₀ is false (i.e. H₁ is true), is at most β=0.2

I.e. we need to find a critical value c such that

Critical value equations, image by Author
Critical value equations, image by Author

where zₚ is the CDF inverse (or percent point function) at p, and μᵢ are the values of the mean under the different hypotheses.

If we do not know the sign of the unknown mean μ, we have to run a two-sided test. This means that the maximum probability of type 1 error on each side of the distribution has to be α/2 = 0.025, implying _z_₀.₉₇₅.

Combining the two expressions together we can solve for the required minimum sample size.

Minimum sample size equation, image by Author
Minimum sample size equation, image by Author

so that

Minimum sample size equation solved, image by Author
Minimum sample size equation solved, image by Author

We need at least 785 observations.

We can get a better intuition by graphically plotting the two distributions with the critical value. I wrote a function [plot_test](https://github.com/matteocourthoud/Blog-Posts/blob/main/notebooks/src/figures.py) to draw a standard hypothesis testing setting.

from src.figures import plot_test

plot_test(mu0=0, mu1=0.1, alpha=0.05, n=n)
Hypothesis testing, image by Author
Hypothesis testing, image by Author

The critical value is such that, given the distributions under the two hypotheses, the rejection area in red is equal to α. The sample size n is such that it shrinks the variance of the two distributions so that the area in green is equal to β.

Let’s now simulate an experiment in which we draw an ordered sequence of observations and, after each observation, we compute the value of the test statistic.

Let’s have a look at what a sample looks like.

df = experiment(zstat)
df.head()
Data snapshot, image by Author
Data snapshot, image by Author

We can now plot the time trend of the test statistic as we accumulate observations during the sampling process. I also mark with horizontal lines the values for rejection of the null hypothesis of a test with α=0.05.

plot_experiment(df, ybounds=[-1.96, 1.96])
Test statistic with sequential sampling, image by Author
Test statistic with sequential sampling, image by Author

In this case, the test never crosses the critical values. Therefore, peeking does not have an effect. We would not have stopped the experiment prematurely.

What would happen if we were repeating the experiment many times? Since the data is generated under the null hypothesis, H₀: μ=0, we expect to reject it only α=5% of the time.

Let’s simulate the data-generating process K=100 times.

We plot the distribution of the z-statistic over samples.

simulate_experiments(zstat, ybounds=[-1.96, 1.96], early_stop=False);
Bounds crossed: 3 (67% upper, 33% lower)
Average experiment duration: 785
Test statistic with sequential sampling over 100 simulations, image by Author
Test statistic with sequential sampling over 100 simulations, image by Author

In the figure above, I have highlighted the experiments for which we reject the null hypothesis without peeking, i.e. given the value of the z test statistic at the end of the sampling process. Only in 3 experiments, the final value lies outside the critical values so that we reject the null hypothesis. This means a rejection rate of 3% which is very close to the expected rejection rate of α=0.05 (under the null hypothesis).

What if instead, we were impatient and, after collecting the first 100 observations, we were stopping as soon as we saw the z-statistic crossing the boundaries?

stops_zstat_h0 = simulate_experiments(zstsat, xmin=99, ybounds=[-1.96, 1.96], early_stop=True, lw=2);
plt.vlines(100, ymin=plt.ylim()[0], ymax=plt.ylim()[1], color='k', lw=1, ls='--');
Bounds crossed: 25 (48% upper, 52% lower)
Average experiment duration: 523
Test statistic over 100 simulations with peeking, image by Author
Test statistic over 100 simulations with peeking, image by Author

In the figure above, I have highlighted the experiments in which the values of the z-statistic cross one of the boundaries, from the 100th observation onwards. This happens in 25 simulations out of 100, which implies a rejection rate of 25% which is very far from the expected rejection rate of α=0.05 (under the null hypothesis). Peeking distorts the significance level of the test.

One potential solution is the Sequential Probability Ratio Test. However, in order to understand it, we first need to introduce the likelihood ratio test.

Likelihood Ratio Test

The likelihood ratio test is a test that tries to assess the probability (or likelihood) that the observed data was generated by either one of two competing statistical models.

In order to perform the likelihood ratio test for hypothesis testing, we need the data generating process to be fully specified under both hypotheses. For example, this would be the case with the following hypotheses:

Test hypothesis, image by Author
Test hypothesis, image by Author

In this case, we say that the statistical test is fully specified. If the alternative hypothesis was H₁: μ ≠ 0, then the data-generating process would not be specified under the alternative hypothesis.

When a statistical test is fully specified, we can compute the likelihood ratio as the ratio of the likelihood function under the two hypotheses.

Likelihood ratio, image by Author
Likelihood ratio, image by Author

The likelihood-ratio test provides a decision rule as follows:

  • If Λ>c, reject H₀;
  • If Λ<c, do not reject H₀;
  • If Λ=c, reject with probability q

The values c and q are chosen to obtain a specified significance level α.

The Neyman–Pearson lemma states that this likelihood-ratio test is the most powerful among all level α tests, in this setting.

Special Case: testing mean of normal distribution

Let’s go back to our example where data is coming from a normal distribution with unknown mean μ and known variance σ² and we want to perform the following test

Test hypothesis, image by Author
Test hypothesis, image by Author

The likelihood of a normal distribution with unknown mean μ and known variance σ² is

Likelihood, image by Author
Likelihood, image by Author

So that the likelihood ratio under the two hypotheses is

Likelihood ration, image by Author
Likelihood ration, image by Author

We now have all the ingredients to move on to the final stage of this blog post: the Sequential Probability Ratio Test.

Sequential Probability Ratio Test

Given a pair of fully specified hypotheses, say H₀ and H₁, the first step of the sequential probability ratio test is to calculate the log-likelihood ratio test log(Λᵢ), as new data arrive: with S₀=0, then, for i=1,2,…,n.

S formula, image by Author
S formula, image by Author

We can now use Sᵢ to generate a simple thresholding scheme:

  • Sᵢ≥ b: Accept H₁
  • Sᵢ≤ a: Accept H₀
  • a<Sᵢ<b: continue monitoring (critical inequality)

where the threshold values a and b (-∞<a<0<b<∞) should depend on the desired type I and type II errors, α and β. How should we pick a and b?

Wald (1945) shows that the choice of the following boundaries delivers a test with an expected probability of type 1 and 2 errors not greater than α and β, respectively.

Optimal alpha and beta, image by Author
Optimal alpha and beta, image by Author

The equations are approximations because of the discrete nature of the data generating process.

Wald and Wolfowitz (1948) have proven that a test with these boundaries is the most powerful sequential probability ratio test, i.e. all SPR tests with the same power 1-β and significance α require at least the same amount of observations.

Special Case: testing null effect

Let’s go back to our example where data is coming from a normal distribution with unknown mean μ and known variance σ² and hypotheses H₀: μ=0 and H₁: μ=0.1.

We have seen that the likelihood ratio with a sample of size n is

Likelihood ratio, image by Author
Likelihood ratio, image by Author

Therefore, the log-likelihood (easier to compute) is

Log-likelihood ratio, image by Author
Log-likelihood ratio, image by Author

Simulation

We are now ready to perform some simulations. First, let’s code the log-likelihood ratio test statistic that we have just computed.

log_lr = lambda x: (np.sum((x)**2) - np.sum((x-0.1)**2) ) / 2
log_lr.__name__ = 'log likelihood-ratio'

We repeat the same experiment we did at the beginning, with one difference: we will compute the log-likelihood ratio as a statistic. The data generating process has μ=0, as under the null hypothesis.

df = experiment(log_lr)
df.head()
Data snapshot, image by Author
Data snapshot, image by Author

Let’s now compute the optimal bounds, given significance level α=0.05 and power 1-β=0.8.

alpha = 0.05
beta = 0.2
a = np.log( beta / (1-alpha) )
b = np.log( (1-beta) / alpha )
print(f'Optimal bounds : [{a:.3f}, {b:.3f}]')
Optimal bounds : [-1.558, 2.773]

Since significance and (one minus) power are different, the bound for the null hypothesis is much closer than the bound for the alternative hypothesis. This means that, in the case of an intermediate effect of μ=0.05, we will be more likely to accept the null hypothesis H₀: μ=0 than the alternative H₁: μ=0.1.

We can plot the distribution of the likelihood ratio over samples drawn under the null hypothesis H₀: μ=0.

plot_experiment(df, ybounds=[a,b])
Log-likelihood ratio with sequential sampling, image by Author
Log-likelihood ratio with sequential sampling, image by Author

In this particular case, the test is inconclusive within our sampling framework. We need to collect more data in order to come to a decision.

plot_experiment(experiment(log_lr, n=789), ybounds=[a,b]);
Log-likelihood ratio with sequential sampling, image by Author
Log-likelihood ratio with sequential sampling, image by Author

It takes 789 observations to reach a conclusion, while before the sample size was 785. This test procedure can require a larger sample size than the previous one. Is it true on average?

What would happen if we were to repeat the experiment K=100 times?

simulate_experiments(log_lr, ybounds=[a, b], early_stop=True, lw=1.5);
Bounds crossed: 96 (4% upper, 96% lower)
Average experiment duration: 264
Log-likelihood ratio with sequential sampling over 100 simulations, image by Author
Log-likelihood ratio with sequential sampling over 100 simulations, image by Author

We get a decision for 96 simulations out of 100 and for 96% of them, it’s the correct decision. Therefore, our rejection rate is very close to the expected α=0.05 (under the null hypothesis).

However, for 4 experiments, the test is inconclusive. What would happen if we were to sample until we reach a conclusion in each experiment?

simulate_experiments(log_lr, ybounds=[a,b], early_stop=True, lw=1.5, n=1900);
Bounds crossed: 100 (4% upper, 96% lower)
Average experiment duration: 275
Log-likelihood ratio with sequential sampling over 100 simulations, image by Author
Log-likelihood ratio with sequential sampling over 100 simulations, image by Author

As we can see from the plot, in one particularly unlucky experiment, we need to collect 1900 observations before coming to a conclusion. However, despite this outlier, the average experiment duration is an astounding 275 samples, almost a third of the original sample size of 785!

What would happen if instead the alternative hypothesis H₁: μ= 0.1 was true?

simulate_experiments(log_lr, ybounds=[a,b], early_stop=True, mu=0.1, lw=1, n=2100);
Bounds crossed: 100 (84% upper, 16% lower)
Average experiment duration: 443
Log-likelihood ratio with sequential sampling over 100 simulations, image by Author
Log-likelihood ratio with sequential sampling over 100 simulations, image by Author

In this case, we make the correct decision in only 84% of the simulations, which is very close to the expected value of 80% (under the alternative hypothesis), i.e. the power of the experiment, 1-β.

Moreover, also in this case, we need a significantly lower sample size: just 443 observations, on average.


Conclusion

In this post, we have seen the dangers of peeking during a randomized experiment. Prematurely stopping a test can be dangerous since it distorts inference, biasing the coverage of the confidence intervals.

Does it mean that we always need to perform tests with a pre-specified sample size? No! There exist procedures that allow for optimal stopping. These procedures were born for a specific purpose: reducing the sample size as much as possible, without sacrificing accuracy. The first and most known is the Sequential Probability Ratio Test, defined by Wallis as "the most powerful and seminal statistical ideas of the past third of a century" (in 1980).

The SPRT was not only a powerful tool when testing munition crates 80 years ago, but keeps being used today for very practical purposes (see for example how it’s used at Netflix or Uber).

References

[1] A. Wald, Sequential tests of statistical hypotheses (1945), The Annal of Mathematical Statistics.

[2] A. Wald and J Wolfowitz, Optimum character of the sequential probability ratio test (1948), The Annals of Mathematical Statistics.

[3] W. A. Wallis, The Statistical Research Group, 1942–1945 (1980), Journal of the American Statistical Association.

Code

You can find the original Jupyter Notebook here.

Blog-Posts/optimal_stopping.ipynb at main · matteocourthoud/Blog-Posts


Thank you for reading!

I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to Causal Inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!


Related Articles