TLDR
Multi-armed Bandits are awesome and go way beyond simple A/B testing. For hypothesis testing with bandits, efficient algorithms and very tight __ bounds on sample sizes are available. Let’s check them out! 😃
In the Previous Episode…
Part 0 of my series on Demystifying Pure Exploration explained why using sequential statistics can dramatically __ improve the efficiency of A/B testing in terms of the number of samples required to achieve a desired performance level (CLICK here for Part 0). Therein, I also explained how a statistical construct called a "Multi-armed Bandit" generalizes sequential A/B testing and provided a list of advantages of using the bandit approach.
Sequential hypothesis testing in the Bandit setting has, over the last two decades, received a lot of attention from researchers. Computer scientists and statisticians alike, have developed a rich toolkit of algorithms and proof techniques to handle a wide variety of such problems building upon a foundation of exceptionally rigorous mathematics.
In this and subsequent parts of this series, I hope to present a tiny sliver of the vast literature on this topic to help readers appreciate the awesome power of sequential statistical analysis better. Let’s begin by understanding the problem we’re dealing with and, in particular, what it’s not.
Pure Exploration ≠ Regret Minimization! 😮
The Multi-armed Bandit (MAB) setting involves K actions that some particularly sadistic statistician decided to refer to as ‘Arms‘ to invoke the sorry image of a Las Vegas slot machine, aka a 1-armed Bandit, that steals all your money. I will quickly summarize the MAB model before proceeding further (newbies can refer to any number of articles on regret minimization on this site for basics).
Each arm k represents a distribution P(k), 1≤ k≤ K, and ‘pulling’ the arm represents sampling from distribution P(k). At time t = 1, 2, 3,… an MAB algorithm pulls an arm Aₜ. Pulling Arm k at time s produces a sample X(s,k) drawn from the distribution P(k). This sample is called a ‘Reward’.
The mean of distribution Pₖ is denoted by θₖ, and the arm with the largest mean is called the Best Arm. Needless to say, we are not a priori aware of any of the reward distributions, except for the class of probability distributions to which they belong. Throughout this article, we will assume that all reward distributions are in the subGaussian class. A random variable Z is said to be σ²-subGaussian if for every λ ∈ ℝ

The parameter σ² above, essentially behaves like a proxy for the variance of Z. In the rest of the article we will assume that our problem has a unique best arm and that arm is Arm 1. Further, let Δₖ = θ₁- θₖ denote the mean reward gap between arms 1 and k. For notational convenience, let’s set Δ₁ = Δ₂.
Regret Minimization: In the above framework, given a fixed time horizon T, the Regret of an MAB algorithm essentially measures the loss in reward because of all the times it did not pull the best arm.
Regret minimization is NOT the focus of this series. I am interested in piquing the medium readership’s interest in a radical generalization of A/B testing, variously called "Pure Exploration" and "Best Arm Identification."
Pure Exploration
Given a confidence parameter δ∈(0,1), find the best arm with probability at least 1-δ, using as few arm pulls as possible. So, our algorithms need to find Arm 1 using as few samples as possible.

Example: A medical researcher has K configurations of a drug, the most efficacious of which she wishes to identify.
A natural way to test this is by injecting lab mice with different configurations and observing the success rate of each configuration. Such an experiment is not concerned with how the mice themselves are affected by the drug. Indeed, the protocol employed will have to be rather aggressive at rooting out suboptimal configurations (or arms) quickly (before she runs out of mice and has to run around for more funding).
Gruesome as it may sound, this example helps illustrate the difference between the two problems. Regret minimization algorithms, by the very nature of the problem they deal with, have to ‘tread carefully’, sampling suboptimal arms as rarely as possible. Pure exploration algorithms have no such compulsion and are only concerned with zeroing in on the best action as quickly as possible, heedless of what regret they accrue in the interim.
💡 Given that we only have samples from distributions to work with, a good proxy for the mean θₖ of Arm k is the average reward earned from sampling Arm k until time t. It behooves us, therefore, to study how to control the distance between θₖ and this average. Fortunately, two famous mathematicians developed a technique to do precisely this.
Chernoff and Hoeffding to the Rescue
In the remainder of the article, given random variables X₁, X₂, … by the term ‘sample mean‘ we will refer to the random variable

When X₁, X₂, … are also independent and identically distributed (IID) with mean μ, we know that 𝔼Xᵐ(t)=𝔼X₁=μ. To highlight the fact that μ, in a way, represents ground truth about the IID sequence, we will call it the ‘true mean’ (as opposed to the sample mean).
Question 1
How far from μ is Xᵐ(t)? Assuming the our IID sequence X₁, X₂, … consists of σ²-subGaussian random variables, standard Chernoff-Hoeffding analysis tells us that

and the same bound holds for Xᵐ(t)-μ<-ϵ as well. So, if we want the sample mean to be within ϵ of the true mean μ with probability at least 1-δ then

💡 This means that O(1/ϵ² log(1/δ)) samples are needed to bring the sample mean to within ϵ of the true mean with high probability. This brings us to the next question.

O(1/ϵ² log(1/δ)) samples are needed to bring the sample mean to within ϵ of the true mean with high probability.
Question 2 (back to MABs)
Given a 2-armed bandit, how many samples does it take to find the better arm with high probability, assuming Δ = θ₁-θ₂ is known? (Recall that we’ve assumed that Arm 1 is the best arm.)
💡 One way to answer this is to sample both arms enough times to ensure their sample means Xᵐ₁(t) and Xᵐ₂(t) are sufficiently close to the true means θ₁ and θ₂ respectively, and simply choosing the arm with the larger sample mean.
Turns out this works quite well!

Following our analysis above, given that we’re now dealing with 4 intervals instead of two, we need to replace δ/2 by δ/4. Also, as the picture shows, in this case, ϵ = Δ/2. With this, we see that taking

samples ensures that, with probability larger than 1 – δ, Arm 1’s sample mean Xᵐ₁(t) > θ₁ -Δ/2 ≥ θ₂ + Δ/2 > Xᵐ₂(t), the sample mean of Arm 2. So, choosing the arm with larger sample mean gives the correct answer with high probability.
A very simple algorithm named Action Elimination or ‘AE’ leverages precisely this principle to find the best arm in a Multi-armed Bandit with high probability. However, AE also has to contend with the added complexity of
- not having any knowledge of the gaps Δᵢ = θ₁ – θᵢ in advance, and
- having to (potentially) deal with more than two arms.
Let’s see how the algorithm manages to get around this issue.
The Action Elimination Algorithm
AE, as the name suggests, is an ‘elimination’-type algorithm in the sense that it proceeds in phases, discarding or… eliminating (aha!) arms that seem to be suboptimal at the ends of phases.
The algorithm cleverly circumvents the need to know the gaps by
- choosing the arm with the largest sample mean at the end of every phase as the current (or empirical) best arm,
- discarding all arms whose sample means are outside a certain distance from the empirical best with this distance chosen at the beginning of the phase, and
- beginning the next phase by choosing a new, smaller, distance for arms to survive the coming phase. This means that as phases progress, sample means will have to come closer and closer to the empirical best for the arm corresponding arm to survive.
The flowchart below illustrates the details of the algorithm. AE begins with all arms and ends when exactly one arm is left at the end of a phase. Notice that since ϵₗ decreases exponentially as phases progress, sample means need to get very close to the empirical best to survive. In other words, the algorithm discards arms quite aggressively.

Analyzing the number of samples AE consumes before stopping involves showing (with high probability) that
- Arm 1 is never eliminated
- sample means of arms are never more than ϵₗ away from their true counterparts in every phase l.
- Arm k survives for at most O(log(8/Δₖ)) phases.
These observations imply that Action Elimination returns the best arm with probability at least 1-δ having consumed at most the following number of samples

Good News and Bad News
We will see later that many of the terms in the sample size of AE are actually fundamentally unavoidable. However, there are plenty of issues with this algorithm.
- To begin with, there is no guarantee that action elimination ever stops. Sample mean estimates could oscillate, Arm 1 can get kicked out early on and a whole host of problems can break the algorithm.
- Even on trajectories where AE stops, outside of the set of probability 1-δ, its performance can be arbitrarily bad.
- The algorithm literally throws out all samples collected in a given phase before beginning the next phase.
For these reasons, I’ve decided to call Action Elimination a ‘Toothless’ algorithm (get it? 😉 Toothless… the tamed dragon…). Anyway, getting back to sample sizes, we will now see that despite its toothless-ness, the expression for AE’s sample complexity is a treasure trove of valuable information.
Lessons from a Toothless Algorithm
Let’s look at the sample size expression a bit more closely. I’ve highlighted only the ‘order’ terms, so to speak, while hiding irrelevant constants.

The terms encircled in red are the result of very fundamental issues in statistical analysis and stochastic process theory. In the forthcoming parts of this series, we will see exactly how they arise. In brief,
- Term 1 (Σₖ1/Δ²ₖ) is, in fact, the most important part of a famous lower bound on the number of samples consumed by any good pure exploration algorithm.
- Term 2 (log(K)) is the result of a weak union bound. We will see how this term was eventually removed.
- Term 3 (log(1/δ)) is a fundamental limit in statistical analysis. In fact, this very article hints at why this term simply cannot be avoided. Can you find it? Hint: can you avoid it if I gave you the values of the gaps Δₖ?
- Term 4 (log(log(1/Delta²))) comes from a fascinating phenomenon called the Law of the Iterated Logarithm that governs the growth rate of sums of independent random variables.
Concluding Remarks
- In the context of pure exploration, AE was one of the first algorithms ever proposed (and rigorously analyzed).
- AE isn’t really "toothless!" I merely used the adjective to stress the fact that more advanced techniques showing near-optimal sample complexity exist. AE has, in fact, been successfully used to solve several problems in both MABs and Reinforcement Learning by multiple researchers (including yours truly 😅 ), usually as a building block within more complex algorithms.
- Many of the issues with AE can, and have, been fixed over the years. In the forthcoming parts I will describe a few successful attempts at fixing the problems with AE.
So, in this part of the pure exploration saga, we formally defined the problem and saw how to construct one algorithm to solve it. Performance analysis of the algorithm gave us hints as to what can and cannot be done in this setting.
In the remaining part of the series, I will explain what makes some of these terms unavoidable and how the log(K) term was eliminated.
NOTE: references will be provided at the end of Part 2.