The world’s leading publication for data science, AI, and ML professionals.

Raining on Your Parade: Thompson Sampling is NOT good for A/B testing

Part 0(a) of my Demystifying Pure Exploration series

Photo by National Cancer Institute on Unsplash
Photo by National Cancer Institute on Unsplash

If Batch Statistics don’t work, throw Thompson at Your Problem 🤡

Most articles discussing A/B testing on Medium

  • begin and end with batch statistics, or
  • supplement batch statistics with a stopping rule that some software magically produces using a completely opaque procedure that’s conveniently ignored, or
  • recognize at least empirically, the superiority of sequential statistics, but simply peel off ‘batch’ from Step 2 above, stick vanilla ‘Thompson Sampling’ and wax eloquently about how this will solve all your product testing woes and make you a billionaire overnight.

While I’ve already explained how sequential testing, combined with a sensible, provable stopping rule, performs much better than batch testing, the recent influx of articles of the third kind above shows increasing faith in a myth I intend to dispel in this article.

Myth: Regret and Sample Size Efficiency are essentially the same metric.

💡 Quick Recap of Definitions

In my previous article, I described what Sample Size efficiency means and so, to keep the article concise, I will only discuss Regret in more detail (continuing with the same basic notation as my previous article). Regret of a sampling algorithm in our Multi Armed Bandit (MAB) framework essentially measures the loss in reward because of all the times it did not pull the best arm.

As before, I assume that θ₁>θₖ, ∀ k ∈ {2,3,…,K}, and define Δₖ=θ₁-θₖ to be the reward gap of Arm k. At time s ∈ {1,2,…,T}, an MAB algorithm π, samples arm Iₛ and receives reward X(s,Iₛ). Its regret is then defined by

Definition of Regret of Algorithm π over T rounds (Image by Author)
Definition of Regret of Algorithm π over T rounds (Image by Author)

⚡ Whence the Myth? (Bit of a redux)

Much of the confusion I alluded to above, in my opinion, is attributable to an equivalent definition of regret. Let the random variable Nₖ(T) represent the number of times Algorithm π samples Arm k in T rounds. Then,

Regret in terms of reward gaps (Image by Author)
Regret in terms of reward gaps (Image by Author)

The above equation shows that good regret performance automatically implies sampling suboptimal arms less frequently. Authors (in my opinion) see the above equation and assume that any algorithm that minimizes regret necessarily also finds the best arm quickly. This is not the complete picture!

You see, such authors tend to forget the other term associated with 𝔼Nₖ(T), namely, the gaps Δₖ. Because losing reward is so costly in regret minimization, algorithms designed for this setting tend to step very carefully, treading the knife edge between exploration and exploitation. Finding the best arm quickly, on the other hand, demands very aggressive behavior from the algorithm. Since reward is no longer a factor, the exploration-exploitation dilemma that plagues regret minimization is much less prominent in this latter setting.

Vanilla Thompson Sampling (vTS) has been developed for the express purpose of minimizing regret, and exhibits all the trepidation of its ilk when it comes to arm selection. This is why articles of the second and third kind above are very misleading in their claims.

Small Regret ⇒ Bad Best Action Identification 🤯

Read that again. Notice that I didn’t say "Small Regret≠ Good Best Action Identification!" The claim I’m making here is far stronger, to wit, that any algorithm that performs well on regret minimization problems, necessarily does a bad job at best arm identification. Let’s make this more formal.

It is well known that the best regret minimizing algorithms show logarithmic increase in regret, i.e., O(log(T)). Given a bandit instance θ = [θ₁,…,θₖ] from some class Θ of instances, a lower bound on regret performance is

Lower bound on regret performance (Image by Author)
Lower bound on regret performance (Image by Author)

This simply means that a regret optimal algorithm for class Θ will show asymptotic performance that is O(log(T)) for any bandit instance θ ∈ Θ. Now the kicker is that for any such algorithm, there always exists at least one "difficult" instance within the same class Θ such that the probability of choosing the wrong arm decays at most polynomially with time! Formally speaking, for any ϵ > 0,

Probability of selecting the WRONG arm (Image by Author)
Probability of selecting the WRONG arm (Image by Author)

which renders π rather unreliable for this task. This naturally leads us to conclude that

(Disclaimer: this theorem does have a few qualifying conditions which I’ve omitted in the interest of clarity. For more details, please refer to Chapter 33 of Tor and Csaba’s masterpiece.)

This rate of decrease is completely unacceptable in applications because (a) it leads to enormous sample sizes and (b) several better alternatives exist! This latter claim is true even if one doesn’t wish to resort to designing sophisticated stopping times that dictate when the experiment has collected enough data that an action can be declared ‘the best.’

Now that I have (hopefully) convinced you that Captain Planet’s "Reduce, Reuse, Recycle" mantra does not apply to A/B testing, we can move on to discussing more efficient alternatives.

Recycle plastic not algorithms (Source: Wikimedia Commons)
Recycle plastic not algorithms (Source: Wikimedia Commons)

Top-Two Thompson Sampling

This is designed to make Thompson Sampling explore more aggressively. Simply put, this algorithm proceeds like vanilla TS, but in each round, randomly chooses one of the top two actions (or arms) and samples it. The probability of randomization is chosen very carefully, and with said choice,

Top-2 TS quickly stops selecting the wrong arm (Image by Author)
Top-2 TS quickly stops selecting the wrong arm (Image by Author)

Several points are of note here.

  • Since either of the top two arms can be chosen in a round, exploration is clearly being encouraged.
  • Comparing with our earlier bound on the probability of choosing the wrong arm, Top-2 TS is orders of magnitude better (from polynomial decrease to exponential decrease)!
  • This constant Γ* that dictates the rate of exponential decrease is also the best achievable.

This algorithm, however, doesn’t use a stopping time (in the literature, it belongs to the Fixed Budget category). Many others also automatically supply a stopping time at which they stop and declare an arm the best.


Enter Stopping Times Stage Left

Pure exploration algorithms that also decide when they’ve seen enough evidence to crown some action ‘the best’ belong to a category called ‘Fixed Confidence’ algorithms. They are designed to solve the following problem:

Given a confidence parameter δ ∈ (0,1), find the best arm with probability at least 1-δ, using as few arm pulls as possible.

So, in our setting, these algorithms need to find Arm 1 using as few samples as possible. I will now discuss a few prominent examples in this category.

Track and Stop

Garivier and Kaufmann, in their landmark 2016 paper, showed that the stopping time τ of any fixed confidence algorithm satisfies, for a given bandit instance θ = [θ₁,…,θₖ] from some class Θ

Lower bound on sample size (Image by Author)
Lower bound on sample size (Image by Author)

The instance-dependent constant c(θ) above depends on the frequency with which each of the K arms is sampled. Suppose w(θ) = [w₁(θ),…,wₖ(θ)] is that frequency. Clearly, this quantity can obviously be computed only when θ is known, which means that in reality no algorithm knows it beforehand. However, it can be estimated over time!

This is what "track-and-stop" strategies do. They estimate θ based on available rewards which, at time t, gives them an estimate wᵗ(θ) of w(θ). They then sample the arm whose sampling frequency is farthest from that prescribed by wᵗ(θ), and re-estimate θ and the process repeats. This continues until enough evidence is collected in favor of an arm and the algorithm stops and declares that arm as the best.

Track and stop strategies are provably optimal in sample size in the high confidence regime, i.e., when δ is close to 0. Notice, once again, that the focus is entirely on exploration and regret minimization is not the objective!

(I will be discussing the intuition behind this approach in much greater detail in a future article in this series.)

Lower-Upper Confidence Bound Algorithms

Confidence-based algorithms maintain a set, called a "Confidence Set" that contains the real bandit instance θ with high probability. This set, for each arm k ∈ {1,…,K}, is simply the current reward mean θₖ(t) plus an exploration bonus Cₖ(t,δ). In other words, with probability at least 1-δ, θₖ ≤ θₖ(t)+Cₖ(t,δ).

Lower-Upper confidence (LUCB) algorithms, like Top-2 TS, maintain two arms (1) the one with the largest estimated mean and (2) the one with the largest estimated upper confidence, i.e.,

LUCB maintains two arms in each pair of rounds (Image by Author)
LUCB maintains two arms in each pair of rounds (Image by Author)

LUCB algorithms proceed in pairs of rounds, sampling both these two arms in consecutive rounds and updating their estimates. They stop when the LOWER confidence bound of hₜ is larger than the UPPER confidence bound of lₜ (and naturally declare hₜ the best).

For the proper choice of the exploration bonus Cₖ(t,δ), these algorithms can be shown to come within logarithmic factors of the optimum. Designing such a bonus requires knowledge of a finite time version of the Law of the Iterated Logarithm (LIL), and is beyond the scope of this article. What is to be noted is that unlike vanilla UCB, LUCB algorithms do not simply sample lₜ in every round! Once again, the pure exploration version of a popular regret minimization algorithm shows forceful modification to favor aggressive exploration.

A few other ways of designing pure exploration algorithms exist, but I believe for the purpose of the article these three illustrative examples are sufficient. In future parts of my series on "Demystifying Pure Exploration," we will encounter other, more exotic creatures and analyze their sample size performance.


Concluding Remarks

Recycling algorithms without understanding the underlying principles upon which they are designed is usually a bad idea and Thompson Sampling, despite all its wonderful advantages, is no exception.

A/B testing and pure exploration in general, form a class of problems that are much more intricate and demanding (in terms of mathematical maturity) than regret minimization because unlike the latter, they involve the design of measurable (a) sampling rules (b) stopping times and (c) winner selection criteria. Through this article, I hope I’ve convinced the reader that these problems necessarily have to be treated as a class of their own.

This article was a quick detour in my series on "Demystifying Pure Exploration," Parts 0 and 1 of which are already available on TDS. The series will continue with our analysis of how to improve the sample size performance of Action Elimination.

So until next time, remember kids

Regret minimizing algorithms are provably BAD at Best Arm Identification.


Related Articles