Faster A/B testing 🚀— in numbers

Other methods than the tradition hypothesis testing (also known as frequentist) approach are becoming more and more widespread — I found in bayesian methods a meaningful improvement.

Nils Ziehn
Towards Data Science

--

The traditional approach 🎓

Typically A/B testing means hypothesis testing — this means that we want to “prove that A & B are significantly different”. We can take our test results, plug them into an existing A/B test evaluator, e.g. the one from Evan Miller, and get out results.

In most of these tools, part of the result in the “p”-value. Typically if it’s less than 5% the results are considered as significant. Interestingly enough — many people don’t know what the “p”-value actually means and very often gets misinterpreted. This is part of the problem because if you can’t interpret the test results, it’s hard to draw reasonable conclusions.

“p”-value stands for the likelihood that A and B come from the same distribution

Why change? 😱

In my opinion, there are 2 significant problems with the normal approach:

  1. Since understanding the “p”-value is hard, people often draw wrong conclusions
  2. The traditional approach requires a very large sample size: with typical parameters to show an uplift of 10% on a conversion funnel with a 1% success rate, you have to let 320 000 users see your test!

Bayes to the rescue! ⛑

To evaluate other tools, I have built a simulation environment to test various frameworks and their parameters. Below I will share my high-level findings and recommend some parameters that you can use, which I found very robust.

Scenario: Let’s say you’re planning to run 10 A/B tests over the next weeks, one after the other, and if a test is accepted, you will roll out the feature for all future users. Not all of the A/B tests will be positive, sometimes you just don’t know what the users are looking for — I assumed that it’s equally likely that you will test a better or worse version.

In each simulation, I compound the improvements of each test — so if 3 tests out of 10 were accepted with the uplifts: +3%, +7%, +2% I calculate:

(1.03 * 1.07 * 1.02)-1 = 0.124.. = 12.4%

I repeated the simulation above 1000s of times and the averaged out the compounded results:

As you can see, all frameworks are accepting positive tests and allow you to roll out improvements to all users. But you can also see that the Bayes (full duration) version produced the biggest uplift (given the same exact A/B test results). How is that possible? The Bayesian approach has a lower burden to accept also small results — since small results from multiple tests can compound to big ones, the Bayesian versions win. You can also see, that if you run the A/B tests for shorter periods of time, that the frameworks will not be able to make as good decisions as if they had all of the data.

But is there any downside to the Bayesian approach? This, unfortunately, is being left out often, but to answer this question we should not look at the averaged simulation results but rather look at the distributions of the compounded results.

The percentiles show, that in a small number of the simulations (~1%) bad decisions of the testing frameworks lead to an overall decrease in performance. And although in our simulation we can see this truth, in a real-world case you would not notice that you’re hurting your business! In contrast, you can see that the very conservative traditional (Frequentist) approach does not have this problem.

Ok, so which one is better? 🤔

Generally, I prefer the Bayesian approach — you can expect to get better results using this method as long as you are not super unlucky, but also in this case you’re not losing much versus what the traditional approach is giving you.

You also have to consider the aspect of time — providing less time (and therefore data) to your decision framework will lead to a comparably worse decision…

Except: If having a shorter test duration will allow you to run more tests than this can be a worthwhile trade-off.

Let’s say if you run your tests in a third of the time it allows you to run twice as many tests over the same period. How does this impact the results?

As you can see, running more tests with a shorter test duration in the Frequentist approach does not help much, but in the Bayesian case, we reduce our possible downside and significantly improve our upside.

Let’s run an infinite amount of tests in 0 time 🦁

My simulations show that in the Bayesian framework it’s possible to cut your test time shorter and shorter as long as you run equivalently more tests (e.g. cut the time by 10 to run 10 times as many tests). In reality, this has a few problems though:

  1. if the quality of the tests that you are running decreases because of the increased frequency — all bets are off
  2. often you can’t run as many tests since somebody needs to build them — in this case, it makes sense to allocate as much time as possible to the test to make the best possible decision
  3. Generally, you should run your test for at least a week — user behavior on the weekend might be very different than on working days and your test should be able to consider these cases!

Let’s analyze some tests using Bayes 🐳

I suggest that you use the following parameters to protect your downside:

  1. Likelihood of win should be >66%
  2. Margin should be >1%

I will provide python code below, but you can do this in most other programming languages as well! The code is just for approximation and runs reasonably fast.

# to install dependencies: pip install numpy scipy
import numpy as _numpy
from scipy.stats.distributions import beta as _beta
def get_likelihood_of_win(test_result, margin=0.01):
success = 0
total = 0
ref_pos = test_result['A']['success_cnt']
ref_neg = test_result['A']['fail_cnt']
test_pos = test_result['B']['success_cnt']
test_neg = test_result['B']['fail_cnt']
conv_rate_base = max(
(ref_pos/ ref_pos + ref_neg),
(test_pos/ test_pos + test_neg)
)
for x in _numpy.linspace(0, min(conv_rate_base * 2, 1), 100):
prob = _beta.pdf(x=x, a=ref_pos, b=ref_neg)
prob_test_wins = 1 - _beta.cdf(
x=x * (1 + margin),
a=test_pos, b=test_neg
)
success += prob_test_wins * prob
total += prob
return success / total
if __name__ == '__main__':
test_result = {
'A': {
'success_cnt': 10,
'fail_cnt': 100,
},
'B': {
'success_cnt': 18,
'fail_cnt': 100,
}
}
likelihood_of_win = get_likelihood_of_win(
test_result=test_result, margin=0.01
)
accepted = likelihood_of_win > 0.66
print(accepted, round(likelihood_of_win, 2))

--

--

VP Product @ Limehome — Numbers driven, moving fast and breaking things. Hooked on product, tech, data & machine learning. https://www.linkedin.com/in/nziehn/