A/B Testing Intuition with Python

A/B Testing has been the golden standard in customer facing products, ranging from front-end changes, like color of a button, to more subtle algorithmic changes like search ranking.
Recently, after taking the Udacity’s A/B Testing Course, I have grasped the practical steps of performing an A/B test as:
- Choosing a metric (a rate or a probability has different implications on our calculations later on).
- Decide the statistical power, minimum detectable effect (lift) and significant level. Also, calculate the base conversion rate/probability.
- Calculate the sample size and time needed for the experiment.
- Set up the experiment and do a sanity check.
- Collect the results and perform analysis.
Of these, steps 1 to 3 are experiment designs; step 4 is setting up and step 5 is analysis. While this is systematically straightforward, I have been having difficulties grasping the math behind some concepts like statistical power, lift and significant level. After visiting a few more resources and tinkering around, I decided to put together some Python code to make sense of some of the concepts. Before that, if you haven’t, please visit Evan Miller blogs for solid resources on A/B Testing.
Designing the experiment
This step mostly rely on the business need, so be sure to talk to your stakeholders beforehand. Also, let’s familiarize ourselves to some of the concepts:
- Statistical power: The ability of the experiment to correctly identify a positive change, given that there is indeed one.
- Significant level: The probability of wrongly identifying a positive change, when there is actually none.
- Minimum detectable effect (MDE): The minimum change that the business would like to detect in this test.
I know statistical power and significant level at this moment sounds quite abstract, so let’s develop some intuition there. Before we dive deep, I would like to point out one thing that I have seen many questions about, and is particularly important: there is no hard math on MDE, it is all based on the need of the business – and it is the product owners who should be telling you, I want to detect at least a 1% change in conversion in order for it to be worth the effort.
Now let’s get to some action.
1. Creating some data
Let’s first assume that you have exactly equal number of pageviews for control and experiment group, which is likely not true – but can help understand the math.
bcr = 0.1 # Baseline conversion rate
lift = 0.02 # Difference between the 2 groups
size = 4000 # Total size of data, each group will have 2000
df, summary = generate_data_equal_size(size, bcr, lift)

This tells us that the click through rate (CTR) of group B (experiment group) is higher than that of group A. However, how do we translate this into statistical confidence and interpret this result.
2. Binomial distribution
First, we know that the probability of a user clicking through is a Bernoulli distribution (either he clicks or not). When multiple users do this (assuming independence), the number of click throughs (success) will follow a Binomial distribution. In the real world, the parameters of the distributions can be derived from the experiment itself.

Which can be represented graphically:
fig, ax = plt.subplots(figsize=(8,5))
data_a = binom(n_A, p_A).pmf(xx)
data_b = binom(n_B, p_B).pmf(xx)
ax.bar(xx, data_a, alpha=0.5)
ax.bar(xx, data_b, alpha=0.5)
plt.xlabel('converted')
plt.ylabel('probability')

However, this doesn’t really tell us much – here we can understand that group B (orange) has a higher number of clicks, but we want to standardize this. In fact, we want to draw the distribution of the click through rates, instead of number of clicks.
Illustrating Central Limit Theorem
From above, we know that when multiple users have a probability of click/not click, then the total number of clicks follows a Binomial distribution. But there’s another way to think about this.
Let’s call the users X₁, X₂, X₃, …, Xₙ. Assume that for each user Xᵢ who has a click through probability of p, then: 𝑋𝑖∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝, 𝑝∗(1−𝑝)) (let’s not get deep into why it’s as such for Bernoulli distribution).
Then, by Central Limit Theorem, the sample mean, which is the same as p (Equation 1):

Note that there is a distinction between p (bolded), the sample mean which follows the normal distribution distribution and p (italic), the click through probability of a user.
Let’s do a small Visualization to understand better the above. Let’s simulate 100000 users, with a 0.1% CTR for each user:
outcomes = []
for i in range(100000):
p = 0.1
outcome = np.random.binomial(1, p)
outcomes.append(outcome)
sns.countplot(outcomes)

Let’s say we have 2000 users in each sample. Then we can calculate the sample mean for each sample X̄ = p. If we do many times like above, we can get the distribution of those sample means:
sample_means = []
# Simulating 1000 samples
for i in range(1000):
# For each sample, we simulate 2000 users
sample = random.choices(outcomes, k=2000)
mean = np.mean(sample)
sample_means.append(mean)
sns.distplot(sample_means)

We can see that the distribution of sample mean does indeed follow a normal distribution.
3. Forming the hypothesis
Since the sample mean will follow a normal distribution, we can graph the "assumed" distribution of the sample mean of group A and group B using the equation (1). Se can assume the observed p of each group is the unbiased estimator for the population mean.
fig, ax = plt.subplots()
xx = np.linspace(0.07, 0.17, 1000)
data_a_norm = norm.pdf(xx, p_A, np.sqrt(p_A*(1-p_A) / n_A))
data_b_norm = norm.pdf(xx, p_B, np.sqrt(p_B*(1-p_B) / n_B))
sns.lineplot(xx, data_a_norm, color='blue', ax=ax)
sns.lineplot(xx, data_b_norm, color='red', ax=ax)
ax.axvline(p_A, color='cyan', linestyle='--')
ax.axvline(p_B, color='orange', linestyle='--')
plt.xlabel('converted')
plt.ylabel('probability')

What we are interested in is the difference between the two groups above. In other words, if I pick a mean 𝜇ₐ from distribution A, and a mean 𝜇ᵦ from distribution B, how likely will 𝜇ᵦ be greater than 𝜇ₐ.
Thus, our hypothesis is as follow:

Distribution of difference in mean
We can assume the users in the two groups are independent, as such:

We can derive the formula for the standard error of difference like so, which gives us the Satterthwaite Approximation. Note that this assumes that the two variance are different (what we are assuming), which is different from the other equation that Udacity uses for Pooled Standard Error.

Since both distribution are normal, we know that d also follows a normal distribution, as such we can graph the distribution for the two hypothesis.
4. Calculating the sample size
Now that we have understood how the hypothesis are formed, let’s continue to derive the sample size. Evan Miller wrote an awesome blog on why you MUST decide a sample size before running the experiment, and I will try to build an intuition around it.
Recall that our baseline conversion rate is 0.1, and our lift is 0.02 for our population. However, before running the experiment you would not have the lift number, and there is no way to figure it out. As such, we will use a minimum desired effect (MDE), which is a lift that we would accept as "significant", to calculate the sample size. The tool I personally use, and is also recommended by Udacity’s course is Evan Miller sample size calculator.
But how does it work?
Power of a test and statistical significant
Let’s see how power and significant level work together. Recall the definition:
- Statistical power: The ability of the experiment to correctly identify a positive change, given that there is indeed one.
- Significant level: The probability of wrongly identify a positive change, when there is actually none.
So essentially, assuming there is in fact no difference between A and B (H₀ is true), 5% of the time, a difference of at least 𝑍𝛼/2 can be observed. We can mathematically calculate 𝑍𝛼/2 with the function norm.ppf(1-alpha/2, d0, SE) whe
re d0=0 and SE equal to the pooled SE.
The reason that we use 𝑍𝛼/2 instead of 𝑍𝛼 is that we are interested in a two-tail test, as shown in the graph below, where the total shaded area is 5%. The area is our probability of wrongly rejecting H₀. We need to take 1-alpha/2 because we are interested in the right tail, and the graph below shows the calculated min difference needed.
plot_null(p_A=0.1, p_B=0.12, n_A=2000, n_B=2000)

So under our test, whenever we observe a difference of at least 0.01939, we will reject the null hypothesis and say that our test group does indeed perform better than control group.
In addition, in A/B test we would need to reach a particular power – normally 80%. This means we need to reject H₀ 80% of the time given that the alternative hypothesis H₁ is true. Recall that we reject the null hypothesis whenever the difference is greater than 0.01939. As such, we need to somehow find the alternative distribution such that the probability of getting a mean of greater than 0.01939 is at least 80% under this distribution.
Let’s take a look at a few of such alternative distributions:
# Since we draw the graph for 3, let's just assume pooled SE is calculated with bcr=0.1 and lift=0.02
SE_0 = calculate_SE(0.1, 0.12, 2000, 2000)
SE_1 = calculate_SE(0.1, 0.12, 2000, 2000)
SE_2 = calculate_SE(0.1, 0.12, 2000, 2000)
plot_multiple_alt(0, 0.01, 0.02, SE_0, SE_1, SE_2)

We can see here that the power can be calculated as the area under the alternative hypothesis to the left of the calculated 𝑍𝛼/2=0.01939. This area denotes the probability of getting an observed mean greater than 𝑍𝛼/2 if the alternative hypothesis is true.
For example, under H₂, which is when the MDE is 0.02 and the sample size is 2000, the power is only 52%. It is greater than the power of when MDE 0.01 as it makes sense that if the actual effect is higher, our test would have an easier time to detect the difference. However, it is still not quite what we wanted.
In order to increase the power in this case, what we would thus need to do is to somehow make the bell curves above thinner (for lack of better words), which means reducing the standard error. If we recall back to the formula of SE (hint: there are two of them), both are inversely proportional to the sample size. Which means, to reduce the SE, we would need to increase the sample size, so let’s do exactly that.
# Since we draw the graph for 3, let's just assume pooled SE is calculated with bcr=0.1 and lift=0.02
SE_0 = calculate_SE(0.1, 0.12, 4000, 4000)
SE_1 = calculate_SE(0.1, 0.12, 4000, 4000)
SE_2 = calculate_SE(0.1, 0.12, 4000, 4000)
plot_multiple_alt(0, 0.01, 0.02, SE_0, SE_1, SE_2)

Above we can see that the graphs do appear to be thinner. We also observe that the 𝑍𝛼/2 score under H₀ has shifted left due to the thinner bell curves. If we do the same and calculate the area under the H₂ curve, we will see that the power is 82%, which is good.
A simple way to put all these together is:
- Calculate 𝑍𝛼/2 based on H₀, which is a threshold that if the observed difference cross, we will reject H₀.
- If H₁ is true, we should see 𝑍𝛼/2 at least 80% of the time.
Calculating the sample size
So now we know that the sample size is based on MDE, power and significant level. The easiest way is to scroll the sample size like the gif below (actually just kidding, but it’s cool to visualize how power change as a function of sample size). I have built a simple slider in notebooks which can be found at my Github.

The official way to do it is with a formula, which can be found online as linked above. There are a few ways to actually calculate the required sample size (due to different assumptions used). The derivation of both formula is beyond the scope of this article.


The animation above use the second formula, which uses a pooled SE, and we can verify that at exactly 3843 users per group, the power cross the 80% mark.
calc_sample_size(alpha=0.05, beta=0.2, p=0.1, delta=0.02, method='pooled_se') # 3843
calc_sample_size(alpha=0.05, beta=0.2, p=0.1, delta=0.02, method='evanmiller') #3623

Analyzing the results
Calculating Confidence Interval
So you have finally finished with defining your experiment. Next is onto the exciting phase of actually analyzing the experiment result. To do that, we need to make use of a concept called confidence interval (CI). The reason that we need to use a confidence interval is that in statistic, we can only express our belief in a number within certain confidence.
In other words, by calculating the confidence interval of 95% (which is the norm) for the difference between two groups, we essentially get two numbers, a lower and an upper one which if we repeat the experiment 100 times, we will see the difference falling between this interval 95 times.
The formula for CI is fairly simple:

Let’s visualize an example:

So this shows us that, if we observe a difference of 0.01675, we are 95% confident that the actual difference is from 0.00317 to 0.03033. Here, the lower CI is greater than 0, meaning that we are quite confident that variant B is doing better than variant A. In this case, we would then reject H₀.
Let’s tie it up together with another simulation. So let’s say we now run 1000 identical experiment based on the population defined from the start. We choose a sample size of 3843 (magic number!). How many times will we fail to conclude that there is a difference?
Yes, it is 20% (1-power). Let’s simulate this:
test_results = []
for i in range(1000):
df, summary = generate_data_equal_size(size, bcr, lift)
p_A = summary.loc['A', 'ctr']
p_B = summary.loc['B', 'ctr']
n_A = summary.loc['A', 'count']
n_B = summary.loc['B', 'count']
p_pool = (p_A*n_A + p_B*n_B)/(n_A+n_B)
low_ci, up_ci = get_ci(p_pool, p_A, p_B, n_A, n_B, alpha=0.05)
test_results.append((low_ci, up_ci))
Recall that we will reject the null hypothesis whenever low_ci > 0.
lower_cis = np.array([i[0] for i in test_results])
len(lower_cis[lower_cis > 0])
# ==> 808
So we will reject the null hypothesis 808 times, and fail to reject it 808 times, when we sample from a population with a true difference. Close enough to the desired 80% power.
That is it. I hope this gives the insights into how to make sure the experiment have enough power, and how to identify experiments which are currently underpowered.
I would like for a special thanks to the article written by Nguyen Ngo, which I have used as a reference to build my understanding. I follow quite closely to his organization and try to explain some concepts which I haven’t grasped yet in my own way.
All the code can be found here. That’s it for now!
References
The influence of this blog: https://towardsdatascience.com/the-math-behind-a-b-testing-with-example-code-part-1-of-2-7be752e1d06f
Twitter A/B testing series: https://blog.twitter.com/engineering/en_us/a/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests.html
Evan Miller explanations: https://www.evanmiller.org/how-not-to-run-an-ab-test.html