
In this post, I would like to invite you to continue our intuitive exploration of A/B testing, as seen in the previous post:
Resuming what we saw, we were able to prove through simulations and intuition that there was a relationship between Website Version and Signup since we were able to elaborate a test with a Statistical Power of 79% that allowed us to reject the hypothesis that states otherwise with 95% confidence. In other words, we proved that behavior as bias as ours was found randomly, only 1.6% of the time.
Even though we were satisfied with the results, we still need to prove with a defined statistical confidence level that there was a higher-performing version. In practice, we need to prove our hypothesis that, on average, we should expect version F would win over any other version.
Before we start
Let us remember and explore our working data from our prior post, where we ended up having 8017 Dices thrown as defined by our Statistical Power target of 80%.
# Biased Dice Rolling Function
DiceRolling <- function(N) {
Dices <- data.frame()
for (i in 1:6) {
if(i==6) {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.2))
} else {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.16))
}
Dices <- rbind(Dices,Observed)
}
return(Dices)
}
# In order to replicate
set.seed(11)
Dices <- DiceRolling(8017) # We expect 80% Power
t(table(Dices))
As a reminder, we designed an R function that simulates a biased dice in which we have a 20% probability of lading in 6 while a 16% chance of landing in any other number.
Additionally, we ended up generating a dummy dataset of 8.017 samples, as calculated for 80% Power, that represented six different versions of a signup form and the number of leads we observed on each. For this dummy set to be random and have a winner version (F) that will serve us as *ground truth,* we generated this table by simulating some biased dice’s throws.
The output:

Which should allow us to produce this report:

As seen above, we can observe different Signup Rates for our landing page versions. What is interesting about this is the fact that even though we planned a precise Signup Probability (Signup Rate), we got utterly different results between our observed and expected (planned) rates.
Let us take a pause and allow us to conduct a "sanity check" of say, Version C, which shows the highest difference between its Observed (0.14) and Expected (0.16) rates in order to check if there is something wrong with our data.
Sanity Check
Even though this step is not needed, it will serve us as a good starting point for building the intuition that will be useful for our primary goal.
As mentioned earlier, we want to prove that our results, even though initially different from what we expected, should not be far different from it since they vary based on the underlying probability distribution.
In other words, for the particular case of Version C, our hypothesis are as follows:

Why did we use means?
This particular case allows us to use both proportions or means since, as we designed our variables to be dichotomous with values 0 or 1, the mean calculation represents, in this case, the same value as our ratios or proportions.
# Results for Version C
VersionC <- Dices[which(Dices$Version=="C"),]
# Mean calculation
mean(VersionC$Signup)

p-Value
We need to find our p-Value, which will allow us to accept or reject our hypothesis based on the probability of finding results "as extreme" as the one we got for Version C within the underlying probability distribution.
This determination, meaning that our mean is significantly different from a true value (0.16), is usually addressed through a variation of the well-known Student Test called "One-Sample t-Test." Note: since we are also using proportions, we could also be using a "Proportion Test", though it is not the purpose of this post.
To obtain the probability of finding results as extreme as ours, we would need to repeat our data collection process many times. Since this procedure is expensive and unrealistic, we will use a method similar to the resampling by permutation that we did in our last post called "Bootstrapping".
Bootstrapping
Bootstrapping is done by reshuffling one of our columns, in this case Signups, while maintaining the other one fixed. What is different from the permutation resample we have done in the past is that we will allow replacement as shown below:

It is important to note that we need to allow replacement within this reshuffling process; otherwise, simple permutation will always result in the same mean as shown below.
Let us generate 10 Resamples without replacement:
i = 0
for (i in 1:10) {
Resample <- sample(VersionC$Signup,replace=FALSE);
cat(paste("Resample #",i," : ",mean(Resample),"n",sep=""));
i = i+1;
}

And 10 Resamples with replacement:
i = 0
for (i in 1:10) {
Resample <- sample(VersionC$Signup,replace=TRUE);
cat(paste("Resample #",i," : ",mean(Resample),"n",sep=""));
i = i+1;
}

Simulation
Let us simulate 30k permutations of Version C with our data.
# Let's generate a Bootstrap and find our p-Value, Intervals and T-Scores
set.seed(1984)
Sample <- VersionC$Signup
score <- NULL
means <- NULL
for(i in 1:30000) {
Bootstrap <- sample(Sample,replace = TRUE)
means <- rbind(means,mean(Bootstrap))
SimulationtTest <- tTest((Bootstrap-mean(Sample))/sd(Sample))
tScores <- rbind(score,SimulationtTest)
}
As result, we got:

Initially, one might expect a distribution similar in shape, but centered around 0.16, therefore resembling the "true population mean" distribution. Even though we did not recreate the exact "ground truth distribution" (the one we designed), since it is now centered in the mean of our sample instead (0.14), we did recreate one that should have roughly the same shape and Standard Error, and that should contain within its range our true mean.
We can compare our "bootstrapped standard error" with the "true mean standard error" by using both Central Limit Theorem and the Standard Deviation formula for the Binomial Distribution.

Which allow us to obtain:

Which seems to be quite near to our bootstrapped Standard Error:
# Mean from sampling distribution
round(sd(means),6)

This data should be enough for us to approximate the original true mean distribution by simulating a Normal Distribution with a mean equal to 0.16 and a Standard Error of 0.01. We could find the percent of times a value as extreme as 0.14 is observed with this information.

As seen above, both our True Mean Distribution (green) and our Bootstrapped Sample Distribution (blue) seems very similar, except the latter is centered around 0.14.
At this point, we could solve our problem by either finding the percent of times a value as extreme as 0.14 is found within our true mean distribution (area colored in blue). Alternatively, we could find the percent of times a value as extreme as 0.16 is found within our bootstrapped sample distribution (area colored in green). We will proceed with the latter since this post is focused on simulations based solely on our sample data.
Resuming, we need to calculate how many times we observed values as extreme as 0.16 within our bootstrapped sample distribution. It is important to note that in this case, we had a sample mean (0.14) inferior to our expected mean of 0.16, but that is not always the case since, as we saw in our results, Version D got 0.17.
In particular, we will perform a "two-tailed test", which means finding the probability of obtaining values as extreme or as far from the mean as 0.16. Being our sample mean for Version C equal to 0.14, this is equivalent to say as low as 0.12 or as high as 0.16 since both values are equally extreme.
For this case, we found:
# Expected Means, Upper and Lower interval (0.14 and 0.16)
ExpectedMean <- 0.16
upper <- mean(means)+abs(mean(means)-ExpectedMean)
lower <- mean(means)-abs(mean(means)-ExpectedMean)
PValue <- mean(means <= lower | means >= upper)
Sum <- sum(means <= lower | means >= upper)
cat(paste("We found values as extreme: ",PValue*100,"% (",Sum,"/",length(means),") of the times",sep=""))

Ok, we have found our p-Value, which is relatively low. Now we would like to find the 95% confidence interval of our mean, which would shed some light as of which values it might take considering a Type I Error (Alpha) of 5%.
# Data aggregation
freq <- as.data.frame(table(means))
freq$means <- as.numeric(as.character(freq$means))
# Sort Ascending for right-most proportion
freq <- freq[order(freq$means,decreasing = FALSE),]
freq$cumsumAsc <- cumsum(freq$Freq)/sum(freq$Freq)
UpperMean <- min(freq$means[which(freq$cumsumAsc >= 0.975)])
# Sort Descending for left-most proportion
freq <- freq[order(freq$means,decreasing = TRUE),]
freq$cumsumDesc <- cumsum(freq$Freq)/sum(freq$Freq)
LowerMean <- max(freq$means[which(freq$cumsumDesc >= 0.975)])
# Print Results
cat(paste("95 percent confidence interval:n ",round(LowerMean,7)," ",round(UpperMean,7),sep=""))

Let us calculate our Student’s T-score, which is calculated as follows:

Since we already calculated this formula for every one of our 30k resamples, we can generate our critical t-Scores for 90%, 95%, and 99% confidence intervals.
# Which are the T-Values expected for each Confidence level?
histogram <- data.frame(X=tScores)
library(dplyr)
histogram %>%
summarize(
# Find the 0.9 quantile of diff_perm's stat
q.90 = quantile(X, p = 0.9),
# ... and the 0.95 quantile
q.95 = quantile(X, p = 0.95),
# ... and the 0.99 quantile
q.99 = quantile(X, p = 0.99)
)

These values are very near the original Student’s Score Table for 1335 (N-1) degrees of freedom as seen here:

Resuming, we can observe that our calculated p-Value was around 3.69%, our 95% interval did not include 0.16, and our absolute t-Score of 2.1, as seen in our table, was just between the score of Alpha 0.05 and 0.01. All this seems to be coherent with the same outcome; we reject the null hypothesis with 95% confidence, meaning we cannot confirm that Version C’s true mean is equal to 0.16.
We designed this test ourselves, and we know for sure our null hypothesis was correct. This concept of rejecting a true null hypothesis is called a False Positive or Type I Error, which can be avoided by increasing our current confidence Interval from 95% to maybe 99%.
So far, we have performed the equivalent of a "One-Sample t-Test" trough simulations, which implies we have determined whether the "sample mean" of 0.14 was statistically different from a known or hypothesized "population mean" of 0.16, which is our ground truth.
For now, this will serve us as a building block for what is coming next since we will now proceed with a very similar approach to compare our Landing Versions between them to see if there is a winner.
Finding our winner version
We have explored how to compare if a Sample Mean was statistically different from a known Population Mean; now, let us compare our Sample Mean with another Sample Mean.
For this particular example, let us compare Version F vs. Version A.
This procedure of comparing two independent samples is usually addressed with a test called "Unpaired Two sample t-Test"; it is unpaired since we will use different (independent) samples; we assume they behave randomly, with normal distribution and zero covariance, as we will later observe.
If we were to use the same sample, say at different moments in time, we would use a "Paired Two Sample t-Test" which, in contrast, compares two dependent samples, and it assumes a non-zero covariance which would be reflected in the formula.
In simple words, we want to know how often we observe a positive difference in means, which is equivalent to say that Version F has a higher mean than Version A, thus, better performance. We know our current difference in means is as follows:

Since we know our Sample Means are just a single measurement of the real Population Mean for both Version F and Version A and not the true sample mean for neither one, we need to compute the estimated sample distribution for both versions like we did earlier. Unlike before, we will also calculate the difference in means for each resample to observe how it is distributed.
Let us simulate 40k samples with a replacement for Version F and Version A and calculate the difference in means:
# Let's select data from Version F and Version A
VersionF <- Dices[which(Dices$Version=="F"),]
VersionA <- Dices[which(Dices$Version=="A"),]
# We simulate 40k
Diff <- NULL
meansA <- NULL
meansF <- NULL
for(i in 1:40000) {
BootstrapA <- sample(VersionA$Signup,replace = TRUE)
BootstrapF <- sample(VersionF$Signup,replace = TRUE)
MeanDiff <- mean(BootstrapF)-mean(BootstrapA)
Diff <- rbind(Diff,MeanDiff)
}
# We plot the result
totals <- as.data.frame(table(Diff))
totals$Diff <- as.numeric(as.character(totals$Diff))
plot( totals$Freq ~ totals$Diff , ylab="Frequency", xlab="Difference",main="Sampling Difference Distrubution")

As we might expect from what we learned earlier, we got a normally distributed shape centered in our previously calculated sample difference of 0.337. Like before, we also know the difference between the true population means for Versions A and F should be within the range of this distribution.
Additionally, our bootstrap should have provided us a good approximation of the Standard Error of the difference between the true means. We can compare our "bootstrapped standard error" with the "true mean difference standard error" with both Central Limit Theorem and the Binomial Distribution.

Which allow us to obtain:

Just like before, it seems to be quite near our bootstrapped Standard Error for the difference of the means:
# Simulated Standard Error of the differences
round(sd(Diff),6)

As designed, we know the true expected difference of means is 0.04. We should have enough data to approximate a Normal Distribution with a mean equal to 0.04 and Standard Error of 0.0148, in which case we could find the percent of times a value as extreme as 0 is being found.
This scenario is unrealistic, though, since we would not usually have population means, which is the whole purpose of estimating trough confidence intervals.
Contrary to our previous case, in our first example, we compared our sample distribution of Version C to a hypothesized population mean of 0.16. However, in this case, we compare two individual samples with no further information as it would happen in a real A/B testing.
In particular, we want to prove that Version F is superior to Version A, meaning that the difference between means is greater than zero. For this case, we need to perform a "One-Tailed" test answering the following question: which percent of the times did we observe a difference in means greater than zero?
Our hypothesis is as follows:

The answer:
# Percent of times greater than Zero
mean(Diff > 0)

Since our p-Value represents the times we did not observe a difference in mean higher than Zero within our simulation, we can calculate it to be 0.011 (1–0.989). Additionally, being lower than 0.05 (Alpha), we can reject our null hypothesis; therefore, Version F had a higher performance than Version A.
If we calculate both 95% confidence intervals and t-Scores for this particular test, we should obtain similar results:
Confidence interval at 95%:
# Data aggregation
freq <- as.data.frame(table(Diff))
freq$Diff <- as.numeric(as.character(freq$Diff))
# Right-most proportion (Inf)
UpperDiff <- Inf
# Sort Descending for left-most proportion
freq <- freq[order(freq$Diff,decreasing = TRUE),]
freq$cumsumDesc <- cumsum(freq$Freq)/sum(freq$Freq)
LowerDiff <- max(freq$Diff[which(freq$cumsumDesc >= 0.95)])
# Print Results
cat(paste("95 percent confidence interval:n ",round(LowerDiff,7)," ",round(UpperDiff,7),sep=""))

As expected, our confidence interval tells us that with 95% confidence, we should expect a difference of at least 0.0097, which is above zero; therefore, it shows a better performance.
Unpaired Two-Sample t-Test score:

Similar to our previous values, checking our t-Table for T=2.31 and 2653 Degrees of Freedom we also found a p-Value of roughly 0.01

Pairwise Comparison
So far, we have compared our Landing Page Version C with a hypothesized mean of 0.16. We have also compared Version F with Version A and found which was the highest-performer.
Now we need to determine our absolute winner. We will do a Pairwise Comparison, meaning that we will test every page with each other until we determine our absolute winner if it exists.
Since we will make a One-Tailed test for each and we do not need to test a version with itself, we can reduce the total number of tests as calculated below.
# Total number of versions
VersionNumber <- 6
# Number of versions compared
ComparedNumber <- 2
# Combinations
factorial(VersionNumber)/(factorial(ComparedNumber)*factorial(VersionNumber-ComparedNumber))
As output we obtain: 15 pair of tests.

We will skip the process of repeating this 15 times, and we will jump straight to the results, which are:

As seen above, we have managed to find that Version F had better performance than both versions A, C, and was almost better performing than B, D, and E, which were close to our selected Alpha of 5%. In contrast, Version C seems to be an extraordinary case since, with both D and E, it seems to have a difference in means greater than zero, which we know is impossible since all three were designed with an equivalent probability of 0.16.
In other words, we have failed to reject our Null Hypothesis at a 95% confidence even though it is false for F vs. B, __ D, and C; this situation (Type II Error) could be solved by increasing our Statistical Power. In contrast, we rejected a true null hypothesis for D _v_s. C and E _v_s. C, which indicates we have incurred in a Type I Error, which could be solved by defining a lower Alpha or Higher Confidence level.
We indeed designed our test to have an 80% statistical power. However, we designed it solely for testing differences between our total observed and expected frequencies and not for testing differences between individual means. In other words, we have switched from a "Chi-Squared Test" to an "Unpaired Two-Sample t-Test".
Statistical Power
We have obtained our results. Even though we could use them as-is and select the ones with the highest overall differences, such as the ones with the lowest P-Values, we might want to re-test some of the variations in order to be entirely sure.
As we saw in our last post, Power is calculated as follows:

Similarly, Power is a function of:
- Our significance criterion is our Type I Error or Alpha, which we decided to be 5% (95% confidence).
- Effect Magnitude or Size: This represents the difference between our observed and expected values regarding the standardized statistic of use. In this case, since we are using a Student’s Test Statistic, this effect (named d) is calculated as the "difference between means" divided by the "Pooled Standard Error". It is usually categorized as Small (0.2), Medium (0.5), and Large (0.8).
- Sample size: This represents the total amount of samples (in our case, 8017).
Effect Magnitude
We designed an experiment with a relatively small effect magnitude since our Dice was only biased in one face (6) with only a slight additional chance of landing in its favor.
In simple words, our effect magnitude (d) is calculated as follows:

If we calculate this for the expected values of Version F vs. Version A, using the formulas we have learned so far, we obtain:

Sample Size
As we commented in our last post, we can expect an inverse relationship between sample sizes and effect magnitude. The more significant the effect, the lower the sample size needed to prove it at a given significance level.
Let us try to find the sample size needed in order to have a 90% Power. We can solve this by iterating different values of N until we minimize the difference between our Expected Power and the Observed Power.
# Basic example on how to obtain a given N based on a target Power.
# Playing with initialization variables might be needed for different scenarios.
set.seed(11)
CostFunction <- function(n,d,p) {
df <- (n - 1) * 2
tScore <- qt(0.05, df, lower = FALSE)
value <- pt(tScore , df, ncp = sqrt(n/2) * d, lower = FALSE)
Error <- (p-value)^2
return(Error)
}
SampleSize <- function(d,n,p) {
# Initialize variables
N <- n
i <- 0
h <- 0.000000001
LearningRate <- 3000000
HardStop <- 20000
power <- 0
# Iteration loop
for (i in 1:HardStop) {
dNdError <- (CostFunction(N + h,d,p) - CostFunction(N,d,p)) / h
N <- N - dNdError*LearningRate
tLimit <- qt(0.05, (N - 1) * 2, lower = FALSE)
new_power <- pt(tLimit , (N- 1) * 2, ncp = sqrt(N/2) * d, lower = FALSE)
if(round(power,6) >= p) {
cat(paste0("Found in ",i," Iterationsn"))
cat(paste0(" Power: ",round(power,2),"n"))
cat(paste0(" N: ",round(N)))
break();
}
power <- new_power
i <- i +1
}
}
set.seed(22)
SampleSize((0.2-0.16)/sqrt((0.16+0.1344)/2),1336,0.9)

As seen above, after different iterations of N, we obtained a recommended sample of 1576 per dice to have a 0.9 Power.
Let us repeat the experiment from scratch and see which results we get for these new sample size of 9456 (1575*6) as suggested by aiming a good Statistical Power of 0.9.
# Repeat our experiment with sample size 9446
set.seed(11)
Dices <- DiceRolling(9456) # We expect 90% Power
t(table(Dices))

Let us make a fast sanity check to see if our experiment now has a Statistical Power of 90% before we proceed; this can be answered by asking the following question:
- If we were to repeat our experiment X amount of times and calculate our P-Value on each experiment, which percent of the times, we should expect a P-Value as extreme as 5%?
Let us try answering this question for Version F vs. Version A:
# Proving by simulation
MultipleDiceRolling <- function(k,N) {
pValues <- NULL
for (i in 1:k) {
Dices <- DiceRolling(N)
VersionF <- Dices[which(Dices$Version=="F"),]
VersionA <- Dices[which(Dices$Version=="A"),]
pValues <- cbind(pValues,t.test(VersionF$Signup,VersionA$Signup,alternative="greater")$p.value)
i <- i +1
}
return(pValues)
}
# Lets replicate our experiment (9456 throws of a biased dice) 10k times
start_time <- Sys.time()
Rolls <- MultipleDiceRolling(10000,9456)
end_time <- Sys.time()
end_time - start_time
How many times did we observe P-Values as extreme as 5%?
cat(paste(length(which(Rolls <= 0.05)),"Times"))

Which percent of the times did we observe this scenario?
Power <- length(which(Rolls <= 0.05))/length(Rolls)
cat(paste(round(Power*100,2),"% of the times (",length(which(Rolls <= 0.05)),"/",length(Rolls),")",sep=""))

As calculated above, we observe a Power equivalent to roughly 90% (0.896), which proves our new sample size works as planned. This implies we have a 10% (1 – Power) probability of making a Type II Error or, equivalently, a 10% chance of failing to reject our Null Hypothesis at a 95% confidence interval even though it is false, which is acceptable.
Absolute winner
Finally, let us proceed on finding our absolute winner by repeating our Pairwise Comparison with these new samples:

As expected, our absolute winner is Version F amongst all other versions. Additionally, it is also clear now that there is no significant difference between any other version’s true means.
Final Thoughts
We have explored how to perform simulations on two types of tests; Chi-Squared and Student’s Tests for One and Two Independent Samples. Additionally, we have examined some concepts such as Type I and Type II errors, Confidence Intervals, and the calculation and Interpretation of the Statistical Power for both scenarios.
It is essential to know that we would save much time and even achieve more accurate results by performing such tests using specialized functions in traditional use-case scenarios, so it is not recommended to follow this simulation path. In contrast, this type of exercise offers value in helping us develop a more intuitive understanding, which I wanted to achieve.
If you have any questions or comments, do not hesitate to post them below.