Many of us have heard, read, or even performed an A/B Test before, which means we have conducted a statistical test at some point. Most of the time, we have worked with data from first or third-party sources and performed these tests with ease by either using tools ranging from Excel to Statistical Software and even more automated solutions such as Google Optimize.
If you are like me, you might be curious about how these types of tests work and how concepts such as Type I and Type II Error, Confidence Intervals, Effect Magnitude, Statistical Power, and others interact with each other.
In this post, I would like to invite you to take a different approach for one specific type of A/B test, which makes use of a particular statistic called Chi-Squared. In particular, I will try to explore and walk through this type of test by taking the great but long road of simulations, avoiding libraries and tables, hopefully managing to explore and build some of the intuition behind it.
Before we start
Even though we could use data from our past experiments or even third-party sources such as Kaggle, it would be more convenient for this post to generate our data. It will allow us to compare our conclusions with a known ground truth; otherwise, it will be most likely unknown.
For this example, we will generate a dummy dataset that will represent six different versions of a signup form and the number of leads we observed on each. For this dummy set to be random and have a winner version that will serve us as *ground truth,* we will generate this table by simulating some biased dice’s throws.
For this, we have generated an R function that simulates a biased dice in which we have a 20% probability of lading in 6 while a 16% chance of landing in any other number.
# Biased Dice Rolling Function
DiceRolling <- function(N) {
Dices <- data.frame()
for (i in 1:6) {
if(i==6) {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.2))
} else {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.16))
}
Dices <- rbind(Dices,Observed)
}
return(Dices)
}
# Let's roll some dices
set.seed(123) # this is for result replication 86
Dices <- DiceRolling(1800)
Think of each Dice number as a representation of a different landing version (1–6 or A-F). For each version, we will throw our Dice 300 times, and we will write down its results as follows:
- If we are on version A (1) and throw the Dice and it lands on 1, we consider it to be Signup; otherwise, just a visit.
- We repeat 300 times for each version.
Sample Data
As commented earlier, this is what we got:
# We shuffle our results
set.seed(25)
rows <- sample(nrow(Dices))
t(Dices[head(rows,10),])

We can observe from our first ten results that we got one Signup for F, D, and A. In aggregated terms, our table looks like this:
library(ggplot2)
ggplot(Result, aes(x=Version, y=Signup)) + geom_bar(stat="identity", position="dodge") + ggtitle("Summary Chart")
Result <- aggregate(Signup ~ Version, Dices, sum)
t(Result)

From now own, think of this table as Dice throws, eCommerce conversions, surveys, or a Landing Page Signup Conversion as we will use here, it does not matter, use whatever is more intuitive for you.
For us, it will be signups, so we should produce this report:

Observed Frequencies
We will now aggregate our results, including both our Signup (1) and Did not Signup (0) results, which will allow us to understand better how these differ from our expected values or frequencies; this is also called a Cross Tabulation or Contingency Table.
# We generate our contigency table
Observed <- table(Dices)
t(Observed)

In summary:

Expected Frequencies
Since we know how our Cross Tabulation looks, we can now generate a table simulating how we should expect our results to be like considering the same performance of all versions. It is equivalent to say that each version had the same Signup Conversion or probability in the case of our example or the expected result of a non-biased dice if you prefer.
# We generate our expected frequencies table
Expected <- Observed
Expected[,1] <- (sum(Observed[,1])/nrow(Observed))
Expected[,2] <- sum(Observed[,2])/nrow(Observed)
t(Expected)

In summary:

Hypothesis Testing
We know our test had a higher-performing version not only by visually inspecting the results but because we purposely designed it to be that way.
This is the moment we have waited for: is it possible for us to prove this solely based on the results we got?.
The answer is yes, and the first step is to define our Null and Alternative Hypothesis, which we will later try to accept or reject.

Our alternative hypothesis (H1) is what we want to prove correct, which states that there is, in fact, a relationship between the landing version and the result we observed. In contrast, our null hypothesis states that there is no relationship meaning there is no significant difference between our observed and expected frequencies.
Statistic
Our goal is to find how often our observed data is located in a universe where our null hypothesis is correct, meaning, where our observed and expected signup frequencies have no significant difference.
A useful statistic that’s able to sum up all these values; six columns (one for each version) and two rows (one for each signup state) into a single value is Chi-Square, which is calculated as follows:

We will not get into details of how this formula can be found neither of its assumptions or requirements (such as Yates Correction), because it is not the subject of this post. On the contrary, we would like to perform a numerical approach through simulations, which should shed some light on these types of hypothesis tests.
Resuming, if we compute this formula with our data, we get:

# We calculate our X^2 score
Chi <- sum((Expected-Observed)^2/Expected)
Chi

Null Distribution Simulation
We need to obtain the probability of finding a statistic as extreme as the one we observed, which in this case, is represented by Chi-Square equal to 10.368. This, in terms of probability, is also known as our P-Value.
For this, we will simulate a Null Distribution as a benchmark. What this means is that we need to generate a scenario in which our Null Distribution is correct, suggesting a situation where there is no relationship between the landing version and the observed signup results (frequencies) we got.
A solution that rapidly comes to mind is to repeat our experiment from scratch, either by re-collecting results many times or, as in the context of this post, using an unbiased dice to compare how our observed results behave in contrast to these tests. Even though this might seem intuitive at first, in real-world scenarios, this solution might not be the most efficient one since it would require extreme use of resources such as time and budget to repeat this A/B test many times.
Resampling
An excellent solution to the problem discussed above is called resampling. What resampling does is make one variable independent of the other by shuffling one of them randomly. If there were an initial relationship between them, this relation would be lost due to the random sampling method.
In particular, we need to use the original (unaggregated) samples for this scenario. We will later permutate one of the columns several times, which will be Signup status in this case.
In particular, let us see an example of 2 shuffles for the first "10 samples" shown earlier:

Let us try it now with the complete (1800) sample set:
Permutation #1
Perm1 <- Dices
set.seed(45)
Perm1$Signup <- sample(Dices$Signup)
ResultPerm1 <- aggregate(Signup ~ Version, Perm1, sum)
cat("Permutation #1:nn")
cat("Summarynn")
t(ResultPerm1)
cat("Chi-Squared")
Perm1Observed <- table(Perm1)
sum((Expected-Perm1Observed)^2/Expected)

Permutation #2
Perm1 <- Dices
set.seed(22)
Perm1$Signup <- sample(Dices$Signup)
ResultPerm1 <- aggregate(Signup ~ Version, Perm1, sum)
cat("Permutation #2:nn")
cat("Summarynn")
t(ResultPerm1)
cat("Chi-Squared")
Perm1Observed <- table(Perm1)
sum((Expected-Perm1Observed)^2/Expected)

As seen in both permutations of our data, we got utterly different summaries and Chi-Squared values. We will repeat this process a bunch of times to explore what we can obtain at a massive scale.
Simulation
Let us simulate 15k permutations of our data.
# Simulation Function
Simulation <- function(Dices,k) {
dice_perm <- data.frame()
i <- 0
while(i < k) {
i <- i + 1;
# We permutate our Results
permutation$Signup <- sample(Dices$Signup)
# We generate our contigency table
ObservedPerm <- table(permutation)
# We generate our expected frequencies table
ExpectedPerm <- ObservedPerm
ExpectedPerm[,1] <- (sum(ObservedPerm[,1])/nrow(ObservedPerm))
ExpectedPerm[,2] <- sum(ObservedPerm[,2])/nrow(ObservedPerm)
# We calculate X^2 test for our permutation
ChiPerm <- sum((ExpectedPerm-ObservedPerm)^2/ExpectedPerm)
# We add our test value to a new dataframe
dice_perm <- rbind(dice_perm,data.frame(Permutation=i,ChiSq=ChiPerm))
}
return(dice_perm)
}
# Lest's resample our data 15.000 times
start_time <- Sys.time()
permutation <- Dices
set.seed(12)
permutation <- Simulation(Dices,15000)
end_time <- Sys.time()
end_time - start_time

Resample Distribution
As we can observe below, our 15k permutations look like it is distributed with a distinct shape and resembles, as expected, a Chi-Square distribution. With this information, we can now calculate how many of the 15k iterations, we observed a Chi-Squared value as extreme as our initial 10.36 calculation.
totals <- as.data.frame(table(permutation$ChiSq))
totals$Var1 <- as.numeric(as.character(totals$Var1))
plot( totals$Freq ~ totals$Var1, ylab="Frequency", xlab="Chi-Squared Values",main="Null Distribution")

P-Value
Let us calculate how many times we obtained a Chi-Square value equal to or higher than 10.368 (our calculated score).
Higher <- nrow(permutation[which(permutation$ChiSq >= Chi),])
Total <- nrow(permutation)
prob <- Higher/Total
cat(paste("Total Number of Permutations:",Total,"n"))
cat(paste(" - Total Number of Chi-Squared Values equal to or higher than",round(Chi,2),":",Higher,"n"))
cat(paste(" - Percentage of times it was equal to or higher (",Higher,"/",Total,"): ",round(prob*100,3),"% (P-Value)",sep=""))

Decision Limits
We now have our P-Value, which means that if the Null Hypothesis is correct, saying there is no relationship between Version and Signups, we should encounter a Chi-Square as extreme only a small 6.5% of the time. If we think of this as only dice results, we should expect "results as biased as ours" even by throwing an unbias dice at most 6.5% of the time.
Now we need to define our decision limits on which we accept or reject our null hypothesis.
We calculated our decision limits for 90%, 95%, and 99% confidence intervals, meaning which Chi-Squared values we should expect as a limit on those odds.
# Decition Limits
totals <- as.data.frame(table(permutation$ChiSq))
totals$Var1 <- as.numeric(as.character(totals$Var1))
totals$Prob <- cumsum(totals$Freq)/sum(totals$Freq)
Interval90 <- totals$Var1[min(which(totals$Prob >= 0.90))]
Interval95 <- totals$Var1[min(which(totals$Prob >= 0.95))]
Interval975 <- totals$Var1[min(which(totals$Prob >= 0.975))]
Interval99 <- totals$Var1[min(which(totals$Prob >= 0.99))]
cat(paste("Chi-Squared Limit for 90%:",round(Interval90,2),"n"))
cat(paste("Chi-Squared Limit for 95%:",round(Interval95,2),"n"))
cat(paste("Chi-Squared Limit for 99%:",round(Interval99,2),"n"))

Fact Check

As observed by the classical "Chi-Square Distribution Table", we can find very similar values from the ones we obtained from our simulation, which means our confidence intervals and P-Values should be very accurate.
Hypothesis Testing

As we expected, we can reject the Null Hypothesis and claim that there is a significant relationship between versions and signups. Still, there is a small caveat, and this is our level of confidence. As observed in the calculations above, we can see that our P-Value (6.5%) is just between 90% and 95% confidence intervals, which means, even though we can reject our Null Hypothesis with 90% confidence, we cannot reject it at 95% or any superior confidence level.
If we claim to have 90% confidence, then we are also claiming there is a 10% chance of wrongly rejecting our null hypothesis (also called Type I Error, False Positive, or Alpha). Note, in reality, such standard arbitrary values (90%,95%, 99%) are used, but we could easily claim we are 93.5% certain since we calculated a 6.5% probability of a Type I Error.
Interestingly, even though we know for sure there is a relationship between version and signups, we cannot prove this by mere observation, simulations, and neither by doing this hypothesis test with a standard 95% confidence level. This concept of failing to reject our Null Hypothesis even though we know it is wrong is called a false negative or Type II Error (Beta), which is dependent on the Statistical Power of this test, which measures the probability that this does not happen.
Statistical Power
In our hypothesis test, we saw we were unable to reject our Null Hypothesis even at standard values intervals such as 95% confidence or more. This is due to the Statistical Power (or Power) of the test we randomly designed, which is particularly sensitive to our statistical significance criterion discussed above (alpha or Type I error) and both effect magnitude and sample sizes.
Power is calculated as follows:

In particular, we can calculate our current statistical Power by answering the following question:
- If we were to repeat our experiment X amount of times and calculate our P-Value on each experiment, which percent of the times, we should expect a P-Value as extreme as 5%?
Let us try answering this question:
MultipleDiceRolling <- function(k,N) {
pValues <- NULL
for (i in 1:k) {
Dices <- DiceRolling(N)
Observed <- table(Dices)
pValues <- cbind(pValues,chisq.test(Observed)$p.value)
i <- i +1
}
return(pValues)
}
# Lets replicate our experiment (1800 throws of a biased dice) 10k times
start_time <- Sys.time()
Rolls <- MultipleDiceRolling(10000,1800)
end_time <- Sys.time()
end_time - start_time
How many times did we observe P-Values as extreme as 5%?
cat(paste(length(which(Rolls <= 0.05)),"Times"))

Which percent of the times did we observe this scenario?
Power <- length(which(Rolls <= 0.05))/length(Rolls)
cat(paste(round(Power*100,2),"% of the times (",length(which(Rolls <= 0.05)),"/",length(Rolls),")",sep=""))

As calculated above, we observe a Power equivalent to 21.91% (0.219), which is quite low since the gold standard is around 0.8 or even 0.9 (90%). In other words, this means we have a 78.09% (1 – Power) probability of making a Type II Error or, equivalently, a 78% chance of failing to reject our Null Hypothesis at a 95% confidence interval even though it is false, which is what happened here.
As mentioned, Power is a function of:
- Our significance criterion: this is our Type I Error or Alpha, which we decided to be 5% (95% confidence).
- Effect Magnitude or Size: This represents the difference between our observed and expected values in terms of the standardized statistic of use. In this case, since we used Chi-Square statistic, this effect (named w) is calculated as the squared root of the normalized Chi-Square value and is usually categorized as Small (0.1), Medium (0.3), and Large (0.5) (Ref: Cohen, J. (1988).)
- Sample size: This represents the total amount of samples (in our case, 1800).
Effect Magnitude
We designed an experiment with a relatively small effect magnitude since our Dice was only biased in one face (6) with only a slight additional chance of landing in its favor.
In simple words, our effect magnitude (w) is calculated as follows:

1) Where our Observed Proportions are calculated as follow:

Probabilities of our alternative hypothesis

2) And our Expected Proportions:

Probabilities of our null hypothesis

Finally, we can obtain our effect size as follows:

Sample Size
Similarly to our effect size, our sample sizes, even though it seems of enough magnitude (1800), is not big enough to spot relationship (or bias) at 95% confidence since our effect size, as we calculated, was very small. We can expect an inverse relationship between sample sizes and effect magnitude. The more significant the effect, the lower the sample size needed to prove it at a given significance level.
At this time, it might be more comfortable to think sample sizes of our A/B test as dice or even coin throws. It is somewhat intuitive that with one dice/coin throw, we will be unable to spot a biased dice/coin, but if 1800 throws are not high enough to detect this small effect at a 95% confidence level, this leads us to the following question: how many throws do we need?
The same principle applies to the sample size of our A/B test. The lesser the effect, such as small variations in conversion from small changes in each version (colors, fonts, buttons), the bigger the sample and, therefore, the time we need to collect the data required to accept or reject our hypothesis. A common problem in many A/B tests concerning website conversion in eCommerce is that tools such as Google Optimize can take many days, if not weeks, and most of the time, we do not encounter a conclusive answer.
To solve this, first, we need to define the Statistical Power we want. Next, we will try answering this question by iterating different values of N until we minimize the difference between our Expected Power and the Observed Power.
# Basic example on how to obtain a given N based on a target Power.
# Playing with initialization variables might be needed for different scenarios.
CostFunction <- function(n,w,p) {
value <- pchisq(qchisq(0.05, df = 5, lower = FALSE), df = 5, ncp = (w^2)*n, lower = FALSE)
Error <- (p-value)^2
return(Error)
}
SampleSize <- function(w,n,p) {
# Initialize variables
N <- n
i <- 0
h <- 0.000000001
LearningRate <- 40000000
HardStop <- 20000
power <- 0
# Iteration loop
for (i in 1:HardStop) {
dNdError <- (CostFunction(N + h,w,p) - CostFunction(N,w,p)) / h
N <- N - dNdError*LearningRate
ChiLimit <- qchisq(0.05, df = 5, lower = FALSE)
new_power <- pchisq(ChiLimit, df = 5, ncp = (w^2)*N, lower = FALSE)
if(round(power,6) >= round(new_power,6)) {
cat(paste0("Found in ",i," Iterationsn"))
cat(paste0(" Power: ",round(power,2),"n"))
cat(paste0(" N: ",round(N)))
break();
}
power <- new_power
i <- i +1
}
}
set.seed(22)
SampleSize(0.04,1800,0.8)
SampleSize(0.04,1800,0.9)

As seen above, after different iterations of N, we obtained a recommended sample of 8.017 and 10.293 for 0.8 and 0.9 Power values, respectively.
Let us repeat the experiment from scratch and see which results we get for these new sample size of 8.017 suggested by aiming a commonly used Power of 0.8.
start_time <- Sys.time()
# Let's roll some dices
set.seed(11) # this is for result replication
Dices <- DiceRolling(8017) # We expect 80% Power
t(table(Dices))
# We generate our contigency table
Observed <- table(Dices)
# We generate our expected frequencies table
Expected <- Observed
Expected[,1] <- (sum(Observed[,1])/nrow(Observed))
Expected[,2] <- sum(Observed[,2])/nrow(Observed)
# We calculate our X^2 score
Chi <- sum((Expected-Observed)^2/Expected)
cat("Chi-Square Score:",Chi,"nn")
# Lest's resample our data 15.000 times
permutation <- Dices
set.seed(20)
permutation <- Simulation(Dices,15000)
Higher <- nrow(permutation[which(permutation$ChiSq >= Chi),])
Total <- nrow(permutation)
prob <- Higher/Total
cat(paste("Total Number of Permutations:",Total,"n"))
cat(paste(" - Total Number of Chi-Squared Values equal to or higher than",round(Chi,2),":",Higher,"n"))
cat(paste(" - Percentage of times it was equal to or higher (",Higher,"/",Total,"): ",round(prob*100,3),"% (P-Value)nn",sep=""))
# Lets replicate this new experiment (8017 throws of a biased dice) 20k times
set.seed(20)
Rolls <- MultipleDiceRolling(10000,8017)
Power <- length(which(Rolls <= 0.05))/length(Rolls)
cat(paste(round(Power*100,3),"% of the times (",length(which(Rolls <= 0.05)),"/",length(Rolls),")",sep=""))
end_time <- Sys.time()
end_time - start_time

Final Thoughts
As expected by our new experiment design of sample size equal to 8017, we were able to reduce our P-Value to 1.9%.
Additionally, we observe a Statistical Power equivalent to 0.79 (very near our goal), which implies we were able to reduce our Type II Error (non-rejection of our false null hypothesis) to just 21%!
This allows us to conclude with 95% confidence (in reality 98.1%) that there is, as we always knew, a statistically significant relationship between Landing Version and Signups. Now we need to test, with a given confidence level, which version was the higher performer; this will covered in Part II of this post.
If you have any questions or comments, do not hesitate to post them below.