A/B tests are powerful tools, but choosing the wrong statistical test can lead to misleading results. This guide will help you pick the perfect test for your data, to guarantee reliable analysis and make confident recommendations.
Just wrapped up your A/B test? The excitement doesn’t stop there! The real magic happens when you dive into the data and unearth valuable insights. This guide equips you, the Data Analyst or Data Scientist, with a systematic approach to analysing your A/B test results.
A big part of the analysis involves understanding the data and the intricate statistical underpinnings in choosing the right test for your data. This step is often overlooked, due to the allure of jumping straight to implementation and potentially missing crucial insights.
Depending on what you’re analysing, there are a different set of assumptions to make and hence different set of tests to choose. This article guides you through how to choose the ‘correct’ test for your data.
GUIDE
Getting right into it. This table focusses on typical Mobile Apps’ metrics that a typical objective of an A/B test would focus on, though the principles and assumptions applies throughout.
Skip ahead to the necessary section below where I describe each of the metric-types, how to decide the best test for them, and using Python to calculate them!
Section 1: Average per User Metrics
Section 2: Categorical Metrics
Section 3: Joint Metrics
Before we start, Basic Definitions:
Null Hypothesis:
Each Hypothesis test is made up of a "theory" to test, this is called the Null Hypothesis (H₀). The goal of our analysis is to try to confidently prove whether this hypothesis is true or not.
The Null Hypothesis (H₀) assumes that there is no difference between the 2 groups, ie, the feature has no impact.
E.g. H₀: μ1 = μ2
HA: μ1 ≠ μ2
Significance Level (Alpha):
The significance level, or alpha (α), is a measure used to decide if the results of the test are statistically significant or not. This forms part of the base assumptions made before running the test.
In simple terms, the significance level helps you determine if your findings are reliable and not just a fluke. It acts as a threshold for saying, "Hey, this is probably real, not just a coincidence."
P-Value:
The p-value, or probability value, is a measure used in Statistics to help determine the strength of evidence against a null hypothesis.
In simple terms, the p-value is related to the probability of getting a false positive result, ie, the probability that the data we got is due to chance and we make a Type I Error.
If the p-value is high, we cannot trust the data as the likelihood of landing on a false positive result is high. If the data is not reliable (ie, the p-value is high), we cannot confidently prove or disprove H₀.
P-value < Significance Level → Reject H₀. We have sufficient data to conclude that the two groups are significantly different
P-value >Significance Level → Do not Reject H₀. We do not have sufficient data to conclude that the two groups are significantly different
Selecting the metric to analyse
Before we begin, it’s important to understand what you’re trying to analyse from your A/B test.
For example, if you are revamping the first-time user experience (FTUE)?, you’d likely be interested in metrics like user retention or conversion. These typically involve yes/no (1 or 0) outcomes, making the "Two Proportion Z Test" a good fit.
We’ll explore different types of metrics mentioned in the guide above, explaining why they might be chosen and how to ensure your data aligns with the test’s requirements.
Setting up your data
I have assumed that you already have your data and it is in the format of a table with 3 columns: Unique ID, Variant, Metric.
#Separate the data into 2
group_A = df[df['Variant'] == "A"]["Metric"]
group_B = df[df['Variant'] == "B"]["Metric"]
#Change "A" and "B" to the relevant groups in your dataset
#Change "Metric" to the relevant column name with the values
Section 1: Average Per User Metrics
This is the most common type of metrics to analyse. It involves independent samples of the data.
In most real-world cases, Avg per User metrics, such as Average Revenue per User or Average Time Spent per User is very common to analyse. If you have a large enough sample size, you could just skip ahead and choose the Welch’s t-test.
I’ll now go through each flow and describe the steps to decide which path to use –
Flow 1: Large Sample or Normally Distributed:
My general assumption for a ‘Large Sample’ is typically where the number of individual samples is greater than 10,000, though this definition is relative and may differ based on various factors, such as the specific field of study, the type of analysis being performed, and the effect size that needs to be detected.
If n is relatively small, then perform a Normality Test to determine whether you should select a T-Test. There are several ways to test for normality. The easiest way to check is by creating a histogram and visually inspecting the data. If this roughly shapes a Normal Distribution, then continue on. If still unsure, it’s best to do a more statistical test, such as the Shapiro-Wilkes test for Normality. Here is a great article about the various ways to test your data for normality. Keep in mind that each statistical test usually makes various assumptions on your data, so keep this in mind before selecting the noramlity right test.
See the code snippet below showcasing how to test your data for Normality through either visual inspection by creating a histogram or using a statistical test called the Shapiro-Wilkes test.
Test for Normality
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
alpha=0.05 #We assume an alpha here of 0.05
# Creating Histograms
grouped = df.groupby('Variant')
#Variant column identifies the group of the user - group_A / group_B etc
# Plotting histograms for each group to visually inspect the shape of the data
for name, group in grouped:
plt.hist(group['Metric'])
plt.title(name)
plt.show()
# Shapiro-Wilkes Test to statistically test for normality
for name, group in grouped:
shapiro_test = stats.shapiro(group['Value'])
print(f"Shapiro-Wilk Test for Group {name}: W = {shapiro_test[0]}, p-value = {shapiro_test[1]}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
Flow 2a: Equal Variances?
Following on from the Average Per User flowchart above –
If you have confirmed that your data conforms to Normality, then the next step is to check whether the datasets have equal variances. This determines whether you should use the Welch’s t-test or the Student’s t-test.
The main differences between Welch’s t-test and Student’s t-test are around the degrees of freedom and sample variance estmates. Student’s assumes that both samples have the same variance, while Welch’s does not.
When comparing large sample sizes (n > 10,000) for a hypothesis test using either Welch’s t-test or Student’s t-test, the difference in significance levels between the two tests is typically negligible. This is because the Student’s t-test assumption of equal variances has minimal impact on the test’s accuracy when dealing with large sample sizes.
Even if the assumption of equal variances is violated, the Student’s t-test remains relatively robust, meaning it produces accurate p-values and maintains the desired type I error rate (the probability of rejecting a true null hypothesis). This robustness is due to the Central Limit Theorem, which states that as the sample size increases, the distribution of sample means approaches a normal distribution, regardless of the underlying population distribution.
In contrast, Welch’s t-test is specifically designed to handle unequal variances, making it more appropriate when the assumption of equal variances is questionable. However, for large sample sizes, the difference in significance levels between Welch’s t-test and Student’s t-test is usually minimal.
If you are concerned about the potential for unequal variances, Welch’s t-test is a safer choice. However, if you want to maximise power and are confident that the sample size is large enough, Student’s t-test can be used.
See the code snippet below showcasing how to test your data for equal variances using the Bartlett’s test. Bartlett’s test is a very robust test that requires the assumption of normality. If you would prefer a less robust test, then Levene’s test may be more appropriate.
Test for Equal Variances
from scipy.stats import bartlett, levene
# Perform Bartlett's test for equal variances (works best on data that conforms to normality)
statistic, p_value = bartlett(group_A, group_B)
# Perform Levene's test for equal variances (less sensitive to Normality assumption)
statistic, p_value = levene(group_A, group_B)
# Display test results
print(f"Test statistic: {statistic}")
print(f"P-value: {p_value}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference in variances between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in variances between the groups.")
Flow 2b: Median or Mean?
Following on from the Average Per User flowchart above –
If you have reasonable doubts that your dataset conforms to Normality, then another statistical method might be more appropriate to analyse your dataset. The next step is to decide whether the Mean or Median of your data is more useful.
Considering the median over the mean in an A/B test can be beneficial in specific scenarios, primarily when dealing with data that might be affected by outliers or very skewed, non-normal distributions.
- Communicating Results: Using the median can offer a clearer and more intuitive interpretation of central tendency, especially when describing the typical or typical-per-user behavior. It might be more relatable for stakeholders or non-technical audiences.
- Skewed Distributions: If your data is highly skewed or does not follow a normal distribution, the mean might not accurately represent the typical value. In such cases, the median provides a more robust estimate of central tendency as it’s less influenced by extreme values or the shape of the distribution.
- Outlier Sensitivity: The mean is highly sensitive to outliers or extreme values in a dataset. Even a few outliers can significantly skew the mean, impacting its representativeness of the central tendency. In contrast, the median is less affected by extreme values since it represents the middle value in the dataset when arranged in ascending order.
Both measures have their merits, and choosing between them should align with the nature of your data, the effect of outliers, and the specific insights you aim to derive from your A/B test. You should consider both the mean and median when coming to a conclusion.
This is a very useful guide to the Mann-Whitney U Test. As always, it is best to do research to thoroughly understand each test before jumping into it!
Statistical Tests
If you’ve followed the Average Per User flowchart above, you would have now decided what’s the best test to determine whether the 2 groups’ metrics are statistically significant different. See how to perform them below.
Refer to the GUIDE and the Average Per User flowchart above.
Student’s t-test
import scipy.stats
# Student's t-test - This test requires Normality and Equal Variances
t_statistic, p_value = stats.ttest_ind(group_A, group_B)
print(f"Student's t-test: t = {t_statistic}, p-value = {p_value}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
Welch’s t-test
import scipy.stats
# Welch's t-test - This test requires Normality
t_statistic, p_value = stats.ttest_ind(group_A, group_B, equal_var=False)
print(f"Welch's t-test: t = {t_statistic}, p-value = {p_value}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
Mann-Whitney U-test
# Mann-Whitney U-test - No statistical assumptions, Median preferred over Mean
u_statistic, p_value = stats.mannwhitneyu(group_A, group_B)
print(f"Mann-Whitney U-test: U = {u_statistic}, p-value = {p_value}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
Bootstrapping
#Bootstrapping - forNon-Normal data/Small sample sizes, and Mean is preferred
# Calculate observed difference in means
observed_diff = np.mean(group_B) - np.mean(group_A)
# Combined data
combined_data = np.concatenate((group_A, group_B))
# Number of bootstrap samples
num_samples = 10000 # You can adjust this number based on computational resources
# Bootstrap resampling
bootstrap_diffs = []
for _ in range(num_samples):
# Resample with replacement
bootstrap_sample = np.random.choice(combined_data, size=len(combined_data), replace=True)
# Calculate difference in means for each bootstrap sample
bootstrap_mean_A = np.mean(bootstrap_sample[len(group_A):])
bootstrap_mean_B = np.mean(bootstrap_sample[:len(group_A)])
bootstrap_diff = bootstrap_mean_B - bootstrap_mean_A
bootstrap_diffs.append(bootstrap_diff)
# Calculate p-value (significance level)
p_value = np.mean(np.abs(bootstrap_diffs) >= np.abs(observed_diff))
print(f"P-value: {p_value}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
Section 2: Categorical Variables
In this section, we will explore Metrics that are categorical. These metrics can be Discrete, such as Clicked/not Clicked, or Continuous, in the case of multivariate tests.
Follow the Flowchart above to select the best test for your data.
2 Groups
Two Proportion Z-test (Binary Metrics)
These are metrics like Retention, Conversion, Clicked etc.
The two-sample z-test for binomial variables in A/B testing compares the proportions of binary outcomes between two groups. From a statistical viewpoint, a binomial distribution converges to a normal distribution as N gets large. This assumption generally holds quite well, so it makes sense to use a z-test for this.
H0: μ1 – μ2 = 0
HA: μ1 – μ2 ≠ 0
Two Proportion Z-test
from statsmodels.stats.weightstats import ztest
# Calculate the z-statistic and p-value. This assumes binomially distributed and i.i.d. variables.
z_stat, p_value = ztest(group_A, group_B)
print(f"Two Sample z-test: t = {z_stat}, p-value = {p_value}")
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
3+ Groups
Discrete Variables
Pearson’s Chi-Squared Test
The chi-squared test is another powerful tool for analyzing A/B tests, particularly when you have multiple groups in addition to a control group.
It allows you to compare multiple variants simultaneously, without assuming any distributional properties. This method will also work with similar binary variables as above, but using multiple groups instead of just 2. As these groups are likely splitting the sample size into smaller groups, it is important to ensure that the sample sizes for each group remain relatively large.
Pearson’s Chi Sq Test is utilised to determine if there’s a significant association or difference between observed and expected frequencies of categorical data among multiple groups. Therefore, the Null Hypothesis assumes that there is no difference between the groups.
As the data is discrete, we create a contingency table to sum the counts across each Variant. This is then interpreted by the stats.chi2_contingency function.
Pearson’s Chi-Squared Test
# Create a contingency table
contingency_table = pd.crosstab(df['Variant'], df["metric"])
# Perform the chi-squared test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
if p_value < alpha:
print("There is a statistically significant difference in the distribution of the metric across groups.")
else:
print("There is no statistically significant difference in the distribution of the metric across groups.")
Continuous Variables
ANOVA Test
ANOVA is a statistical test used to compare the means of three or more groups. It assesses whether the observed differences between group means are likely due to chance or actually represent significant differences in their underlying populations. It is useful when releasing multiple different variants and want to compare the results against each other to save time rather than deploying individual A/B tests.
ANOVA is relatively robust to violations of normality and variance homogeneity assumptions, especially when sample sizes are reasonably large.
ANOVA Test
# Group the counts for the various groups in the data
grouped_data = [df[df['Variant'] == cat]['Metric'] for cat in df['Variant'].unique()]
# Perform ANOVA test
f_statistic, p_value = stats.f_oneway(*grouped_data)
if p_value < alpha:
print("There is a statistically significant difference in the means of the metric across groups.")
else:
print("There is no statistically significant difference in the means of the metric across groups.")
This test determines whether the groups are statistically significant. One problem we encounter here is trying to identify a particular group that significantly outperforms the rest. While the collective ANOVA test highlights general deviations, a deeper investigation is needed to identify specific groups showcasing statistical differences over the other groups.
Fortunately, there is a neat function which uses Tukey’s range test offering a structured approach. It generates a comprehensive pairwise comparison table across all group combinations, unveiling statistically significant differences among them.
However, exercising caution is imperative due to potential violations of underlying assumptions in the Turkey’s range test. This test should primarily be used to identify distinct groups, utilizing it as a supportive tool only in group comparisons. This is really useful video showing how it is performed and used.
See the code snippet below which should be used as an aid to the ANOVA test above to identify the particular groups that outperforms the rest.
Turkey’s HSD test
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Perform Tukey's HSD test for post-hoc analysis
tukey = pairwise_tukeyhsd(df["metric"], df['Variant'])
# Print pairwise comparison results
print(tukey.summary())
Section 3: Joint Metrics
Following the GUIDE above, if you’ve determined that your metric is actually a joint metric of 2 or more variables, then you may need to take additional steps to effectively determine the statistical differences between your various groups. This is because the other tests above assume that the metrics you are testing are independent of one another.
Delta t-test
The Delta t-test is a statistical method employed to assess the difference in means between two independent groups while considering the ratio of two random variables forming a joint distribution. In the realm of A/B testing, there are often scenarios where the metrics itself may not be independent.
An example of this is Ad Click-Through-Rate. A person may view the same Advertisement multiple times, but only click on it once.
The problem with using the standard t-tests here is that we’re analysing 2 separate random variables and their ratio forms a joint distribution. The subjects themselves are independent, however, their joint ratio is not. This violates the assumptions of independence of the Student’s and Welch’s t-test.
Instead, we use the Delta Method to estimate the Variance;
See the code snippet below showing how to calculate this new Variance formula, as well as creating a t-test function using this new variance.
Delta t-test
# create the new Variance function as described above
def var_ratio(metric1,metric2):
mean_x = np.mean(metric1)
mean_y = np.mean(metric2)
var_x = np.var(metric1,ddof=1)
var_y = np.var(metric2,ddof=1)
cov_xy = np.cov(metric1,metric2,ddof=1)[0][1]
result = (mean_x**2 / mean_y**2) * (var_x/mean_x**2 - 2*cov_xy/(mean_x*mean_y) + var_y/mean_y**2)
return result
# create this new ttest function, using the new Variances above. This is a standard t-test function.
def delta_ttest(mean_c,mean_t,var_c,var_t, alpha = 0.05):
mean_diff = mean_t - mean_c
var = var_c + var_t
std_e = stats.norm.ppf(1 - alpha/2) * np.sqrt(var)
lower_ci = mean_diff - std_e
upper_ci = mean_diff + std_e
z = mean_diff/np.sqrt(var)
p_val = stats.norm.sf(abs(z))*2
return z, p_val, upper_ci, lower_ci
#Eg. Here we calculate the significance of the CTR for a control and treatment group.
var_c = var_ratio(control['click'],control['view']) #Calculates the delta variance for the control group
var_t = var_ratio(treatment['click'],treatment['view']) #Calculates the delta variance for the treatment group
mean_c = control['click'].sum()/control['view'].sum()
mean_t= treatment['click'].sum()/treatment['view'].sum()
z, p_value, upper_ci, lower_ci = delta_ttest(mean_c,mean_t,var_c,var_t,alpha) #Applies the ttestusing these new delta variances
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis. There is no significant difference between the groups.")
Conclusion:
In summary, while A/B tests are invaluable for experimentation and optimisation, the choice of the right statistical test is essential. For robust and reliable results, data scientists should carefully consider the characteristics and test assumptions.
Remember, A/B tests are powerful tools, but choosing the wrong statistical test can lead to misleading results!
References:
1. Test for Normality by Sivasai Yadav Mudugandla
2. Levene’s Test by Kyaw Saw Htoon
3. Mann Whitney U Test by Ricardo Lara Jácome
- Delta Method: https://arxiv.org/pdf/1803.06336.pdf