
A/B testing is commonly used across all industries to make decisions in different aspects of the business. From writing emails, to choosing landing pages, implementing specific feature designs, A/B testing can be used to make the best decision based on statistical analysis. This article will cover the basis of the frequentist approach to A/B testing and outline an example of how to derive a decision through A/B testing. I will also provide the associated Python implementation of the code for a specific example.
Table of Contents
- What is A/B Testing
- Frequentist Approach
- Null & Alternative Hypothesis
- Sample Mean Estimate
- Confidence Interval
- Test Statistic
- P Value
- Example
- Concluding Remarks
- Resources
What is A/B Testing
Inferential statistics is often used to infer something about the population based on the observations on a sample of that population. A/B testing is the application of inferential statistics for researching user experience. It consists of randomized experiments with two variants, A and B [1]. Generally, by testing each variant against user response, one can find statistical evidence proving that one variant is a better choice than the other or you can conclude that there is no statistical significance from choosing 1 over the other. Common applications a company would have to conduct A/B tests would be to make decisions to improve conversion rates of their users, improve marketability of their products, increase their daily active users, etc.
Frequentist Approach
This is the more traditional approach when it comes to statistical inference, it is commonly introduced in basic stats courses taken in university. A simple outline of this approach can be of the following manner.
-
Identify the null and alternative hypothesis
- Null hypothesis : there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
- Alternative hypothesis : a hypothesis which contradicts the null hypothesis
- Calculate a sample size to achieve statistical significance (usually 95%)
- Calculate the test statistics and map it to a p value
- Accept or reject the null hypothesis based on if the p value is smaller / larger than the p critical value
Null & Alternative Hypothesis
Identifying a hypothesis to test is generally through domain knowledge of your given problem. The null hypothesis is generally a statement regarding the population that is believed to be true. The alternative hypothesis is a statement which contradicts the null hypothesis. A simple example can be outlined in the following scenario; you want to increase the conversion rate of users visiting your website based on adding a distinct feature. The null hypothesis would be that adding this distinct feature to the website will have no impact on the conversion rate. The alternative hypothesis would be that adding this new feature will impact conversion rate.
Sample Mean Estimate
The sample mean estimate from a group of observations is essentially an estimate of the population mean [2]. It can be represented by the following formula :
![Where N represents the total number of items in the samples and xi represents the number of occurrences of an event (Image provided by Siva Gabbi [3])](https://towardsdatascience.com/wp-content/uploads/2021/03/1ApiDVfpqTC7aarLAaKS-WQ.png)
In an ideal situation we would want the difference between the sample mean estimates of the variations (A and B) to be high. The larger the difference between the two would indicate a larger gap between the test statistics, implying that there would be a clear winner between the variations.
Confidence Intervals
Confidence intervals are a range of values so defined that there is a specified probability that the value of a parameter lies within it. It can be outlined by the following formula :
![u represents the sample mean estimate, t is the confidence level value, sigma is the sample standard deviation and N is the sample size (Image provided by Siva Gabbi [3])](https://towardsdatascience.com/wp-content/uploads/2021/03/1TuYwciDl4Jad4czUY3tkHg.png)
Test Statistics
The test statistic is a point value on a normal distribution, which shows how far off (in no. of standard deviations) from the mean the test-statistic is. There are various formulations of the test statistic based on the sample size and other factors. A few variations of the formula can be seen in the image below.
![Image from Krista King [6]](https://towardsdatascience.com/wp-content/uploads/2021/03/1NzIJhLKzRkMA84MaLeGViw.png)
Based on the value yielded from the test statistic, one can map the test-statistic to a p-value and either accept or reject the hypothesis based on if the p value is above or below the p critical value.
P Value
In statistics, a p-value is the probability that the null hypothesis (the idea that a theory being tested is false) gives for a specific experimental result to happen. p-value is also called probability value. If the p-value is low, the null hypothesis is unlikely, and the experiment has statistical significance as evidence for a different theory [4]. In many fields, an experiment must have a p-value of less than 0.05 for the experiment to be considered evidence of the alternative hypothesis. In short, a low p-value means a higher chance of the null hypothesis being false. As explained above, once a p-value is identified, interpreting the results is fairly simple.
Example
The mortgage department of a large bank is interested in the nature of loans of first-time borrowers. This information will be used to tailor their marketing strategy [5]. They believe that 50% of first-time borrowers take out smaller loans than other borrowers. They perform a hypothesis test to determine if the percentage is the same or different from 50%. They sample 100 first-time borrowers and find 53 of these loans are smaller than the other borrowers. For the hypothesis test, they choose a 5% level of significance.
Null Hypothesis : p = 0.5 Alternative Hypothesis : p != 0.5
This will be ran as a two tailed test.
Given that our significance level is 5% and that it is a two tailed test, our confidence interval will be 1–0.05/2 = 0.975. From running the code above you will yield a p-critical value of 1.96
Based on the code above we notice that the test statistic is 0.6. This is barely off of the mean of the standard normal distribution of zero. There is virtually no difference from the sample proportion and the hypothesized proportion in terms of standard deviations.
The test statistic is within the critical values, hence we fail to reject the null hypothesis. This implies that at a 95% level of significance we cannot reject the null hypothesis that 50% of first-time borrowers have the same size loans as other borrowers
Concluding Remarks
In summation, the frequentist approach to A/B testing is used to make a decision based on statistical significance of the outcome favouring one of the two variants A or B. This is done through identifying the null and alternative hypothesis associated with the test, identifying the sample size and calculating the test statistic at a certain confidence interval. Once the test statistic is obtained we can determine the P value and conclude if we accept or reject the null hypothesis.
Resources
- [1] https://en.wikipedia.org/wiki/A/B_testing
- [2] http://www.stat.yale.edu/Courses/1997-98/101/sampmn.htm#:~:text=The%20sample%20mean%20from%20a,estimate%20of%20the%20population%20mean%20.&text=For%20example%2C%20suppose%20the%20random,(70%2C5))..)
- [3] https://www.dynamicyield.com/lesson/frequentists-approach-to-ab-testing/
- [4] https://simple.wikipedia.org/wiki/P-value
- [5] https://opentextbc.ca/introbusinessstatopenstax/chapter/full-hypothesis-test-examples/
- [6] https://www.kristakingmath.com/blog/test-statistics-for-means-and-proportions
If you enjoyed reading this article, here are a few other written by me which you might also like :
Bayesian A/B Testing Explained
Recommendation Systems Explained
Text Summarization in Python with Jaro-Winkler and PageRank
Link Prediction Recommendation Engines with Node2Vec