A/B testing is like Jenga, a delicate balance of interconnected pieces that form the foundation of a successful experiment. Just like in the game of Jenga, where removing the wrong block can cause the entire tower to crumble, A/B testing relies on multiple components working together. Each piece represents a crucial element of the test, and if any of them fail, the integrity of the experiment can be compromised, leading to inaccurate results or missed opportunities.
And in my experiences, I’ve seen such great experiment ideas crumble because of very common mistakes that many Data scientists commit, myself included! And so, I want to cover with you four of the most common mistakes when A/B testing (and how to solve them!).
If you’re not familiar with A/B testing and you’re interested in pursuing a career in data science, I strongly recommend you at least familiarize yourself with the concept.
You can check out my below if you’d like a primer on A/B testing:
With that said, let’s dive into it!
Problem #1: Setting the statistical power too low.
To recap, statistical power represents the probability of correctly detecting a true effect, or more accurately speaking, it is the conditional probability of rejecting the null hypothesis given that it is false. Statistical power is inversely related to the probability of committing a Type 2 error (false negative).
Generally, it’s common practice to set the power at 80% when conducting a study. Given its definition, this means that if you set the power at 80%, you would fail to reject the null hypothesis given that it is false 20% of the time. In simpler terms, if there were true effects in 100 conducted experiments, you would only detect 80 out of the 100.
Why is this a problem?
In a business, especially a tech company, one of the main goals is to learn, build, and iterate as quickly as possible. One of the reasons that large tech companies, like Airbnb and Uber, are able to grow so fast and maintain their market share is because of their ability to constantly evolve.
When you set the statistical power to 80%, that equates to 20% of true effects that were not captured, which means 20% fewer iterations. Now compound that over, say 10 years, and you can understand the impact that this may have.
What’s the solution?
The obvious answer to this is to increase the statistical power. How you go about doing so is not so obvious – statistical power is directly related to several other experimental parameters, meaning there are several ways to increase it:
- Increase the sample size. The main way to improve statistical power is to increase the sample size. By increasing the sample size, the variability in the data decreases which results in narrower confidence intervals and more accurate estimation. This is why if you use a power analysis tool like Evan Miller’s A/B test calculator, setting a higher power results in a bigger recommended sample size.
- Adjust the alpha. Power and alpha are inversely related, and this makes clear sense if you think about it. If you lowered the alpha from 0.05 to 0.01, the threshold for rejecting the null hypothesis becomes more stringent, which makes it harder to reject, which results in lower statistical power. The same could be said for the opposite – if your alpha is too low, by increasing the alpha, you’re more likely to reject the null and have higher statistical power.
- Decrease the effect size (minimum detectable effect). By decreasing the effect size, the statistical power of a test is enhanced because it results in an increased ability to detect smaller effects. That being said, it’s not always advisable to decrease the effect size, which leads me to my next point!
Problem #2: Setting the minimum detectable effect (MDE) too low.
The minimum detectable effect (MDE) represents the smallest effect size that an experiment can detect reliably. If the observed effect size falls below the MDE, it suggests that the effect is so small that it could be due to random variation or noise in the data rather than a genuine, meaningful effect.
If that’s the case then why wouldn’t you want to set the MDE as low as possible? This is where the idea of statistical significance vs practical significance comes into play – while statistical significance focuses on the probability that an effect is not due to random chance, practical significance takes into account the magnitude and implications of the effect in practical terms.
To give an example, I conducted a pricing experiment at KOHO to determine price elasticity for a given product. The end result was statistically significant in that the reduction in price resulted in an increase in product adoption. However, reducing the price as much as we did was not practically significant because despite the increased number of users that subscribed, the price reduction ultimately led to a lower profit overall.
What’s the solution?
You should choose an MDE based on practical effect sizes that are relevant to the context and align with the objectives of the experiment. This ensures that the detected effects are both statistically significant and practically meaningful, while also optimizing resource allocation and avoiding the risk of false negatives.
Problem #3: Conducting too many hypothesis tests.
Too often have I seen dozens and dozens of A/B tests (hypothesis tests) conducted on virtually the same thing, like testing several price points for a given product, testing various website configurations, and testing multiple marketing campaigns.
Why is this a problem?
The problem with this is that as you conduct more hypothesis tests, you’re more likely to obtain statistically significant results by chance. Statistically speaking, this is because of the alpha that is set when A/B testing. The alpha represents the probability of rejecting the null hypothesis when it is true, so if we set it to 0.05 then 5 out of 100 true tested hypotheses will be rejected.
What’s the solution?
The solution to this is to control for false discoveries (false positives), and there are several methods to achieve this. The most common technique is the Bonferroni Correction, which simply adjusts the significance level (alpha) by dividing it by the number of tests being performed. For example, if you are conducting 10 hypothesis tests and want to maintain an overall alpha of 0.05, you would divide 0.05 by 10, resulting in an adjusted alpha of 0.005 for each test. This correction ensures a more stringent criterion for declaring statistical significance, reducing the chances of false positives.
Problem #4: Not accounting for Survivorship Bias.
Another problem I see is that survivorship bias is not often adjusted for in the experimental design.
Why is this a problem?
Survivorship bias and user tenure have a strong relationship. Consider this: unengaged or unprofitable users are unlikely to have a long tenure with a company – users that don’t find value in a product are unlikely to stay as long with a company. Therefore, it’s important to account for the potential differences in behavior between users of different tenures.
When splitting your control and test groups, failing to account for user tenure can skew your results if there are significant differences in behavior. One group may have a higher average user tenure, which can impact factors like profitability and engagement. In other words, not accounting for user tenure can introduce confounding variables and hinder the analysis of the specific cause-and-effect relationship of interest.
What’s the solution?
Stratified sampling can be used to address the issue of user tenure skewing A/B test results when not controlled for. This involves stratifying (or partitioning) the population into specific segments and then randomly sampling each group individually. It can be done by doing the following:
- Define User Tenure Groups: Divide your user population into distinct groups based on their tenure with the company. For example, you could create groups such as "New Users" (short tenure), "Mid-Term Users" (moderate tenure), and "Long-Term Users" (extended tenure).
- Determine Sample Sizes: Determine the sample size you want for each tenure group. The sample sizes can be proportional to the size of each group in the overall user population or based on specific considerations, such as the importance of each group or desired statistical power.
- Random Sampling within Each Group: Randomly select users from each tenure group to form the control group and the test group for your A/B test. Ensure that the selection is representative of the users in each group, preserving the proportions of users with different tenure levels.
- Conduct your A/B Test: By stratify sampling, you’ll reduce bias in your experiment and set yourself up for more reliable results. You can now properly conduct your experiment in a manner that will control for variables other than the variable of interest.
After reading this, you should know four common A/B testing errors and how to solve them – specifically, you should know how to account for:
- Setting the statistical power too low
- Setting the minimum detectable effect (MDE) too low
- Conducting too many hypothesis tests
- Not accounting for Survivorship Bias
Considering these errors will certainly improve the validity and reliability of your A/B tests, enabling meaningful insights and informed decision-making.
Now go out there and see what you can discover!
Thanks for Reading!
If you enjoyed this article, subscribe and become a member today to never miss another article on Data Science guides, tricks and tips, life lessons, and more!
Not sure what to read next? I’ve picked another article for you:
or you can check out my Medium page: