Once exclusively used in academia, in particular medical research, randomized control trials are now a popular method for businesses to make data-driven decisions. In particular, online A/B testing is easy to implement and potentially incredibly powerful to optimize digital processes. By comparing two or more variants, organizations can evaluate the effectiveness of different options and determine the most favorable outcome. However, it is crucial to recognize and address certain limitations to ensure that biases do not impact the reliability and validity of the results. In this article, we explore three key limitations to consider before running an online A/B test to avoid costly bias. Before jumping into the list of what I personally think are the top three issues, let me briefly define A/B testing and important concepts.
What is A/B testing?
A/B testing involves presenting different versions/variants, A and B, to different study subjects (e.g., clients). Online A/B testing could explore variations of a webpage, email campaign, user interface, or any other digital asset to a subset of users. The variations typically differ in one or more specific elements, such as design, layout, color scheme, call-to-action, or content. Through carefully controlled experiments, organizations can measure the impact of these variations on user behavior, engagement, and conversion rates.
Randomization
The process begins by dividing the audience randomly into two or more groups, with each group exposed to a different variant. The control group receives the original version (referred to as the baseline or control, as long as an original version exists), while the other groups receive modified versions. By tracking user interactions, such as clicks, conversions, time spent on a page, or any other predefined metrics, organizations can compare the performance of the different variants and determine which one yields the desired outcomes.
Causality
The primary goal of A/B testing is to properly identify the effect of a change. Without following this strategy carefully, other things could affect the behavior of the subjects. Imagine that Netflix decides to change its homepage to show the content that is watched the most at the moment instead of the latest release (this is a hypothetical example). Then, imagine that the company does not use A/B testing but rather changes the platform for everyone in April and then compares the time spent on the platform and the number of subscribers between March and April. The differences could be caused by the homepage change but also a difference in weather, in other online streaming platforms, etc. It would be impossible to identify the cause due to the multiple factors confounded at the same time. A/B testing aims to solve this issue by randomly allocating and testing simultaneously two or more groups. To learn more about Causality, I invite you to read my two-part article on the Science and Art of Causality (https://medium.com/towards-data-science/the-science-and-art-of-causality-part-1-5d6fb55b7a7c)
Now, let us dive into three key limitations that organizations should consider before running an online A/B test to avoid costly bias. By understanding and mitigating these limitations, businesses can maximize the value of A/B testing, make more informed decisions, and drive meaningful improvements in their digital experiences.
1. Channel: Uncovering the User’s Perspective
One of the primary limitations of online A/B testing is understanding the reasons behind user preferences for one option over another. Often, the choice between options A and B is not explicitly justified, leaving experimenters to speculate about user behavior. In scientific research, we call this the "channel," the reasoning explaining the rationale for the causal effect. Imagine that your option B incorporates an additional feature on the checkout page (e.g. recommendations for similar products or products bought together). You observe a drop in purchases with option B and hence conclude that it was a bad idea. However, a more careful analysis revealed that actually, the time to load the page for option B was longer. Now you have basically two differences: the content and the waiting time. Hence, back to the concept of causality, you don’t know what drives the choice; the two are confounded. If you think that loading time is marginal, think again: _" […] experiments at Amazon showed a 1% sales decrease for an additional 100msec, and that a specific experiments at Google, which increased the time to display search results by 500 msecs reduced revenues by 20%" (_Kohavi et al. (2007))
Solutions: First, to mitigate this limitation, incorporating additional survey questions can provide valuable insights into users’ motivations and hence minimize the risk of biased interpretations. Second, trying to avoid having several differences helps to pin down the cause (e.g. having the same loading time).
2. Short-Term vs. Long-Term Impact: Beyond Immediate Results
When conducting an online A/B test, it is essential to consider the potential long-term effects of the chosen metric. While short-term objectives, such as click-through rates or immediate conversions, may seem favorable initially, they could have adverse consequences in the long run. For example, employing clickbait strategies may yield quick views and impressions, but they might negatively impact the audience’s perception and your credibility over time.
Solution: It is crucial to measure multiple metrics that assess both short-term and long-term impact. By evaluating a comprehensive range of indicators, organizations can make more informed decisions and avoid myopic optimization strategies. Long-term impact metrics could include satisfaction evaluation and audience retention (e.g. time of a video watched or time spent reading an article). That being said, it is not trivial to assess those.
3. Primacy and Newness Effects: The Influence of Novelty
Two related limitations arise from the influence of novelty in online A/B testing: primacy and newness effects. Primacy effect refers to the fact that experienced users might be confused or lost when encountering a change, such as a button’s placement or color alteration. Conversely, newness effect occurs when users are tempted to interact with a new feature due to its novelty, but this effect may fade quickly. These effects are particularly prevalent in platforms where users have regular interactions, such as social media.
Solution: It is recommended to run experiments over several weeks, observing how the effects change over time. By monitoring the fluctuating user behavior, experimenters can gain a more comprehensive understanding of the long-term impact of their changes.
Conclusion:
While online A/B testing offers a valuable tool for data-driven decision-making, it is crucial to consider at least those three potential issues. By considering the channel through which users engage, measuring both short-term and long-term impacts, and accounting for primacy and newness effects, organizations can enhance the reliability and validity of their A/B testing results. This is just the tip of the iceberg and I invite you to read further: Kohavi, R., Henne, R. M., & Sommerfield, D. (2007, August). Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 959–967).