Hands-on Tutorials
With the rise of myriad software development frameworks and the ever more accessible cloud infrastructure, building products has never been easier and more products than ever are competing for user attention. As a result, experimentation is becoming an increasingly important weapon in the arsenal for building successful products at scale that stand out from the crowd. In particular, experimentation, when done correctly, enables 3 key outcomes that help move the product quickly towards product market fit:
1. Causation instead of correlation
In the absence of a controlled experiment, data analysis often can only establish correlations between inputs and outputs. Correlation is useful to inform what hypotheses to test, but it cannot definitively inform how to change the product to achieve the desired outcome. With properly executed controlled experiments, on the other hand, one can establish a quantified and causal relationship between the inputs being tested and target outcomes of interest. This leads to clear and actionable steps to improve the product.
2. Learning and prioritization
A holistic experimentation approach is not limited to point-wise optimization of specific features, it teaches the experiment not just what works, but also the scales of its impact, for her specific users and workflows. These learnings can compound over time to spawn new hypotheses and serve as an objective prioritization mechanism to help product teams focus on the most impactful area for further iteration and development.
3. Decentralized execution
As products grow in scale and surface area, centralized product planning invariably becomes the bottleneck that prevents new ideas from getting implemented quickly. This stifles innovation and slows down healthy product evolution. A holistic experimentation approach enables core product health metrics to be the ultimate arbiter of what works vs not. This gives everyone permission to implement and prove their idea positively impacts the product, without a centralized decision making body. The centralized authority only has to hold the line on the core product health metrics, and the practical thresholds for positive impact. This mode of decentralized execution is a key enabler for rapid product evolution without compromising on product quality.
In the next few sections, I will attempt to describe the ingredients needed and practical steps for getting the most out of your experiments.
What do you need to experiment effectively?
In order for a product organization to effectively leverage experimentation, it needs to build (or purchase) the right infrastructure, develop the processes and best practices, and dedicate the appropriate people and resources. Below are some high level guidelines on what these should look like, but the specific implementation should be tailored and authentic to the particular product and organization.
Infrastructure
There are 4 major components in an experimentation stack
- Feature management – launch and unlaunch features for specific user segments. For example, show the new onboarding flow to 10% of the Chrome users.
- Experiment engine – sits on top of feature management to manage experiment configurations and serve up experiments. Specifically, this means handling all feature variants and their assignment to randomized user buckets within the target group, as well as physically starting and stopping the experiment. For example, run an A/B/C test with 2 new onboarding flows on 10% of the Chrome users, split the traffic evenly 3 ways, run the experiment for 1 week starting next Monday, and evaluate the variants on the metric of onboarding completion rate.
- Data logging – collect appropriate event data from the user-experiment interaction for evaluation and diagnostics. For example, capture events relating to the start of onboarding, individual onboarding actions, and onboarding completion, for each user in the onboarding flow experiment.
- Analytics and reporting – synthesize and present experiment results using rigorous statistics. For example, show that onboarding flow variant C increased onboarding completion rate by 4% relative to control with a significant level of 5%
Process
"Do no harm"
Establish a core set of global evaluation metrics that the entire organization is aligned on. These metrics need to reflect company or product level goals, and should be a part of every experiment evaluation. It’s fine to use experiments to optimize feature level metrics, but nothing should negatively impact the global metrics. This prevents local optimizations that are not accretive to company or product level goals, and is also a key part of what enables nimble, decentralized execution.
Decide on evaluation criteria (and action plan) ahead of time
Not doing so enables the very human tendency toward confirmation bias. People have vested interests in experiment working out (and sometimes not working out), and will often cut the data until it matches their particular bias. This will significantly inflate the false discovery rate and degrade the value of the experiment. As Ronald Coase famously said
If you torture the data long enough, it will confess to anything
To counter this, all stakeholders should work together to coalesce their perspectives into the overall evaluation plan before the experiment, as well as align on the post-experiment action plan. The team running the experiment should then document and publish these plans, before the experiment is conducted.
Living knowledge base
Experiments should not be treated as a way to settle a debate, instead it should be a way to continuously learn and improve the org’s understanding of what works (and by how much) for your specific users and workflows. Ideally each experiment should spawn new ideas and turn the flywheel of learning. The results of these experiments can then be continuously added to the living knowledge base that compounds over time.
Example experiment cycle
- Align relevant stakeholders on design, evaluation, and implementation plan
- Publish plan
- Run experiment
- Review experiment results with relevant stakeholders
- Rollout / implement change if results pass evaluation
- Document learnings, and identify follow up areas for further experimentation
People
Champion
A strong champion, who buys into the value of experimentation, is needed to help instill the culture of experimentation and hold the rest of the organization accountable for doing their part. Without this person, it can be difficult to achieve the alignment necessary to truly unlock experimentation as a driver of product growth.
Dedicated resources
Product / engineering / design / Data Science teams all need to collaborate to hypothesize, design, implement and run experiments. Post experiment, they also need to collaborate to learn from the results and iterate. This takes time and resources, which if not explicitly allocated, will likely lead to deprioritization of experimentation altogether.
Overall program owner
This role is particularly important for orgs that are in the early stages of setting up experimentation. To be clear, this is not the person responsible for actually doing the work of hypothesize, design, implement and run. This person’s role is on-the-ground counterpart to the champion to help define best practices, hold people accountable, facilitate alignment, and help curate the knowledge base. As the experimentation muscle and culture grows, ideally this role fades away and becomes more diffuse, as the individual teams can take on the responsibility for their own experiments.
How do you run successful experiments?
Steps of an A/B test
The basic A/B test is the workhorse of experimentation. Many of the more complex designs are essentially combinations and variations built on collections of A/B tests. As a result, it is worth going in depth to understand how to properly run an A/B test.
1. Choose the evaluation metric
At the risk of beating a dead horse, it is important to do this ahead of time. It is very human to want to massage the data / metrics after the fact, don’t do it. It will inflate the false discovery rate.
The decision should ideally be made on one metric that is as close as possible to the true drivers for the business. If you are having trouble prioritizing, you should try translating the improvements to the metrics into a dollar value. This helps narrow down the choices and also exposes the implicit assumptions that could lead to local optimization at the expense of global optimization. For example, say you want to test a new advertisement design, the click through rate might seem like a great metric to use. The implicit assumption here is that by changing the design, the only thing that will be impacted is the click through rate. But if we work through the math on top line impact
revenue = ad impression x click through rate x purchase rate x average revenue per purchase
It is possible to improve the click through rate, while also degrading the overall revenue by making the purchase rate and average revenue per purchase lower. This could happen if the new ad design is attracting lower propensity customers for example.
To illustrate this with a real world example, in the early days of Bing search engine, they optimize the search relevance algorithm on distinct user queries. The implicit assumption here is that everything else being equal, the more the users used the search, the more ad impressions, and the higher the revenue. This seems fine until they rolled out a bad relevance algorithm that made the search results terrible. This resulted in a short term 10% increase in distinct user queries and 30% increase in ad revenue, because the users had to work much harder to find what they were looking for. As a result, users became frustrated by the non-relevant results and stopped using the service. The implicit assumption not accounted for here is the number of users would not be negatively impacted, which turned out not to be true. For more details on this and other interesting experiments see [1].
2. Run an A/A test if possible
If resources allow, it is very beneficial to run an A/A test (a test where both feature variants are exactly the same) before the actual experiment. This can help you identify potential issues in your experiment stack. Some common problems include issues with randomization [2], biased user sampling, and incorrect data collection. It is also very useful to establish baselines on the average values and variances of the evaluation metrics, as well as the traffic size and distribution of the target population. These are important inputs for properly designing the experiment to achieve adequate power and confidence level.
3. Define experiment parameters
Using the baseline measured from the A/A test (or estimates from historical data), the experiment should explicitly define the experiment parameters. It is entirely up to the experimenter to choose the value of these parameters, but it is important to understand what they mean and how to appropriately customize them given the specific questions and use cases.
Hypothesis
- Hypothesis testing is central to evaluating experimental results, but it also can be somewhat non-intuitive. Therefore, it is usually helpful to be very explicit about the procedure and assumptions. i.e. we start the default assumption that the null hypothesis is true, and if the data proves it safe to reject the null hypothesis, we then adopt the alternative hypothesis.
- The null hypothesis H0: evaluation metric with the new feature is no better than the baseline
- The alternative hypothesis Ha: evaluation metric with the new feature is better than the baseline
- Note that this is a one-sided hypothesis, meaning we are only testing to see if the new feature is better than the baseline. We will use this for rest of the discussion as well. This is different from the two-sided hypothesis, which tests for inequality and whether the new feature is either better or worse than the baseline.
Significance threshold
- The probability of seeing the experiment result data, if the null hypothesis is true and there is no real effect. Put another way, what is the likelihood that the experiment results occur by random chance alone.
- Main decision criterion for rejecting the null hypothesis
- Also known as the critical p-value, and commonly denoted α in equations
- Typically set at 5%
Power
- The probability of seeing the experiment result data, if the alternative hypothesis is true, or there is a real effect. Put another way, what’s the likelihood that the experiment actually detects a real effect.
- Not a decision criterion, but an important consideration in experiment design
- Commonly denoted as 1-β in equations
- Typically set at 80%
Minimum detectable effect
- Given significance level and power, what is the smallest lift in the evaluation metric that the experiment should be able to detect
- Commonly denoted as δ in equations
Traffic allocation ratio
- Portion of traffic to allocate towards test feature vs control, sample size of the treatment group / sample size of the control group. Typically ranges between 0 and 1.
- Commonly denoted as r in equations
- 50/50 equal split is most efficient in terms of overall traffic requirement. Any deviation will require more overall traffic to achieve the same level of significance and power
- It is important to note that it may not always be practical to use a 50/50 split. For example, if one is testing out the impact of a discount code, one may want to allocate less traffic to the treatment group to limit the reduction in revenue
- It is often desirable to slowly ramp up the traffic allocation for the new feature, so that if there were a bug that impairs the user experience or an extreme adverse reaction, it can be caught without having impacted half of all users. This can be done in a stepwise manner over the course of a few days or few weeks depending on volume of traffic.
Sample size
- Insufficient amounts of sample observations will lead to under-powered and error-prone experiments that either fail to detect a true effect or incorrectly detect a non-existent effect. Therefore, it is very important to calculate the required sample size based on the above parameters. The formula (and easier to work with approximations) for a one-sided hypothesis, where the alternative hypothesis is Ha: evaluation metric with new feature > evaluation metric with control, are presented below. For a two-sided hypothesis, where the alternative hypothesis is Ha: evaluation metric with new feature evaluation with control, change α to α/2
- For a continuous evaluation metric, e.g. revenue per visit, the formula for the sample size N required for the treatment group is as follows [3]
- For a rate evaluation metric, e.g. click through rate, the formula for the sample size N required for the treatment group is as follows [4]
See this for more details and example implementations in python.
Experiment length
- Duration of the experiment is generally calculated by dividing the required sample size by the average daily traffic. For products that have a large number of users, it may be possible to satisfy the sample size requirements with a couple of days worth of traffic. However, it may be beneficial to run the test over a full week. Because the characteristics of users may vary significantly across different days of the week. Similar dynamics may also exist across holiday vs non-holiday. Ignoring these effects in the consideration of the experiment length may result in bias sampling and incorrect conclusions.
4. Run experiment
There are many good reasons for monitoring the metrics while running the experiment, such as catching unintended bugs that impair the user experience or catastrophically bad variant features that are clearly losers. However, this "peeking" also significantly increased the false discovery rate under normal circumstances. If you peek continuously at an A/A test, the chances of finding a significant result at a p-value of 5% is actually 26% [5]. For more details, the reader is encouraged to run the simulation described here to get a feel for this effect.
There are a few different statistical approaches to combat this problem e.g. sequential testing [6][7] and Bayesian statistics [8][9]. Without getting into too much details, these techniques allow the experimenter to call the experiment early given sufficient evidence in the data, without having to wait for the predetermined number of observations. The readers are encouraged to learn more in the provided reference.
5. Analyze results
The procedures for a one-sided hypothesis, where the alternative hypothesis is Ha: evaluation metric with new feature > evaluation metric with control, are presented below. Note that these also assume that the treatment and controls group are independent and do not have the same underlying population statistics. For more details and variations under different assumptions, the readers are encouraged to learn more in the provided references
For a continuous evaluation metric, e.g. revenue per visit, a t-test is often used to evaluate the outcome, so named for its use of the t-distribution. The procedure is as follows [10]
For a rate evaluation metric, e.g. click through rate, a z-test is often used to evaluate the outcome, so named for its use of the standard normal distribution. The procedure is as follows [11]
See this for more details and example implementations in python, scroll to the analysis section.
6. Call the winner
If the calculated significance level is below the threshold set before the experiment (typically 5%), one can successfully reject the null hypothesis, and declare the new feature being tested a winner. If the significance level is not below the threshold, then there is not enough evidence to say that the new feature is better than the baseline control.
Generalization for A/B/n experiments
A/B/n tests are an extension of the basic A/B tests, where instead of just 2 variants, 3 or more variants of a given element are compared against one another in the experiment. Note that element here refers to a generic unit of variation and could be as large as an entire web page or as small as a single button. Most of the steps from the basic A/B test outlined above apply to A/B/n as well, except for sample size and analysis procedure, where modifications are necessary.
The sample size calculation adjustment for an A/B/n test is relatively straightforward. Each element variant needs to satisfy the pairwise criteria against the control group, given the specific parameters of the comparison. In the simplest form with equal allocation, all individual test group test sample sizes are the same, and the total sample reduces to (m+1)*N, where m is the number of variants, and N is the sample size for the control as determined by the equations outlined above.
The analysis procedure adjustment is a little more complicated, due to the nature of the frequentist statistics used in hypothesis testing. Recall that the significance level of an A/B test, as calculated via the analysis procedure outlined above, is the probability that the data observed from the experiment is purely due to randomness and there is no real difference between treatment and control. Suppose we ran an A/B/n test with 20 variants, and all of them ended up better than control with a significant level 5%, what’s the probability that at least 1 one of those results happened due to chance? One might be surprised to learn that it is not 5%, but 64%. It is relatively easy to show this is true using basic probability:
- The probability that each result observed is due to randomness = 5%
- The probability that each result observed is due to actual effect = 1–5% = 95%
- The probability that at least one result is due to randomness = 1 – the probability that all results observed are due to actual effect = 1 – (95%)^20 = 64%
This error rate increases with the number variants tested in the experiment and is sometimes referred to as the familywise error rate (FWER). Two well known procedures can be used to correct for this when analyzing the results of an A/B/n experiment
- Bonferroni correction [12] – which states that in order to control the FWER to a desired level α, each of the individual pairwise comparisons must have a significance level of less than α/m, where m is the number of variants being tested. For example, if we want the FWER to be no more than 5% in a 20 variant experiment, then each pairwise comparison must have a significance level of 5%/20 = 0.25%.
- Benjamini Hochberg procedure [13] – this procedure is motivated by the fact that the Bonferroni correction is quite strict and makes it likely to miss real effects that do not quite make the statistical cut. The particular insight underlying this procedure is that the problem can be reframed in terms of false discovery rate (FDR) instead of FWER, which is defined as number of null hypothesis rejected incorrectly divided all null hypothesis rejected. In other words, of the pairwise comparisons declared significantly different, what portion is actually not and due to random chance alone. Controlling FDR is less strict and leads to higher powered tests. Tactically, the procedure states that to control the FDR to a level α: one rank sorts the pairwise comparisons by their significance level, and reject the null hypothesis if the significance level < αi/m, where i is rank based on significant level, m is the number variants tested.
Multivariate experiments
Multivariate testing is a further generalization from A/B/n testing, where instead of testing variants of just one element, one tests variants of 2 or more elements in combination with one another in a single experiment.
While one can think of each variant combination in a multivariate test just as an independent variant, and treat the collection as a big A/B/n test, multivariate design actually offers two significant additional benefits over the more simplistic A/B/n paradigm:
1. Relative impact and prioritization
Multivariate tests can help the experimenter tease out the relative contribution of the individual element. For example, you might want to test the size of a button in conjunction with its color and its display text. One can use a full factorial design, where all combinations of element variants are tested against one another, to figure out the relative impact of size vs color vs display text. The relative impact can then be used to prioritize the most important element for further testing and iteration, leading to more efficient use of resources and faster product improvement.
2. Interaction effects
Multivariate testing can also help the experimenter understand if some element variants work better (or worse) in tandem than the sum of individual element performance might suggest. For example, suppose one runs a study on the enjoyment of food where the food is varied between hot dog and ice cream, and the topping is varied between chocolate sauce and mustard. You are likely to find that people prefer the combination of mustard on ice cream far less than they do mustard as a general condiment. In this particular example, the effect is fairly obvious – the enjoyment depends heavily on the specific combination of food / topping. In a more complex product, however, this type of interaction effect may be far less self evident. This is where careful multivariate design can be very helpful in dis-entangling and quantifying these effects.
In order to understand the relative impact of elements and any potential interaction effects, one must analyze the data produced by the multivariate experiment. It is possible to approach it as a big A/B/n experiment and performing pairwise comparison of all distinct combinations against the control (with the appropriate error correction procedure), but it is likely difficult to parse out the relative impact of each element and any interaction effects, especially if the number of combinations is large. To get at these higher level attributes more directly one should use multiple regression techniques [14]. The high level procedure is as follows
- Use logistic regression for rate evaluation metrics, and linear regression for continuous evaluation metrics
- Include all elements and any desired interaction terms as input features. E.g. food, topping, food x topping.
- Run appropriate regression algorithm against the evaluation metric, with regularization to eliminate non-impactful input features [15]
- Use cross validation to avoid overfitting and find the optimal regression model [16]
- Optional: if your sample size is fairly large (.e.g > 10k observations) consider creating a random 20% holdout set, which is removed from training of the regression model. This holdout set, sometimes referred to as the test set, can be used to objectively evaluate the true accuracy of the regression model. If the test accuracy of the model is significantly worse than the cross validation accuracy during training, then the model is likely overfit and is not to be trusted.
- Assuming the regression model is not overfit, then the coefficients can be used to interpret relative importance of the various elements, and any interaction effects that may be present
Click here for an example multivariate experiment analysis implemented in python
While multivariate experiment is a useful tool, it also has many drawbacks that need to be considered as part of the overall experimental design process. Specifically
- More engineering resources – more element variants means more time and energy devoted to building them. This can lead to long lead times and slow iteration, as experiments cannot start until all variants are ready to launch.
- Longer experiment – each pairwise comparison between baseline and element variant combination must satisfy the sample size requirement for a standard A/B test. Since multivariate experiment generally have larger numbers of distinct combinations, this tends to lead to significantly larger sample sizes requirements
- More complex analysis – multivariate tests generate a lot more data, and as a result, the analysis required to quantify the results is also more complex. More importantly, however, it is also easier to fall prey to confirmation bias, as there are many more ways to slice and dice the data to show positive results. This makes it more likely to falsely discover effects that may not really be there.
Takeaways
- Experimentation is a key component of a success product strategy
- Invest in infrastructure, process and people to get the most out of experimentation
- Use experiments as a way to learn and prioritize your resources
- Make sure you are optimizing the right metrics for the business – beware of local optimum and implicit assumptions
- Choose the appropriate experiment design for your questions and product
- Follow good experiment practice guided by statistics to avoid going down the wrong path
If you’ve made this far and still have questions, feel free to reach out. Twitter | Linkedin | Medium
Reference
[1] https://notes.stephenholiday.com/Five-Puzzling-Outcomes.pdf
[2] https://codeascraft.com/2018/11/07/double-bucketing-in-ab-testing/
[3] https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0602-y
[4] https://www2.ccrb.cuhk.edu.hk/stat/proportion/tspp_sup.htm
[5] https://www.evanmiller.org/how-not-to-run-an-ab-test.html
[6] https://www.evanmiller.org/sequential-ab-testing.html
[7] https://codeascraft.com/2018/10/03/how-etsy-handles-peeking-in-a-b-testing/
[8] https://www.evanmiller.org/bayesian-ab-testing.html
[9] http://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technical_whitepaper.pdf
[10] https://en.wikipedia.org/wiki/Student%27s_t-test
[11] https://online.stat.psu.edu/stat800/lesson/5/5.5
[12] https://en.wikipedia.org/wiki/Bonferroni_correction
[13] Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of Royal Statistical Society. Series B. Vol. 57, №1 (1995), 289–300. Benjamini, Hochberg.
[14] http://www.biostathandbook.com/multipleregression.html
[15] https://en.wikipedia.org/wiki/Regularized_least_squares#Specific_examples
[16] https://en.wikipedia.org/wiki/Cross-validation_(statistics)