Measure A/B Testing Platform Health with Simulated A/A and A/B Tests

How we ensure the reliability of our A/B testing platform with simulations

Qike (Max) Li
Towards Data Science

--

Contributors: Max Li, Eric Jia

Photo by William Warby on Unsplash

A/B testing plays a critical role in decision-making in data-driven companies. It is typically the final say of go or no-go for a new product design, business strategy, machine learning model, etc. Inaccuracies in the A/B testing platform degrade all business decisions derived from A/B tests. In this article, we share our methodology that continuously evaluates the reliability of our A/B testing platform to ensure a trustworthy experimentation platform.

The goal of the continuous evaluation is to measure the type I error rate and power, which is equal to 1- type II error rate. Type I error rate and power determine the quality of a hypothesis test. Type I errors (false positives) are typically a false claim of a beneficial/detrimental product feature when the new feature has no significant impact on the business metrics. Shipping a feature with no impact is a waste of effort and may lead to further investment in an unfruitful area. Type II errors (false negatives) result in concluding a beneficial/detrimental product feature is neutral. It is a miss of opportunity when we kill a product feature, which may take months to build, based on false A/B test results.

Simulated A/A Tests measure type I error rates

To measure the type I error rates (false positive rates), we run 500 hundred simulated A/A tests every day using data accumulated from the last seven days. Each simulated A/A test randomly splits users into control and treatment groups and calculates the metrics of the two experiment groups in the same ways as the production metric calculation pipeline.

It requires a large number of A/A tests to measure the type I error rate accurately, but running all the repetitions with real-time user traffic would be too costly. Simulated A/A tests run offline and therefore have no impact on the App/Website latency.

We found 500 repetitions balance between computation cost and precision in the Type I error rate measurement. With the nominal p-value threshold at 0.05, the expected Type I error rate (false-positive rate) is 5% for a reliable A/B testing platform. Assuming our false-positive rate is indeed 5%, each repetition in the Monte Carlo simulation is an independent Bernoulli trial with a success probability of 5%. Hence, the 95% confidence interval (CI) of our false positive measure is

The CI shrinks with the square root of the number of repetitions (n). The table below demonstrates the calculated CI with various numbers of repetitions.

Table 1. CI shrinks with the square root of the number of repetitions (n)

Figure 1 demonstrates a dashboard to monitor the false positive rate over time.

Fig 1. Continuous false positive rate evaluation

Further, we continuously evaluate the distributions of the p-values from the simulated A/A tests. The p-value should be uniformly distributed when the null hypothesis is true, and all other assumptions are met [1]. The null hypothesis is μ1=μ2, where μ1 and μ2 are the population means of the control bucket and the treatment bucket, respectively.

Fig 2. p-value distributions of A/A tests

Simulated A/B tests measure power

The statistical power, which equals 1-type II error (false negative) rate, is equally important but more challenging to evaluate due to the lack of ground truth. We typically do not have product features that certainly improve the metrics. Otherwise, we would have already implemented them. On the other hand, it would be costly and sometimes even unethical to use an inferior version of the product to evaluate the false positives of an A/B testing platform. Simulated A/B tests enable us to assess the power in various business scenarios without interfering with user experience.

We conduct the simulated A/B test similarly to the simulated A/A test, except that we introduce a synthetic difference in the treatment group, i.e., increasing the metric value by 1% for users in the treatment group. In a simulated A/B test, we know the null hypothesis (i.e., no effect) does not hold due to the synthetically introduced difference. Any neutral results (results that are not statistically significant) in the simulated A/B tests are type II errors (false negatives).

Furthermore, with simulation, we can evaluate the power of A/B tests in various scenarios. Some illustrative scenarios are:

  • Different levels of metric differences between control and treatment: 0.1%, 0.2%, 0.5%, etc.
  • Different patterns of the differences: a new product feature primarily impacts the heavy users, the effectiveness of a new product metric depends on user attributes (such as age, geolocation, language), etc.

Note a typical power analysis is equivalent to the simulated A/B tests with a uniform difference — the population mean in the treatment bucket is shifted uniformly by an X amount. With the simulated A/B tests, we get to understand the statistical power of A/B tests with more sophisticated business assumptions.

Figure 3 shows the results of the simulated A/B tests at different levels of lifts, where a < b < c < d. When the lift reaches c%, most metrics have sufficient powers. Also, even when the lifts are the same, different metrics have different levels of power due to the difference in the variances of the metrics.

Fig 3. Continuous false negative rate evaluation

Conclusion

The more we improve our A/B testing platform, the more we realize that devil is in the details. It is surprisingly hard to get your A/B testing right. Something seemingly straightforward can lead to severe drawbacks, and approaches that appear to be reasonable can yield highly inaccurate results. A systematic evaluation of the health of the A/B testing platform is paramount. The simulated A/A and A/B Tests continuously measure the platform’s health.

We aim at making our evaluations more comprehensive and convenient in future work. Although our current simulated A/A tests are generic due to using data from all users, we plan to enable experiment-specific simulated A/A tests by on-demand scheduling of the A/A tests with only the users exposed to a particular experiment. In addition, we plan to add more sophisticated scenarios in the simulated A/B tests to represent a wide variety of business cases.

Acknowledgments

Thanks to Chao Qi for his contribution to this project. We are also grateful to Pai Liu for his support, and Pavel Kochetkov, Lance Deng, Delia Mitchell for their feedback.

Data scientists at Wish are passionate about building a trustworthy experimentation platform. If you are interested in solving challenging problems in this space, we are hiring for the data science team.

References

[1] Murdoch, Duncan J., Yu-Ling Tsai, and James Adcock, P-values are random variables(2008), The American Statistician 62.3 : 242–245.

--

--