The world’s leading publication for data science, AI, and ML professionals.

How Wish A/B tests percentiles

A Methodology Study for Detecting Differences in Percentiles in Large-scale Online Experiments

Contributors: Qike (Max) Li & Bo Gao

Image by Gerd Altmann from Pixabay
Image by Gerd Altmann from Pixabay

In this article, we share our journey of developing an A/B testing methodology for percentiles. You will see the example use cases, why it is challenging to A/B test percentile metrics, how the seemingly simple solutions fail, the statistical test we end up using, and the performance of the test.

At Wish, we A/B test almost all the new product features to ensure they enhance user experience. In each experiment, we conduct hypothesis testing for a number of metrics, and some of those metrics are percentiles, such as P50 (fiftieth percentile or median) or P95 (ninety-fifth percentile). When testing the difference in two percentiles from two experiment buckets (e.g., P95 of control vs. P95 of treatment), the typical hypothesis testing methods, such as t-test, are not suitable since those tests compare the means, rather than percentiles, of two samples.

Example Use Cases

Percentile A/B testing enables us to measure the impact of a new product feature on the website performance. Website performance highly affects user experience and has real monetary effects. Amazon found that just 100 milliseconds of extra load time cost them 1% in sales. Pinterest reduced wait times by 40%, which led to a 15% increase in search engine traffic and sign-ups.

Photo by CHUTTERSNAP on Unsplash
Photo by CHUTTERSNAP on Unsplash

Percentiles, such as P50 (fiftieth percentile) or P95 (ninety-fifth percentile), are better website performance indicators than the sample mean. For example, product A loads in 0.5 seconds for all users, and product B loads in 0 seconds for 95% of the users and in 10 seconds for 5% of users. Even though the average loading time is the same for the two products, the user experience with product A is much better. Product A loads for all users in the blink of an eye. In contrast, it takes seemingly forever to load product B for 5% of users, and the poor experience might encourage those users to churn. Throughout this article, we focus on the application of percentile testing to performance metrics.

Besides performance metrics, percentile testing unlocks many other applications when we are concerned about the impact on the distribution of a metric. For example, percentile testing is also invaluable to transaction metrics. Wish seeks to provide a fair marketplace that grows the business for merchants of all sizes. Therefore, when we make a policy change, we aim to boost sales for all merchants. Thus, while A/B testing the average sales between control and show buckets, we also need to compare percentiles like P10, P50, and P90 to ensure the policy does not boost platform sales by heavily favoring a handful of top merchants, which may put the small size merchants at a disadvantage.

Challenges in testing percentiles

A/B testing the differences in percentiles of the performance metrics is challenging due to the following reasons:

  • Challenge I: Randomization Unit. Most Experimentation platforms randomize users, which guarantees independence between users. But performance metrics are measured at the response level and response times from the same users are likely correlated, which violates the independence assumption of most statistical tests and may yield a high false-positive rate. This is often referred to as the discrepancy between the randomization unit and the analysis unit [1].
  • Challenge II: Scalability. The approach needs to be scalable enough to conduct the hypothesis testing from billions of rows of input data within an acceptable time frame.

We first explored simple solutions

Can we set up a fixed threshold to detect performance changes?

Does it suffice to check if the difference between the treatment and control groups is above a fixed threshold, such as 100 ms difference or 10% relative change? For example, we conclude that a product feature introduces latency if the P95 in the treatment bucket is 100ms higher than that in the control bucket, or the P95 in the treatment bucket is 10% higher than that in the control bucket.

To see if using a fixed threshold is suitable, we retrospectively conducted several A/A experiments and discovered neither fixed absolute change nor fixed relative change works. Note for those A/A experiments, any difference between show and control is not a true difference but due to natural fluctuations.

We investigated the P95 difference between treatment and control for various APIs. Using a fixed threshold is not robust due to the following reasons.

  • One API can be two orders of magnitudes slower than the other. A 10 ms difference is undoubtedly a performance degradation for one API but may be due to natural fluctuation for another API.
  • The sample percentile variation can be considerable when we look at the latencies for a less-used API. For example, a 10% difference in sample percentile may come from natural variation. Whereas, for heavily used APIs, a 10% increase in P95 may indicate substantial performance degradation.

In contrast to statistical tests, a fixed threshold does not take into account sample variance (i.e., the response time of some APIs is more variable than it is of other APIs) and, therefore, can not guarantee a well-controlled Type I error rate (false positive rate).

Can we apply the proportion Z test?

A two-sample proportion Z test allows us to compare two proportions to see if the difference between two proportions is significant. When there truly is a significant difference between the two P95s of the two experiment buckets, there would be a significant difference in the proportions of samples above the pooled P95 (the P95 of the data combined from control and treatment). For illustration, let the P95 latency in the treatment group, in the control group, and pooled be 300ms, 100ms, and 200ms, respectively. Assuming that both groups are equal in size, the treatment (control) group is likely to contain significantly more (less) data points above 200ms. To construct a meaningful test using this property, we modify the proportion Z test as follows:

  • Compute the pooled P95.
  • Use the proportion Z test to compare the proportions of the request with response time higher than the pooled P95 from treatment and control.

This approach seems reasonable for comparing Percentiles. But, unfortunately, it does not address the discrepancy between the experimentation unit and the analysis unit (Challenge I). We, therefore, simulated A/A tests in two different ways as follows to study if the modified proportion Z test is suitable for percentiles.

  • Retroactively simulated A/A tests using data from Wish with two different randomizations: 1) Randomize users(Randomization I); 2) Randomize requests(Randomization II). Note since these are simulated A/A tests, randomizing requests is possible.
  • Apply the modified proportion Z test

If the test works properly, the distribution of p-values resulting from A/A tests should follow a uniform distribution from 0 to 1. And, if we set the significance level at 5%, we should expect to see ~5% A/A tests with p-values smaller than 0.05.

We discovered that when randomizing at the user level (Randomization I), more than 60% of A/A tests led to false positives. The false-positive rate is high because the randomization unit is different from the analysis unit. On the other hand, when randomizing at the request level (the randomization unit is the same as the analysis unit), the p-value distribution is reasonably close to uniform, and Type I error is well controlled.

In conclusion, if we could randomize requests, this approach would be reasonable. However, in practice, our randomization is at the user level, which makes this approach invalid.

Is cluster bootstrapping applicable?

To resolve the discrepancy between the randomization unit and the analysis unit, we can apply cluster bootstrapping. However, bootstrapping is not scalable (Challenge II), its computation is prohibitively expensive. The complexity of bootstrapping is essentially O(nB), where n is the number of samples and B is the number of bootstrap iterations. Given the massive user base at Wish, it would take days to run the cluster bootstrapping for one experiment, and we often have hundreds of experiments running at the same time.

Although cluster bootstrapping is not scalable for Wish’s data size (or most other tech companies), it is theoretically sound. We can use it as the gold standard to evaluate the standard error estimates of the sample percentiles.

Wish’s solution

Our exploration of the approaches described above emphasizes that the approach to A/B test percentiles needs to address the discrepancy between the randomization unit and the analysis unit (Challenge I) and be scalable (Challenge II).

A method [2] published by LinkedIn in 2019 provides a closed-form analytical solution to estimate the standard error of the sample percentile estimate and is, therefore, scalable. It also specifically addresses Challenge I. One of the steps in the LinkedIn method requires estimating the probability density at a percentile. For that step, we found Quora’s approach was easier to use in practice and yielded satisfactory results.

We adopted both approaches to estimate the standard error of the sample percentile. With the sample percentile and the standard error of it, we apply Z test for the null hypothesis – there is no difference between the percentiles from two experiment buckets (e.g., control and treatment).

Notations

Suppose we run an A/B test with two variants, where users in each variant get a different experience when viewing product pages. We aim to detect the difference in P95 of product page load time between the two variants. Note, here we use P95 as an example, we use this approach to test various percentiles in practice. Focusing on one variant, suppose there are:

  • N users indexed by i = 1, 2, . . . , N;
  • User i views Pᵢ products, where Pᵢ’s are independent and identically distributed (i.i.d.) random variables;
  • Xᵢⱼ is the page load time of user i’s jᵗʰ page view. Xᵢⱼ’s are likely to be positively correlated because page views are likely to all be faster from a user with a fast network and a fast device, and vice versa.
  • The sample P95 of {Xᵢⱼ, i = 1, 2, . . . , N; j = 1, 2, . . . , Pᵢ} is denoted Q̂.

Summary statistics

Our data pipeline needs to output the following summary statistics for conducting the hypothesis test:

  • number of users N
  • sample P95 Q̂
  • ∑ᵢ Jᵢ, where

And Jᵢ is the number of page loading times lower than the sample P95 for user i in one bucket.

  • ∑ᵢPᵢ, where Pᵢ is the number of page visits for user i
  • ∑ᵢ Jᵢ²
  • ∑ᵢPᵢ²
  • ∑ᵢJᵢPᵢ
  • P(95-δ) and P(95+δ). δ is a tunable window size. Say δ=0.25, we need to calculate P94.75 andP95.25.

Note that all of these summary statistics are aggregated at the experiment bucket level, making the data pipeline much simpler. With regard to the tuning parameter, we found = 0.25 led to satisfactory results for percentiles like P50 and P95. It may require more careful tuning when it comes to extreme percentiles like P99 or P99.9.

Hypothesis test

With the summary statistics, the variance of the sample P95 is estimated as

Note, we follow Quora’s approach to estimate the density fₓ(Q) in the denominator

fₓ(Q)= 2*δ/(P(95+δ) -P(95-δ))

And the numerator is estimated as follows

With the variance estimate of sample P95, we conduct a Z test to compute p-values.

If you are curious about why we can estimate the standard error of sample P95 this way, please refer to equations 1–6 in the LinkedIn paper. In essence, the percentile equals F⁻¹(percentage), where F is the cumulative density function (cdf) of Xᵢⱼ, and the percentage is 0.95 in the case of P95. The percentage is a function of ∑ᵢJᵢ and ∑ᵢPᵢ. In addition (1/n∑ᵢJᵢ and 1/n∑ᵢPᵢ) follow a bivariate normal distribution due to the central limit theorem. Applying the delta method twice, we can derive the distribution of the sample percentile.

Data pipeline

The data pipeline of percentile testing is substantially different from that of typical A/B testing, which typically applies t-test and requires only three summary statistics: sample means, sample variance, and sample size. In addition, the computation of percentiles for a large dataset is slow and resource-intensive. So we calculated approximate percentiles, which is an order of magnitude faster and yields satisfactory results. Lastly, since some of the summary statistics are decimal percentiles (e.g., P94.75 and P95.25), the minimal sample size needs to be at least larger than 10,000, which is far smaller than the typical sample size at Wish.

Results

The analytical solution yields accurate variance estimates

Using cluster bootstrapping as the gold standard, we compared the standard errors estimated by the analytical solution described above and those estimated by cluster bootstrapping. The standard errors estimated by these two methods are close across different client platforms. The cluster bootstrapping takes hours for a moderate sample size on a server over 200 cores and 1.5T memory. And it would take days to finish for all experiments at Wish if we were to use bootstrapping. In contrast, the analytical solution’s computation time is negligible.

Type I error rate is well controlled in A/A tests

To evaluate the type I error rate, we retroactively constructed A/A tests for different percentiles using the request data from various API calls. The type I error rate (false positive rate) is close to 5%, and the p-value distribution follows uniform ranging from 0 to 1.

The analytical solution yields good statistical power

We also evaluated the power (true positive rate). We took the historical requests data and randomly split it into 50% control and 50% treatment. For the treatment bucket, we multiplied the request time by 110% or 101%. Even when there is only a 1% increase in response time, the power to detect the difference is still decent.

Conclusion

Reliable statistical testing is a key component of a trustworthy experimentation platform. There is no one-size-fits-all statistical test for different metrics and different types of business. We share our effort in developing the percentile test, which unlocks more potential applications of our experimentation platform. The percentile testing is now applied to all experiments at Wish to A/B test their latency impacts on API response times. We set guardrail rules to identify the experiments that introduce high latencies and notify the experiment owners to prevent from introducing slow product features.

Acknowledgments

Thanks to Bo Gao and Chao Qi for their contributions to this project. We are also grateful to Pai Liu, Shawn Song, Simla Ceyhan for their support, and Pavel Kochetkov, Lance Deng, Caroline Davey, Leah Levine for their feedback on drafts of this post.

Data scientists at Wish are passionate about building a trustworthy experimentation platform. If you are interested in solving challenging problems in this space, we are hiring for the data science team.

[1] A. Deng, J. Lu, and J. Litz, Trustworthy Analysis of Online A/B Tests: Pitfalls, Challenges and Solutions ( 2017), WSDM: The Tenth International Conference on Web Search and Data Mining. Cambridge, UK.

[2] M. Liu , X. Sun, M. Varshney, and Y. Xu, Large-Scale Online Experimentation with Quantile Metrics (2019), arXiv preprint arXiv:1903.08762


Related Articles