Top 5 mistakes with statistics in A/B testing

Published in

Towards Data Science

14 min readOct 1, 2019

A/B tests (a.k.a. online controlled experiments) are widely used to estimate the effect of proposed changes to websites and mobile apps and as tools for managing the risk associated with such changes. A number of software vendors are competing in this field with custom-built testing rigs used by conversion rate optimizers, landing page optimization specialists, growth experts, product managers, and data analysts.

However, despite the relative maturity of the field, there are several widespread mistakes with using A/B testing statistics. Their effect ranges from making tests less efficient and thus more costly, to rendering the results of an A/B test useless (or worse — completely misleading). I will go over 5 of the most frequently encountered high-impact issues, discuss why I believe they are prevalent, and offer some ways to address them.

Mistake #1: Not testing your A/B testing setup

Nowadays testing is a prerequisite for releasing the tiniest of changes to a successful website or app, yet, oftentimes the A/B testing setup itself avoids this scrutiny.

It shouldn’t!

It is a piece of software (several interconnected pieces, to be precise) and is prone to issues just like any other. If it is unreliable or not working as expected, for some reason, its output will not be trustworthy. A central function of A/B tests — to serve as a last layer of defense against releasing business-damaging user experiences — would be partially or completely compromised if the A/B testing rig is not working to specification.

The Issue, in short, is that improper randomization, allocation of users between test groups, or tracking, can make any A/B test result unreliable. This happens since the statistical models used to evaluate the data require a certain set of assumptions to be fulfilled in order to work. Common assumptions are independence of observations, homogeneity, and certain distributional assumptions (e.g. Normality).

The Harm is that results from A/B tests with a mis-specified model would be inaccurate or misleading. Decisions made based on such tests will not only harm business results, but they will also acquire an aura of certainty around them (“we’ve tested it, it’s OK”) so that any issues detected by observational analysis later on are much less likely to be attributed to the tested change.

The Reason for the prevalence of this mistake is that few if any of the commercial tools offer in-built detection for violations of statistical assumptions. The situation is no different with many of the custom-built ones I’ve seen in my practice, barring some of the systems used by the likes of Google, Microsoft, and Amazon. Another reason could be that often statistical tool outputs are treated as unquestionable due to simple lack of awareness for the need for checking statistical assumptions.

Some of the solutions include:

1.) A rigorous quality assurance process before test launch. It can help detect various tracking issues, technical issues that can bias the outcome (e.g. variant is delivered 500ms slower than control simply due to the test delivery process), as well as simply broken test experiences.

2.) Post-data, a goodness-of-fit test can be performed to check if the actual allocation ratio conforms to the specified allocation ratio. For example, we can have the following outcome of an A/B test planned with 50/50 target allocation:

Is the outcome — variant better than control, p-value: 0.02 — to be trusted? We can perform a Chi-Square goodness-of-fit test in order to find out. A low p-value from such a test would suggest that there is a poor fit between expected and actual allocation. This is what some call sample ratio mismatch (Crook et al. 2009 (pdf), Dmitriev et al. 2017 (pdf) )

The result for this particular test decidedly points to an issue (calculation):

Such a tiny p-value means it was extremely unlikely to observe the allocation we actually ended up with. This doesn’t tell us what the issue was, exactly. Such a sample ratio mismatch could have happened due to improper randomization, issues with the allocation procedure, tracking issues, or a combination of these. Further investigation would be needed to determine the culprit, but at least we know there is an issue, and the results are not to be trusted.

3.) Regular A/A tests can be performed to ensure that nominal and actual error rates (especially type I error rates) match. Statistical tests for normality and other assumptions can be performed on the A/A test data to ensure that these assumptions would hold for tests performed using this particular testing setup. Point 8.2 of the aforementioned paper by Crook et al. titled “Seven Pitfalls to Avoid when Running Controlled Experiments on the Web” is a good primer on A/A tests.

Mistake #2: Measuring the wrong metric

I’ve dedicated quite a few pages of my book “Statistical Methods in Online A/B Testing” to explaining how the statistical and measurement apparatus should be adjusted to suite the business questions one is faced it, and not the other way around.

The reason is that in many cases I see the reverse happening: measuring the data one has, instead of the data one needs. Whatever measurement is taken in such a case is inadequate to answer the business question being posed. However, it is often presented as if it is exactly what was needed.

The Issue is that oftentimes the metric being used cannot answer the business question being put to the test. Conversion rate is not always the correct metric. It should certainly not be the only metric you use when there is variance in the business value attached to conversions.

The Reason, I believe, is partially habit, partially ease of use and convenience. Most A/B testing tools support statistical models for rates of various kinds, as these are simple to work with. Most still don’t support or have limited support for continuous metrics such as difference in average revenue per user (ARPU) or average order value (AOV), even though these are essential for answering typical business questions related to e-commerce. Reporting tools like Google Analytics also lack proper support for calculation of statistics for most continuous metrics.

Another reason could be the larger sample size required plus the additional overhead of needing to estimate the standard deviation from historical data in order to calculate the required sample size / statistical power; this step is much simpler for metrics based on binomial variables such as difference in conversion rate.

The Harm is easy to demonstrate with a simple example from e-commerce. If the business goal of the tested variant is to increase the revenue of the store, this is usually translated to an increase in average revenue per user. If the store makes more per person brought in, this will result in an increase in revenue, all else being equal.

Average revenue per user has two components to it — the conversion rate and the average order value (ARPU = CR x AOV). Measuring just conversion rate and neglecting average order value can easily lead to implementing a variant which, while increasing conversion rate, actually harms the business due to a lower AOV which ends up decreasing ARPU. The test then would have exactly the opposite of the intended effect, due to a poor choice of metric.

The Solution for the above example is to base the test on ARPU. It is an easy way to fix that, however, it may not be so easy in practice as it may require switching software vendors to those with support reporting and statistical calculations based on ARPU*.

As noted above, the expected test duration will inevitably increase when switching from CR to ARPU, due to the increased variance of ARPU versus CR which is one of its components. However, this is often a fair price to pay to have your business question answered properly.

* In my role of a software vendor I understand why the industry as a whole is slow to offer proper support for continuous metrics — the product becomes more complicated for end users, continuous data is more expensive to store, plus the demand is not quite there, yet. However, offering support for such metrics in some of our basic statistical calculators at www.analytics-toolkit.com has proven useful and we are working to extend that to our advanced statistical tools. I encourage others to follow our lead.

Mistake #3: Using two-tailed tests

If you haven’t heard about two-tailed tests versus one-tailed tests or two-sided hypotheses versus one-sided hypothesis, you are not alone. Unless you’ve dug deep in stats you might not have faced this question at all, as most of the time the type of test used is not immediately visible and may only be mentioned in a technical reference or manual. This is likely even less obvious in a custom-built platform.

Space does not permit me to get into the details surrounding the question of one vs. two-tailed tests here, but if you want to take a deeper dive — I’ve debunked some commons myths surrounding the issue and stated the case for one-tailed tests briefly here, in more detail in my book, and in a series of articles at project OneSided.

If you don’t want to get stuck in the details, it suffices to understand that it is all about aligning the business question at hand with the statistical hypothesis being tested. Usually we want to control the risk of implementing an inferior solution, or sometimes the risk of implementing something which is no better than what we already have.

Therefore, most business questions we face in online A/B testing are of the form “we want some guarantee that experience B is better than A as measured by metric X”. These translate properly into a one-sided hypothesis and thus a one-tailed test. We rarely, if ever, ask the question as such “we want some guarantee B is not the same as A, whichever way”, which would indeed require a two-tailed test. Even more rarely are we prepared to pay the price to get the same level of precision for such an answer, as opposed to the previous one.

The Issue is, therefore, that the statistical model of a two-tailed test is not aligned with the business question at hand.

The Reason for its prevalence is that many statistical tools perform two-tailed calculations either exclusively, or by default. The discussion will get prohibitively deep if we go into the reasons why that is the current state of the matter, but as result of it, many end users remain blissfully unaware of this issue.

The Harm from using two-tailed tests is that due the reported uncertainty measurement (p-value, confidence level, etc.) significantly overestimates the actual uncertainty — usually by about 100%; if the tool reports an error probability of 0.1 or 10%, it is in fact 0.05 or 5%. This happens, again, due to the misalignment between the question posed and the tested statistical hypothesis.

Furthermore, a two-tailed test requires between 20% and 60% larger sample size (more users, thus longer A/B test duration) compared to a one-tailed test with the same parameters other than the different hypothesis.

The Solution here depends on your particular situation. If your statistical tools already default to one-tailed calculations, then simply continue using them. If they support both two-tailed and one-tailed calculations, switch to the one-tailed (one-sided) option. If they only support two-tailed calculations, there is still a solution — as long as the error distribution of the parameter of interest is symmetrical, all you need to do is half the uncertainty estimate.

So, if it is a p-value, and the reported two-sided p-value is 0.1, simply divide by 2 to get the the 0.05 one-sided p-value. If it is reported in terms of a confidence level, say 90%, then simply divide 100%-90% by two and add it to the reported level. So 90% two sided becomes 90% + (100%-90%)/2 = 90% + 5% = 95%. If the error distribution is asymmetrical, then I presume you’re already working with tools advanced enough to calculate the one-tailed value on request.

If you are unsure which kind of calculation is used, read the technical manual or reference, or contact your vendor’s support staff. If it is a custom built solution, consult its manual or work with its developers to determine what type of calculation is employed.

Mistake #4: Overestimating the sample size for A/B/n tests

A/B/n tests — tests where the control is tested against more than one test variant — are another area where a lot of mistakes are often made, both in terms of calculating statistical estimates and in terms of estimating the sample size required to achieve a target power for a specified significance and minimum effect of interest. Here I will focus on the latter issues.

The Issue is that often the sample size per test group will be calculated for a simple A/B test, and then multiplied by the number of tested variants plus one, to get the sample size for the A/B/n test.

Consider this scenario: A/B/n tests with 4 variants vs. control (A/B/C/D/E). A naive calculation would start by estimating that 24,903 users would be required per group for an A/B test. Therefore, for an A/B/C/D/E test the total number of users is simply 5 x 24,903 = 124,515.

However, in reality, only 95,880 users (total) are needed for the A/B/C/D/E test with these parameters, or 23% less than the naive calculation.

The Reason this happens is that most free sample size calculators, and perhaps some paid ones, only support calculations for one variant versus a control. Even worse — some claim to support calculations for A/B/n tests, but in fact only perform the naive calculation shown above. It is therefore quite easy to fall prey to this mistake.

The Harm is fairly obvious — tests are planned for much longer than needed due to the overestimation in the sample size calculations. Business losses are incurred both due to missed opportunities to implement winners early, and due to missed sales because of running tests with inferior variants for longer than necessary.

The Solution is to use proper sample size calculations. Sample size calculations using the Dunnet’s test result in optimal tests since they take into account the dependent nature of the comparisons A vs. B, A vs. C, A vs. D, etc. Calculations based on the Holm-Bonferroni method are acceptable, although they result in slightly sub-optimal tests.

A paid solution tailored to online A/B testing is provided by the sample size calculator at Analytics-Toolkit.com (free trial available). This sample size calculator is completely free and also uses Dunnett’s, but learning to navigate the interface can be more challenging. If you prefer using R, then checkout the DunnettTests package.

Mistake #5: Incorrect use of confidence intervals

I’ve seen a number of mistakes in using confidence intervals when evaluating and communicating the results of online A/B tests, but three stick out in particular.

The first one is related to the issue discussed above — one-tailed versus two-tailed tests. Except here it translates into the issue of using one-sided versus two-sided intervals.

I argue that most of the time one is interested in seeing values below which parameter value can be ruled out with some confidence level XX%. This is also how intervals are usually interpreted — say, upon observing the below two-sided 95% confidence interval spanning from 2% to 5%, one would say that values below 2% can be ruled out with 95% confidence.

However, this is incorrect! In fact, in this particular example we can rule out values below 3% with 95% confidence. This happens by calculating the correct one-sided interval, spanning from 3% to plus infinity. If we are to construct this interval infinitely many times, 95% of the time its lower limit will lie below the true value of the parameter we are trying to estimate (e.g. difference in conversion rates).

So, if you are looking to limit values on both sides, use a two-sided interval, but when you claim values outside the interval in one particular direction can be excluded with a given confidence, you need to construct a one-sided interval, not a two-sided one. If your software doesn’t support one-sided intervals, for some reason, simply construct an interval with twice the uncertainty, e.g. the lower bound of a 90% two-sided CI is exactly the lower bound of a 95% one-sided CI (for parameters with symmetrical error distribution).

The second mistake with confidence intervals I see an awful lot in A/B testing, including from major vendors, is the reporting and usage of interval estimates for each test group when in fact the parameter of interest is difference between group means / proportions.

Very often one would see something like the graph on the left or a “fancier”

distribution plot to the same effect (the distributions can be on the same x-axis and overlapping, it’s the same thing). The logic following from such a presentation is to say that since there is some overlap between the interval for the Variant and that of the Control, we cannot conclude the Variant is better than the control with the desired confidence level.

This logic is flat out incorrect, as in the example above the confidence interval for relative difference in proportions will often exclude zero, meaning that the difference can be accepted as non-zero with the specified confidence level. Using intervals this way is sure to lead to overestimation of the uncertainty of the A/B test result.

The third mistake I see committed almost universally is the naive transformation of confidence interval bounds calculated for absolute difference to bounds for relative difference. For example, a 95% CI for absolute difference in conversion rate spanning from 0.0015 to plus infinity would be computed as shown:

With a 10% baseline conversion rate, the lower bound would then be converted to 1.5% and the claim would be made that a difference less than 1.5% can be ruled out with 95% confidence.

This can be demonstrated to be incorrect by calculating the correct confidence interval for relative difference in proportions (Kohavi et al. 2009, pp.154–155 (pdf)). In this example it spans from -2.9% to plus infinity, meaning that differences above -2.9% cannot be ruled out at the 95% confidence level. In this particular case constructing the proper interval also reverses the outcome of the test: the variant can no longer be accepted as superior to the control at the 95% confidence level.

If you are interested in a more detailed discussion on the difference between intervals for absolute difference and those for relative difference and the extent of the error which can be expected, all accompanied by extensive simulation results, please check out this article on confidence intervals for percent change.

I hope you’ll find this fairly detailed overview and solutions for the 5 most common mistakes in using statistical methods in online A/B testing useful and practical. These are costly mistakes I see in my daily practice both as a consultant and a tool vendor. I really think we should see much fewer instances of these, especially given how easy most of them are to avoid.