The third Ghost of Experimentation: Multiple comparisons

Skyscanner Engineering
Towards Data Science
6 min readMay 29, 2018

--

By Lizzie Eardley, with Tom Oliver and Colin McFarland

šŸ‘» This post is part of a series exploring common misconceptions amongst experimentation practitioners that lead to Chasing Statistical Ghosts.

When running an experiment we usually want to learn as much as we possibly can, and so it is tempting to look at many different things to gain the most insight. Unfortunately, this comes at a cost ā€” the more comparisons you make the higher your chances of seeing a false positive.

At Skyscanner we use an experimentation scorecard, which displays the results of a default set of business level metrics for each experiment. Weā€™ve found there are many benefits of using a scorecard, such as:

  • Ensure no high-importance impacts are missed
  • Agree as an organisation what weā€™re optimising for and focus on the metrics that matter
  • Deliver trustworthy results and standardized analytics at large scale

However using this kind of scorecard without being aware of the multiple comparison problem could put us in danger of chasing statistical ghosts.

We can never be entirely certain of an outcome since there is an element of randomness and noise in any experiment ā€” there is always some chance that a result will appear significant, even when the reality is that no true effect exists. How often this happens is called the ā€˜false positive rateā€™.

At Skyscanner we use a fairly industry-standard 95% confidence level in our experimentation and hence we expect a 5% false positive rate on a single metric. However as soon as you look at more than one metric, you increase your chances of seeing a false positive. To put it simply, every comparison you make has a chance of showing a false positive, so more comparisons means more false positives.

The plot on the left shows, for an experiment which had no real impact (e.g. an AA test), the probability that it would nonetheless appear to have significantly impacted at least one metric, as a function of how many metrics are in the scorecard. At Skyscanner we have 14 metrics in our experimentation scorecard ā€” so this puts us at just above 50% chance of one or more false positives on any one experiment.

You will also be affected by this inflated false positive rate if your experiment is testing multiple different treatments (aka a multi-variant or ABn test) or if you look at results for a single metric broken down in to segments. For example if you were to look at how an experiment changed a single metric in each of 20 different countries, you would find at least one significantly impacted country over 60% of the time even when there was no real effect of your experiment whatsoever. This can be quite dangerous if you are not aware of the change in likelihood, and donā€™t compensate for the ghost.

Multiple comparison corrections

Luckily there are a number of statistical approaches which can be used to mitigate the multiple testing problem and avoid chasing statistical ghosts. These often involve adjusting the significance threshold, so that the p-value required for an effect to be classed as significant depends on the number of comparisons your test is making (remember the first ghost of this series discussed how to interpret p-values).

One of the most commonly used approaches is the Bonferroni correction, which is a simple scaling of the significance threshold, Ī±, by the number of hypotheses being tested, n, such that

Ī± ā†’ Ī± / n .

For example, if you wish to keep a 95% confidence level when running an AB test and looking at 5 different metrics, you could adjust the p-value threshold from Ī±=0.05 to

Ī± ā†’ Ī± / 5 = 0.05/5 = 0.01 .

You might use a similar approach when running a multivariant test with 5 different treatments and looking only at a single metric, since both cases involve 5 comparisons.

The Bonferroni correction controls the Family-Wise Error Rate (FWER), which is the probability that one or more of your metrics appears significant when there is in fact no true effect. In the above example of looking at 5 metrics for one AB test, a FWER of 5% means that, in the situation where there is no real change, you will see one (or more) of the 5 metrics as significant at most ~5% of the time. Without the Bonferroni correction this number would be close to 25%.

Bonferroni is often considered a conservative correction, since it ensures the FWER is at most Ī±, however, if there is any correlation or dependence between the test statistics then applying Bonferroni may be too stringent and lead to the actual FWER being much less than the intended value. Whilst a lower than intended error rate might sound like a good thing, the downside is that this means the experimentā€™s power is being reduced more than necessary, and that the intended balance between false-positives and false-negatives is not being struck.

How we approach the problem at Skyscanner

Apply corrections

At Skyscanner, a number of the metrics in our scorecard are expected to be highly correlated. For example the proportion of travelers that book will be correlated with the proportion of travelers that exit to an airline partner, which itself will be correlated with the proportion of travelers that conduct a search. For this reason, instead of the Bonferroni correction discussed above, we primarily use a different, less stringent correction called the Benjamini-Hochberg procedure.

Benjamini-Hochberg differs from Bonferroni in that it limits the False Discovery Rate (FDR) rather than the FWER. If we think of each measurement which is deemed to be statistically significant as a discovery, then the FDR is the fraction of these discoveries which are false (just noise and not a real impact). The subtle difference between the FWER and the FDR is that FWER controls the fraction of experiments with 1 or more false discoveries, whilst FDR controls the fraction of all discoveries which are false.

Benjamini-Hochberg works for us at Skyscanner as a way to limit the large number of false positives without suffering as large a reduction in power as we would with a Bonferroni correction. However the ā€œbestā€ solution is rather subjective and will depend on the relative costs of false positives (not a strict enough correction) and false negatives (due to the loss in power).

So if, like us, you wish to measure impacts on many different metrics then a p-value correction can be a useful way to avoid an inflated false positive rate whilst still learning from a variety of metrics.

Pre-register the primary metric of interest

We believe an experiment should be based on a clear and testable hypothesis, that it should primarily target one metric in particular, and that this should be ā€˜pre-registeredā€™ by deciding in advance of starting the experiment.

Hence, we require an experimenter to state their primary metric of interest up front ā€” this metric will be treated differently and itā€™s significance will not depend on the Benjamini-Hochberg correction mentioned above. However, if the experiment is a multi-variant (ABn) test then we do use a correction to adjust the p-value threshold for the metric of interest. In this case Bonferroni has an advantage over Benjamini Hochberg in that it allows for a simple power analysis.

It is still important to know if non-primary metrics were affected, but we consider this more of a back stop for unintended consequences, or for learnings to inspire future iterations and new hypotheses. If you do see an impact in an unexpected metric, then..

Re-run experiments with updated hypotheses

We encourage experimenters to conclude and evaluate experiments based on their pre-registered metric of interest ā€” if an experiment shows a significant change to a metric which didnā€™t form part of the original hypothesis the safest course of action is to call ghost and re-run, this time with the unexpected change as the target metric. The same applies if you see that a particular subset or segment of users is significantly affected. Donā€™t be tempted to start reasoning how you would have caused these effects until you are convinced that they are actually real!

šŸ‘» More: Chasing Statistical Ghosts.

--

--

We are the engineers at Skyscanner, the company changing how the world travels. Visit skyscanner.net to see how we walk the talk!