Understanding the power of your A/B tests

A visual exploration of experimental design choices

Paul Stubley
Towards Data Science

--

@shelleypauls on Upsplash

When designing an experiment, a lot of decisions should be made ahead of time. One of these that I found difficult to visualise was power. The concept is straightforward —if an effect actually exists, how likely are we to find it? — but nailing down how different factors affect it was tricky for me initially. Below, I visualise a few of the factors that affect the power of a test, but first let’s recap a few things.

Let’s use the example of whether or not a person will buy an apple when they go into a greengrocer’s. Historically, let’s say 10% of customers do, which we’ll call p⁰ (usually we’d use subscripts here, but for Medium superscripts are easier). We want to know whether moving the apple-stand closer to the door (the treatment) will increase the proportion of people who buy an apple. After the changing the location of the stand, we will call that proportion (irrespective of whether it changes from p⁰ or not).*

*For this to be a true A/B test, we would also need a control group. In this example, we could move the stand in half our chain’s stores (chosen at random) and not move it at the other half. We would then compare p_treatment with p_control, rather than comparing the proportions before and after the move.

So we set up Null (H⁰) and Alternative () hypotheses:

H⁰ : There will be no increase in the proportion of people buying apples (p¹≤p⁰).

H¹ : There will be an increase in the proportion of people buying apples (p¹>p⁰).

Now there are a possible few outcomes to this test — either the move can cause an effect or not, and separately either we can conclude that it caused an effect or not. The four possibilities are summarised below:

The 4 possible outcomes of the test

Before we run the test, we need to decide a few things:

  • How accepting can we be of false-positives?
  • If an effect is caused, how confident do we want to be that we’ll catch it?
  • What is the smallest effect we want to be able to detect?

The first bullet defines our acceptable Type-I error rate (see the table above), or 𝛼. The second is related to our Type-II error rate, or 𝜷. Specifically, the second point is describing the power of the test, given by 1-𝜷.

That power and Type-II errors are related in this way touches on a fundamental point, the probabilities in the table above sum-to-one across rows — that is to say, the truth (what actually happened) does not have a probability associated with it, either the treatment caused a change or it didn’t. The probabilities are associated with what we will conclude happened, so if an increase did happen, we might conclude it didn’t (with probability 𝜷) or conclude that it did (with probability 1-𝜷).

So, back to it. We decide (ahead of time) an acceptable Type-I error rate, which is often chosen to be 𝛼=0.05 — i.e. “if there is no effect from our treatment, 5% of the time we will say there is an effect anyway” — and we decide a required power, which is often chosen to be 𝜷=0.8 — i.e. “if there is an effect from our treatment, we’ll be able to separate that from random fluctuations 80% of the time”. So we have our requirements, how can we adjust the test design to fit those requirements?

Sample size

Imagine for a moment H⁰ is indeed true: and that there is no effect of moving the apple-stand. Well, on any given day a few more or fewer people could buy apples, so even though the true proportion is 10%, sampling that proportion on any given day will create a distribution around that 10% mean. The larger this sample, the more likely we are to be close to that 10% mark. We will be wanting to understand the difference in proportions ∆p = p¹-p⁰, so this distribution would be centred on zero if H⁰ is true.*

If H₀ is true, 5% of the time (defined by 𝛼) we will measure a value of ∆p > p_crit

*Remember that our key measurement during this test is a single value of ∆p̂, the difference in the sample proportions of treatment and control groups. We need to take that value and make a decision of whether H⁰ or H¹ is true. We do this by saying “If p̂ is greater than some critical value, we’ll reject H⁰ in favour of H¹”. This critical value is precisely the probability defined by 𝛼, and is shown by the grey dashed line in the figures.

Now imagine is true: there is a positive effect of moving the apple-stand closer to the door, and that this effect is a +2 p.p. increase from 10%→12%. Once again, sampling the proportion any given day will create a distribution around the 12% mean. Once again, the bigger this sample, the more likely we are to be close to the 12% mean. This ∆p distribution will be centred on 2%.

The alternative distribution centred on ∆p=2%, however the overlap with the null distribution is large, resulting in high 𝜷, and low power.

The logic above hints at the first way we can increase an experiment’s power. By increasing the sample size, we decrease the variance of each of the distributions, which increases the proportion of the distribution — the distribution we would be sampling if the Alternative hypothesis was true — that is above the False-Positive cutoff* (shown as a grey dashed line). Notice in the figure below that neither the means, nor the Type-I error rate changes, the only change is an increasing sample size, which decreases the variance of the possible sampling distributions, increasing the power.

One point that baffled me previously is that we are only actually sampling from one of these distributions in any given test. Either H⁰ is true and we are sampling from the blue distribution, or is true and we are sampling from the orange, but not both. These figures are to determine how likely it is you can tell which distribution you are sampling from, based on your experiment design — the more overlap in the two distributions, the harder it will be to say which distribution you are sampling for a given measured ∆p̂.

Changing only the sample size of the experiment adjusts the variance of the distributions

As our sample sizes increase above we reach a power of 80%, while keeping our Type-I error rate at 0.05. This defines the sample size for each of our treatment and control. Going back to our store example, if 100 people visit the stores in the treatment group each day, and 100 people visit the stores in the control group, we should plan to wait about a month before we analyse our experiment (to allow a sample of at least 3000 in each group).

Minimum detectable effect

Another way we could increase the power of our test is to accept that we’ll only be able to detect larger effects of the treatment. In our visual example this means ∆p increases (shifting the orange distribution to the right).

Increasing the minimum-detectable-effect

If we’re willing to believe our treatment will create a +3.5 p.p. effect, and that we are not interested in effects smaller than that, we can accept a smaller sample size and still get the necessary power.

False-positive rate

A final way we can increase our power is to accept a higher Type-I error rate. Without changing either distribution, we can shift the critical p-value down, essentially increasing the chance of us concluding there was an effect, but decreasing the confidence we would have in our conclusion.

Increasing the acceptable False-positive (Type-I error) rate, 𝛼

Whether or not we are willing to accept a higher Type-I error rate is a matter of context. It depends on whether there is an imbalance in the cost associated with making certain types of error. A false-positive in our example would cause us to move the apples unnecessarily, perhaps not such a big deal. A false-positive in a medical drug trial on the other hand could cause an ineffectual drug to be used in hospitals. On the other side of the coin, a false-negative would cause an effective drug to be dismissed and not used as a future treatment. It’s clear that levels of Type-I and -II errors are therefore highly subjective.

As a final note, these three factors do not need to be adjusted in isolation. For example, we could require both 𝛼=0.05 and 𝜷=0.8, and solve for ∆p and the sample size. This would create a curve of parameters we could choose from for our design.

Solving for 𝛼=0.05 and 𝜷=0.8 allows multiple solutions of Sample size and ∆p

Note the non-linear relationship between sample-size and minimum-detectable effect. As the effect we’re looking for gets smaller, it gets exponentially more costly to create the sample.

So there we are, I hope a few of those figures have helped visualise how the power changes with different choices for experiment parameters. They certainly helped me figure out what was going on. And if you haven’t run power-analysis before testing in the past, try it out next time you set up an experiment.

If you would like to contact me, you can do so on LinkedIn. If you’d like to see the code and analysis (including the derivations behind each distribution), check it out on GitHub. The animations were made using Matplotlib and imagemagick.

This discussion was inspired by, and led on from, the experimental design section of the Udacity Datascience Nanodegree.

--

--

Data and Decision Scientist, PhD, seeking next opportunity in Vancouver, BC