What Does an A/B Test Measure?

Hint: It’s Not What Most People Think

Elvis
Towards Data Science

--

Image by Author

An A/B test is pretty much the gold standard for measuring the impact of marketing spend, website features (e.g., language, visuals, fonts) and many other business processes in an objective fashion. It’s as close to a “physics experiment” as you can get in business. It’s also the source of much confusion and misunderstanding. Before diving into what an A/B test actually measures, let’s first discuss what an A/B test is. An A/B test is defined by two versions of something we want to test, version A and version B, sometimes also known as the test and the control, which will be exposed to a random selection of participants. This is done by randomly assigning the pool of participants to the test or the control group. Once assigned the participants in the test group will be exposed to version A and the control group to version B. The response of the participants to each version will then be measured to determine if there is any difference between the two versions.

The randomization of participants is critical for a proper A/B test because it guarantees that the demographics of the test and the control groups are statistically the same. For example, let’s say you are a business that markets using television, radio, and online search; you are evaluating the potential for remarketing ads (those display ads that follow you around the internet) to add incremental sales. You want to determine if remarketing ads will add any incremental sales because spending the money on people who will already convert will only increase your cost per acquisition without adding any new sales. Unfortunately, its impossible to find a population of people who do not watch television and radio, and never use online search when looking for something they want to buy. But if you randomly split visitors to your website into the test and control group, you will end up with two populations that watch TV, listen to the radio and search online in a statistically similar manner. This means that, everything being equal, the test and control group will generate sales at exactly (statistically) the same rate. If in one month the test group generates 100 sales, then the control group will generate almost the same number of sales (e.g., 99, 100, or 101). Any difference between the test and the control group will be due to random chance.

Now when you perform an A/B test for remarketing ads, you introduce one crucial difference between the test and the control group: the test group is exposed to remarketing ads but the control group is not. This means that the only difference between the population is something you introduced: the exposure to remarketing ads. After a certain amount of time, you count how many sales the test group generated and compare that against the number of sales the control group generated. If the number of sales is sufficiently different, then you can be confident that the exposure to remarketing ads indeed changed the behavior of the test group relative to the control group. This in a nutshell is what an A/B test does.

So what exactly does an A/B test measure? In my experience most people will say an A/B test measures which version is better. That’s incorrect. An A/B test measures if the outcome depends on the exposure to version A or version B; not which version is better. The outcome could be that version A is better or that version A is worse. The A/B test is only measuring how likely it is that there is any difference between version A or version B, not which is better. It is a subtle but importance difference (we will explain in a minute why its important). Technically, a statistician would say that the A/B test determines “if you should reject or accept the null hypothesis.” What is the null hypothesis? It is a fancy term for the following plain English statement: The outcome does not depend on exposure to version A or version B. If the A/B test result tells you to reject the null hypothesis, then it is telling you that the outcome did depend on which version you were exposed to.

But how do you determine if that outcome was good or bad? Well in the case of the remarketing ad A/B test, you look at the number of sales. If the test group generated 200 sales and the control group generated 100 sales, then you know exposure to the remarketing ads was good for sales. But if the test group generated 50 sales and the control group generated 100 sales, then you know exposure to the remarketing ads was bad for sales. The A/B test doesn’t tell you whether or not the outcome was good or bad, only that the difference between the test and the control is “real”; that being exposed to version A or version B did indeed impact the outcome. In this example, the A/B test might tell you to reject the null hypothesis if the test group gets 200 sales and it might also tell you to reject the null hypothesis if the test group gets 50 sales. In both cases, the A/B test is simply telling you the difference you see is “real” and not due to random chance. You must look at the outcome to determine if its a positive or negative thing (which depends on context).

Some may argue this is splitting hairs. Sure maybe the A/B test doesn’t directly tell you if version A is better or worse than version B but it sure sounds like it is an immediate logical outcome of the test. This would be true except that is very common for business users to ask the following question: So if the A/B test says that version A is better than version B with a 99% confidence level, what is the probability that in fact version A is worse than version B? Is it 1%? Definitely not! If you think about the null hypothesis (i.e., The outcome does not depend on exposure to version A or version B) then what the result is telling you is that out of 100 identical tests 99 would demonstrate exposure to version A leads to a different outcome than exposure to version B but 1 out of 100 identical tests would suggest that there is in fact no difference between version A and version B. To use the remarketing example, you might end up with 99 tests where the test group generates about 200 sales (in each test) but in 1 out of the 100 tests you end up with the test group only generating 110 sales (versus 100 sales for the control group), which means it looks like the remarketing ads had very little if any influence on the outcome.

There is no real way of knowing “What are the chances version A is actually worse than version B?” without running the experiment and looking at the outcome. There are techniques one can use to get a feeling for how likely that might be but its not a direct outcome of an A/B test. One technique you can use is estimating the confidence interval for the parameter but that’s done only after the A/B is completed. For example using the remarketing ad example again, lets say the test group generated 200 sales and the control group generated 100 sales. One could take the number of people in each group and calculate the “sale rate” by dividing the number of sales by the number of people in each group. If, for example, the sale rate for the test group was 20% and the sale rate for the control group was 10%, you might the calculate the 95% confidence interval for the sale rate as being between 15% — 25% for the test group and 6% — 14% for the control group (these numbers are made up). One would then have a good feeling that in fact its very unlikely the remarketing ads actually make things worse.

By the way, to those more technically minded, the formal answer as to why the outcome of a statistical test doesn’t directly tell you about the true value of the parameter is that the confidence interval is part of the parameter space, whereas the acceptance region for a statistical test is part of the sample space.

--

--

An Amazonian academically trained in Physics and Electrical Engineering experienced in Data Science, Data Engineering, Analytics and Business Intelligence.