The world’s leading publication for data science, AI, and ML professionals.

How to Successfully Run A/B Tests

Best practices from large tech companies that are aggressively experimenting with everything

Photo by Ousa Chea on Unsplash
Photo by Ousa Chea on Unsplash

All of the most successful tech companies are constantly experimenting and improving their offering to users. A/B tests are a widely used technique to determine the impact of manipulating a feature.

"Randomized A/B or A/B/N tests are considered the gold standard in many quantitative scientific fields for evaluating treatment effects." -Anirban Deb et al., Uber

While they play an important part in many data-driven decisions, A/B tests are hard to run successfully and harder to run at scale.

In this article, we will look at how some of the most successful tech companies are designing systems to run thousands of A/B tests correctly. We don’t need to reinvent the wheel to run these tests successfully. We can treat what large tech companies are doing with their experimentation platforms as best practices for running A/B testing that is scalable.

Check out the following links if you are interested in learning more about a specific company’s approach to running A/B tests:


What is an A/B test?

Before getting into the details, it’s important that we are all on the same page about what an A/B test is.

They are experiments with a control group and one or more treatment groups (or experimental groups). The control group experiences the service the same as they always have. The treatment groups experience the service with a changed feature (such as a changed color of a button on the homepage).

The goal of the experiment is to determine the impact that a change the treatment group experiences has relative to the control group. The impact needs to be a measurable metric such as time spent on a page, the number of views, or the number of interactions with a button.

Assignment to treatment and control groups should maximize the similarity between each group to reduce the risk of confounding factors in the observed treatment effect. This is typically done through a randomized assignment.


Why A/B tests are important

We can understand the importance of A/B testing through their popularity and results.

We begin by looking at what the five companies we are studying have to say about the importance of A/B tests in their decision-making process:

"In fact, every product change Netflix considers goes through a rigorous A/B testing process before becoming the default user experience." – Netflix Tech Blog

"There are over 1,000 experiments running on our platform at any given time." – Anirban Deb et al., Uber

"Today almost all product decisions are made with some input from one or more A/B tests." – Johan Rydberg, Spotify

"As a data-driven company, we rely heavily on experiments to guide products and features. At any given time, we have around 1,000 experiments running, and we’re adding more every day." – Shuo Xiang, Pinterest

"Experimentation is at the heart of Twitter’s product development cycle. This culture of experimentation is possible because Twitter invests heavily in tools, research, and training to ensure that feature teams can test and validate their ideas seamlessly and rigorously." -Dmitriy Ryaboy, Twitter

You get the idea. Successful companies are always running experiments to improve their platforms. They are an essential component to determining almost every decision at these companies.

A/B tests are such a powerful tool to leverage. They allow companies at a massive scale to test incremental changes without exposing them to all of the customers. These results are then relatively generalizable to the population of customers.

"However, [changes] are too risky to roll out without extensive A/B testing, which enables us to prove that the new experience is preferred over the old." – Netflix Tech Blog

Netflix claims to have run A/B tests on their images associated with titles that result in upwards of 20% to 30% more viewing for that title.

With these types of impacts, it is clear why companies are running so many experiments. It is not only useful, but it is necessary to provide the best possible experience to customers.


Building experimentation platforms that scale

We now understand that all successful businesses are aggressively experimenting before every change they make. How do they successfully run all of these tests at the scale of thousands of concurrent tests?


Essential features for these platforms

All of the experimentation platforms are similarly designed with at least these three components (names from Spotify’s system):

  1. Remote configuration – Allows a user to choose a feature of the frontend or backend service to be tested on
  2. Metrics catalog – A system that manages, stores, and serves data from experiments to a UI or notebook with minimal latency (Spotify’s goal was less than a second).
  3. Experiment planner – Allows employees to run experiments through an easy to use UI

The goals of these platforms are also more or less the same across the five companies as well (names of goals from Pinterest):

  1. Realtime config change: shut down and startup experiments in real-time without code deploy for each change
  2. Lightweight process: Should prevent predictable errors but be as easy as a normal feature launch
  3. Client-agnostic: Users shouldn’t have to learn a new method to run experiments for each platform.
  4. Analysis: Easy to use analytics dashboard
  5. Scalability: Needs to be able to scale in both online service and offline experiment data processing.

All of these businesses we are analyzing have a diverse portfolio of offerings and departments. Because of this, it is important to allow for the generalizability of the platform for diverse business needs. Uber places a heavy emphasis on this in most of its data platforms.

"One of our team’s main goals is to deliver one-size-fits-most methodologies of hypothesis testing that can be applied to use cases across the company." – Anirban Deb et al., Uber


How to assign users to treatment and control groups correctly

One of the most important parts of these experiments is assigning users to treatment and control groups correctly. If the groups are not similar enough, the results are meaningless. It is impossible to tell if the observed treatment effect was caused by the experimental manipulation or the differences between groups (known as a confounding factor).

Uber identifies two key issues to identify to limit the risk of these problems:

  1. Sample size imbalance: if the ratio of the size of the control to the treatment groups is significantly different from the expected ratio, experimenters should check that their randomization mechanism was successfully setup.
  2. Flickers: if a user switched between the control and treatment group during the experiment (for example a user buys a new Android phone and used to own an iPhone), they should be removed from the analyses.

Identifying these issues is done automatically by Uber’s platform.

Spotify describes a few pre-test validity checks they perform to make sure the results are able to be interpreted:

  1. Sample mismatch: make sure the proportion of treatment and control groups align with what is actually observed.
  2. Pre-exposure activity: see if there are any pre-experiment differences in how users interact with the app.
  3. Increases in crashes: make sure the app is still working as expected during the experiment
  4. Property collisions: make sure analysts know when similar experiments are running to reduce the risk of not getting expected exposure to them.

How to allocate users to experiments

As the number of tests increases for these companies, it is important to keep track of which users have been assigned to which experiments. Users should not be assigned to an experiment that conflicts with one they are currently a part of (manipulating multiple features on the same page).

"If someone ended up using the wrong [allocations], a whole slew of experiments would be impacted" – Johan Rydberg, Spotify

Spotify takes an interesting approach they call "The Salt Machine." Spotify’s teams work autonomously at a pace that best fits them. This leads to complexities in the assignment of users to experiments (while preserving the critical randomization of assignments). The "salt machine" randomly reshuffles users in buckets (without stopping ongoing experiments) using hashing into buckets using a tree of "salts." There are a detailed explanation and visualization available here.

Netflix gives more responsibility to users running the experiments with two options for allocation: batch allocation and real-time allocation.

Batch allocation allows analysts to allocate users to tests using custom queries. This is their more flexible approach that helps analysts find the exact customers they are interested in experimenting on. A few draw-backs include not being able to guarantee all users experience the test and new users cannot be added to the experiment.

Real-time allocation allows analysts to configure less flexible rules about how the user is interacting with the app in real-time to assign them to experiments. This guarantees that users experience the expected treatment because they are interacting with the app in real-time. It is also difficult to know when the desired number of members will be treated or allocated under this system. Finally, potentially increased app latency may be a problem with this approach (which Netflix solves with parallel computation as the app waits for other services).


Metrics to track

The issue of what metrics to track can quickly get complicated as the number of offerings increases from a company. Many of these experimentation platforms have flexible metric capabilities to allow analysts to test what they are interested in.

Twitter allows analysts to use three different types of metrics (in increasing level of flexibility to the analyst):

  1. Built-in metrics: defined and owned by experimentation teams
  2. Experimenter-defined: configured metrics created using a lightweight DSL, specifying what "events" should be counted.
  3. Imported metrics: Experimenters create their own metrics and add them to the system.

Uber defines three types of metrics that they allow analysts to create:

  1. Continuous metrics: numeric values
  2. Proportion metrics: binary indicator variable indicating if users perform an event or not
  3. Ratio metrics: two columns that allow analysts to compute a ratio (a numerator for the number of completed actions and a denominator for the total number of people who could have done the action).

Pre-processing and validity checks

Before interpreting results, Uber does three preprocessing steps to improve the robustness of their analyses:

  1. Outlier detection: remove irregularities in data. They do this through a clustering-based algorithm to detect outliers.
  2. Variance reduction: increases the statistical power of hypothesis testing (helpful for experiments with a small number of users). They use the CUPED Method for this.
  3. Pre-experiment bias: sometimes randomization does not lead to great good group assignments. To correct this, they use [Difference in differences](https://en.wikipedia.org/wiki/Difference_in_differences#:~:text=Difference%20in%20differences%20(DID%20or,’%20versus%20a%20’control%20group‘) to correct bias in these assignments.

How to effectively interpret results

Getting an accurate p-value calculation is critical. The experiment is not worth running if the results are not easily interpretable. Uber describes four different tests they use to calculate statistical results:

  1. Welch’s t-test – for continuous metrics
  2. Mann-Whitney U test – performs well with skewed data
  3. Chi-squared test – used for proportion metrics
  4. The Delta method – used for small sample sizes (it is a bootstrap method)

When running thousands of tests at a company, it is possible to make statistical errors known as Type 1 and Type 2 errors. To reduce the risk of making a statistical finding that is not true, it is important to reduce the false discovery rate (FDR). Uber does this using the Benjamin-Hochberg procedure.


Challenges with scaling these platforms

Spotify identified four problems it faced with scaling its experimentation platform it needed to address back in 2017:

  1. Reduce time between adjusting a problem with an experiment
  2. Produce fewer events: A/B testing was generating 25% of total event volume leading to increased costs of processing
  3. Improved Analysis: More metrics needed and more things able to be analyzed needed.
  4. Sophisticated coordination: don’t allow the same user to be assigned to multiple conflicting experiments

All of these optimizations were important for allowing Spotify to scale the number of experiments it was running and should be considered when building an experimentation platform.


Conclusion

It is necessary to be running experiments in today’s business environment. A/B testing is a tool that gives us the ability to make causal inferences about customers without testing changes on all customers.

"Thank the team of engineers and data scientists who constantly A/B test their innovations to our adaptive streaming and content delivery network algorithms. What about more obvious changes, such as the complete redesign of our UI layout or our new personalized homepage? Yes, all thoroughly A/B tested." – Netflix Tech Blog

Businesses that are experimenting are providing a better experience to their customers and are not blindly making changes that could have a large impact on the Customer Experience.

Twitter views this important process as a cycle of innovation with 6 steps:

  • Build a hypothesis
  • Define success metrics
  • Test hypothesis
  • Learn
  • Ship
  • Build another hypothesis

We learned how top-performing Tech companies tackle issues in these steps successfully on a massive scale.

Thank you for reading this article.


Related Articles