The world’s leading publication for data science, AI, and ML professionals.

How to Optimize your Switchback A/B Test Configuration

An algorithm that determines the most effective randomization points for a switchback experiment.

Thoughts and Theory

In January of 2021, researchers at MIT and Harvard developed a paper that outlines a theoretical framework for optimal analysis and design of switchback experiments. Switchback experiments, also known as time split experiments, employ sequential reshuffling of control/treatments to remove bias inherent to certain data. These methods are popular in 2-sided marketplaces, such as Uber and Lyft, because they allow for robust experimentation on data with finite resources (drivers, riders, etc.).

The algorithm outlined in the paper leverages our knowledge about the carryover effect, the time it takes for our finite resource to replenish, to minimize the variance of our experiment. While the method does require a solving algorithm, such as minimax, the optimization is done before the experiment is run so there’s low computational overhead.

Figure 1: switchback testing framework with 30 minute time splits - image by author.
Figure 1: switchback testing framework with 30 minute time splits – image by author.

Here’s how the method works…

0. Technical TLDR

0.1 Determine the Duration of the Carryover Effect Order. The order of the carryover effect is the number of time periods into the future that a treatment’s effects can be observed. One example for rideshare markets is the number of time splits it takes for drivers to become available for another ride.

0.2 Encode the Risk Function to Minimize. Below we have a function that, when minimized, will provide the best randomization points for our specified carryover effect order (m) and experiment length (K).

Figure 2: expression that, when minimized, will produce the optimal randomization points with the smallest variance - source (Equation 8)
Figure 2: expression that, when minimized, will produce the optimal randomization points with the smallest variance – source (Equation 8)

0.3 Develop an Optimized Control/Treatment Sequence. Employ a solver to solve the above minimization problem.

0.4 Run the Experiment and Analyze. After we have developed the optimal randomization config and run the experiment, we can proceed with our analysis as usual. The only difference is we will need to leverage a bootstrap-style algorithm to determine our p-value, which is discussed below.

1. But, what’s actually going on?

Ok, let’s slow down a bit and develop some intuition for this method.

1.1 When do we use switchback experiments?

Switchback experiments are most useful when our user pool has structural biases. These structural biases often take the form of a shared pool of limited resources or complex interactions between our users.

Take the classic example of a freelancing site for data scientists. It’s a 2-sided marketplace meaning that supply and demand concepts apply to both freelancers and employers. If we run a treatment that helps employers find data scientists with Machine Learning (ML) skills, the pool of data scientists would become depleted, thereby affecting the number of data scientists in the control group. Let’s also say that our metric of interest is the number of contracts signed for ML work (i.e. conversion). Due to the limited supply of data scientists, our control would see a drop in conversion, which would overestimate our treatment impact.

If the treatment impacts a shared pool of resources, the control group will be effected, thereby invalidating our experiment.

So, how can we trust treatment lift as an accurate measure of our ML feature’s success?

1.2 Defintion of Switchback Experiments

Switchback experiments (also called time split experiments) solve this very problem. They are A/B testing frameworks that alternate 100% of traffic between treatment and control.

Figure 3: illustration of the sequential randomization periods - image by author.
Figure 3: illustration of the sequential randomization periods – image by author.

As shown in figure 3, every 30 minutes we randomly all users in User Group A to either control or treatment. This method can can apply to experiments with any number of treatments.

Now the duration of each time split is fairly arbitrary, however the guiding principle is that the duration should be small enough to show valid insights into our data, but not unnecessarily small so that computation becomes a problem. Doordash uses 30-minute windows.

Finally, to measure our experiment lift, we simply take the average of the treatment lift across all time splits. Note that are more complex lift measurements that treat a user at a given timestamp as the randomization unit, but that’s out of the scope of this post.

1.3 A Better Method

While the simple switchback experiment framework above works, it’s not very robust to treatment effects that can last a long time, so researchers at Harvard and MIT developed a better method. Instead of simply randomizing at each time split, we use knowledge about our carryover effect to design an optimal configuration.

First, what’s a carryover effect? From our freelancer example above, when a freelancer is hired, they are removed from the candidate pool for some time while they complete the project. The duration that it takes for a candidate to reenter the pool is our carryover duration.

Often the carryover duration is represented as a multiple of the number of time splits it covers; this value is called our carryover effect order and is represented by the letter m. So, for a carryover duration of 2 hours, if our time splits are 30 minutes, then m = 4.

Second, we must understand what makes a good/bad configuration. The main goal is to develop a randomization framework that creates the proper number of m+1 sequences of just control or just treatment. From a sequence of m treatments in a row, we can observe the confounding of our carryover effect. Conversely, if there are m+1 controls in a row, we "reset" our carryover and observe a treatment unaffected by the carryover effect.

Our goal is to alternate our control/treatments across our user groups to minimize the amount of confounding created by our experiment.

Cool. Ready for the giant equation?

Figure 4: the expression we are looking to minimize to develop optimal randomization points. Duplicate of figure 2 - source (Equation 8)
Figure 4: the expression we are looking to minimize to develop optimal randomization points. Duplicate of figure 2 – source (Equation 8)

In Figure 4,

  • K represents the total number of time periods we have in our experiment,
  • _tk is the _k_th randomization indicator, and
  • m is our carryover effect order.

The output of this minimization would be a vector T* (T-star) __ of randomization points. So, for instance, for an experiment with K=12 total time splits and a carryover effect order of _m=_2, we’d see an optimal output of T*={1,5,7,9}. Here, T* indicates that we reshuffle our control/treatment groups at the 1st, 5th, 7th, and 9th time splits. For all other time splits, we would not reshuffle our control/treatments.

Figure 5 is a table that visually represents the optimal design described above as compared to an arbitrary design. The top row is our optimized framework and the bottom row is an arbitrary design. And finally, the X’s correspond to a randomization point; if there’s an X we reshuffle the control/treatments.

Figure 5: example from the paper of an optimal config (top row) and a randomly designed switchback (bottom row) - source (Table 1)
Figure 5: example from the paper of an optimal config (top row) and a randomly designed switchback (bottom row) – source (Table 1)

A quick technical note – the minimization is adapted from the risk function of our configuration and corresponds to the variance of our lift. The more we minimize the function, the lower our observed variance and the better our estimate of the treatment effect becomes.

The math in the paper is pretty cool and well documented, so check out section 3 of the paper if you’re curious.

1.4 Intuition Behind the Method

Now that we understand what’s going on, let’s understand why it works. For this section, let’s try randomization points at indices F={1,6,7,8,9}. _tn corresponds to the _n_th index of our F vector, so for example, _t0 = 1 and _t4 = 9.

Figure 6: leftmost summation of the minimization expression - source (Equation 8)
Figure 6: leftmost summation of the minimization expression – source (Equation 8)

Starting with the leftmost summation (figure 6), we can see the term will become larger as the difference between sequential randomization indices gets larger.

For full disclosure, let’s quickly work out an example. If k = 0, _t0 = 1 and t(0+1) = 6_. The difference between the two is 5, which relatively large; we are penalized by 25 units. Summing over all the indices, our penalty term would be 112.

Figure 7: second lefmost summation of the minimization expression - source (Equation 8)
Figure 7: second lefmost summation of the minimization expression – source (Equation 8)

Next, let’s tackle the second term (figure 7). The expression is from lemma 2 in the paper which states that both the first and last randomization points should be followed by m non-randomized time splits. If randomization comes too early or late, we are penalized.

Working through this expression with our vector, the expression evaluates to 128. We’re not looking great so far.

Figure 8: rightmost summation of the minimization expression - source (Equation 8)
Figure 8: rightmost summation of the minimization expression – source (Equation 8)

Ignoring the next two terms which are artifacts of the math and have little interpretable value, let’s focus on the rightmost summation (figure 8). This term represents the gap between sequential randomization points; if there is too much time between reshuffling, we are penalized. Note that the ()⁺ is simply the max of our numbers and 0 i.e. max(x, 0).

This term evaluates to 16 and if we sum up all values, including the values we ignored in the above paragraph, we get 112 + 128 + 192 – 16 + 16 = 432.

There’s our method. Pretty straightforward right?

Now, all we have to do is use a solver to find the randomization path that minimizes the expression in figure 4.

2. Inference

With an optimized config determined in section 1, we unfortunately can’t use traditional methods to assess our treatment success because we introduced structural bias to our randomizations.

There are two proposed methods for deriving inference from experiments structured in this manner. The first provides an "exact" estimate and the second provides an asymptotic upper limit.

2.1 Exact Method

The exact method allows us to test a sharp-null hypothesis which is that we observe no effect for any of our time splits. See the pythonic pseudocode below for an outline of the steps…

sampled_lifts = []
# 0. Calcualte lift using the Horvitz-Thompson esimtator for our full data
total_lift = hv_estimator(observed_assignment_path, observed_ys)
# 1. Iterate through a large number of samples
for i in range(num_samples):
  # 2. Sample a new assignment path (w) according to the assignment mechanism
  w = sampled_assigment_path(dist)

  # 3. estimate our lift (lift) using the Horvitz-Thompson esimtator
  lift = hv_estimator(w, observed_vals)
  lifts.append(lift)
# 4. Calcualte p_value as the proprtion of samples that have higher lift than the total_lift
p_val = avg(abs(lifts) > abs(lift))
# 5. Examine values of interest
print(total_lift, p_val)

If your not a fan of the pseudo-code, the English version is we repeatedly take samples from our observed experiment lifts to get many estimates of other theoretical lifts. Note that these samples must follow the same schema that was used in the experiment i.e. they must have the same randomization points. Also note that the theoretical lifts are calculated using the Horovitz-Thompson estimator, a weighted estimate of each lift.

Leveraging all these samples, we observe what proportion of the time they are more extreme than our true lift. This proportion is our p-value.

2.2 Asymptotic Method

The second method uses the variance of our Horovitz-Thompson to calculate a t-statistic. Unfortunately, the variance calculation and proofs are pretty lengthy, so they’ve been omitted, but feel free to check out section 4 of the paper for the full notes on both methods.

Instead, we’ll just focus on performance. The asymptotic estimate is more conservative than the exact method, however they exhibit similar performance for large lifts. Whichever method you choose will be decided by your needs.

And there you have it. A start-to-finish framework for the optimal switchback experiment.

Implementation Notes

  • If you incorrectly misspecify the order of the carryover effect m, statistical inference is still valid.
  • Section 7.2 of the paper outlines a method to systematically determine m. Note that this method requires experiment data.
  • When selecting the length of a time split period, it’s important that the length is less than m so that the carryover effect can be represented as a multiple of several time splits. However, time splits that are too small are computationally intensive for no reason. A rule of thumb is _(numsplits / m) > 100 and err on the side of overestimating m.
  • While expansion to a framework with multiple treatments was not covered in this post, the same randomization points would hold for an experiment with any number of treatments.

Thanks for reading! I’ll be writing 48 more posts that bring "academic" research to the DS industry. Check the comments for links/ideas on optimizing your switchback experiments.


Related Articles