The world’s leading publication for data science, AI, and ML professionals.

Randomization, Blocking, and Re-Randomization

Three fundamental pillars for online experimentation

Experimentation and Causal Inference

Introduction

A/B testing has become increasingly popular in the industry. Almost all top tech companies allocate extensive resources and develop in-house Experimentation Platforms (ExP) that support hundreds, even thousands of parallel A/B tests.

Whenever in doubt, A/B tests it.

To a large extent, the active testing strategy works exceptionally well as it helps companies mitigate any potential risks and generate constant revenues over time. The beauty of A/B testing lies in its ability to objectively evaluate ideas in a scalable way.

The success of any testing strategy comes from multiple components as there are common pitfalls to avoid, best practices to follow, data quality (via A/A) to ensure, and user interference to prevent.

The list goes on and on as many variations that factor into the success of large-scale experimentation. In today’s blog post, let’s discuss the three fundamental statistical concepts that contribute to the high validity of Online Experimentation: randomization, blocking, and re-randomization.


A Toy Example

Let’s explore a typical use case of A/B testing: we want to examine how a new product feature would affect user engagement measured by Daily Active Users (DAU). The existing User Experience research indicates a small percentage (e.g., <2%) of heavy users who engage with the product differently from the rest of the population.

How would you design a study that assesses the new product performance?

The remainder of the post aims to answer this question from the sampling perspective. Part 1 explains why a simple randomization process would fail the mission; Part 2 introduces blocking as the tentative solution and its limitations; Part 3 shows why re-randomization is preferred.

Part 1: Why Don’t We Just Randomize It?

There is a widespread misunderstanding of randomization and what it is capable of among the Data Science community and beyond. I often see Product Managers, Data Engineers, and even Data Scientists naively assume that randomization always leads to covariate balance, unfortunately which is not true.

Here is a detailed walkthrough of why it is a weak assumption. For Online Experimentation, Data Scientists collect a random sample that is representative of the entire customer base (i.e., the population). So, we do not have to worry about the generalization of the findings, aka external validity.

After collecting a sample, we randomly split the data into two groups and assign treatment to one group (treatment) but not the other (control). Due to its random nature, we expect no statistical difference across the covariates between the treatment and control groups, ON AVERAGE. This is the secret power of A/B testing due to its ability to achieve covariate balance except for the treatment condition, which makes the treatment effect identifiable.

In contrast, if we do not have randomization, some users may self-select into the treatment or control group, incurring selection bias. If that happens, we cannot naively attribute the differences in the two groups to the intervention and should resort to quasi-Experimentation for viable alternatives. Check out this Medium list (link) for a comprehensive review of major quasi-experimental designs.

Come back to the original question:

Can we rely on randomization to guarantee the covariate balance between the treatment and control groups?

Unfortunately, the answer is no! Randomization balance covariates On Average but cannot guarantee each realization has balanced covariates. Here is a quick math walkthrough. Let’s assume there is a 5% chance of observing a statistically significant difference in one covariate, and the probability of having at least one covariate imbalance across 10 covariates (i.e., Family-Wise Error Rate) is 1 – (1–0.05)¹⁰ = 0.4 or 40%. In other words, we will observe at least one imbalanced covariate 40% of the time if 10 covariates need to be adjusted. It’s a pretty high number.

Given our business scenario with a small percentage of heavy users, a simple randomization process will most likely fail to distribute the heavy users equally into the experimental groups.

For the last time, randomization aims to balance covariates ON AVERAGE but does not guarantee the balance in any specific sample.

Another closely related reason why simple randomization is a suboptimal choice. The unequal distribution of the heavy users will inflate the variance that makes small- to medium-sized treatment effects go unnoticed. This is so because the variance is proportional to the square root of sample size, holding other factors constant. For example, we would need 100 times more users if the variance increases by 10 times. If the variance goes up and the sample size stays the same or fails to keep up the pace, the test would be underpowered and fail to detect any differences.

where:

  • σ²: sample variance.
  • 𝛿: the difference between the treatment and control groups

The above formula is an approximation of minimum sample size if we set α at 0.05 and β at 0.2 (or power at 0.8). Please refer to this post for more information about the above formula.

The next section introduces blocking and why it can work in our favor.

Part 2: Why Blocking Is Better?

As mentioned, a simple randomization process may cause covariate imbalance. Fortunately, the statistical theory of the Design of Experiments proposes a solution called blocking, i.e., to put experimental units in blocks (e.g., device type, age, gender, etc.) that are homogenous to each other.

In our case, our primary interest is the effect of the new product feature. So, it would be great to eliminate the influence from other variables like the usage frequency (i.e., the heavy VS light users). In practice, we can set up a cutting threshold and create three tiers based on usage frequency: low, medium, and high. Then, we randomly split users into the treatment and control groups at each tier.

It is more likely for the blocking-and-randomization strategy to achieve covariate balance than randomization alone. Also, blocking reduces or even eliminates the estimation error from the blocking variable. From the standpoint of variance reduction, blocking can reduce the variance between the blocks and so increase test sensitivity.

However, here is a take: blocking suffers from the Curse of Dimensionality. That is, blocking fails when there are multiple covariates with multiple levels. Let’s re-use the previous example and suppose there are 10 variables with 4 levels. To fully deploy blocking, we need at least 4¹⁰, or at least 1 million possible strata, which is a daunting number to achieve.

Do you have enough data/users to fill each stratum?

The number scales exponentially with the dimension, which limits the use cases of blocking.

Practically speaking, we block on the most important variables (e.g., 2–3) and rely on randomization to handle the remaining nuisance variables. Here comes the famous saying in the Design of Experiments:

Block what you can;

Randomize what you cannot.

  • George Box

With blocking, we can answer the question asked in the beginning. If we know a visitor is a heavy or light user apriori (e.g., via log-in information), we can create blocks (strata) and put them into different strata. In this way, we can reduce the variance and make the experiment more sensitive and powerful.

However, for other cases when the log-in information is unavailable or data arrive in a real-time streaming fashion, we cannot block; or in other cases, we only know the covariates afterward, and it’s impossible to create blocks beforehand. For instance, we do not know the gender of a random respondent to a telephone marketing campaign before they pick up the phone. Overall, a good rule of thumb is to block the most important factors and randomize the remaining factors.

In the last section, we discuss another concept called Re-Randomization.

Part 3: Re-Randomization

Both blocking and randomization cannot always guarantee balanced covariates, not to mention when blocking is unavailable. As mentioned above, the high internal validity of experimentation lies in its ability to strictly control for confounding variables and avoid selection bias. If randomization fails to achieve balance across covariates, its high validity will be compromised.

What to do after unlucky randomization?

Re-Randomization.

Literally, it means we should re-run the randomization process for any unlucky realization. The key is to set up a threshold to measure the level of covariate (im)balance before we run the experiment. A typical choice of metric is called Mahalanobis Distance between the covariate mean vectors of the treatment and control groups.

Please check this paper for more information on Re-Randomization and Mahalanobis Distance, Yang et al., 2021, "Rejective Sampling, Re-randomization and Regression Adjustment in Survey Experiments."


_If you find my post useful and want to learn more about my other content, plz check out the entire content repository here: https://linktr.ee/leihua_ye._


Conclusion

Data Scientists have so much faith in online experimentation and want to test every single product decision without questioning its validity. In general, this is a promising effort with a high ROI. However, we should bear in mind that randomization does not automatically guarantee covariate balance. For example, the Experimentation Team @Bing reports that 1 out of 4 experiments have covariate imbalance after randomization. In that case, we need to apply supplementary tools like blocking and re-randomization to ensure covariate balance.


Enjoy reading this one?

Please find me on LinkedIn and Youtube.

Also, check my other posts on Artificial Intelligence and Machine Learning.


Related Articles