The world’s leading publication for data science, AI, and ML professionals.

5 Ways to Fail Your Million Dollar Experiments

A/B Tests the Right Way

Experimentation and Causal Inference

Photo by Keyur Nandaniya on Unsplash
Photo by Keyur Nandaniya on Unsplash

Experimentation before product launch has become an intrinsic part of big tech companies (e.g., Netflix and Airbnb). As the saying goes, no Business decision shall be made without A/B testing your customers.

True!

Business experimentation provides rich information about our customer behavior and helps us better understand the gap between customer thinking and behavior.

People think and behave differently. Often of the time, they tell you one thing and do the direct opposite.

Without doing experiments, there is no way of knowing how customers would react to the new product. It’s like blind men measuring an elephant’s size: we don’t get the full picture.

Like any research method, the experimental method has drawbacks and assumptions, but industry folks often run it even when the assumptions are violated. If it happens (unfortunately), we may face the most undesirable case scenario: decision-making with false evidence.

False information is worse than no evidence.

We do not make decisions if there is no information, but false information leads to wrong decisions. In today’s post, I elaborate on five common mistakes that may sunk your millions-of-dollars experiments and the coping strategies under each scenario


(I have contributed several pieces on business experimentation. Check them out: why experiments, natural experiments, ITS, RDD, and DID).


Causal Inference and Experimentation

Here, I’ll try to give you an overview of Causal Inference & experimentation and why it matters. In day-to-day routine, we ask ourselves questions like: "does the new UI increase DAU?" "what factors lead to the next million signups?"

Causal inference is the process of tracing back to the root cause and gauging its effect upon the result of interest.

Experimentation can help us find the causal effect by holding other factors constant while only changing one variable at a time. If random assignment is not possible, Data Scientists can choose quasi-experimental (Five Tricks) and observational methods (e.g., pairing, propensity scoring matching) to do causal inference.

Put it simply, causal inference starts with comparable experimental groups and ends with differentiated outcomes. If the experimental groups (treated and control) are not equivalent, to begin with, then the research design is flawed.

Let’s delve into the main course.


1. Choose Your Metrics Carefully

Data Scientists need to set up a metric to gauge our experiments’ performance, and a lousy metric can kill the experiment.

For example, the sales number lags for the quarter for a retail company, but the management team still wants to hit the set targets by offering big promotions.

The Data Team comes up with an experimental design that suggests a 50% discount would do the trick. Should you roll out the promotion as planned?

Probably, not!

The promotion may increase the short-term metrics but dip into the long-term profits as customers shop everything they need for now and don’t have to shop again shortly.

Choosing valid metrics is as crucial as, if not more, your research design itself. The last thing we want to see is the cannibalism between various sectors.

What To Do?

  • Collaborate with the major business stakeholders and come up with the primary and secondary metrics of interests
  • Understand the tradeoffs of selections
  • Come back to the metrics before, during, and after the experiment, if necessary
  • As a rule of thumb, experimentation tells a good story of short-term metrics but falls short in long-term performance

2. Problematic Experiment Designs

As stated above, Data Scientists need to check whether the experimental groups are comparable in key covariates ("apples-to-apples comparison"). Otherwise, the result may not stand.

Certain designs are widely popular in the industry but ill for causal inference. Among which, One Group Before-After-Comparison design and Treated and Non-Treated Groups design stand out!

Here are two examples of these two designs, respectively. We see a bump in some metrics (e.g., DAU) after rolling out the new UI, so the updated version is the way to go. Compared to the control group, the treated group brings in more revenue, so the treatment works.

These designs do not offer "apples-to-apples" comparisons! The treated group may be fundamentally different from the control group, making any direct comparison problematic. It is a typical problem in causal inference called selection bias.

Also, there is no way of ruling out the possibility of alternative explanations. e.g., there are other significant changes besides the UI update.

What To Do?

  • Check with the engineering team about data availability.
  • Is the treatment group similar to the control group? If not, what measures have you taken?
Photo by George Bale on Unsplash
Photo by George Bale on Unsplash

3. End Your Experiment Too Early

We have carefully chosen the metric that reflects the business interests and intentionally makes the experimental groups comparable. Sweet surprise, there is a positive bump in the metrics. Shall we end the experiment now?

It’s tempting to do so when there is a positive finding. BTW, this is a common cognitive fallacy – human beings tend to confirm what we are expecting.

However, my recommendation is to wait it out and see if the result "regresses to the mean." The positive jump in the metric may be just a glitch and will disappear after a little longer.

What To Do?

  • Keep it running after initial results
  • A trick here: provided with limited data, we can create more data points by giving the assignment and taking it away. Repeat the process multiple times
  • Check how the treatment and control groups change over time

4. Spillover Effect and Contamination

Here, the key question is: does the treatment group engage with the control group? If so, the result may be contaminated by the spillover effect.

Hypothetically, the experimentation randomly assigns the treatment to individual users who live nearby. The treatment and control groups may talk to each other, and the control group unintentionally picks up the treatment.

The fourth mistake is closely related to the last one, as can be seen later.

What To Do?

  • Talk to your engineering team: at what level can you randomize your treatment (e.g., user-level, city-wide, URL)?
  • Do your experimental groups engage with each other? If so, change the level of random assignment.

5. Randomization Not So Random

For A/B tests and any other experimental designs that adopt randomization, the primary question that Data Scientists should ask: is it really a random assignment?

It sounds contradictory, but a random process does not guarantee a random assignment!

There are extraneous factors that cause some groups more likely to participate in the treatment group than others.

Related to the previous point, a randomization process at the individual URL may be infeasible b/c of cross-contamination. Then, we shall choose the city-level randomization.

Think carefully about the lowest level of random assignment.

What To Do?

  • This is a challenging question! It involves a thorough understanding of the experimentation platforms and be able to differentiate subtle nuances.
  • Randomization at the unit level? URL? City-level?
  • Your engineering team understands better what is possible and what is not.

Conclusion

This post covers the five major ways of failing your experiments: metrics, research design, timing, spillover, and randomization. As the old saying goes: details are the devils! Data Scientists should check these five common mistakes while rolling out the experiments.


Enjoy reading this one?

Please find me on LinkedIn and Youtube.

Also, check my other posts on Artificial Intelligence and Machine Learning.


Related Articles