Are we stats sig yet? 🤔

Pre-requisite: This story is for both technical and business folks who have experience in running experimentation.

Lily Chen
Towards Data Science

--

Photo by Carlos Muza on Unsplash

TL;DR — The goal of experimentation is to make a decision and not to chase a specific significance level.

It is day 14 of running the most critical experiment in your business unit. Your leadership pings you on work chat and asks, “Are we stat sig yet?”

You quickly run the t-test and report back, “hmm we are almost stats sig at 90% significance level.”

“Okay, but when are we going to reach 95% stats sig?”, the lead replies.

You point back to the variance plot to explain how some experiments just don’t get to 95% stats sig even if it was well designed. Not quite following the math, the lead responds, “but what do we need to do to get to stats sig?”

Okay…

We’ve all been here, right?

First, let’s remind business folks what ‘stats sig’ mean.

What does statistical significance (stats sig) mean?

According to Wikipedia:

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study’s defined significance level, denoted by alpha, is the probability of the study rejecting the null hypothesis, given the null hypothesis was assumed to be true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when p ≤ alpha. The significance level for a study is chosen before data collection, and is typically set to 5% or much lower — depending on the field of study.

According to most data scientists:

Statistical significance means that the result of our experiment is not due to random chance. This gives us confidence in our decision and industry standard is to use 95% statistical significance level (p ≤ 0.05).

According to most leads:

The bullet proof evidence for us to get influence, headcount, or promotion.

Stats sig is important, but it might not be the right thing to focus on. Chasing technical thresholds like 95% significance typically slows down organizations on the opportunity to learn and iterate. Ultimately decision-making is never perfect.

Jeff Bezos said “Most decisions should probably be made with somewhere around 70% of the information you wish you had. If you wait for 90%, in most cases, you’re probably being slow.”

Simply put, anything better than correlation is learning to act on. Whether we need to reach the high rigor of 95% significance level, it all depends. The right question to ask is can we make a decision based on the observed impact at the current significance level.

By this point, your stakeholder may ask “what could be the reason we are not stat sig?” which is a fair question. Here are some most common ones:

  • The sample size is too small — Most of the time, this scenario can be answered before you set up the experiment by running the power analysis to estimate the sample needed for a certain effect size (expected impact). I say ‘most of the time’ because not all the products we test have historical data to calculate the variance and estimate the effect size. The rule of thumb is — the lower the effect size and the higher the variance is, the larger the sample needed.
  • The impact is too small — Not everything is pretty. Most experiments fail because the impact is not there. Running the experiment for longer is not a solution. It’s best to move on if you did not see the desired impact in the estimated time period.
  • The primary metric sucks— Your metric doesn’t represent your hypothesis, you might be using a lagging indicator that might be harder to detect, you don’t have guardrail metrics to understand any other potential downside, or you might have too many or too few metrics. Many articles out there, it’s sometimes more of an art (painstakingly aligning with your stakeholders and refining it constantly) than science.

In summary, the goal of experimentation is to make a decision and not to chase a specific significance level. Don’t get me wrong, ‘stats sig’ is important — it validates if your hypothesis is strong, it gives you a signal on if your product is working, and it makes everyone happy but the overemphasis does not help with decision making. As Bezos inferred — value is in the speed of decision making! Next time if someone asks you, “are we stat sig yet?”, take a deep breath, calm down, and let’s help them understand why that might not be the most important question. Send them this article, cross your fingers 🤞, and hope that they like it.

This is my first article, so thank you so much for getting to this point! If you found it fun and interesting please give it a clap and share it! If you want to connect, reach out on LinkedIn.

- ✌ Lily

--

--