Getting Started
Preface
This article is not meant to be a technical article nor is it meant to be a comprehensive article on all the different methods out there that control Type I and Type II error rates. This article will assume some background knowledge and is primarily focused on motivating a novel paradigm for combatting the multiple Hypothesis Testing problem and introducing a set of tools in R and R Shiny that you can use.
Background
If you’ve ever done statistics or read a research paper about a discovery before, the number 0.05 should ring __ a bell. It refers to a significance threshold of 0.05. This means that there is a 5% chance the result is "surprising", if the null hypothesis is true. In hypothesis tests, we compare our p-values against a significance threshold. A p-value is the probability of observing your results or more extreme, assuming the null hypothesis is correct. The holy value of 0.05 has its purported origins with Cambridge statistician RA Fisher, who in 1926, introduced the then novel concept of statistical significance. He picked this value arbitrarily, as we can quote from his paper:
Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level.
Ever since, scientists across time and around the world have mainly stuck to the 5% threshold. If your hypothesis test returns a p-value below 0.05, you reject the null hypothesis and conclude that you have some evidence to support your alternative hypothesis. If it’s above 0.05, you fail to reject the null hypothesis.
This is great, but what if you’re doing many many hypothesis tests? Consider a scenario where we have performed even a small number of hypothesis tests, say 20, and all these tests are truly null (which we’d never know in real life). If we ask ourselves what’s the probability of observing at least one significant hypothesis test due to chance, we can calculate:
64% is a pretty high probability for observing at least one significant result by chance for just 20 tests, so you can imagine what will happen to that probability when we have 1000 hypothesis tests. 10,000 tests. 1,000,000 tests. Even at 100 hypothesis tests, as you can see in the plot below , that the probability for observing at least one significant result approaches 1, or 100% certainty. So what do we do?
Traditional Approach: Bonferroni and Benjamini-Hochberg
In statistical inference, we have two (often competing) goals. The first is to try to minimize the probability of making a Type I error, or rejecting the null hypothesis when we shouldn’t have. That’s what I demonstrated with the above toy example. There has been a lot of past and ongoing research on how to control Type I error rates, but in this article, I’ll focus primarily on controlling the False Discovery Rate and Family-wise Error Rate.
It’s first important to distinguish between the two. The Family-wise Error Rate (FWER) is the probability of making any Type I errors at all. The Bonferroni correction guarantees that the FWER will be less than our chosen significance threshold (let’s use 0.05) by decreasing the significance level for each individual test to 0.05/n where n is the total number of hypothesis tests.
Now’s a good time to tell you the second goal in statistical inference. That is to maximize the power, or minimize the probability of Type II errors (failing to reject the null hypothesis when you should have). Managing these two goals is a constant balancing act. The Bonferroni correction is known to be quite conservative, meaning that we may miss out on potential discoveries because we set the significance threshold too stringently.
So, we might want to consider controlling a different metric – the False Discovery Rate (FDR), or the expected proportion of false rejections out of all rejections. It is necessarily less strict because now we’re ok with some number of false positives, whereas with controlling FWER, we didn’t want any false positives. This shift in mindset is acceptable especially in fields such as genomics where we are testing among millions of genes. And by controlling the FDR, we have higher statistical power, or the ability to detect significant findings. All good, right?
The answer to this leading question is not really. There are two main challenges in our modern data-heavy era.
- The number of hypothesis tests carried out on related data over time is unknown and potentially very large
- Data repositories are increasingly used, and they grow over time.
Both these challenges manifest due to the current paradigm of scientific research – oftentimes, many teams in different places are working on the same problem. This means that for any given team, no one knows how many total hypotheses are being tested for that data. John Ioannidis states this in his famous paper, "Why Most Published Research Findings are False".
A research finding is less likely to be true … when more teams are involved in a scientific field in chase of statistical significance.
It turns out that many researchers likely still control for multiple hypothesis testing in an offline setting. It refers to the setting where a multiple hypothesis testing procedure (say Bonferroni) takes into account all the p-values at once (Remember in Bonferroni we divided by the number of total hypothesis tests)? Well, how does anyone know what the number of total hypothesis tests is? Is it all the tests that your research group did? Is it all the tests in the greater NYC area? How could anyone know? If we don’t take into account previous discoveries, what are we really controlling? And in the case where certain research fields deposit their results in publicly available data repositories, a fixed significance threshold fails to adapt to the growing number of hypotheses.
Novel Approach: Online Control
We have now arrived at our principal motivation – in the modern data era, hypothesis testing has taken on a different form which we call the online setting. It’s defined as:
Hypotheses arrive sequentially in a stream. At each step, the investigator must decide whether to reject the current null hypothesis without having access to the number of hypotheses (potentially infinite) or the future p-values, but solely based on the previous decisions.
A growing field of research has emerged to tackle this problem, and I had the wonderful opportunity to join this growing community spanning the United States and the United Kingdom and contribute. I helped to develop an R package that contains an assortment of algorithms that control FDR and FWER in an online manner as well as a Shiny app that make it easy for non-coders to use this package.
The math behind these algorithms is quite technical, so just keep the following benefits in mind:
- You can use your domain knowledge to order the hypotheses to increase statistical power (by rejecting certain hypotheses first) while still controlling your Type I error. You can thus better achieve both of the aforementioned goals!
- These algorithms use the significance threshold, say 0.05, as a kind of spending budget where you spend some of that 0.05 to perform hypothesis tests, and if you make a discovery (reject a null hypothesis), you get some "wealth" back!
- Lastly, if you stick to traditional methods in our current scientific era you will likely be led to the wrong conclusions.
Doing It In R
If you’ve made it this far, and are excited/intrigued to witness this magic for yourself, here’s a checklist of how you can get started.
- Read our Quick Start Guide
- Familiarize yourself with some of our algorithms, how they’re different, and under what circumstances you should use them
- Either download the package or open up my Shiny app so that you can get started with online control
- If you want to read more about the package, see this article. If you’re interested in all the gory math details, see the References section in our vignette.
As an example of what some of these algorithms can do, let’s see what the SAFFRON algorithm does on a sample dataset of simulated hypothesis tests.
Notice how Bonferroni (technically a version of Bonferroni that accounts for infinite hypothesis tests, which is why you see it’s non-constant) gets more and more conservative the more tests there are, but SAFFRON deals with information much more intelligently and can gain some "budget" when it makes a discovery. In this way, we are able to maintain our power while also minimizing the number of Type I errors that we make.
Concluding Thoughts
Multiple hypothesis testing is one of the core problems in statistical inference. Reproducibility, publication bias, and "p-hacking" are obstacles that obstruct good research and scientific advancement. In our big data era, with more and more hypothesis tests being performed by research groups around the world, it’s increasingly important that all researchers begin to shift towards the direction of a more accurate and effective error testing framework. So, I want you to guys to think carefully about 0.05, and even more carefully about how you go about controlling it.