A large-sample crisis or not? Exaggerated false positives by popular differential expression methods on human population RNA-seq samples

Sanity check by permutation should be done for large samples

Jingyi Jessica Li

Published in

Towards Data Science

4 min readMar 16, 2022

Written by Dr. Yumei Li (UC Irvine) and Jingyi Jessica Li (UCLA)

Image from https://unsplash.com/photos/I23WeOTsA8M

In biological studies, identifying biological features that are significantly different under different experimental conditions or disease states is important to understand the biological mechanisms behind phenotypes. Among these features, gene expression is the most commonly studied one. And the development of RNA-seq technology has made it easier and faster to identify differentially expressed genes (DEGs) at the genome-wide level. However, the generally small sample size of RNA-seq data (usually only 2–4 biological replicates per condition) makes it difficult to identify DEGs accurately. Previous researchers have developed statistical methods based on parametric distributional assumptions and the empirical Bayes approach to improve statistical power in small samples, such as the two popular methods, DESeq2 [1] and edgeR [2]. As the cost of sequencing has declined, the sample size of RNA-seq data has gradually increased, reaching hundreds or even thousands in some population-level studies. This raises a natural question of whether methods like DESeq2 and edgeR designed for small-sample-size data are still applicable to population-level RNA-seq datasets.

To answer this question, recently, researchers from UCLA and UCI published a paper entitled “Exaggerated false positives by popular differential expression methods when analyzing human population samples” in Genome Biology. Through permutation analysis, the researchers found that DESeq2 and edgeR have extremely high false discovery rates (FDRs), far exceeding the target FDR threshold. By further evaluating multiple DEG analysis methods — including another popular method limma-voom, the nonparametric method NOISeq adopted by the GTEx consortium, the recent nonparametric method dearseq, and the classic nonparametric Wilcoxon rank-sum test (also known as the Mann-Whitney test) — they found that only the Wilcoxon rank-sum test could control the FDR and achieve good power. Therefore, for population-level RNA-seq studies, the researchers recommend the Wilcoxon rank-sum test.

Image from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

In this study, the researchers first generated negative-control datasets by randomly permuting the two-condition labels of all samples and found DESeq2 and edgeR have unexpectedly high FDRs. Surprisingly, when analyzing the RNA-seq dataset from an immunotherapy study [3], they found that DESeq2 and edgeR even identified more DEGs from the permuted data than the original data. In particular, some genes were mistakenly identified as DEGs in multiple permuted datasets and were enriched in immune-related biological pathways, which could easily mislead researchers. Further, the researchers benchmarked DESeq2 and edgeR along with four other DEG identification methods on more datasets (GTEx [4] and TCGA [5]) through semi-synthetic data analysis. The results showed the classic nonparametric Wilcoxon rank-sum test consistently controlled the FDR.

**Results of permutation analysis based on RNA-seq data from an immunotherapy study.** Image from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4

Through a more in-depth analysis, the authors found that one reason why methods based on parametric distributional assumptions (e.g., DESeq2 and edgeR) cannot control the FDR is the violation of the parametric distributions assumed by these methods. When the sample size is large enough, parametric assumptions are no longer necessary. That is why the nonparametric Wilcoxon rank-sum test performed best in the evaluation. Therefore, the researchers conclude that for population-level studies with large sample sizes, classic nonparametric statistical methods should be considered as the baseline for data analysis and new method benchmarking.

Professor Jingyi Jessica Li from the Department of Statistics of UCLA and professor Wei Li from the Department of Biological Chemistry of UCI are the corresponding authors of this paper. This work was mainly done by Dr. Yumei Li (postdoc from Wei Li’s lab) and Dr. Xinzhou Ge (postdoc from Jingyi Jessica Li’s lab).

Link to the original paper: https://doi.org/10.1186/s13059-022-02648-4

For small-sample-size data, if users would like valid FDR control but find it difficult to obtain high-resolution p-values, Jingyi Jessica Li’s lab developed the statistical method Clipper [6], for which Dr. Xinzhou Ge was also a leading author: https://www.physicalsciences.ucla.edu/new-statistical-framework-to-increase-the-reliability-of-data-driven-biomedical-research-on-biomolecules/

References

1. Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014, 15:550.

2. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26:139–140.

3. Riaz N, Havel JJ, Makarov V, Desrichard A, Urba WJ, Sims JS, Hodi FS, Martin-Algarra S, Mandal R, Sharfman WH, et al: Tumor and Microenvironment Evolution during Immunotherapy with Nivolumab. Cell 2017, 171:934–949 e916.

4. Consortium GT: The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 2020, 369:1318–1330.

5. Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM: The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013, 45:1113–1120.

6. Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, Wang N, Li W, Wang LD, Li JJ: Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 2021, 22:288.

A large-sample crisis or not? Exaggerated false positives by popular differential expression methods on human population RNA-seq samples

Sanity check by permutation should be done for large samples

Written by Jingyi Jessica Li