Validation in Statistics and Machine Learning - Abstract
In high-dimensional problems such as differential gene expression analysis, it is common practice to use a statistical model per gene studied, with subsequent correction for multiple testing of the many p-values computed. Such approaches take little advantage of the large number of genes studied, and still require multiple testing correction for adequate evaluation of test error, which reduces power to find differential expression. In an attempt to reduce the multiple testing penalty and improve power, researchers often use filtering methods like fold-change or variance filters. However, filtering may introduce a bias on the multiple testing correction. The precise amount of bias depends on many quantities, such as fraction of probes filtered out, filter statistic, multiple testing method and test statistic used and cannot, as such, be estimated a priori.
We show that a biased multiple testing correction results if non-differentially expressed probes are not filtered out with equal probability from the entire range of p-values, assuming that one of a class of FDR methods are used, which includes the FDR step-up procedure of Benjamini & Hochberg (JRSSB, 1995). We then illustrate our results using both a simulation study and an experimental dataset, where the FDR is shown to be biased mostly by filters that are associated with the hypothesis being tested, such as the fold change. These are, however, the filters that yield most power gain. Indeed, filters that induce little bias on the FDR also yield less or no additional power of detecting differentially expressed genes. Results are extended to include FDR methods that estimate the null p-values distribution empirically, such as those from Yekutieli & Benjamini (J Stat Plan Inf, 1999) and Dudoit et al. (Biomet J, 2008). Bearing these important issues in mind, we propose a statistical test that can be used in practice to determine whether any chosen filter introduces bias on the FDR estimate used, given a general experimental setup.
Filtering out of probes must be used with care as it may bias the multiple testing correction. Researchers can use our test for FDR bias to guide their choice of filter and amount of filtering in practice. This problem is generated by the fact that a statistical model is used per gene. The issue can be avoided altogether for example by choosing models that make use of all genes, such as gene-set based models, and thus dispensing with the need for multiple testing correction.