Validation in Statistics and Machine Learning - Abstract

Boulesteix, Anne-Laure

Over-optimism in statistical bioinformatics: an illustration

In statistical bioinformatics research, different optimization mechanisms potentially lead to In statistical bioinformatics research, different optimization mechanisms potentially lead to över-optimism" in published papers. In this talk, I present an empirical study on over-optimism using high-dimensional classification based on the KEGG database as an example.

Specifically, I consider a "promising" new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, I quantitatively demonstrate that it can artificially seem superior to existing approaches if we "fish for significance". The investigated sources of over-optimism include the optimization of data sets, of settings, of competing methods and, most importantly, of the method's characteristics. I conclude that, if the improvement of a quantitative criterion such as the error rate is an important aspect of the study, the superiority of the new algorithm should be demonstrated on independent validation data.