Validation in Statistics and Machine Learning - Abstract
Benchmark experiments draw B independent and identically distributed learning samples from a data generating process to estimate the (empirical) performance distribution of candidate algorithms. Formal inference procedures are used to compare these performance distributions and to investigate hypotheses of interests. In most benchmark experiments B is a "freely chosen" number, often specified depending on the algorithms' runtime to setup experiments which finish in reasonable time. In this presentation we discuss how to control B and remove its ärbitrary" aftertaste. General sequential designs enable, amongst other things, to control the sample size, i.e., B. A benchmark experiment can be seen as a sequential experiment as each run, i.e., drawing a learning sample and estimating the candidates' performances, is done one by one. Currently, no benefit is taken from this sequential procedure: The experiment is considered as a fixed-sample experiment with B observations and the hypothesis of interest is tested using a test T at the end of all B runs. We propose to take the sequential nature of benchmark experiments into account and to execute a test T successively on the accumulating data. In a first step, this enables to monitor the benchmark experiment -- to observe p-value and test statistic of the test during the execution of the benchmark experiment. In a second step, this information can be used to make a decision -- to stop or to go on with the benchmark experiment. We present methods, group sequential and adaptive, and discuss the advantage obtained by sequential benchmark experiments.