Validation in Statistics and Machine Learning - Abstract

Okita, Tsuyoshi

Statistical Significance Test in Machine Translation

(joint work with Andy Way)

In the context of Machine Translation, there are two popular statistical significance tests : a method based on bootstrap method [Koehn,2004; Zhang and Vogel, 2004] and that based on approximate randomization [Riezler and Maxwell III, 2005]. The latter is more conservative since it increases the likelihood of type-I error than the former.

This poster would like to examine several special conditions on these methods which enable the application to Machine Translation. Firstly, for a given test set, since an MT system does not produce translation outputs in various ways the overall score, which is often measured by BLEU [Papineni et al., 2002], is single and fixed. An idea in the above two methods is to randomly select a paired test set in a sentence level to enable the permutation tests, which seems supported by the stratification of the output [Yeh, 2000; Noreen 1989]. Often test statistic is examined 1000 times. Our question is whether this is all right?

Secondly, the test statistic consists of the difference between the evaluated translation outputs in Machine Translation, whose evaluation measure is given from the beginning: either BLEU [Papineni et al., 2002], NIST [Doddington, 2002], METEOR [Bernerjee and Lavie, 200], TER [Snover et al., 2006], or others. Riezler and Maxwell III [2005] say that NIST is more appropriate than BLEU with approximate randomization. Does statistical significance test depend on the evaluation measure?

Thirdly, when algorithm A and B are compared, it is often the case where these two systems share most of the underlying systems, i.e. there are a lot of dependencies. Church and Mercer [1993] give examples of dependence between test set instances in natural language. Although expected value of the instance results will stay the same, but the chances of getting an unusual result may change. Hence, the chances of getting an unusual result under some null hypothesis requires to incorporate these dependencies. Then, what is a good way to quantify these dependencies?

Fourthly, it is often the case even though algorithm A is proven to be statistical significant with algorithm B for one set of corpus, it does not often work for another set of corpus whose language pairs are different or whose size are different.