Friday, June 14, 2013

"ironic" comments on cross-validation

There is an interesting set of responses to "Ten ironic rules for non-statistical reviewers" (Friston, 2012) in press at NeuroImage. Many interesting issues are raised, but I want to touch on the discussion of cross-validation in the reply Sample size and the fallacies of classical inference (Friston, 2013).

from Friston, 2013:
"Is cross validation useful?
[...] I think that the role of cross validation in neuroimaging deserves further discussion. As articulated clearly by Lindquist et al., the goals of inference and cross validation are very distinct. Cross validation is generally used to validate a model (or its parameters) that predicts or classifies using new observations. Technically, this is usually cast in terms of a posterior predictive probability density. Cross validation is used to validate this posterior predictive density for diagnostic or other purposes (such as out-of-sample estimation or variable selection). However, these purposes do not include model comparison or hypothesis testing. In other words, cross validation is not designed for testing competing hypotheses or comparing different models. This would be no problem, were it not for the fact that cross validation is used to test hypotheses in brain mapping. For example, do the voxels in my hippocampal volume of interest encode the novelty of a particular stimulus? To answer this question one has to convert the cross validation scheme into a hypothesis testing scheme - generally by testing the point null hypothesis that the classification accuracy is at chance levels. It is this particular application that is suboptimal. The proof is straightforward: if a test of classification accuracy gives a different p-value from the standard log likelihood ratio test then it is – by the Neyman–Pearson Lemma – suboptimal. In short, a significant classification accuracy based upon cross validation is not an appropriate proxy for hypothesis testing. It is in this (restricted) sense that the Neyman–Pearson Lemma comes into play.

I have mixed feelings about cross validation, particularly in the setting of multivariate pattern classification procedures. On the one hand, these procedures speak to the important issue of multivariate characterisation of functional brain architectures. On the other hand, their application to hypothesis testing and model selection could be viewed as a non-rigorous and slightly lamentable development."
I fully agree that "cross validation is not designed for testing competing hypotheses or comparing different models", but do not see how that is a problem for (typical) MVPA hypothesis testing.

This image is my illustration of a common way cross-validation is used in MVPA to generate a classification accuracy (see this post for an explanation). As Friston says in the extract, we often want to test whether the classification accuracy is at chance. My preferred approach in these sorts of cases is usually dataset-wise permutation testing, in which the task labels are changed, then the classification (and averaging the accuracies obtained on the cross-validation folds) is repeated, resulting in a null distribution for mean accuracy to which the true mean accuracy can be compared (and a significance level obtained, if desired).

In this scenario the cross-validation does not play a direct role in the hypothesis testing: it is one step of the procedure that led from the data to the statistic we're hypothesizing about. The classification accuracy is not significant "based upon cross validation", but rather significant based on the permutation testing; cross-validation is one stage of the procedure that led to the classification accuracy. Friston, K. (2013). Sample size and the fallacies of classical inference NeuroImage DOI: 10.1016/j.neuroimage.2013.02.057

No comments:

Post a Comment