Neuroskeptic made me aware of a new paper by Combrisson & Jerbi entitled "Exceeding
chance level by chance: The caveat of theoretical chance levels in
brain signal classification and statistical assessment of decoding
accuracy"; full citation below. Neuroskeptic's post has comments and a summary of the article, which I suggest you check out, along with its comment thread.

My first reaction reading the article was confusion: are they suggesting we shouldn't test against chance (0.5 for two classes), but some other value? But no, they are arguing that it

What about the results of the paper's analyses? Basically, they strike me as unsurprising. For example, the authors note that smaller datasets are less stable (eg quite easy to get accuracies above 0.7 in noise data when only 5 examples of each class), and that smaller test set sizes (eg leave-1-out vs. leave-20-out cross validation when 100 examples) tend to have higher variance across the cross-validation folds (and so harder to reach significance). At right is Figure 1e, showing the accuracies they obtained from classifying many (Gaussian random) noise datasets of different sizes. What I immediately noticed is how nice and symmetrical around chance the spread of dots appears: this is the sort of figure we expect to see when doing a permutation test. Eyeballing the graph (and assuming the permutation test was done properly), we'd probably end up with accuracies above 0.7 being significant at small sample sizes, and around 0.6 for larger datasets, which strikes me as reasonable.

I'm not a particular fan of using the binomial for significance in neuroimaging datasets, especially when the datasets have any sort of complex structure (eg multiple fMRI scanning runs, cross-validation, more than one person), which they almost always have. Unless your data is structured exactly like Combrisson & Jerbi's (and they did the permutation test properly, which they might not have, see Martin Hebart's comments), Table 1 strikes me as inadequate for establishing significance: I'd want to see a test taking into account the variance in your actual dataset (and claims being made).

Perhaps my concluding comment should be that proper statistical testing can be hard, and is usually time consuming, but is absolutely necessary. Neuroimaging datasets are nearly always structured (eg sources of variance and patterns of dependency and interaction) far differently from the assumptions of quick statistical tests, and we are asking questions of them not covered by one-line descriptions. Don't look for a quick fix, but rather focus on your dataset and claims, and a method for establishing significance levels is nearly always possible.

My first reaction reading the article was confusion: are they suggesting we shouldn't test against chance (0.5 for two classes), but some other value? But no, they are arguing that it

*is necessary to do a test against chance*... to which I say, yes,**of course**it is necessary to do a statistical test to see if the accuracy you obtained is significantly above chance. The authors are arguing against a claim ("the accuracy is 0.6! 0.6 is higher than 0.5, so it's significant!") that I don't think I've seen in an MVPA paper, and would certainly question if I did. Those of us doing MVPA debate about how exactly to best do a permutation test (a favorite topic of mine!), and if the binomial or t-test is appropriate in particular situations, but everyone agrees that a statistical test is needed to support a claim that an accuracy is significant. In short, I agree with Nikolaus Kriegeskorte's comment that this is a strawman argument.What about the results of the paper's analyses? Basically, they strike me as unsurprising. For example, the authors note that smaller datasets are less stable (eg quite easy to get accuracies above 0.7 in noise data when only 5 examples of each class), and that smaller test set sizes (eg leave-1-out vs. leave-20-out cross validation when 100 examples) tend to have higher variance across the cross-validation folds (and so harder to reach significance). At right is Figure 1e, showing the accuracies they obtained from classifying many (Gaussian random) noise datasets of different sizes. What I immediately noticed is how nice and symmetrical around chance the spread of dots appears: this is the sort of figure we expect to see when doing a permutation test. Eyeballing the graph (and assuming the permutation test was done properly), we'd probably end up with accuracies above 0.7 being significant at small sample sizes, and around 0.6 for larger datasets, which strikes me as reasonable.

I'm not a particular fan of using the binomial for significance in neuroimaging datasets, especially when the datasets have any sort of complex structure (eg multiple fMRI scanning runs, cross-validation, more than one person), which they almost always have. Unless your data is structured exactly like Combrisson & Jerbi's (and they did the permutation test properly, which they might not have, see Martin Hebart's comments), Table 1 strikes me as inadequate for establishing significance: I'd want to see a test taking into account the variance in your actual dataset (and claims being made).

Perhaps my concluding comment should be that proper statistical testing can be hard, and is usually time consuming, but is absolutely necessary. Neuroimaging datasets are nearly always structured (eg sources of variance and patterns of dependency and interaction) far differently from the assumptions of quick statistical tests, and we are asking questions of them not covered by one-line descriptions. Don't look for a quick fix, but rather focus on your dataset and claims, and a method for establishing significance levels is nearly always possible.

Combrisson, E., & Jerbi, K. (2015). Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy Journal of Neuroscience Methods DOI: 10.1016/j.jneumeth.2015.01.010

## No comments:

## Post a Comment