MVPA Meanderings: "Classification Based Hypothesis Testing in Neuroscience"

Monday, April 25, 2016

"Classification Based Hypothesis Testing in Neuroscience"

There's a lot of interesting MVPA methodology in a recent paper by Jamalabadi et. al, with the long (but descriptive) title "Classification Based Hypothesis Testing in Neuroscience: Below-Chance Level Classification Rates and Overlooked Statistical Properties of Linear Parametric Classifiers". I'll focus on the below-chance classification part here, and hopefully get to the permutation testing parts in detail in another post; for a very short version, I have no problem at all with their advice to report p-values and null distributions from permutation tests to evaluate significance, and agree that accuracy alone is not sufficient, but they have some very oddly-shaped null distributions, which make me wonder about their permutation scheme.

Anyway, the below-chance discussion is mostly in the section "Classification Rates Below the Level Expected for Chance" and Figure 3, with proofs in the appendices. Jamalabadi et. al set up a series of artificial datasets, designed to have differing amounts of signal and number of examples. They get many below-chance accuracies when "sample size and estimated effect size is low", which they attribute to "dependence on the subsample means":

"Thus, if the test mean is a little above the sample mean, the training mean must be a little below and vice versa. If the means of both classes are very similar, the difference of the training means must necessarily have a different sign than the difference of the test means. This effect does not average out across folds, ....."

They use Figure 3 to illustrate this dependence in a toy dataset. That figure is really too small to see online, so here's a version I made (R code after the jump if you want to experiment).

This is a toy dataset with two classes (red and blue), 12 examples of each class. The red class is from a normal distribution with mean 0.1, the blue, a normal distribution with mean -0.1. The full dataset (at left) shows a very small difference between the classes: the mean of the the blue class is a bit to the left of the mean of the red class (top row triangles); the line separates the two means.

Following Jamalabadi et. al's Figure 3, I then did a three-fold cross-validation, leaving out four examples each time. One of the folds is shown in the right image above; the four left-out examples in each class are crossed out with black x. The diamonds are the mean of the training set (the eight not-crossed-out examples in each class). The crossed diamonds are the means of the test set (the four crossed-out examples in each class): and they are flipped: the blue mean is on the red side, and the red mean on the blue side. Looking at the position of the examples, all of the examples in the blue test set will be classified wrong, and all but one of the red: accuracy of 1/8, which is well below chance.

This is the "dependence on subsample means": pulling out the test set shifts the means of the remaining examples (training set) in the other direction, making performance worse (in the example above, the training set means are further from zero than the full dataset). This won't matter much if the two classes are very distinct, but can have a strong impact when they're similar (small effect size), like in the example (and many neuroimaging datasets).

Is this an explanation for below-chance classification? Yes, I think it could be. It certainly fits well with my observations that below-chance results tend to occur when power is low, and should not be interpreted as anti-learning, but rather of poor performance. My advice for now remains the same: if you see below-chance classification, troubleshoot and try to boost power, but I think we now have more understanding of how below-chance performance can happen.

Jamalabadi H, Alizadeh S, Schönauer M, Leibold C, & Gais S (2016). Classification based hypothesis testing in neuroscience: Below-chance level classification rates and overlooked statistical properties of linear parametric classifiers. Human brain mapping, 37 (5), 1842-55 PMID: 27015748

follow the jump for the R code to create the image above

 do.plot <- function(s1, s2, test1, test2) {  # s1 <- ex1[1:8]; s2 <- ex2[1:8]; test1 <- ex1[9:12]; test2 <- ex2[9:12];  
  plot(x=0,y=0, xlim=c(-2.5,2.5), ylim=c(0.5,2), col='white', xlab='', ylab='', yaxt='n', main="a cv fold")  
  points(x=ex1, y=rep(1.25,length(ex1)), col=col1, cex=1.5);  
  points(x=ex2, y=rep(1.55,length(ex2)), col=col2, cex=1.5);  
  points(x=test1, y=rep(1.25, length(test1)), pch=4, cex=1.75); # x out omitted points  
  points(x=test2, y=rep(1.55, length(test2)), pch=4, cex=1.75);  
   
  points(x=mean(ex1), y=1.75, col=col1, cex=2, pch=2); # means for full dataset  
  points(x=mean(ex2), y=1.75, col=col2, cex=2, pch=2);  
  lines(x=rep((mean(ex1)+mean(ex2))/2, 2), y=c(-1,4))  
   
  points(x=mean(s1), y=1, col=col1, cex=2, pch=5);  # means of training dataset  
  points(x=mean(s2), y=1, col=col2, cex=2, pch=5);  
    
  points(x=mean(test1), y=0.75, col=col1, pch=9, cex=2) # means of testing dataset  
  points(x=mean(test2), y=0.75, col=col2, pch=9, cex=2)  
 }  
   
 set.seed(3463);  # example shown in the post  
 ex1 <- rnorm(12, mean=-0.1);  
 ex2 <- rnorm(12, mean=0.1);  
 col1 <- 'blue';  
 col2 <- 'red';  
   
 layout(matrix(1:4, c(2,2), byrow=TRUE));  
 plot(x=0,y=0, xlim=c(-2.5,2.5), ylim=c(0.5,2), col='white', xlab='', ylab='', yaxt='n', main='full dataset')  
 points(x=ex1, y=rep(1.25,length(ex1)), col=col1, cex=1.5);  
 points(x=ex2, y=rep(1.55,length(ex2)), col=col2, cex=1.5);  
 points(x=mean(ex1), y=1.75, col=col1, cex=2, pch=2);  
 points(x=mean(ex2), y=1.75, col=col2, cex=2, pch=2);  
 lines(x=rep((mean(ex1)+mean(ex2))/2, 2), y=c(-1,4))  
   
 do.plot(ex1[1:8], ex2[1:8], ex1[9:12], ex2[9:12]);  
 do.plot(ex1[c(1:4,9:12)], ex2[c(1:4,9:12)], ex1[5:8], ex2[5:8]);  
 do.plot(ex1[5:12], ex2[5:12], ex1[1:4], ex2[1:4]);

11 comments:

basileJuly 11, 2016 at 10:56 AM
Hi Jo,

"Anti-learning" is a very interesting topic as it occurs really often in fMRI MVPA, thanks for sharing you insights.

I am dealing with a below-chance case, but in a different context: Searchlights. There are a lot of voxels where we do expect chance level, and the distribution of the map should then be centered around chance with a positive tail for the significant results. However the distribution is more centered at 40-45% and with negative values extending to 15%.
We have 2 runs of this same task, but the resulting Searchlight maps are not quite similar even when averaging across subjects.
However the far-below-chance accuracy blobs make sense regarding our task, this seems to be information
As described in his paper, spurious above chance accuracy also occur, for a few subjects the distribution is centered above chance.
Also permutation (full dataset) are similarly biased, thus "above permutation chance" group mean accuracy can be lower than 50% and rarely exceeds 60%.

Subsequently to Jamalabadi paper, I tested both LOO and 2fold, which reduces the variance of the distribution, but the map is quite similar, same regions "of interest" show below-chance, and it is still centered around 45-48%.

Our design has 4 classes pseudo-randomized temporally using DeBruijn cycles, thus each pairs of successive stimuli is equally represented.
However the 4 classes are not equal so we tested 2 by 2, so this is two 2-classification problems, however the balancing of temporally neighboring class remains.
The Gaussian Naive Bayes classifier is really sensitive to the XOR configuration but it is one of the "fastest" option especially when it comes to computing permutation.

Do you think the design can be the problem?
Or maybe the analysis is wrong?
Is it really necessary to do "cross-run-validation"?
If we only have 2 runs, can we repeat cross-validation by subsampling the 2 runs events?

Thanks.
basile
ReplyDelete
Replies
UnknownOctober 12, 2016 at 2:32 AM
Hi
I have been encountering similar problems.
I am struggling to understand why data with low effect size can produce these heavily skewed distributions whereas the label-permuted null distributions tend to be well centered on 0. This is what I'm observing in data using an LDA.
Could you offer some insight on this?
Thanks
Sophie
ReplyDelete
Replies
UnknownNovember 1, 2016 at 4:50 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousNovember 1, 2016 at 4:56 AM
Yes, the null distributions are centered on the expected chance level.
ReplyDelete
Replies
AnonymousNovember 13, 2016 at 10:55 AM
I seem to not be clear!
My data is neural data in response to two sounds and the decoder is classifying according to the neural responses whether sound A or B was played.
My problem is the following : when I perform decoding with cross-validation on my original data set I observe below chance average accuracy - around 30%.
If I now take this data set and randomize the sound labels (A or B) of all the trials and perform decoding again, giving me my null distribution, the result is centered on chance and not skewed so my average accuracy is 50% as expected.
I am finding it hard to understand how the data is giving below chance levels but the randomized data is not.
ReplyDelete
Replies
Egor AnanyevFebruary 19, 2020 at 11:30 PM
Dear Jo, just wanted to thank you for this very clear explanation and demonstration. This is a problem I'm encountering with certain SVM training/testing schemes but not others. Any ideas why, say, polynomial (as opposed to linear) classifiers may be more susceptible to this problem?
ReplyDelete
Replies

Add comment