Monday, April 25, 2016

"Classification Based Hypothesis Testing in Neuroscience"

There's a lot of interesting MVPA methodology in a recent paper by Jamalabadi et. al, with the long (but descriptive) title "Classification Based Hypothesis Testing in Neuroscience: Below-Chance Level Classification Rates and Overlooked Statistical Properties of Linear Parametric Classifiers". I'll focus on the below-chance classification part here, and hopefully get to the permutation testing parts in detail in another post; for a very short version, I have no problem at all with their advice to report p-values and null distributions from permutation tests to evaluate significance, and agree that accuracy alone is not sufficient, but they have some very oddly-shaped null distributions, which make me wonder about their permutation scheme.

Anyway, the below-chance discussion is mostly in the section "Classification Rates Below the Level Expected for Chance" and Figure 3, with proofs in the appendices. Jamalabadi et. al set up a series of artificial datasets, designed to have differing amounts of signal and number of examples. They get many below-chance accuracies when "sample size and estimated effect size is low", which they attribute to "dependence on the subsample means":
 "Thus, if the test mean is a little above the sample mean, the training mean must be a little below and vice versa. If the means of both classes are very similar, the difference of the training means must necessarily have a different sign than the difference of the test means. This effect does not average out across folds, ....."
They use Figure 3 to illustrate this dependence in a toy dataset. That figure is really too small to see online, so here's a version I made (R code after the jump if you want to experiment).
This is a toy dataset with two classes (red and blue), 12 examples of each class. The red class is from a normal distribution with mean 0.1, the blue, a normal distribution with mean -0.1. The full dataset (at left) shows a very small difference between the classes: the mean of the the blue class is a bit to the left of the mean of the red class (top row triangles); the line separates the two means.

Following Jamalabadi et. al's Figure 3, I then did a three-fold cross-validation, leaving out four examples each time. One of the folds is shown in the right image above; the four left-out examples in each class are crossed out with black x. The diamonds are the mean of the training set (the eight not-crossed-out examples in each class). The crossed diamonds are the means of the test set (the four crossed-out examples in each class): and they are flipped: the blue mean is on the red side, and the red mean on the blue side. Looking at the position of the examples, all of the examples in the blue test set will be classified wrong, and all but one of the red: accuracy of 1/8, which is well below chance.

This is the "dependence on subsample means": pulling out the test set shifts the means of the remaining examples (training set) in the other direction, making performance worse (in the example above, the training set means are further from zero than the full dataset). This won't matter much if the two classes are very distinct, but can have a strong impact when they're similar (small effect size), like in the example (and many neuroimaging datasets).

Is this an explanation for below-chance classification? Yes, I think it could be. It certainly fits well with my observations that below-chance results tend to occur when power is low, and should not be interpreted as anti-learning, but rather of poor performance. My advice for now remains the same: if you see below-chance classification, troubleshoot and try to boost power, but I think we now have more understanding of how below-chance performance can happen.


ResearchBlogging.org Jamalabadi H, Alizadeh S, Schönauer M, Leibold C, & Gais S (2016). Classification based hypothesis testing in neuroscience: Below-chance level classification rates and overlooked statistical properties of linear parametric classifiers. Human brain mapping, 37 (5), 1842-55 PMID: 27015748

follow the jump for the R code to create the image above



 do.plot <- function(s1, s2, test1, test2) {  # s1 <- ex1[1:8]; s2 <- ex2[1:8]; test1 <- ex1[9:12]; test2 <- ex2[9:12];  
  plot(x=0,y=0, xlim=c(-2.5,2.5), ylim=c(0.5,2), col='white', xlab='', ylab='', yaxt='n', main="a cv fold")  
  points(x=ex1, y=rep(1.25,length(ex1)), col=col1, cex=1.5);  
  points(x=ex2, y=rep(1.55,length(ex2)), col=col2, cex=1.5);  
  points(x=test1, y=rep(1.25, length(test1)), pch=4, cex=1.75); # x out omitted points  
  points(x=test2, y=rep(1.55, length(test2)), pch=4, cex=1.75);  
   
  points(x=mean(ex1), y=1.75, col=col1, cex=2, pch=2); # means for full dataset  
  points(x=mean(ex2), y=1.75, col=col2, cex=2, pch=2);  
  lines(x=rep((mean(ex1)+mean(ex2))/2, 2), y=c(-1,4))  
   
  points(x=mean(s1), y=1, col=col1, cex=2, pch=5);  # means of training dataset  
  points(x=mean(s2), y=1, col=col2, cex=2, pch=5);  
    
  points(x=mean(test1), y=0.75, col=col1, pch=9, cex=2) # means of testing dataset  
  points(x=mean(test2), y=0.75, col=col2, pch=9, cex=2)  
 }  
   
 set.seed(3463);  # example shown in the post  
 ex1 <- rnorm(12, mean=-0.1);  
 ex2 <- rnorm(12, mean=0.1);  
 col1 <- 'blue';  
 col2 <- 'red';  
   
 layout(matrix(1:4, c(2,2), byrow=TRUE));  
 plot(x=0,y=0, xlim=c(-2.5,2.5), ylim=c(0.5,2), col='white', xlab='', ylab='', yaxt='n', main='full dataset')  
 points(x=ex1, y=rep(1.25,length(ex1)), col=col1, cex=1.5);  
 points(x=ex2, y=rep(1.55,length(ex2)), col=col2, cex=1.5);  
 points(x=mean(ex1), y=1.75, col=col1, cex=2, pch=2);  
 points(x=mean(ex2), y=1.75, col=col2, cex=2, pch=2);  
 lines(x=rep((mean(ex1)+mean(ex2))/2, 2), y=c(-1,4))  
   
 do.plot(ex1[1:8], ex2[1:8], ex1[9:12], ex2[9:12]);  
 do.plot(ex1[c(1:4,9:12)], ex2[c(1:4,9:12)], ex1[5:8], ex2[5:8]);  
 do.plot(ex1[5:12], ex2[5:12], ex1[1:4], ex2[1:4]);  

11 comments:

  1. Hi Jo,

    "Anti-learning" is a very interesting topic as it occurs really often in fMRI MVPA, thanks for sharing you insights.

    I am dealing with a below-chance case, but in a different context: Searchlights. There are a lot of voxels where we do expect chance level, and the distribution of the map should then be centered around chance with a positive tail for the significant results. However the distribution is more centered at 40-45% and with negative values extending to 15%.
    We have 2 runs of this same task, but the resulting Searchlight maps are not quite similar even when averaging across subjects.
    However the far-below-chance accuracy blobs make sense regarding our task, this seems to be information
    As described in his paper, spurious above chance accuracy also occur, for a few subjects the distribution is centered above chance.
    Also permutation (full dataset) are similarly biased, thus "above permutation chance" group mean accuracy can be lower than 50% and rarely exceeds 60%.

    Subsequently to Jamalabadi paper, I tested both LOO and 2fold, which reduces the variance of the distribution, but the map is quite similar, same regions "of interest" show below-chance, and it is still centered around 45-48%.

    Our design has 4 classes pseudo-randomized temporally using DeBruijn cycles, thus each pairs of successive stimuli is equally represented.
    However the 4 classes are not equal so we tested 2 by 2, so this is two 2-classification problems, however the balancing of temporally neighboring class remains.
    The Gaussian Naive Bayes classifier is really sensitive to the XOR configuration but it is one of the "fastest" option especially when it comes to computing permutation.

    Do you think the design can be the problem?
    Or maybe the analysis is wrong?
    Is it really necessary to do "cross-run-validation"?
    If we only have 2 runs, can we repeat cross-validation by subsampling the 2 runs events?

    Thanks.
    basile

    ReplyDelete
    Replies
    1. Sorry for the slow reply; this is not a simple question! It might be better to email me directly (jetzel@wustl.edu) instead of trying through the comments. There are several issues here: getting biggish areas of very below-chance in the searchlight analysis, analysis design, permutation testing.

      Briefly, I view seriously below chance MVPA (e.g., less than 45 if chance is 50) as a sign that something is not right. Below chance searchlight analysis blobs would worry me as well; maybe it'd be ok if there are sensible positive areas (e.g., motor in a motor task) and negative regions are confined to non-grey areas (e.g., ventricles).

      It is not critically necessary to do leave-one-run-out cross-validation, but time matters, so something like splitting the runs in half (first half, second half, not randomly) is generally a fairly safe alternative.

      I'm still fairly convinced that permutation test null distributions should be approximately normal. I've emailed a bit with Steffen Gais, and the permutation test they did in Jamalabadi et al is different than what I generally recommend. Hopefully we can make some of those details public soon.

      Delete
  2. Hi
    I have been encountering similar problems.
    I am struggling to understand why data with low effect size can produce these heavily skewed distributions whereas the label-permuted null distributions tend to be well centered on 0. This is what I'm observing in data using an LDA.
    Could you offer some insight on this?
    Thanks
    Sophie

    ReplyDelete
    Replies
    1. Not sure what you're asking: are your null distributions well-centered?

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Yes, the null distributions are centered on the expected chance level.

    ReplyDelete
    Replies
    1. I think I need a concrete example of the problem; if the null distributions are centered on chance and approximately normal, that's good. If they're on chance but highly skewed, perhaps check your cross-validation and label permutation schemes - is your relabeling respecting all of the dependencies in the dataset (e.g., scanner runs, individuals)?

      Delete
  5. I seem to not be clear!
    My data is neural data in response to two sounds and the decoder is classifying according to the neural responses whether sound A or B was played.
    My problem is the following : when I perform decoding with cross-validation on my original data set I observe below chance average accuracy - around 30%.
    If I now take this data set and randomize the sound labels (A or B) of all the trials and perform decoding again, giving me my null distribution, the result is centered on chance and not skewed so my average accuracy is 50% as expected.
    I am finding it hard to understand how the data is giving below chance levels but the randomized data is not.

    ReplyDelete
    Replies
    1. Ok, that helped. :) This is definitely not normal behavior; something is wrong. I assume you've already seen http://mvpa.blogspot.com/2013/04/below-chance-classification-accuracy.html. While that post is getting old now, the advice is still current: it seems likely that something is mislabeled or otherwise went wrong in the processing. I'd focus on processing first: if you decode sound-or-not-sound, do you get good signal in auditory areas (i.e., a positive control)? If so, I'd then look closely at balance and cross-validation. For example, do you have the same number of A and B sounds in each cross-validation partition? Do the cross-validation partitions represent a sensible data grouping (e.g., scanner run, experimental session, person)? Good luck!

      Delete
  6. Dear Jo, just wanted to thank you for this very clear explanation and demonstration. This is a problem I'm encountering with certain SVM training/testing schemes but not others. Any ideas why, say, polynomial (as opposed to linear) classifiers may be more susceptible to this problem?

    ReplyDelete
    Replies
    1. For the polynomial vs. linear, I'd guess the extra complexity (and room for overfitting) in the polynomial. We're so on the edge of what SVMs can handle with neuroimaging data (very similar features, more features than examples, etc.) that every extra bit hurts. It may also be that the "shape" of fMRI datasets in hyperspace are often more amenable to linear decision boundaries, though I have never investigated that.

      Delete