Thursday, April 18, 2013

below-chance classification accuracy

This topic comes up repeatedly with MVPA, so I thought I'd start collecting some of my thoughts on the blog. As an introduction, what follows started as a version of what I posted to the pyMVPA message list some time ago. This topic has been discussed multiple times on mailing lists; see the thread at . Googling "below-chance accuracy" also brings up some useful links.

My opinion in short: Sorting out below-chance accuracy is really vexing, but I am highly skeptical that it represents anti-learning (or other task-related information) in the case of MVPA. Impossible? No, but I'd want to see very impressive documentation before I'd accept such a conclusion.

"below-chance accuracy"?

By "below-chance accuracy" I specifically mean classification accuracy that is worse than it should be, such as some subjects classifying at 0.3 accuracy when chance is 0.5. It can even turn up in permutation tests: the permutation test null distribution nicely centered on chance and reasonably normal but the true-labeled accuracy far into the left tail (and so significantly below chance).

I don't have a complete explanation for this (and would be very interested to see one), but tend to think it has to do with data that doesn't make a linear-svm-friendly shape in hyperspace. Often we don't have a huge number of examples in MVPA datasets, which also can make classification results unstable. We're so often on the edge of what the algorithms can handle  (tens of examples and thousands of voxels, complex correlation structures, etc.) that we see a lot of strange things.

what to do?

FIRST: Check the dataset. Are there mislabeled examples (it's happened to me!) or faulty preprocessing? Look at the preprocessed data. Anything strange in the voxel timecourses? Basically, go through the data processing streams, spot-checking as much as possible to make sure everything going into the classifier seems sensible.

Then look at stability, such as the performance over cross-validation folds and subjects. How much variation is there across the cross-validation folds? Things seem to often go better when all the cross-validation folds have fairly similar accuracies (0.55, 0.6, 0.59, ...) rather than widely variable ones (0.5, 0.75, 0.6, ...).

Often, I've found the most below-chance accuracy difficulties in cases where stability is poor: a great deal of variation in accuracy over cross-validation folds, large changes in accuracy with small changes in the classification (or preprocessing) procedure. If this seems to be happening, it can be sensible to try reasonable changes to the classification procedure to see if things improve. For example, try leaving two or three runs out instead of just one for the cross-validation, since having a small testing set can make a lot of variance in the cross-validation folds.Or, perhaps try a different temporal compression scheme (e.g. average more timepoints or repetitions together).


I am very aware that this type of "trying" can be dangerous - the experimenter degrees of freedom starts exploding. But there often seems little choice. In practice, a particular MVPA should start with a set of well-considered choices (which classifier? which cross-validation scheme? how to do scaling? how to do temporal compression?). But these choices are usually based far more on guesswork than objective criteria. Until such object criteria exist, I have the most faith in a classification result when it appears in several reasonable analysis schemes. 

In practice, we should think carefully and attempt to design the best-possible analyses scheme before looking at the data. (And this assumes a very concrete experimental hypothesis is in place! Make sure you have one!) If we start seeing signs that the scheme is not good (like high variability or below-chance accuracy) we should try several other reasonable schemes, attempting to find something that yields better consistency.

Ideally, we do this "optimization" on a separate part of the dataset (e.g. a subset of the subjects), or something other than the actual hypothesis. For example, suppose the experiment was done to test some cognitive hypothesis, and the people made responses with a button box. We can then use the button-pushing for optimizing some aspects of the procedure: Can we classify which button was pushed/when the buttons were pushed in motor and somatosensory areas? If not, something is seriously wrong with the dataset (labeling, preprocessing, etc) and that should be fixed (so button classification is possible) before trying to attempt the actual classification we care about. This is of course not a perfect procedure, but can catch some serious problems, and reduces looking at the real data a little bit (since we don't actually care about the button-pushes or motor cortex).

what not to do

You should certainly not simply reverse the accuracy on below-chance subjects (i.e. so 0.4 becomes 0.6 if chance is 0.5), nor omit below-chance subjects. That is likely to cause nearly any dataset (even just noise) to classify well , particularly since fMRI data is usually so variable. No double-dipping! There may be cases when such actions are proper, but they need to be very, very tightly controlled, and certainly not done at the first sign of below-chance accuracy.

UPDATE [6 May 2016]: see this post about Jamalabadi et. al (2016), for a possible explanation of how below-chance accuracy occurs.


  1. Thanks for the thoughtful post on the topic. I've also noticed occasional very-significant below-chance results. I assume it might be something to do with the order of the trials or something similar. Also I like your suggestion on how to deal with the dangers of double-dipping. I also start by trying to decode something trivial and unrelated to the hypothesis first, so let me check my regressor timings etc are ok, etc, without having to look at the real hypothesis.

    1. Thanks! It'd be great to have proper theory-based guidance, but when we don't, sharing and trying to improve our best guesses/standard practices seems sensible.

  2. Why can't you reverse the output? If you view the decoder as a "black box" that is trying its best to make use of whatever structure might be in the data, why can one not make a new classifier which takes the learned existing classifier and flips the output. This would still be a perfectly valid learning algorithm to use on the data, and result in accuracy values that flipped?

    1. Yes, you could decide before conducting an analysis to reverse the accuracy anytime you get below-chance accuracy, which would be like having an additional classifier. But I suspect that would really increase the number of false positive findings (and we already have way too many false positives) - you'd have to convince me that it doesn't.

      Also, *when* would you decide whether a person is "flip-label" or "real-label"? Deciding *after* seeing that the test-set accuracy is below-chance seems way too much like cheating to me. I suppose you could look at the training-set accuracy, but that is usually above-chance.

  3. Hey, I have come across a crazy scenario. I had 90 datapoints, 45 from class A and 45 from class B. I was doing leave-one-out cross-validation. I was getting 50% training accuracy and 0% test accuracy. I was puzzled like crazy. I started googling and came across your nice post. Great to see that I am not the only one confused about such things. But hey, I figured it out! If the leave-one-out strategy selected the test point from group A, then the training set would consist of 44 points from group A and 45 from group B. Since the labels were completely unpredictable from the data (at least using the the classifier I used), the next best thing it could converge to was to vote for the most prevalent class. So we would get 0% test accuracy, because for every test point the remaining training set would be imbalanced in the opposite way! I was so shocked. Anyway, I have read that there are ways of combating this behaviour, namely, oversamling of the training set to make classes equal before training. Hope this is useful to somebody

    1. Thanks for the comment! Balance in the training set (equal numbers of each class) is *so* important. For fMRI datasets I often suggest subsetting the larger class rather than duplicating members of the smaller class, but I believe both can be appropriate.