My opinion in short: Sorting out below-chance accuracy is really vexing, but I am highly skeptical that it represents anti-learning (or other task-related information) in the case of MVPA. Impossible? No, but I'd want to see very impressive documentation before I'd accept such a conclusion.
"below-chance accuracy"?By "below-chance accuracy" I specifically mean classification accuracy that is worse than it should be, such as some subjects classifying at 0.3 accuracy when chance is 0.5. It can even turn up in permutation tests: the permutation test null distribution nicely centered on chance and reasonably normal but the true-labeled accuracy far into the left tail (and so significantly below chance).
I don't have a complete explanation for this (and would be very interested to see one), but tend to think it has to do with data that doesn't make a linear-svm-friendly shape in hyperspace. Often we don't have a huge number of examples in MVPA datasets, which also can make classification results unstable. We're so often on the edge of what the algorithms can handle (tens of examples and thousands of voxels, complex correlation structures, etc.) that we see a lot of strange things.
what to do?FIRST: Check the dataset. Are there mislabeled examples (it's happened to me!) or faulty preprocessing? Look at the preprocessed data. Anything strange in the voxel timecourses? Basically, go through the data processing streams, spot-checking as much as possible to make sure everything going into the classifier seems sensible.
Then look at stability, such as the performance over cross-validation folds and subjects. How much variation is there across the cross-validation folds? Things seem to often go better when all the cross-validation folds have fairly similar accuracies (0.55, 0.6, 0.59, ...) rather than widely variable ones (0.5, 0.75, 0.6, ...).
Often, I've found the most below-chance accuracy difficulties in cases where stability is poor: a great deal of variation in accuracy over cross-validation folds, large changes in accuracy with small changes in the classification (or preprocessing) procedure. If this seems to be happening, it can be sensible to try reasonable changes to the classification procedure to see if things improve. For example, try leaving two or three runs out instead of just one for the cross-validation, since having a small testing set can make a lot of variance in the cross-validation folds.Or, perhaps try a different temporal compression scheme (e.g. average more timepoints or repetitions together).
yikes!I am very aware that this type of "trying" can be dangerous - the experimenter degrees of freedom starts exploding. But there often seems little choice. In practice, a particular MVPA should start with a set of well-considered choices (which classifier? which cross-validation scheme? how to do scaling? how to do temporal compression?). But these choices are usually based far more on guesswork than objective criteria. Until such object criteria exist, I have the most faith in a classification result when it appears in several reasonable analysis schemes.
In practice, we should think carefully and attempt to design the best-possible analyses scheme before looking at the data. (And this assumes a very concrete experimental hypothesis is in place! Make sure you have one!) If we start seeing signs that the scheme is not good (like high variability or below-chance accuracy) we should try several other reasonable schemes, attempting to find something that yields better consistency.
Ideally, we do this "optimization" on a separate part of the dataset (e.g. a subset of the subjects), or something other than the actual hypothesis. For example, suppose the experiment was done to test some cognitive hypothesis, and the people made responses with a button box. We can then use the button-pushing for optimizing some aspects of the procedure: Can we classify which button was pushed/when the buttons were pushed in motor and somatosensory areas? If not, something is seriously wrong with the dataset (labeling, preprocessing, etc) and that should be fixed (so button classification is possible) before trying to attempt the actual classification we care about. This is of course not a perfect procedure, but can catch some serious problems, and reduces looking at the real data a little bit (since we don't actually care about the button-pushes or motor cortex).
what not to doYou should certainly not simply reverse the accuracy on below-chance subjects (i.e. so 0.4 becomes 0.6 if chance is 0.5), nor omit below-chance subjects. That is likely to cause nearly any dataset (even just noise) to classify well , particularly since fMRI data is usually so variable. No double-dipping! There may be cases when such actions are proper, but they need to be very, very tightly controlled, and certainly not done at the first sign of below-chance accuracy.
UPDATE [6 May 2016]: see this post about Jamalabadi et. al (2016), for a possible explanation of how below-chance accuracy occurs.