The paper describes a few ways in which permutation testing is commonly done in MVPA, focusing on how the cross-validation folds (aka data partitions) are accounted for. If you're not familiar with why cross-validation is relevant for MVPA, I think this is a pretty readable statistically-oriented introduction, you could try a paper I wrote a few years ago (Etzel, J.A., Gazzola, V., Keysers, C., 2009. An introduction to anatomical ROI-based fMRI classification analysis. Brain Research 1282, 114-125.), or just google.
two permutation schemesOne goal of the paper is to point out that "we did a permutation test" is not a sufficient description for MVPA, since there are many reasonable ways to set up permutation tests. We use the terms "dataset-wise" and "fold-wise" to describe two common schemes. Since these terms aren't standard, we illustrate the two schemes with a running example.
introduction: running example
"This person completed three runs of fMRI scanning, each of which contained three blocks each of two different tasks. These task blocks were presented with sufficient rest intervals to allow the task-related BOLD signal to return to baseline, making it reasonable to assume that the task labels can be permuted [2, 3]. We further assume that the image preprocessing (motion correction, etc.) was adequate to remove most linear trends and uninteresting signals. Temporal compression  was performed, so that each task block is represented in the final dataset as a single labeled vector of voxel values (Fig. 1). There are n entries in each vector, corresponding to the voxels falling within an anatomically-defined region of interest (ROI). We assume that n is small enough (e.g. 100) that further feature selection is not necessary.
We wish to use a classification algorithm (e.g. linear support vector machines) to distinguish the two tasks, using all n voxels listed in the dataset. For simplicity, we will partition the data on the runs (three-fold CV): leave out one run, train on the two remaining runs, and repeat, leaving out each run in turn. The three test set accuracies are then averaged to obtain the overall classification accuracy (Fig. 2), which, if greater than chance, we interpret as indicating that the voxels’ BOLD varied with task."
I carry this way of illustrating cross-validation and classification through the later figures. The white-on-black color indicates that these examples have the true task labels: the numbers in the circles (which are 'seen' by the classifier) match those of the third (task) column in the dataset.
permutation schemesNow, the permutation testing. We need to put new task labels on, but where? There are 20 ways of reordering the task labels; shown in the figures as colored circles on a light-grey background.
Under the dataset-wise scheme we put the new task labels on before the cross-validation, carrying the new labels through the cross-validation folds. Figure 4 shows how this works when both the training and testing sets are relabeled, while Figure 5 shows how it works when only the training sets are relabeled.
Note that the dataset's structure is maintained under the dataset-wise permutation scheme when both training and testing sets are relabeled (Figure 4 has the same pattern of arrows as Figure 2). Some of the arrows are shared between Figure 5 and Figure 2, but the property (in the real data) that each labeling is used in a test set is lost.
Under the fold-wise permutation scheme we put the new task labels on during the cross-validation. Figure 6 shows this for relabeling the training data only, as suggested in the pyMVPA documentation. Figure 6 has a similar structure to Figure 5, but the coloring is different: under the dataset-wise scheme a run is given the same set of permuted labels in all training sets in a permutation, while under the fold-wise scheme each run gets a different set of permuted labels (i.e. in Figure 5 run 1 is purple in both the first and second cross-validation folds, while in Figure 6 run 1 is purple in the first and red in the second).
does the permutation scheme matter?Yes, at least sometimes. We are often dealing with such small, messy datasets in MVPA that shifting the rank by just a few values can really matter. The simulations in the little demo show a few (tweakable) cases.
Here are the first two repetitions of the demo (this is in a more readable format in the pdf in the demo). In the first, the true-labeled accuracy was 0.64 (vertical line), the p-value for both.dset (permuting training and testing labels dataset-wise, ala figure 4) was 0.013, train.dset (dataset-wise, permuting training only, ala figure 5) was 0.002; both.fold (permuting both, fold-wise) was 0.001, and train.fold (fold-wise permuting training only, ala figure 6) was 0.005. On repetition 2, the p-values were: both.dset=0.061, train.dset=0.036, both.fold=0.032, and train.fold=0.028. That gives us p-values above and below the 'magical' value of 0.05, depending on how we do the permutation test.
final thoughtsWhich permutation scheme should we use for MVPA? Well, I don't know a universally-applicable answer. As we know, how simulated datasets are created can really, really matter, and I certainly don't claim this little demo is representative of true fMRI data. That said, the pattern in null distributions above - larger variance on dataset-wise than fold-wise schemes (and so a more stringent test) - is common in my experience, and unsurprising. It seems clear that more of the dataset structure is kept under dataset-wise schemes, which is consistent with the goal of matching the permutation test as closely as possible to the true data analysis.
My feeling is that dataset-wise permutation schemes, particularly permuting both the training and testing sets (Fig. 4) is the most rigorous test for a dataset like the little running example here. Dataset-wise permuting of either just the training or just the testing set labels may be preferable in some cases, such as 'cross' classification, when the training and testing datasets are not runs but rather independent datasets (e.g. different experimental conditions or acquisition days).
I don't think that fold-wise schemes should be recommended for general use (datasets like the running example), since some of the dataset structure (similarity of training data across cross-validation folds) present in the true data is lost.
UPDATE (20 June 2013): Here is a link to a copy of my poster, plus expanded simulation code.
UPDATE (22 July 2013): Here is a link to a copy of the paper, as well.
UPDATE (6 November 2013): The paper is now up at IEEE and has the DOI 10.1109/PRNI.2013.44.