This reminds me of the situation with searchlight shape: many different implementations of the same idea are possible, and we really need to be more specific when we report results: often papers don't specify the scheme they used.
Which permutation scheme is best? As usual, I doubt there is a single, all-purpose answer. I put together this little simulation to explore one part of the effect of the choice: what do the null distributions look like under each label permutation scheme? The short answer is that the null distributions look quite similar (normal and centered on chance), but there is a strong relationship between the proportion of labels permuted and accuracy when only the test or training set labels are permuted, but not when both are permuted.
simulationThese simulations use a simple mvpa-like dataset for one person, two classes (chance is 0.5), 10 examples of each class in each run, two runs, and full balance (same number of trials of each class in each run, no missings). I made the data by sampling from a normal distribution for each class, standard deviation 1, mean 0.15 for one class and -0.15 for the other, 50 voxels. I classified with a linear svm, c=1, partitioning on the runs (so 2-fold cross-validation). I used R; email me for a copy of the code. I ran the simulation 10 times (10 different datasets), with the same dataset used for each permutation scheme.
1500 label permutations of each sort (training-only, testing-only, both) were run, chosen at random from all those possible. I coded it up such that the same relabeling was used for each of the cross-validation folds when only the training or testing data labels were permuted (e.g. the classifier was trained on the 1st run of the real data, then the permuted label scheme was applied to the 2nd run and the classifier tested. Then the classifier was trained on the 2nd run of the real data, and the SAME permuted label scheme applied to the 1st run and the classifier tested.). This was simply for convenience, when coding, but restricts the number of possibilities; another example of how the same idea can be implemented multiple ways.
This is a moderately difficult classification: the average accuracy of the true-labeled data (i.e. not permuted) was 0.695, ranging from 0.55 to 0.775 over the 10 repetitions. The accuracy of each dataset is given in the plot titles, and by a reddish dotted line.
For each permutation I recorded both the accuracy and the proportion of labels matching between that permutation and the real labels. When both training and testing labels are permuted this is an average over the two cross-validation folds. I plotted the classification accuracy of each permutation against the proportion of labels matching in the permutation and calculated the correlation. Histograms of each variable appear along the axes. These graphs are complicated, but enlarge if you click on them.
Training set labels only permuted:
Here are density plots of all 30 null distributions, overplotted. These are the curves that appear along the y-axis in the above graphs.
observationsThere is a strong linear relationship between the number of labels changed in a permutation and its accuracy when either the training or testing set labels alone are shuffled: the more the relabeling resembles the true data labels, the better the accuracy. When all labels are permuted there isn't much of a relationship.
Despite the strong correlation, the null distributions resulting from each permutation scheme are quite similar (density plot overlap graph). This makes sense, since the relabelings are chosen at random, so relabelings quite similar and quite dissimilar to the true labeling are included. The null distributions would be skewed if the labels for the permutations were not chosen at random (e.g. centered above chance if only mostly-matching relabelings were used).
commentsIntuitively, I prefer the permute-both scheme: more permutations are possible, and the strong correlation is absent. But since the resulting null distributions are so similar, I can't say that permuting just one set of labels or the other is really worse, much less invalid. This is quite a simplified simulation; I think it would be prudent to check null distributions and relabeling schemes in actual use, since non-random label samplings may turn up.
- Al-Rawi, M.S., Cunha, J.P.S., 2012. On using permutation tests to estimate the classification significance of functional magnetic resonance imaging data. Neurocomputing 82, 224-233. http://dx.doi.org/10.1016/j.neucom.2011.11.007
- Ojala, M., Garriga, G.C. Permutation tests for studying classifier performance. J. Mach. Learn. Res., 11 (2010), pp. 1833–1863.
- Pereira, F., Botvinick, M., 2011. Information mapping with pattern classifiers: A comparative study. Neuroimage 56, 476-496.
- Stelzer, J., Chen, Y., Turner, R., 2013. Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. Neuroimage 65, 69-82.