There has been an interesting thread about permutation testing on the pyMVPA mailing list recently. In a previous blog post about permutation testing I used two runs with partitioning on the runs, for two-fold cross-validation. In that simulation the null distributions were quite similar regardless of whether the training set only, testing set only, or entire dataset (training and testing, as a unit) were relabeled.
The pyMVPA discussion made me suspect that the overlapping null distributions are a special case: when there are only two runs (used for the cross-validation), permuting either the training set only or the testing set only is permuting half of the data. When there are more than two runs, permuting the training set only changes the labels on more examples than permuting the testing set only.
I repeated that simulation creating four runs of data instead of just two. This makes the classification easier, since it is trained on (in this case) 60 examples (3 runs * 10 examples of each class in each run) instead of just 20 examples. As before, I ran each simulation ten times, and did 1000 label rearrangements (chosen at random).
I plan more posts describing permutation schemes, but I'll summarize here. I always permute the class labels within each run separately (a stratified scheme). In this example, partitioning is also done using the runs, so the number of trials of each class in each cross-validation fold is always the same. I precomputed the relabelings and try to keep as much constant across the folds as possible. For example, when permuting the testing set only I use the same set of relabelings (e.g. permutation #1 = aabbbaab when the true labeling is aaaabbbb) for all test sets (when run 1 is left out for permutation 1, when run 2 is left out for permutation 1, etc.). This creates (maintains) some dependency between the cross-validation folds.
Here are the null distributions that resulted from running the simulation 10 times. Only the accuracy > chance side is shown (they're reasonably symmetrical), with the dotted vertical line showing the accuracy of the true-labeled dataset. The lines are histogram-style: the number of samples falling into the 0.05-wide accuracy bins. This is a fairly easy classification, and would turn out highly significant under all permutation schemes in all repetitions.
Permuting the training sets only tends to result in the lowest-variance null distributions and permuting the testing sets only the highest, with permuting both often between. This is easier to see when density plots are overplotted:
Yaroslav's suggestion that permuting the training set only may provide better power - a narrower null distribution means you're more likely to find your real accuracy in the extreme right tail, and so significant. But I don't know which scheme is better in terms of error rates, etc. - which will get us closer to the truth?
While very much still in progress, the code I used to make these graphs is here. It lets you vary the number of runs, trials, bias (how different the classes are), and whether the relabelings are done within each run or not.
UPDATE: see the post describing simulating data pyMVPA-style for dramatically different curves.