This R code demonstrates how to carry out a permutation test at the group level, when using within-subjects cross-validation. I describe permutation schemes in a pair of PRNI conference papers; see DOI:10.1109/PRNI.2013.44 for an introduction and single subjects, and this one for the group level, including the fully balanced within-subjects approach used in the demo. A blog post from a few years ago also describes some of the issues, using examples structured quite similarly to this demo.
For this demo I used part of the dataset from doi:10.1093/cercor/bhu327, which can be downloaded from the OSF site. I did a lot of temporal compression with this dataset, which is useful for the demo, since only about 18 MB of files need to be downloaded.Unfortunately, none of the analyses we did for the paper are quite suitable to demonstrate simple permutation testing with within-subjects cross-validation, so this demo performs a new analysis. The demo analysis is valid, just not really sensible for the paper's hypotheses (so, don't be confused when you can't find it in the paper!).
The above figure is generated by the demo code, and shows the results of the test. The demo uses 18 subjects' data, and their null distributions are shown as blue histograms. The true-labeled accuracy for each person is plotted as a red line, and listed in the title, along with its p-value, calculated from the shown null distribution (the best-possible p-value, 1/2906, rounds to 0).
The dataset used for the demo has no missings: each of the people has six runs, with four examples (two of each class) in each run. Thus, I can use a single set of labels for the permutation test, carrying out the relabelings and classifications in each person individually (since it's within-subjects cross-validation), but with the null distribution for each person built from the same relabelings. Using the same relabelings in each person allows the group-level null distribution (green, in the image above) to be built from the across-subjects average accuracy for each relabeling. In a previous post I called this fully-balanced strategy "single corresponding stripes", illustrated with the image below; see that post (or the demo code) for more detail.
The histogram for the across-subjects means (green histogram; black column) is narrower than the individual subject's histograms. This is sensible: for any particular permutation relabeling, one person might have a high accuracy, and another person a low accuracy; averaging the values together gives a value closer to chance. Rephrased, each individual has at least one permutation with very low (0.2) accuracy (as can be seen in the blue histograms). But different labelings made that low accuracy in each person, so the lowest group-mean accuracy was 0.4.
The group mean of 0.69 was higher than all the permuted-label group means, giving a p-value of 0.0003 = 1/2906 (2906 permutation relabelings were run, all possible). The equivalent t-test is shown in the last panel, and also produces a very significant p-value.