First, the dataset.I have data for n people. Assume there are no missings, the data is balanced (equal numbers of both cases), and that the number of examples of each case is the same in all people (e.g. 10 "a" and 10 "b" in each of three runs for every person).
Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person. In other words suppose relabeling 1 = abaabba ... aabbb. The "perm 1" box for each person is the accuracy obtained when relabeling 1 is used for that person's data; permutations of the same number for each person use the same relabeling.
Next, we generate the null distribution.Here are the three schemes for doing a group permutation test. In the first, the linking between permutations described above is maintained: p group means are calculated, one for each of the p unique relabelings. I've used this scheme before.
Alternatively, we could disregard the linking between permutations for each person, selecting one accuracy from each person to go into each group mean. Now we are not restricted to p group means; we can generate as many as we wish (well, restricted by the number of subjects and permutations, but this is usually a very, very large number). In my little sketch I use the idea of a basket: for each of the r group means we pick one permutation accuracy from each person at random, then calculate the average. We sample with replacement: some permutations could be in multiple baskets, others in none. This is the scheme used in (part of) Stelzer, Chen, & Turner, 2012, for example.
Or, we could fill our baskets more randomly: ignoring the separation into subjects. In other words, we draw n permutation values each of r time, disregarding the subject identities. This is closer to the common idea of bootstrapping, but I don't know of any examples of using this with neuroimaging data.
Finally, the null distribution.Once we have the collection of p or r group means we can generate the null distribution, and find the rank (and thus the p-value) of the real accuracy, as usual in permutation testing. To be complete, I usually calculate the p-value as ((number of permutation means) > (real mean)) / (r + 1). I actually don't think this exact equation is universal; it's also possible to use greater-than-or-equal-to for the rank, or r for the denominator.
Thoughts.I feel like the first scheme ("stripes") is the best, when it is possible. Feelings aren't proof. But it feels cleaner to use the same relabeling scheme in each person: when we don't, we are putting a source of variability into the null distribution (the label scheme used for each person) that isn't in the real data, and I always prefer to have the null distribution as similar to the real data as possible (in terms of structure and sources of variance).
A massive problem with the first scheme is that it is only possible when the data is very well-behaved: the same relabeling scheme can be applied to all subjects. As soon as even one person has some missing data, the entire scheme breaks. I've used the second scheme ("balanced baskets") in this case, usually by keeping the "striping" whenever possible, then bootstrapping the subjects with missing (This is sort of a hybrid of the first two schemes: I end up with p group means, but the striping isn't perfect).
A small change to the diagram gets us from the second to the third scheme (losing the subject identities). This of course steps the null distribution even further from the real data, but that is not necessarily a bad thing. Stelzer, Chen, & Turner (2012) describe their scheme 2 as a "fixed-effects analysis" (line 388 of the proof). Would scheme 3 be like a random-effects analysis?
So, which is best? Should the first scheme be preferred when it is possible? Or should we always use the second? Or something else? I'll post some actual examples (eventually).