"1- The title: 'groups of people'. Are you trying to perform permutation testing for a group of subjects (people), so, is it 'groups of people', or group analysis?"
I meant group analysis (aka 2nd level analysis): "is this effect present in these subjects?" The goal is to generate a significance level for the group-level statistic, such as the accuracy averaged across subjects.
"2- You also wrote; 'Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person'. So, how do you get this set of 'p unique relabelings', or you don't have to find them, you just do find p times relabelings?"Sorry for the confusion; I was struggling a bit with terminology. What I mean is that there is a single set of label permutations which are used for every person, rather than generating unique label permutations for everyone. The set of label permutations is created in the 'usual' way. If the number of labels is small all possible permutations should be used, but when there are too many for that to be practical a random subset can be generated.
For example, suppose there are two classes (a and b) and three examples of each class. If the true labeling (i.e. row order of the data table) is aaabbb, possible permutations include aabbba, ababab, etc.
"3- 'applied such that there is a linking between the permutations in each person'. I cannot figure out what that is? I can tell that it gives a biased distribution, either all yielding best accuracy(s) that are closer to the real one, or all yielding closer to chance level accuracy."What I meant was that, if all subjects have the same number and type of examples, the same (single) permutation scheme can be used for everyone. To continue the example, we would calculate the accuracy with the aabbba relabeling in every subject, then the accuracy with the ababab relabeling, etc.
I guess this gives a biased distribution in the sense that fewer relabelings are included ... When a random subset of the permutations has to be used (because there are too many to calculate them all), under scheme 2 you could generate a separate set of permutations for each person (e.g. aabbba might be used with subject 3 but not subjects 1, 2, or 4). The single set of permutations used for everyone is not biased (assuming it was generated properly), but does sample a smaller number of relabelings than if you generate unique relabelings for each person.
"4- Since each person/subject (data) has its own classifier (or a correlated set of classifiers due to using k-fold cross validation), is it legitimate to take the mean of each of the 'p unique relabelings' (as you showed in scheme 1)?"
This is one of the reasons why I made the post: to ask for opinions on its legitimacy! It feels more fair to me to take the across-subjects mean when the same labeling is used for everyone than averaging values from different labelings. But a feeling isn't proof, and it's possible that scheme 1 is too stringent.
"5- The formula; ((number of permutation means) > (real mean)) / (r + 1)Yes, I think this is right: if the true labeling is better than all 1000 permutations we want the resulting p-value to come out as 0.001, not 0.
I think using ((number of permutation means) > (real mean) +1 ) / (r + 1) would prevent getting p=0 value when the classification accuracy is highly above change. We shouldn't be getting 0 in any case, but that is most probably to happen because the number of permutations is limited by the computational power, and the randomization labeling might not be perfectly tailored enough to generate an idealistic distribution (e.g. at one permutation should capture the original labeling or one that is highly close to it and thus give high classification accuracy)."