Combining the first six runs (the first training set) this person has 19 usable examples of class a and 31 of class b. In the last six runs (the first testing set) there are 22 usable examples of class a and 18 of class b. The data is thus quite out of balance (many more b in the training set and many more a in the testing). Balancing the dataset requires selecting 19 of the class b examples from the training set, and 18 of class a from the testing set. I did this selection randomly 10 times, which I will call "splits".
The permutation test's aim is to evaluate the classification significance: is the accuracy by which we classified the stimuli as a or b meaningful? I do this by permuting the class labels in the dataset, so in any particular permutation some class a examples are still labeled a, while others are labeled b. But in this case, even though it's just one person, we have ten datasets: the splits.
I handle this by precalculating the label permutations, then running the permutation test on each of the splits separately and averaging the results. To make it more concrete, this person has (after balancing) 19 examples of each class in the training set. I sort the rows so that they are grouped by class: the real labeling is 19 a followed by 19 b. I store 1000 random label rearrangements, each containing 19 a and 19 b. For example, here's the new labeling for permutation #1: "a" "b" "b" "a" "b" "a" "b" "a" "a" "a" "b" "b" "b" "a" "a" "b" "a" "b" "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "a" "b" "a" "a" "b" "b" "b" "b" "a" "b". I replace the real class labels with this set, then run the usual classification code - for each of the 10 splits.
When everything is done I end up with 1000 permuted-labels * 10 splits accuracies. I then make a set of 1000 accuracies by averaging the accuracy obtained with each label permutation on each split. In other words, I take the average of the accuracies I got by using the abbababaaa... labeling on each of my 10 "split" datasets. I then compute the significance by taking the rank of the real accuracy over the number of permutations. In this case, that's 2/1001 = 0.002.
Here's how it actually looks. These are histograms of the null distributions obtained on each of the split datasets separately, plus the averaged one. The vertical blue lines give the accuracy of the true labeling on each split, while the vertical light-grey line is at 0.5 (chance). The first graph is overplotting density plot versions of the 10 splits' histograms.
This method obviously requires a lot of computations (I often use a cluster). But I like that it enables comparing apples to apples: I'm comparing the accuracy I got by averaging 10 numbers to a distribution of accuracies obtained by averaging 10 numbers. Since the same set of splits and relabelings is used in all cases, the averaging is meaningful.
In this case, as often, the averaged distribution has less variance than the individual splits. This is not surprising, of course. It's also clear that the true-labeling accuracies vary more than the permutation distributions for each split, which is probably also not surprising (since these are permutation distributions), but not as clean as might be hoped. I do like that the true accuracy is at the right of all of the distributions; I consider this amount of variability pretty normal. I'd be worried if the distributions varied a great deal, or if the true-label accuracies are more variable (e.g. at chance on some splits, > 0.7 on others).