This is a dense methodological paper, laying out a way (and rationale) to carry out permutation tests for group-level classifier-based searchlight analysis (linear svm). This is certainly a needed topic; as pointed out in the article, the assumptions behind t-tests are certainly violated in searchlight analysis, and using the binomial is also problematic (they suggest that it is too lenient, which strikes me as plausible).
Here's my interpretation of what they propose:
- Generate 100 permuted searchlight maps for each person. You could think of all the possible label (i.e. class, stimulus type, whatever you're classifying) rearrangements as forming a very large pool. Pick 100 different rearrangements for each person and do the searchlight analysis with this rearrangement. (The permuted searchlight analysis must be done exactly as the real one was - same cross-validation scheme, etc.)
- Generate 100,000 averaged group searchlight maps. Each group map is made by picking one permuted map from each person (out of the 100 made for each person in step 1) and averaging the values voxel-wise. In other words, stratified sampling with replacement.
- Do a permutation test at each voxel, calculating the accuracy corresponding to a p = 0.001 threshold. In other words, at each voxel you record the 100th biggest accuracy after sorting the 100,000 accuracies generated in step 2. (100/100000 = 0.001)
- Threshold the 100,000 permuted group maps and the one real-labeled group map using the voxel-wise thresholds calculated in step 3. Now the group maps are binary (pass the threshold or not).
- Apply a clustering algorithm to all the group maps. They clustered voxels only if they shared a face. I don't think they used a minimum cluster size, but rather called un-connected voxels clusters of size 1 voxel. (This isn't really clear to me.)
- Count the number of clusters by size in each of the 100,000 permuted maps and 1 real map. (this gives counts like 10 clusters with 30 voxels in map #2004, etc.)
- Generate the significance of the real map's clusters using the counts made in step 6. I think they calculated the significance for each cluster size separately then did FDR, but it's not obvious to me ("Cluster-size statistics" section towards end of "Materials and Methods").
- Done! The voxels passing step 7 are significant at the cluster level, corrected for multiple comparisons (Figure 3F of paper). The step 4 threshold map can be used for uncorrected p-values (Figure 3E of paper).
Most of this strikes me as quite reasonable. I've actually previously implemented almost this exact procedure (minus the cluster thresholding) on a searchlight dataset (not linear svms).
The part that makes me twitch the most is step 2: turning the 100 maps for each person into 100,000 group-average maps. I've been wanting to post about this anyway in the context of my ROI-based permutation testing example. But in brief, what makes me uncomfortable is the way 100 maps turn into 100000. Why not just calculate 5 for each person? 5^12 >> 100,000 (they had 12 subjects in some of the examples). Somehow 100 for each person feels more properly random than 5 for each person, but how many are really needed to properly estimate the variation? I will expand on this more (and give a few alternatives), hopefully somewhat soon.
The other thing that makes me wonder is the leniency. They show (e.g. Figure 11) that many more voxels are called significant in their method than with a t-test, claiming that as closer to the truth. This relates to my concern about how to combine over subjects: using 100,000 group maps allows very small p-values. But if the 100,000 aren't as variable as they should be, the p-values will be inflated.
Stelzer, J., Chen, Y., & Turner, R. (2012). Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. NeuroImage DOI: 10.1016/j.neuroimage.2012.09.063
UPDATE (30 October): We discussed this paper in a journal club and a coworker explained that the authors do explain the choice of 100 permutations per person in Figure 8 and the section "Undersampling of the permutation space". They made a dataset with one searchlight and many examples (80, 120, 160), then varied the number of permutations they calculated for each individual (10, 100, 1000, 10,000). They then made 100,000 group "maps" as before (my step 2), drawing from each group of single-subject permutations. Figure 8 shows the resulting histograms: the curves for 100, 1000, and 10,000 individual permutations are quite similar, which they use as rationale for running 100 permutations for each person (my step 1).
I agree that this is a reasonable way to choose a number of each-person permutations, but I'm still not entirely comfortable with the way different permutation maps are combined. I'll explain and show this more in a separate post.