MVPA Meanderings

Thursday, November 29, 2012

surface or volume searchlighting ... not a mixture

I saw a report with an analysis that used volumetric searchlights (spheres) within a grey matter mask (like the one at the right, though theirs was somewhat more dilated). The cortical thickness in the mask was usually around 10 voxels, and they ran a 3-voxel radius searchlight.

I do not think this is a good idea.

My short general suggestion is to do either volumetric searchlighting of the whole brain (or a large 3D subset of it - like the frontal lobes), or do surface-based searchlighting (e.g. Chen, Oosterhof), but not mix the two.

A fundamental problem is that a fairly small proportion of the searchlights will be fully in the brain using a volumetric grey matter mask.

In this little cartoon the yellow circles represent the spherical searchlights, and the two grey lines the extent of the grey matter mask. Many searchlights do not fully fall into the mask; the edges will be sampled differently than the center.

This cartoon tries to convey the same idea: only the strip in the middle of the mask is mapped by searchlights completely in the brain. If informative voxels are evenly distributed in the grey matter strip, searchlights in the middle (fully in the brain) could be more likely to be significant (depending on the exact classifier, information distribution, etc.), a distorted impression.

I've heard it suggested that it's better to use a grey matter mask because that's where the information should be. I don't think that's a good motivation. For one thing, running the searchlight analysis on the whole brain can serve as a control: if the "most informative areas" turn out to be the ventricles, something went wrong. For another, spatial normalization is not perfect. Depending on how things are run (searchlight on spatially-normalized brains or not, etc), there is likely to be some blurring of white and grey matter in the functional images. Letting the searchlights span both may capture more of the information actually present.

One final point. Volumetric searchlighting will put non-strip-wise-adjacent areas (areas touching across a sulcus) into the same searchlight. This might be ok, given the spatial uncertainties accepted in some fMRI. But if you want to avoid this, surface-based methods are the way to go, not a volumetric grey matter mask.

Monday, November 19, 2012

postdoc position: Susanne Quadflieg's lab

Susanne Quadflieg is looking for a postdoc. The position is fMRI and social psychology research, not MVPA methods, but might interest some of you.

permutation testing: feedback

MS Al-Rawi sent some feedback about my last post and kindly agreed that I could post it here. His comments/questions are in quote blocks, my responses in regular type.

"1- The title: 'groups of people'. Are you trying to perform permutation testing for a group of subjects (people), so, is it 'groups of people', or group analysis?"

I meant group analysis (aka 2nd level analysis): "is this effect present in these subjects?" The goal is to generate a significance level for the group-level statistic, such as the accuracy averaged across subjects.

"2- You also wrote; 'Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person'. So, how do you get this set of 'p unique relabelings', or you don't have to find them, you just do find p times relabelings?"

Sorry for the confusion; I was struggling a bit with terminology. What I mean is that there is a single set of label permutations which are used for every person, rather than generating unique label permutations for everyone. The set of label permutations is created in the 'usual' way. If the number of labels is small all possible permutations should be used, but when there are too many for that to be practical a random subset can be generated.

For example, suppose there are two classes (a and b) and three examples of each class. If the true labeling (i.e. row order of the data table) is aaabbb, possible permutations include aabbba, ababab, etc.

"3- 'applied such that there is a linking between the permutations in each person'. I cannot figure out what that is? I can tell that it gives a biased distribution, either all yielding best accuracy(s) that are closer to the real one, or all yielding closer to chance level accuracy."

What I meant was that, if all subjects have the same number and type of examples, the same (single) permutation scheme can be used for everyone. To continue the example, we would calculate the accuracy with the aabbba relabeling in every subject, then the accuracy with the ababab relabeling, etc.

I guess this gives a biased distribution in the sense that fewer relabelings are included ... When a random subset of the permutations has to be used (because there are too many to calculate them all), under scheme 2 you could generate a separate set of permutations for each person (e.g. aabbba might be used with subject 3 but not subjects 1, 2, or 4). The single set of permutations used for everyone is not biased (assuming it was generated properly), but does sample a smaller number of relabelings than if you generate unique relabelings for each person.

"4- Since each person/subject (data) has its own classifier (or a correlated set of classifiers due to using k-fold cross validation), is it legitimate to take the mean of each of the 'p unique relabelings' (as you showed in scheme 1)?"

This is one of the reasons why I made the post: to ask for opinions on its legitimacy! It feels more fair to me to take the across-subjects mean when the same labeling is used for everyone than averaging values from different labelings. But a feeling isn't proof, and it's possible that scheme 1 is too stringent.

"5- The formula; ((number of permutation means) > (real mean)) / (r + 1)
I think using ((number of permutation means) > (real mean) +1 ) / (r + 1) would prevent getting p=0 value when the classification accuracy is highly above change. We shouldn't be getting 0 in any case, but that is most probably to happen because the number of permutations is limited by the computational power, and the randomization labeling might not be perfectly tailored enough to generate an idealistic distribution (e.g. at one permutation should capture the original labeling or one that is highly close to it and thus give high classification accuracy)."

Yes, I think this is right: if the true labeling is better than all 1000 permutations we want the resulting p-value to come out as 0.001, not 0.

Tuesday, November 13, 2012

permutation testing: groups of people

Here I describe three ways to do a group permutation test. The terminology and phrasing is my own invention; I'd love to hear other terms for these ideas.

First, the dataset.

I have data for n people. Assume there are no missings, the data is balanced (equal numbers of both cases), and that the number of examples of each case is the same in all people (e.g. 10 "a" and 10 "b" in each of three runs for every person).

For each person there is a single "real accuracy": the classification of their data with the true labeling. We can average these n values for the real group average accuracy.

Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person. In other words suppose relabeling 1 = abaabba ... aabbb. The "perm 1" box for each person is the accuracy obtained when relabeling 1 is used for that person's data; permutations of the same number for each person use the same relabeling.

Next, we generate the null distribution.

Here are the three schemes for doing a group permutation test. In the first, the linking between permutations described above is maintained: p group means are calculated, one for each of the p unique relabelings. I've used this scheme before.

Alternatively, we could disregard the linking between permutations for each person, selecting one accuracy from each person to go into each group mean. Now we are not restricted to p group means; we can generate as many as we wish (well, restricted by the number of subjects and permutations, but this is usually a very, very large number). In my little sketch I use the idea of a basket: for each of the r group means we pick one permutation accuracy from each person at random, then calculate the average. We sample with replacement: some permutations could be in multiple baskets, others in none. This is the scheme used in (part of) Stelzer, Chen, & Turner, 2012, for example.

Or, we could fill our baskets more randomly: ignoring the separation into subjects. In other words, we draw n permutation values each of r time, disregarding the subject identities. This is closer to the common idea of bootstrapping, but I don't know of any examples of using this with neuroimaging data.

Finally, the null distribution.

Once we have the collection of p or r group means we can generate the null distribution, and find the rank (and thus the p-value) of the real accuracy, as usual in permutation testing. To be complete, I usually calculate the p-value as ((number of permutation means) > (real mean)) / (r + 1). I actually don't think this exact equation is universal; it's also possible to use greater-than-or-equal-to for the rank, or r for the denominator.

Thoughts.

I feel like the first scheme ("stripes") is the best, when it is possible. Feelings aren't proof. But it feels cleaner to use the same relabeling scheme in each person: when we don't, we are putting a source of variability into the null distribution (the label scheme used for each person) that isn't in the real data, and I always prefer to have the null distribution as similar to the real data as possible (in terms of structure and sources of variance).

A massive problem with the first scheme is that it is only possible when the data is very well-behaved: the same relabeling scheme can be applied to all subjects. As soon as even one person has some missing data, the entire scheme breaks. I've used the second scheme ("balanced baskets") in this case, usually by keeping the "striping" whenever possible, then bootstrapping the subjects with missing (This is sort of a hybrid of the first two schemes: I end up with p group means, but the striping isn't perfect).

A small change to the diagram gets us from the second to the third scheme (losing the subject identities). This of course steps the null distribution even further from the real data, but that is not necessarily a bad thing. Stelzer, Chen, & Turner (2012) describe their scheme 2 as a "fixed-effects analysis" (line 388 of the proof). Would scheme 3 be like a random-effects analysis?

So, which is best? Should the first scheme be preferred when it is possible? Or should we always use the second? Or something else? I'll post some actual examples (eventually).

Wednesday, October 31, 2012

needles and haystacks: information mapping quirks

I really like this image and analogy for describing some of the distortions that can arise from searchlight analysis: a very small informative area ("the needle") can turn into a large informative area in the information map ("the haystack"), but the reverse is also possible: a large informative area can turn into a small area in the information map ("haystack in the needle").

I copied this image from the poster Matthew  Cieslak,  Shivakumar  Viswanathan,  and  Scott  T.  Grafton  presented last year at SfN (poster 626.16, Fitting and Overfitting in Searchlights, SfN2011). The current article covers some of the same issues as the poster, providing a mathematical foundation and detailed explanation.

They step through several proofs of information map properties, using reasonable assumptions. One result I'll highlight here is that the information map's representation of a fixed-size informative area will grow as searchlight radius increases (my phrasing, not theirs). Note that this (and the entire paper) is describing the single-subject, not group level of analysis.

This fundamental 'growing' property is responsible for many of the strange things that can appear in searchlight maps, such as the edge effects I posted about here. As Viswanathan et al. point out in the paper, it also means that interpreting the number of voxels found significant in a searchlight analysis is fraught with danger: it is affected by many factors other than the amount and location of informative voxels. They also show that it is possible to have just 430 properly-spaced informative voxels create the entire brain to be marked as informative in the information map, using just 8 mm radius searchlights (that's not particularly large in the literature).

I recommend taking a look at this paper if you generate or interpret information maps via searchlight analysis, particularly if you have a mathematical bent. It nicely complements diagram- and description-based explanations of searchlight analysis (including, hopefully soon, my own). It certainly does not include all the aspects of information mapping, but provides a solid foundation for those it does include.

Shivakumar Viswanathan, Matthew Cieslak, & Scott T. Grafton (2012). On the geometric structure of fMRI searchlight-based information maps. arXiv: 1210.6317v1

Thursday, October 25, 2012

permuting searchlight maps: Stelzer

Now to the proposals in Stelzer, not just their searchlight shape!

This is a dense methodological paper, laying out a way (and rationale) to carry out permutation tests for group-level classifier-based searchlight analysis (linear svm). This is certainly a needed topic; as pointed out in the article, the assumptions behind t-tests are certainly violated in searchlight analysis, and using the binomial is also problematic (they suggest that it is too lenient, which strikes me as plausible).

Here's my interpretation of what they propose:

Generate 100 permuted searchlight maps for each person. You could think of all the possible label (i.e. class, stimulus type, whatever you're classifying) rearrangements as forming a very large pool. Pick 100 different rearrangements for each person and do the searchlight analysis with this rearrangement. (The permuted searchlight analysis must be done exactly as the real one was - same cross-validation scheme, etc.)
Generate 100,000 averaged group searchlight maps. Each group map is made by picking one permuted map from each person (out of the 100 made for each person in step 1) and averaging the values voxel-wise. In other words, stratified sampling with replacement.
Do a permutation test at each voxel, calculating the accuracy corresponding to a p = 0.001 threshold. In other words, at each voxel you record the 100th biggest accuracy after sorting the 100,000 accuracies generated in step 2. (100/100000 = 0.001)
Threshold the 100,000 permuted group maps and the one real-labeled group map using the voxel-wise thresholds calculated in step 3. Now the group maps are binary (pass the threshold or not).
Apply a clustering algorithm to all the group maps. They clustered voxels only if they shared a face. I don't think they used a minimum cluster size, but rather called un-connected voxels clusters of size 1 voxel. (This isn't really clear to me.)
Count the number of clusters by size in each of the 100,000 permuted maps and 1 real map. (this gives counts like 10 clusters with 30 voxels in map #2004, etc.)
Generate the significance of the real map's clusters using the counts made in step 6. I think they calculated the significance for each cluster size separately then did FDR, but it's not obvious to me ("Cluster-size statistics" section towards end of "Materials and Methods").
Done! The voxels passing step 7 are significant at the cluster level, corrected for multiple comparisons (Figure 3F of paper). The step 4 threshold map can be used for uncorrected p-values (Figure 3E of paper).

Most of this strikes me as quite reasonable. I've actually previously implemented almost this exact procedure (minus the cluster thresholding) on a searchlight dataset (not linear svms).

The part that makes me twitch the most is step 2: turning the 100 maps for each person into 100,000 group-average maps. I've been wanting to post about this anyway in the context of my ROI-based permutation testing example. But in brief, what makes me uncomfortable is the way 100 maps turn into 100000. Why not just calculate 5 for each person? 5^12 >> 100,000 (they had 12 subjects in some of the examples). Somehow 100 for each person feels more properly random than 5 for each person, but how many are really needed to properly estimate the variation? I will expand on this more (and give a few alternatives), hopefully somewhat soon.

The other thing that makes me wonder is the leniency. They show (e.g. Figure 11) that many more voxels are called significant in their method than with a t-test, claiming that as closer to the truth. This relates to my concern about how to combine over subjects: using 100,000 group maps allows very small p-values. But if the 100,000 aren't as variable as they should be, the p-values will be inflated.

Stelzer, J., Chen, Y., & Turner, R. (2012). Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. NeuroImage DOI: 10.1016/j.neuroimage.2012.09.063

UPDATE (30 October): We discussed this paper in a journal club and a coworker explained that the authors do explain the choice of 100 permutations per person in Figure 8 and the section "Undersampling of the permutation space". They made a dataset with one searchlight and many examples (80, 120, 160), then varied the number of permutations they calculated for each individual (10, 100, 1000, 10,000). They then made 100,000 group "maps" as before (my step 2), drawing from each group of single-subject permutations. Figure 8 shows the resulting histograms: the curves for 100, 1000, and 10,000 individual permutations are quite similar, which they use as rationale for running 100 permutations for each person (my step 1).

I agree that this is a reasonable way to choose a number of each-person permutations, but I'm still not entirely comfortable with the way different permutation maps are combined. I'll explain and show this more in a separate post.

searchlight shapes: Stelzer

This is the first of what will likely be a series of posts on a paper in press at NeuroImage:

Stelzer, J., et al., Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA). NeuroImage (2012), http://dx.doi.org/10.1016/j.neuroimage.2012.09.063

There is a lot in this paper, touching some of my favorite topics (permutation testing, using the binomial, searchlight analysis, Malin's 'random' searchlights).
But in this post I'll just highlight the searchlight shapes used in the paper. They're given in this sentence: "The searchlight volumes to these diameters were 19 (D=3), 57 (D=5), 171 (D=7), 365 (D=9), and 691 (D=11) voxels, respectively." The authors don't list the software they used; I suspect it was custom matlab code.

Here I'll translate the first few sizes to match the convention I used in the other searchlight shape posts:

diameter	radius	number of surrounding voxels	notes
3	1	18	This looks like my 'edges or faces touch' searchlight.
5	2	56	This has more voxels than the 'default' searchlight, but less than my two-voxel radius searchlight. Squinting at Figure 1 in the text, I came up with the shape below.

Here's the searchlight from Figure 1, and my blown-up version for a two-voxel radius searchlight.

It looks like they added the plus signs to the outer edges of a three-by-three cube. This doesn't follow any of my iterative rules, but perhaps would result from fitting a particular sphere-type rule.