Thursday, November 29, 2012

surface or volume searchlighting ... not a mixture

I saw a report with an analysis that used volumetric searchlights (spheres) within a grey matter mask (like the one at the right, though theirs was somewhat more dilated). The cortical thickness in the mask was usually around 10 voxels, and they ran a 3-voxel radius searchlight.

I do not think this is a good idea.

My short general suggestion is to do either volumetric searchlighting of the whole brain (or a large 3D subset of it - like the frontal lobes), or do surface-based searchlighting (e.g. Chen, Oosterhof), but not mix the two.

A fundamental problem is that a fairly small proportion of the searchlights will be fully in the brain using a volumetric grey matter mask.
In this little cartoon the yellow circles represent the spherical searchlights, and the two grey lines the extent of the grey matter mask. Many searchlights do not fully fall into the mask; the edges will be sampled differently than the center.

This cartoon tries to convey the same idea: only the strip in the middle of the mask is mapped by searchlights completely in the brain. If informative voxels are evenly distributed in the grey matter strip, searchlights in the middle (fully in the brain) could be more likely to be significant (depending on the exact classifier, information distribution, etc.), a distorted impression.

I've heard it suggested that it's better to use a grey matter mask because that's where the information should be. I don't think that's a good motivation. For one thing, running the searchlight analysis on the whole brain can serve as a control: if the "most informative areas" turn out to be the ventricles, something went wrong. For another, spatial normalization is not perfect. Depending on how things are run (searchlight on spatially-normalized brains or not, etc), there is likely to be some blurring of white and grey matter in the functional images. Letting the searchlights span both may capture more of the information actually present.

One final point. Volumetric searchlighting will put non-strip-wise-adjacent areas (areas touching across a sulcus) into the same searchlight. This might be ok, given the spatial uncertainties accepted in some fMRI. But if you want to avoid this, surface-based methods are the way to go, not a volumetric grey matter mask.

Monday, November 19, 2012

postdoc position: Susanne Quadflieg's lab

Susanne Quadflieg is looking for a postdoc. The position is fMRI and social psychology research, not MVPA methods, but might interest some of you.

permutation testing: feedback

MS Al-Rawi sent some feedback about my last post and kindly agreed that I could post it here. His comments/questions are in quote blocks, my responses in regular type.

"1- The title: 'groups of people'. Are you trying to perform permutation testing for a group of subjects (people), so, is it 'groups of people', or group analysis?"

I meant group analysis (aka 2nd level analysis): "is this effect present in these subjects?" The goal is to generate a significance level for the group-level statistic, such as the accuracy averaged across subjects.

"2- You also wrote; 'Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person'. So, how do you get this set of  'p unique relabelings', or you don't have to find them, you just do find p times relabelings?"
Sorry for the confusion; I was struggling a bit with terminology. What I mean is that there is a single set of label permutations which are used for every person, rather than generating unique label permutations for everyone. The set of label permutations is created in the 'usual' way. If the number of labels is small all possible permutations should be used, but when there are too many for that to be practical a random subset can be generated.

For example, suppose there are two classes (a and b) and three examples of each class. If the true labeling (i.e. row order of the data table) is aaabbb, possible permutations include aabbba, ababab, etc.

"3- 'applied such that there is a linking between the permutations in each person'. I cannot figure out what that is? I can tell that it gives a biased distribution, either all yielding best accuracy(s) that are closer to the real one, or all yielding closer to chance level accuracy."
What I meant was that, if all subjects have the same number and type of examples, the same (single) permutation scheme can be used for everyone. To continue the example, we would calculate the accuracy with the aabbba relabeling in every subject, then the accuracy with the ababab relabeling, etc.

I guess this gives a biased distribution in the sense that fewer relabelings are included ... When a random subset of the permutations has to be used (because there are too many to calculate them all), under scheme 2 you could generate a separate set of permutations for each person (e.g. aabbba might be used with subject 3 but not subjects 1, 2, or 4). The single set of permutations used for everyone is not biased (assuming it was generated properly), but does sample a smaller number of relabelings than if you generate unique relabelings for each person.

"4- Since each person/subject (data) has its own classifier (or a correlated set of classifiers due to using k-fold cross validation), is it legitimate to take the mean of each of the 'p unique relabelings' (as you showed in scheme 1)?"

This is one of the reasons why I made the post: to ask for opinions on its legitimacy! It feels more fair to me to take the across-subjects mean when the same labeling is used for everyone than averaging values from different labelings. But a feeling isn't proof, and it's possible that scheme 1 is too stringent.

"5- The formula; ((number of permutation means) > (real mean)) / (r + 1)
I think using ((number of permutation means) > (real mean) +1 ) / (r + 1) would prevent getting p=0 value when the classification accuracy is highly above change. We shouldn't be getting 0 in any case, but that is most probably to happen because the number of permutations is limited by the computational power, and the randomization labeling might not be perfectly tailored enough to generate an idealistic distribution (e.g. at one permutation should capture the original labeling or one that is highly close to it and thus give high classification accuracy)."
Yes, I think this is right: if the true labeling is better than all 1000 permutations we want the resulting p-value to come out as 0.001, not 0.

Tuesday, November 13, 2012

permutation testing: groups of people

Here I describe three ways to do a group permutation test. The terminology and phrasing is my own invention; I'd love to hear other terms for these ideas.

First, the dataset.

I have data for n people. Assume there are no missings, the data is balanced (equal numbers of both cases), and that the number of examples of each case is the same in all people (e.g. 10 "a" and 10 "b" in each of three runs for every person).
For each person there is a single "real accuracy": the classification of their data with the true labeling. We can average these n values for the real group average accuracy.

Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person. In other words suppose relabeling 1 = abaabba ... aabbb. The "perm 1" box for each person is the accuracy obtained when relabeling 1 is used for that person's data; permutations of the same number for each person use the same relabeling.

Next, we generate the null distribution.

Here are the three schemes for doing a group permutation test. In the first, the linking between permutations described above is maintained: p group means are calculated, one for each of the p unique relabelings. I've used this scheme before.

Alternatively, we could disregard the linking between permutations for each person, selecting one accuracy from each person to go into each group mean. Now we are not restricted to p group means; we can generate as many as we wish (well, restricted by the number of subjects and permutations, but this is usually a very, very large number). In my little sketch I use the idea of a basket: for each of the r group means we pick one permutation accuracy from each person at random, then calculate the average. We sample with replacement: some permutations could be in multiple baskets, others in none. This is the scheme used in (part of) Stelzer, Chen, & Turner, 2012, for example.

Or, we could fill our baskets more randomly: ignoring the separation into subjects. In other words, we draw n permutation values each of r time, disregarding the subject identities. This is closer to the common idea of bootstrapping, but I don't know of any examples of using this with neuroimaging data.

Finally, the null distribution.

Once we have the collection of p or r group means we can generate the null distribution, and find the rank (and thus the p-value) of the real accuracy, as usual in permutation testing. To be complete, I usually calculate the p-value as ((number of permutation means) > (real mean)) / (r  + 1). I actually don't think this exact equation is universal; it's also possible to use greater-than-or-equal-to for the rank, or r for the denominator.


I feel like the first scheme ("stripes") is the best, when it is possible. Feelings aren't proof. But it feels cleaner to use the same relabeling scheme in each person: when we don't, we are putting a source of variability into the null distribution (the label scheme used for each person) that isn't in the real data, and I always prefer to have the null distribution as similar to the real data as possible (in terms of structure and sources of variance).

A massive problem with the first scheme is that it is only possible when the data is very well-behaved: the same relabeling scheme can be applied to all subjects. As soon as even one person has some missing data, the entire scheme breaks. I've used the second scheme ("balanced baskets") in this case, usually by keeping the "striping" whenever possible, then bootstrapping the subjects with missing (This is sort of a hybrid of the first two schemes: I end up with p group means, but the striping isn't perfect).

A small change to the diagram gets us from the second to the third scheme (losing the subject identities). This of course steps the null distribution even further from the real data, but that is not necessarily a bad thing. Stelzer, Chen, & Turner (2012) describe their scheme 2 as a "fixed-effects analysis" (line 388 of the proof). Would scheme 3 be like a random-effects analysis?

So, which is best? Should the first scheme be preferred when it is possible? Or should we always use the second? Or something else? I'll post some actual examples (eventually).