Friday, December 28, 2012

which labels to permute?

Which labels should be permuted for a permutation test of a single person's classification accuracy? A quick look found examples of MVPA methods papers using all three possibilities: relabel training set only (Pereira 2011), relabel testing set only (Al-Rawi 2012), or relabel both training and testing sets (Stelzer 2012). Note that I'm considering class label permutations only, "Test 1" in the parlance of Ojala 2010.

This reminds me of the situation with searchlight shape: many different implementations of the same idea are possible, and we really need to be more specific when we report results: often papers don't specify the scheme they used.

Which permutation scheme is best? As usual, I doubt there is a single, all-purpose answer. I put together this little simulation to explore one part of the effect of the choice: what do the null distributions look like under each label permutation scheme? The short answer is that the null distributions look quite similar (normal and centered on chance), but there is a strong relationship between the proportion of labels permuted and accuracy when only the test or training set labels are permuted, but not when both are permuted.

simulation

These simulations use a simple mvpa-like dataset for one person, two classes (chance is 0.5), 10 examples of each class in each run, two runs, and full balance (same number of trials of each class in each run, no missings). I made the data by sampling from a normal distribution for each class, standard deviation 1, mean 0.15 for one class and -0.15 for the other, 50 voxels. I classified with a linear svm, c=1, partitioning on the runs (so 2-fold cross-validation). I used R; email me for a copy of the code. I ran the simulation 10 times (10 different datasets), with the same dataset used for each permutation scheme.

1500 label permutations of each sort (training-only, testing-only, both) were run, chosen at random from all those possible. I coded it up such that the same relabeling was used for each of the cross-validation folds when only the training or testing data labels were permuted (e.g. the classifier was trained on the 1st run of the real data, then the permuted label scheme was applied to the 2nd run and the classifier tested. Then the classifier was trained on the 2nd run of the real data, and the SAME permuted label scheme applied to the 1st run and the classifier tested.). This was simply for convenience, when coding, but restricts the number of possibilities; another example of how the same idea can be implemented multiple ways.

This is a moderately difficult classification: the average accuracy of the true-labeled data (i.e. not permuted) was 0.695, ranging from 0.55 to 0.775 over the 10 repetitions. The accuracy of each dataset is given in the plot titles, and by a reddish dotted line.

For each permutation I recorded both the accuracy and the proportion of labels matching between that permutation and the real labels. When both training and testing labels are permuted this is an average over the two cross-validation folds. I plotted the classification accuracy of each permutation against the proportion of labels matching in the permutation and calculated the correlation. Histograms of each variable appear along the axes. These graphs are complicated, but enlarge if you click on them.

Training set labels only permuted:

Testing set labels only permuted:

both Training and Testing set labels permuted:
  

Here are density plots of all 30 null distributions, overplotted. These are the curves that appear along the y-axis in the above graphs.

observations

There is a strong linear relationship between the number of labels changed in a permutation and its accuracy when either the training or testing set labels alone are shuffled: the more the relabeling resembles the true data labels, the better the accuracy. When all labels are permuted there isn't much of a relationship.

Despite the strong correlation, the null distributions resulting from each permutation scheme are quite similar (density plot overlap graph). This makes sense, since the relabelings are chosen at random, so relabelings quite similar and quite dissimilar to the true labeling are included. The null distributions would be skewed if the labels for the permutations were not chosen at random (e.g. centered above chance if only mostly-matching relabelings were used).

comments

Intuitively, I prefer the permute-both scheme: more permutations are possible, and the strong correlation is absent. But since the resulting null distributions are so similar, I can't say that permuting just one set of labels or the other is really worse, much less invalid. This is quite a simplified simulation; I think it would be prudent to check null distributions and relabeling schemes in actual use, since non-random label samplings may turn up.

references

  • Al-Rawi, M.S., Cunha, J.P.S., 2012. On using permutation tests to estimate the classification significance of functional magnetic resonance imaging data. Neurocomputing 82, 224-233. http://dx.doi.org/10.1016/j.neucom.2011.11.007
  • Ojala, M., Garriga, G.C. Permutation tests for studying classifier performance. J. Mach. Learn. Res., 11 (2010), pp. 1833–1863.
  • Pereira, F., Botvinick, M., 2011. Information mapping with pattern classifiers: A comparative study. Neuroimage 56, 476-496.
  • Stelzer, J., Chen, Y., Turner, R., 2013. Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. Neuroimage 65, 69-82.

Friday, December 7, 2012

Try R tutorial

R can have a steep learning curve if you're new to programming. Try R is a set of online tutorials that are visually attractive and light-hearted, but cover the basics. Worth a "try"!

Thursday, November 29, 2012

surface or volume searchlighting ... not a mixture

I saw a report with an analysis that used volumetric searchlights (spheres) within a grey matter mask (like the one at the right, though theirs was somewhat more dilated). The cortical thickness in the mask was usually around 10 voxels, and they ran a 3-voxel radius searchlight.

I do not think this is a good idea.

My short general suggestion is to do either volumetric searchlighting of the whole brain (or a large 3D subset of it - like the frontal lobes), or do surface-based searchlighting (e.g. Chen, Oosterhof), but not mix the two.

A fundamental problem is that a fairly small proportion of the searchlights will be fully in the brain using a volumetric grey matter mask.
In this little cartoon the yellow circles represent the spherical searchlights, and the two grey lines the extent of the grey matter mask. Many searchlights do not fully fall into the mask; the edges will be sampled differently than the center.

This cartoon tries to convey the same idea: only the strip in the middle of the mask is mapped by searchlights completely in the brain. If informative voxels are evenly distributed in the grey matter strip, searchlights in the middle (fully in the brain) could be more likely to be significant (depending on the exact classifier, information distribution, etc.), a distorted impression.

I've heard it suggested that it's better to use a grey matter mask because that's where the information should be. I don't think that's a good motivation. For one thing, running the searchlight analysis on the whole brain can serve as a control: if the "most informative areas" turn out to be the ventricles, something went wrong. For another, spatial normalization is not perfect. Depending on how things are run (searchlight on spatially-normalized brains or not, etc), there is likely to be some blurring of white and grey matter in the functional images. Letting the searchlights span both may capture more of the information actually present.

One final point. Volumetric searchlighting will put non-strip-wise-adjacent areas (areas touching across a sulcus) into the same searchlight. This might be ok, given the spatial uncertainties accepted in some fMRI. But if you want to avoid this, surface-based methods are the way to go, not a volumetric grey matter mask.

Monday, November 19, 2012

postdoc position: Susanne Quadflieg's lab

Susanne Quadflieg is looking for a postdoc. The position is fMRI and social psychology research, not MVPA methods, but might interest some of you.

permutation testing: feedback

MS Al-Rawi sent some feedback about my last post and kindly agreed that I could post it here. His comments/questions are in quote blocks, my responses in regular type.

"1- The title: 'groups of people'. Are you trying to perform permutation testing for a group of subjects (people), so, is it 'groups of people', or group analysis?"

I meant group analysis (aka 2nd level analysis): "is this effect present in these subjects?" The goal is to generate a significance level for the group-level statistic, such as the accuracy averaged across subjects.

"2- You also wrote; 'Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person'. So, how do you get this set of  'p unique relabelings', or you don't have to find them, you just do find p times relabelings?"
Sorry for the confusion; I was struggling a bit with terminology. What I mean is that there is a single set of label permutations which are used for every person, rather than generating unique label permutations for everyone. The set of label permutations is created in the 'usual' way. If the number of labels is small all possible permutations should be used, but when there are too many for that to be practical a random subset can be generated.

For example, suppose there are two classes (a and b) and three examples of each class. If the true labeling (i.e. row order of the data table) is aaabbb, possible permutations include aabbba, ababab, etc.

"3- 'applied such that there is a linking between the permutations in each person'. I cannot figure out what that is? I can tell that it gives a biased distribution, either all yielding best accuracy(s) that are closer to the real one, or all yielding closer to chance level accuracy."
What I meant was that, if all subjects have the same number and type of examples, the same (single) permutation scheme can be used for everyone. To continue the example, we would calculate the accuracy with the aabbba relabeling in every subject, then the accuracy with the ababab relabeling, etc.

I guess this gives a biased distribution in the sense that fewer relabelings are included ... When a random subset of the permutations has to be used (because there are too many to calculate them all), under scheme 2 you could generate a separate set of permutations for each person (e.g. aabbba might be used with subject 3 but not subjects 1, 2, or 4). The single set of permutations used for everyone is not biased (assuming it was generated properly), but does sample a smaller number of relabelings than if you generate unique relabelings for each person.

"4- Since each person/subject (data) has its own classifier (or a correlated set of classifiers due to using k-fold cross validation), is it legitimate to take the mean of each of the 'p unique relabelings' (as you showed in scheme 1)?"

This is one of the reasons why I made the post: to ask for opinions on its legitimacy! It feels more fair to me to take the across-subjects mean when the same labeling is used for everyone than averaging values from different labelings. But a feeling isn't proof, and it's possible that scheme 1 is too stringent.

"5- The formula; ((number of permutation means) > (real mean)) / (r + 1)
I think using ((number of permutation means) > (real mean) +1 ) / (r + 1) would prevent getting p=0 value when the classification accuracy is highly above change. We shouldn't be getting 0 in any case, but that is most probably to happen because the number of permutations is limited by the computational power, and the randomization labeling might not be perfectly tailored enough to generate an idealistic distribution (e.g. at one permutation should capture the original labeling or one that is highly close to it and thus give high classification accuracy)."
Yes, I think this is right: if the true labeling is better than all 1000 permutations we want the resulting p-value to come out as 0.001, not 0.

Tuesday, November 13, 2012

permutation testing: groups of people

Here I describe three ways to do a group permutation test. The terminology and phrasing is my own invention; I'd love to hear other terms for these ideas.

First, the dataset.

I have data for n people. Assume there are no missings, the data is balanced (equal numbers of both cases), and that the number of examples of each case is the same in all people (e.g. 10 "a" and 10 "b" in each of three runs for every person).
For each person there is a single "real accuracy": the classification of their data with the true labeling. We can average these n values for the real group average accuracy.

Assume further that there is a single set of p unique relabelings (label permutations), applied such that there is a linking between the permutations in each person. In other words suppose relabeling 1 = abaabba ... aabbb. The "perm 1" box for each person is the accuracy obtained when relabeling 1 is used for that person's data; permutations of the same number for each person use the same relabeling.

Next, we generate the null distribution.

Here are the three schemes for doing a group permutation test. In the first, the linking between permutations described above is maintained: p group means are calculated, one for each of the p unique relabelings. I've used this scheme before.

Alternatively, we could disregard the linking between permutations for each person, selecting one accuracy from each person to go into each group mean. Now we are not restricted to p group means; we can generate as many as we wish (well, restricted by the number of subjects and permutations, but this is usually a very, very large number). In my little sketch I use the idea of a basket: for each of the r group means we pick one permutation accuracy from each person at random, then calculate the average. We sample with replacement: some permutations could be in multiple baskets, others in none. This is the scheme used in (part of) Stelzer, Chen, & Turner, 2012, for example.

Or, we could fill our baskets more randomly: ignoring the separation into subjects. In other words, we draw n permutation values each of r time, disregarding the subject identities. This is closer to the common idea of bootstrapping, but I don't know of any examples of using this with neuroimaging data.

Finally, the null distribution.

Once we have the collection of p or r group means we can generate the null distribution, and find the rank (and thus the p-value) of the real accuracy, as usual in permutation testing. To be complete, I usually calculate the p-value as ((number of permutation means) > (real mean)) / (r  + 1). I actually don't think this exact equation is universal; it's also possible to use greater-than-or-equal-to for the rank, or r for the denominator.

Thoughts.

I feel like the first scheme ("stripes") is the best, when it is possible. Feelings aren't proof. But it feels cleaner to use the same relabeling scheme in each person: when we don't, we are putting a source of variability into the null distribution (the label scheme used for each person) that isn't in the real data, and I always prefer to have the null distribution as similar to the real data as possible (in terms of structure and sources of variance).

A massive problem with the first scheme is that it is only possible when the data is very well-behaved: the same relabeling scheme can be applied to all subjects. As soon as even one person has some missing data, the entire scheme breaks. I've used the second scheme ("balanced baskets") in this case, usually by keeping the "striping" whenever possible, then bootstrapping the subjects with missing (This is sort of a hybrid of the first two schemes: I end up with p group means, but the striping isn't perfect).

A small change to the diagram gets us from the second to the third scheme (losing the subject identities). This of course steps the null distribution even further from the real data, but that is not necessarily a bad thing. Stelzer, Chen, & Turner (2012) describe their scheme 2 as a "fixed-effects analysis" (line 388 of the proof). Would scheme 3 be like a random-effects analysis?

So, which is best? Should the first scheme be preferred when it is possible? Or should we always use the second? Or something else? I'll post some actual examples (eventually).

Wednesday, October 31, 2012

needles and haystacks: information mapping quirks

I really like this image and analogy for describing some of the distortions that can arise from searchlight analysis: a very small informative area ("the needle") can turn into a large informative area in the information map ("the haystack"), but the reverse is also possible: a large informative area can turn into a small area in the information map ("haystack in the needle").

I copied this image from the poster Matthew 
Cieslak, 
Shivakumar
 Viswanathan, 
and 
Scott 
T. 
Grafton
 presented last year at SfN (poster 626.16, Fitting and Overfitting in Searchlights, SfN2011). The current article covers some of the same issues as the poster, providing a mathematical foundation and detailed explanation.

They step through several proofs of information map properties, using reasonable assumptions. One result I'll highlight here is that the information map's representation of a fixed-size informative area will grow as searchlight radius increases (my phrasing, not theirs). Note that this (and the entire paper) is describing the  single-subject, not group level of analysis.

This fundamental 'growing' property is responsible for many of the strange things that can appear in searchlight maps, such as the edge effects I posted about here. As Viswanathan et al. point out in the paper, it also means that interpreting the number of voxels found significant in a searchlight analysis is fraught with danger: it is affected by many factors other than the amount and location of informative voxels. They also show that it is possible to have just 430 properly-spaced informative voxels create the entire brain to be marked as informative in the information map, using just 8 mm radius searchlights (that's not particularly large in the literature).

I recommend taking a look at this paper if you generate or interpret information maps via searchlight analysis, particularly if you have a mathematical bent. It nicely complements diagram- and description-based explanations of searchlight analysis (including, hopefully soon, my own). It certainly does not include all the aspects of information mapping, but provides a solid foundation for those it does include.


ResearchBlogging.org Shivakumar Viswanathan, Matthew Cieslak, & Scott T. Grafton (2012). On the geometric structure of fMRI searchlight-based information maps. arXiv: 1210.6317v1