Sunday, February 10, 2013

where to start with MVPA?

I was recently asked for suggestions about starting with MVPA, so here are some ideas. This is certainly not an exhaustive list, feel free to submit others.

literature

  • Pereira, F., Mitchell, T., Botvinick, M., 2009. Machine learning classifiers and fMRI: A tutorial overview. Neuroimage 45, S199-S209.
  • Etzel, J.A., Gazzola, V., Keysers, C., 2009. An introduction to anatomical ROI-based fMRI classification analysis. Brain Research 1282, 114-125.
  • Mur, M., Bandettini, P.A., Kriegeskorte, N., 2009. Revealing representational content with pattern-information fMRI--an introductory guide. Soc Cogn Affect Neurosci, nsn044.
  • Haynes JD. 2015. A Primer on Pattern-Based Approaches to fMRI: Principles, Pitfalls, and Perspectives. Neuron, 87 (2), 257-70. doi: 10.1016/j.neuron.2015.05.025 PMID: 26182413
  • Mitchell, T.M., Hutchinson, R., Niculescu, R.S., Pereira, F., Wang, X., 2004. Learning to Decode Cognitive States from Brain Images. Machine Learning 57, 145-175.
  • Norman, K.A., Polyn, S.M., Detre, G.J., Haxby, J.V., 2006. Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends In Cognitive Sciences 10, 424-430.

software

There are a few software packages specifically for MVPA, though many people (myself included) use quite a bit of their own code. A nice comparison of packages is towards the end of Hebart2015.

workshops/conferences

publicly-available fMRI datasets

other things


updated 15 October 2013: added a link to PRoNTO; added the workshops section
updated 16 January 2015: added the link to The Decoding Toolbox and Hebart 2015.
updated 4 February 2015: added the public datasets section.
updated 11 September 2015: changed the reference from Haynes 2006 to Haynes 2015.
updated 21 March 2017: added a link to CCN.
updated 25 September 2017: added BrainIAK

Friday, February 8, 2013

comparing null distributions: changing the bias

Here is another example in this series of simulations exploring the null distributions resulting from various permutation tests. In this case I changed the "bias": a higher bias makes the samples easier to classify since the random numbers making up each class are drawn from a normal distribution with standard deviation 1 and mean either bias or (-1 * bias). The random seeds are the same here as in the previous examples, so the distributions are comparable.

As before, here are the null distributions resulting from ten simulations using either a bias of 0.05 or 0.15.

These are from using two runs:

and these from using four runs:

The null distributions within each pane pretty much overlap: the curves don't change as much with the different biases as they do with changing the number of runs or the permutation scheme.

The true-labeled accuracy and so the p-values change quite a lot, though:
with two runs, bias = 0.05
with four runs, bias = 0.05
As in the simulations with more signal (bias = 0.15), there is more variability in true-labeled accuracy in the dataset with two runs than the one with four runs. Some of the two-run simulations (#5 and #8) have accuracy below chance, and so p-values > 0.9. But two other two-run simulations (#3 and #7) have accuracy above 0.7 and p-values of better than 0.05. The best four-run simulation accuracy is 0.6, and none have p-values better than 0.05 (looking at permuteBoth).

So, in these simulations, changing the amount of difference between the classes did not substantially change the null distributions (particularly when permuting the entire dataset together). Is this sensible? I'm still thinking, but it does strike me as reasonable that if the relabeling fully disrupts the class structure then the amount of signal actually in the data should have less of an impact on the null distribution than other aspects of the data such as number of examples.

comparing null distributions: 2 or 4 runs

Carrying on the previous example, this post shows the null distributions resulting from running the simulation with two or four runs. Since the bias (difference between the classes) and number of examples per run per class (10) is kept constant, increasing the number of runs makes the classification easier since there are more training examples.

The null distributions are narrower when four runs are included:

Since both the null distributions are wider for two runs and the classification accuracy is worse, the p-values are less significant for two runs than four (below; they should get bigger if you click on them). For example, repetition #9 had an accuracy of 0.72 with two runs, which resulted (when permuting the training data only) in a rank of 10. Repetition #6 with four runs also had an accuracy of 0.72, but this time had a rank of 0 (the true-labeled data was more accurate than all permutations).
with two runs
with four runs


The true-labeled data accuracy (given as "real" in the tables) varies quite a bit more over the ten repetitions with only two runs compared to four runs (.57 to .9 with two runs, .69 to .84 with four runs). This strikes me as expected: the classification with only two runs is much more difficult - we have much less statistical power - and so is less stable. The permutation distributions should also be wider (have more variance) when we have less power.

which labels to permute: with 4 runs

There has been an interesting thread about permutation testing on the pyMVPA mailing list recently. In a previous blog post about permutation testing I used two runs with partitioning on the runs, for two-fold cross-validation. In that simulation the null distributions were quite similar regardless of whether the training set only, testing set only, or entire dataset (training and testing, as a unit) were relabeled.

The pyMVPA discussion made me suspect that the overlapping null distributions are a special case: when there are only two runs (used for the cross-validation), permuting either the training set only or the testing set only is permuting half of the data. When there are more than two runs, permuting the training set only changes the labels on more examples than permuting the testing set only.

I repeated that simulation creating four runs of data instead of just two. This makes the classification easier, since it is trained on (in this case) 60 examples (3 runs * 10 examples of each class in each run) instead of just 20 examples. As before, I ran each simulation ten times, and did 1000 label rearrangements (chosen at random).

I plan more posts describing permutation schemes, but I'll summarize here. I always permute the class labels within each run separately (a stratified scheme). In this example, partitioning is also done using the runs, so the number of trials of each class in each cross-validation fold is always the same. I precomputed the relabelings and try to keep as much constant across the folds as possible. For example, when permuting the testing set only I use the same set of relabelings (e.g. permutation #1 = aabbbaab when the true labeling is aaaabbbb) for all test sets (when run 1 is left out for permutation 1, when run 2 is left out for permutation 1, etc.). This creates (maintains) some dependency between the cross-validation folds.

Here are the null distributions that resulted from running the simulation 10 times. Only the accuracy > chance side is shown (they're reasonably symmetrical), with the dotted vertical line showing the accuracy of the true-labeled dataset. The lines are histogram-style: the number of samples falling into the 0.05-wide accuracy bins. This is a fairly easy classification, and would turn out highly significant under all permutation schemes in all repetitions.


Permuting the training sets only tends to result in the lowest-variance null distributions and permuting the testing sets only the highest, with permuting both often between. This is easier to see when density plots are overplotted:
This pattern is consistent with Yaroslav's suggestion that permuting the training set only may provide better power - a narrower null distribution means you're more likely to find your real accuracy in the extreme right tail, and so significant. But I don't know which scheme is better in terms of error rates, etc. - which will get us closer to the truth?

While very much still in progress, the code I used to make these graphs is here. It lets you vary the number of runs, trials, bias (how different the classes are), and whether the relabelings are done within each run or not.

UPDATE: see the post describing simulating data pyMVPA-style for dramatically different curves.

Friday, January 18, 2013

low-accuracy subjects' influence on group statistics

background

This post comes from something I saw in some of my actual searchlight results.The searchlight analysis was performed within a large anatomical area (~1500 voxels, svm within fairly small searchlights), separately in ~15 people. In most subjects, most all searchlights classified quite well, but in two subjects most searchlights classified quite poorly. As usual, there were a few low-accuracy searchlights in the "good" subjects (people in which most searchlights classified well), and a few high-accuracy searchlights in the "bad" subjects (people in which most searchlights did not classify accurately).

I made the group-level results by performing a t-test at each voxel: are the subjects' accuracies greater than chance? (I did not smooth the individual searchlight maps first.) I saw a worrisome pattern in the group maps: the peak (best t-value) areas tended to coincide where the "bad" subjects had their best (most accurate) searchlights. This outcome is not surprising (see below), but is not what we want: applying a strict threshold to the group map will identify voxels. But those particular voxels came out as peak because the "bad" subjects had accurate searchlights in those locations, not because the voxels were more accurate across subjects: the group map was overly influenced by the low-accuracy subjects.

simulation

Here's a little simulation to show the effect (R code is here). The code creates 15 searchlights, each of which contain 10 voxels. There are 12 people in the dataset, 10 of whom have signal in all searchlights, and 2 of whom have signal in all but the first two searchlights. It's a two-class classification, using a linear SVM, partitioning on the two "runs". Voxel values were sampled from a normal distribution, with different means for the voxels with signal but the same means for no signal.

The classification accuracies of the 12 people in each of the 15 searchlights are summarized in these boxplots (the figure should get larger if you click on it; run the code to get the underlying numbers). The accuracies vary a bit over the people and searchlights, as expected, given how the data was simulated. All of the "boxes" are well above chance, and all of the t-values (see figure labels) are above 5. So this seems reasonable.

The searchlights with the highest t-values are 1 and 2: the two searchlights which I assigned to have signal in the two "bad" subjects. You can see why in the boxplots: the bottom whisker in searchlights 1 and 2 only reach 0.6, while all the others have a whisker or outlier closer to chance. Some searchlights (like 8) have two outliers: the two "bad" subjects.

So the t-test didn't make an error: the first two searchlights should have the highest t-values, since they have the most individual accuracies the furthest above chance. But this could have led to an improper conclusion: If we had applied a high threshold (t > 8 in this simulation) we might think the first two searchlights are where the information falls, where in actuality it is evenly distributed.


what to do?

Follow-up testing or different group-level statistics can help to catch this type of situation. As often is the case, precise hypotheses are better (for example, if you want to find searchlights with significant classification in most subjects individually, test for that directly - don't assume that the best searchlights in a t-test will also have that property).

Here are a few suggestions for follow-up testing:
  • Look at the individual subject's searchlight maps: Are a few subjects quite different than the rest (such as here, where two people had quite low accuracies compared to the others)? 
  • Sensitivity testing can help: how much does the group-level map change when individual subjects are left out? 
  • Do the group-level results align more closely with some subjects than others?

Friday, December 28, 2012

which labels to permute?

Which labels should be permuted for a permutation test of a single person's classification accuracy? A quick look found examples of MVPA methods papers using all three possibilities: relabel training set only (Pereira 2011), relabel testing set only (Al-Rawi 2012), or relabel both training and testing sets (Stelzer 2012). Note that I'm considering class label permutations only, "Test 1" in the parlance of Ojala 2010.

This reminds me of the situation with searchlight shape: many different implementations of the same idea are possible, and we really need to be more specific when we report results: often papers don't specify the scheme they used.

Which permutation scheme is best? As usual, I doubt there is a single, all-purpose answer. I put together this little simulation to explore one part of the effect of the choice: what do the null distributions look like under each label permutation scheme? The short answer is that the null distributions look quite similar (normal and centered on chance), but there is a strong relationship between the proportion of labels permuted and accuracy when only the test or training set labels are permuted, but not when both are permuted.

simulation

These simulations use a simple mvpa-like dataset for one person, two classes (chance is 0.5), 10 examples of each class in each run, two runs, and full balance (same number of trials of each class in each run, no missings). I made the data by sampling from a normal distribution for each class, standard deviation 1, mean 0.15 for one class and -0.15 for the other, 50 voxels. I classified with a linear svm, c=1, partitioning on the runs (so 2-fold cross-validation). I used R; email me for a copy of the code. I ran the simulation 10 times (10 different datasets), with the same dataset used for each permutation scheme.

1500 label permutations of each sort (training-only, testing-only, both) were run, chosen at random from all those possible. I coded it up such that the same relabeling was used for each of the cross-validation folds when only the training or testing data labels were permuted (e.g. the classifier was trained on the 1st run of the real data, then the permuted label scheme was applied to the 2nd run and the classifier tested. Then the classifier was trained on the 2nd run of the real data, and the SAME permuted label scheme applied to the 1st run and the classifier tested.). This was simply for convenience, when coding, but restricts the number of possibilities; another example of how the same idea can be implemented multiple ways.

This is a moderately difficult classification: the average accuracy of the true-labeled data (i.e. not permuted) was 0.695, ranging from 0.55 to 0.775 over the 10 repetitions. The accuracy of each dataset is given in the plot titles, and by a reddish dotted line.

For each permutation I recorded both the accuracy and the proportion of labels matching between that permutation and the real labels. When both training and testing labels are permuted this is an average over the two cross-validation folds. I plotted the classification accuracy of each permutation against the proportion of labels matching in the permutation and calculated the correlation. Histograms of each variable appear along the axes. These graphs are complicated, but enlarge if you click on them.

Training set labels only permuted:

Testing set labels only permuted:

both Training and Testing set labels permuted:
  

Here are density plots of all 30 null distributions, overplotted. These are the curves that appear along the y-axis in the above graphs.

observations

There is a strong linear relationship between the number of labels changed in a permutation and its accuracy when either the training or testing set labels alone are shuffled: the more the relabeling resembles the true data labels, the better the accuracy. When all labels are permuted there isn't much of a relationship.

Despite the strong correlation, the null distributions resulting from each permutation scheme are quite similar (density plot overlap graph). This makes sense, since the relabelings are chosen at random, so relabelings quite similar and quite dissimilar to the true labeling are included. The null distributions would be skewed if the labels for the permutations were not chosen at random (e.g. centered above chance if only mostly-matching relabelings were used).

comments

Intuitively, I prefer the permute-both scheme: more permutations are possible, and the strong correlation is absent. But since the resulting null distributions are so similar, I can't say that permuting just one set of labels or the other is really worse, much less invalid. This is quite a simplified simulation; I think it would be prudent to check null distributions and relabeling schemes in actual use, since non-random label samplings may turn up.

references

  • Al-Rawi, M.S., Cunha, J.P.S., 2012. On using permutation tests to estimate the classification significance of functional magnetic resonance imaging data. Neurocomputing 82, 224-233. http://dx.doi.org/10.1016/j.neucom.2011.11.007
  • Ojala, M., Garriga, G.C. Permutation tests for studying classifier performance. J. Mach. Learn. Res., 11 (2010), pp. 1833–1863.
  • Pereira, F., Botvinick, M., 2011. Information mapping with pattern classifiers: A comparative study. Neuroimage 56, 476-496.
  • Stelzer, J., Chen, Y., Turner, R., 2013. Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. Neuroimage 65, 69-82.

Friday, December 7, 2012

Try R tutorial

R can have a steep learning curve if you're new to programming. Try R is a set of online tutorials that are visually attractive and light-hearted, but cover the basics. Worth a "try"!