MVPA Meanderings

Monday, September 24, 2012

searchlight shapes: searchmight

The searchmight toolbox was developed by Francisco Pereira and the Botvinick Lab and introduced in Information mapping with pattern classifiers. The toolbox is specialized for searchlight analysis, streamlining and speeding the process of trying different classification algorithms on the same datasets.

Francisco said that, by default, the toolbox uses cubical searchlights (other shapes can be manually defined). In the image of different one-voxel radius searchlights, this is the one that I labeled "edge, face, or corner": 26 surrounding voxels for a one-voxel radius searchlight. Larger radii are also cubes.

Pereira, Francisco, & Botvinick, Matthew (2011). Information mapping with pattern classifiers: A comparative study. NeuroImage. DOI: 10.1016/j.neuroimage.2010.05.026

Wednesday, September 19, 2012

pyMVPA searchlight shape: solved

Michael Hanke posted further description of the pyMVPA searchlight: it is the same as the Princeton MVPA toolbox searchlight I described earlier.

As Michael points out, it is possible to define other searchlight shapes in pyMVPA; this is the default for a sphere, not the sole option.

searchlight shapes: a collection

The topic of searchlight shapes is more complex than I'd originally expected. I'd like to compile a collection of what people are using. I added a "searchlight shapes" label to the relevant posts: clicking that label should bring up all of the entries.

I have contacted some people directly in the hopes of obtaining sufficient details to add their searchlight shapes to the collection. If you have details of a searchlight not included, please send them and I'll add it (or arrange for you to post them). My initial goal is to describe the main (i.e. default) spherical searchlights produced in the most widely used code/programs, but I also plan to include other types of searchlights, like the random ones created by Malin Björnsdotter.

Tuesday, September 18, 2012

permutation testing: one person only

Here are the results and specifics of the permutation testing for one (real) person from the dataset I introduced in the previous post.

Combining the first six runs (the first training set) this person has 19 usable examples of class a and 31 of class b. In the last six runs (the first testing set) there are 22 usable examples of class a and 18 of class b. The data is thus quite out of balance (many more b in the training set and many more a in the testing). Balancing the dataset requires selecting 19 of the class b examples from the training set, and 18 of class a from the testing set. I did this selection randomly 10 times, which I will call "splits".

The permutation test's aim is to evaluate the classification significance: is the accuracy by which we classified the stimuli as a or b meaningful? I do this by permuting the class labels in the dataset, so in any particular permutation some class a examples are still labeled a, while others are labeled b. But in this case, even though it's just one person, we have ten datasets: the splits.

I handle this by precalculating the label permutations, then running the permutation test on each of the splits separately and averaging the results. To make it more concrete, this person has (after balancing) 19 examples of each class in the training set. I sort the rows so that they are grouped by class: the real labeling is 19 a followed by 19 b. I store 1000 random label rearrangements, each containing 19 a and 19 b. For example, here's the new labeling for permutation #1: "a" "b" "b" "a" "b" "a" "b" "a" "a" "a" "b" "b" "b" "a" "a" "b" "a" "b" "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "a" "b" "a" "a" "b" "b" "b" "b" "a" "b". I replace the real class labels with this set, then run the usual classification code - for each of the 10 splits.

When everything is done I end up with 1000 permuted-labels * 10 splits accuracies. I then make a set of 1000 accuracies by averaging the accuracy obtained with each label permutation on each split. In other words, I take the average of the accuracies I got by using the abbababaaa... labeling on each of my 10 "split" datasets. I then compute the significance by taking the rank of the real accuracy over the number of permutations. In this case, that's 2/1001 = 0.002.

Here's how it actually looks. These are histograms of the null distributions obtained on each of the split datasets separately, plus the averaged one. The vertical blue lines give the accuracy of the true labeling on each split, while the vertical light-grey line is at 0.5 (chance). The first graph is overplotting density plot versions of the 10 splits' histograms.

This method obviously requires a lot of computations (I often use a cluster). But I like that it enables comparing apples to apples: I'm comparing the accuracy I got by averaging 10 numbers to a distribution of accuracies obtained by averaging 10 numbers. Since the same set of splits and relabelings is used in all cases, the averaging is meaningful.

In this case, as often, the averaged distribution has less variance than the individual splits. This is not surprising, of course. It's also clear that the true-labeling accuracies vary more than the permutation distributions for each split, which is probably also not surprising (since these are permutation distributions), but not as clean as might be hoped. I do like that the true accuracy is at the right of all of the distributions; I consider this amount of variability pretty normal. I'd be worried if the distributions varied a great deal, or if the true-label accuracies are more variable (e.g. at chance on some splits, > 0.7 on others).

permutation testing: example dataset

I currently have a dataset that while mostly straightforward to analyze, has some complications that arise frequently. In this post I'll sketch some critical aspects to hopefully clarify the permutation testing issues.

The analysis is within-subjects (each person classified alone) with anatomical ROIs. The ROIs are reasonably small and independently defined; we will not do additional feature selection.
The subjects completed 12 scanning runs, with the stimuli reasonably-separated (time-wise), and randomly ordered.
There are various types of stimuli, but we are only interested in certain two-way classifications.
We only want to include examples where the subject gave the correct response.
We will do linear support vector machines, c=1.

As frequently happens, there are not as many examples as I might prefer. The stimuli types were randomized and balanced across the experiment, but not such that each run was guaranteed to have the same number of examples of the types we're classifying. After removing examples the subjects answered incorrectly, the imbalance is even worse: not all stimulus types are present in every run for every person.

There are various ways to set up the cross-validation. Partitioning on the runs is not an option, since some runs don't have all the stimulus types. My goal was to find a partitioning scheme that is as simple as possible, but separates the training and testing sets in time while getting the number of examples of each class in the training and testing sets as close as possible.

I calculated the number of training and testing examples for several schemes, such as leave-four-runs-out and leave-three-runs-out, settling on leave-six-runs-out. Specifically, I use the first six runs as the training set and the last six runs (in presentation order) as the testing set, then the reverse. Using this scheme, the smallest number of examples in a partition for any person is 6; not great, but I guessed workable. The largest minimum for any person is 25; most people have in the teens.

Even with leave-six-runs-out partitioning the number of examples of the two classes in the training and testing set is usually unequal, sometimes strikingly so. My usual work-around is to subset the larger class to equalize the number of examples in each case (e.g. if there are 12 of class a and 14 of class b in the training set, take out two of the class b). There are of course many, many ways to do the subsetting. My usual practice is to randomly calculate 10 of these, then average the resulting accuracies. (e.g. calculate the accuracy when taking out the 1st and 3rd class b examples in the training set, then after taking out the 4th and 13th examples, etc.).

Monday, September 17, 2012

permutation testing: opening thoughts

This is the first of what will probably be a continuing series of musing, meandering posts about permutation testing for group fMRI MVPA.

I'm a big fan of permutation testing, especially for MVPA: our questions, analyses, sample sizes, distributions, etc. are so far from what was envisioned for parametric stats that using a "standard" parametric test "out of the box" gives me pause (or, in practice, lots of assumption-checking). I like that with permutation tests it's possible to design a test that gets at exactly the hypothesis we're asking, while keeping the relevant variance structures (e.g. how many runs, type of classifier, number of examples in each cross-validation fold) intact.

But, the exact way two different people design a permutation test for any particular question will probably be different. And this gets worse in group analyses.

I'll be writing a series of posts illustrating some of the common ways permutation tests are done, both single subject and group analyses. My intent is that these posts will highlight and explain how to carry out what I think are the most useful and common methods, but also share my perception of where some of the open questions lie. As always, please feel free to comment, especially pointing out where your practice/observation/experience/conclusions differ from mine.

Wednesday, September 12, 2012

more searchlights: the Princeton mvpa toolbox

MS Al-Rawi kindly sent me coordinates of the searchlights created by the Princeton MVPA toolbox, which I made into this image. As before, the center voxel is black, the one-voxel-radius searchlight red, the two-voxel-radius searchlight purple, and the three-voxel-radius searchlight blue.

The number of voxels in each is:
1-voxel radius: 6 surrounding voxels (7 total)
2-voxel radius: 32 surrounding voxels (33 total)
3-voxel radius: 122 surrounding (123 total)
4-voxel radius: 256 surrounding (257 total)

From a quick look at the code generating the coordinates, it appears to "draw" a sphere around the center voxel, then retrieve the voxels falling within the sphere. (anyone disagree with this characterization?)

This "draw a sphere first" strategy explains why I couldn't come up with this number of voxels for a three-voxel-radius searchlight if the shape follows additive rules (e.g. "add voxels that share a face with the next-size-smaller radius"). It's another example of how much it can vary when different people actually implement a description: to me, it was more natural to think of searchlights of different radii as growing iteratively, a rather different solution than the one used in the Princeton MVPA toolbox.