Monday, July 9, 2012

many, many options (continued)

Thanks for the comments on the previous post; here's a few thoughts and responses.

optimizing

Jonas Kaplan mentioned optimizing the methodology on a pilot subject or two, which are then not included in the main analysis. That can be a great solution, especially if the pilot subjects are collected ahead of time and need to be collected regardless.

Something that's worked for me is figuring out the analysis using a positive control classification, then applying it to the real classification. For example, classifying the button presses (did the person push the first-finger or second-finger response button?). Which button was pressed is of course not of interest, but if I can't classify button pressing using motor voxels, I know something is not right, and I can use button-pressing classification accuracy as a sort of yardstick (if button-presses are classified around 80% accuracy and the task I actually care about at 75%, that's probably a strong signal).

Neither strategy (testing out methodology on different subjects or classifications) is perfect (the pilots could be much different than the other subjects, the control classification could have a radically different signal than the classification of interest, etc.), but they're better than nothing, or trying out many different analyses with the real data.

how big an impact?

An anonymous person commented that they've found that often a variety of methodological choices produces similar results, mentioning that sometimes a change will increase accuracy but also variability, resulting in similar conclusions/significance.

My experience is similar: sometimes changing a factor has a surprisingly small impact on the results. Classifier type comes to mind immediately; I've tried both linear svms and random forests (very different classifiers) on the same datasets with very similar results. Changing the number of cross-validation folds doesn't always make a big difference, though changing schemes entirely can (e.g. partitioning on the runs, then ignoring the runs in the partitioning schemes).

I've found that some choices have made very large differences in particular datasets; enough to change the interpretation completely.
  • temporal compression. I've seen massive differences in results when compressing to one example per run instead of one example per block; much more accurate with more averaging. I interpret this as the increase in signal/noise (from averaging more volumes together) outweighing the decrease in the number of datapoints (fewer examples with more compression).
  • scaling and detrending. Some datasets, particularly when classifying across subjects, have very different results depending on which scaling (i.e. normalizing, across-volumes within a single voxel, or across voxels within a single image) technique was used.
  • balancing. This one's a bit different; I mean choosing a subset of examples if necessary to ensure that equal numbers of each class are sent to the classifier. For example, suppose we've compressed to one example per block, and are partitioning on the runs, but are only including blocks the person answered correctly. We might then have different numbers of examples in the classes in each cross-validation fold. I usually handle this by subsetting the bigger class at random - some of the examples are left out. Since there are many ways to choose which examples are left out, I usually do ten different random subsets and average the results. Sometimes which examples are included can have a very large difference in accuracy. In these cases I often look at the processing and analysis stream to see if there's a way to make the results more stable (such as by increasing the number of examples in the training set).

Monday, July 2, 2012

many, many options

Let me join the chorus of people singing the praises of the recent article "False-positive psychology". If you haven't read the paper yet, go read it, especially if you don't particularly like statistics. One of the problems they highlight is "researcher degrees of freedom." In the case of MVPA I'd summarize this as having so many ways to do an analysis that if you know what sort of result you'd like you can "play around" a bit until you find one that yields what you'd like. This isn't cheating in the sense of making up data, but more insidious: how do you determine what analysis you should perform, and when to accept the results you found?

Neuroskeptic has a great post on this topic, listing some of the researcher degrees of freedom in analyzing a hypothetical fMRI experiment:
"Let's assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data."
What are some of the options for analyzing this with MVPA? This is not an exhaustive list by any stretch, just the first few that came to mind.
 temporal compression
  • Average the volumes to one per block. Which volumes to include in the average (i.e. to account for the hemodynamic lag)?
  • Create parameter estimate images (PEIs) (i.e. fit a linear model and do MVPA on the beta weights), one per block. The linear model could be canonical or individualized.
  • Average the volumes to one per run. Calculate the averages from the block files or all at once from the raw images.
  • Create one PEI for each run.
  • Analyze individual volumes (first volume in each block, second volume in each block, etc).
classifier
  • the "default": linear svm, c=1.
  • a linear svm, but fit the c.
  • a nonlinear svm (which type?).
  • a different classifier (random forest, naive bayes, ....).
  • correlation-based
  • linear discriminants (multiple options)
cross-validation
  • on the runs
  • on a combination of runs (first two runs out, next two out, etc)
  • ignoring the runs (ten-fold, leave-three-examples-out, etc)
  • on the subjects (leave-one-subject-out)
  • on the runs, but including multiple subjects
voxels
  • whole-brain
  • ROI (anatomical, functional, hybrid)
  • searchlight (which radius? which shape? how to combine across subjects?)
  • resize the voxels?
  • scale (normalize) the data? (across voxels within an example, across examples?). Center, normalize the variance, take out linear trends, take out nonlinear trends?
Simmons, Nelson, and Simonsohn argue that we (as authors and reviewers) need to be clear about why we chose particular combinations, and how sensitive the results were to those choices. For example, there is not always a clear choice as to the best cross-validation scheme to use in a particular analysis; several may be equally valid (leave-three-out, leave-four-out, leave-five-out, ...). If you only try one scheme and report it, that's fine. But if you try multiple, then only report the one that "worked", you've really increased the odds of finding a false positive. We need to be honest about how stable the results are to these types of (sometimes arbitrary) analytical choices.



ResearchBlogging.org Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061

Tuesday, June 26, 2012

attending SfN

I'll be attending SfN this year, and will have a poster Monday morning. It's not a method poster, but uses some nice MVPA (if I say so myself!). I'm still putting my schedule together; contact me if you'd like to get together and chat about MVPA.

Wednesday, June 20, 2012

linear SVM weight behavior

Lately I've been thinking about how linear svm classification performance and weight maps look when the informative voxels are highly similar, since voxels can of course be highly correlated in fMRI. I made a few little examples that I want to share; R code for these examples is here.

These examples are similar to the ones I posted last month: one person, two experiment conditions (classes), four runs, four examples of each class in each run. Classifying with linear svm, c=1, partitioning on the runs, 100 voxels.Images are of the weights from the fitted svm; averaged over the four cross-validation folds.

In each case I generated random numbers for one class for each voxel. If the voxel is "uninformative" I copied the set of random numbers for the other class, if the voxel is "informative" I added a small number (the "bias") to the random numbers to form the other class. In other words, a non-informative voxel's value on the first class A example in run 1 is the same as the first class B example in run 1. If the voxel is informative, the first class B example in run 1 will be equal to the value of the first class A example plus the bias.

I ran these three ways: with all the informative voxels being identical (i.e. I generated one "informative" voxel than copied it the necessary number of times), with all informative voxels equally informative (equal bias) but not identical, and with varying bias in the informative voxels (so they were not identical or equally informative).

Running the code will let you generate graphs for each cross-validation fold and however many informative voxels you wish; I'll show just a few here.

In the graph for 5 identical informative voxels the informative voxels have by far the strongest weights, when there are 50 identical informative voxels they 'fade': their weights are less than the uninformative voxels.





Linear svms produce a weighted sum of the voxel values; a small weight on each is needed when there are so many identically informative voxels.
This does not happen when the voxels are equally informative, but not identical: the weights are largest (most negative, in this case) for the informative voxels (left side of the image).

The accuracy is higher than with the 50 identical informative voxels, though the bias is the same in both cases.

When the informative voxels are more variable the weight map is also more variable, with voxels with more bias having higher weights.










recap

The most striking thing I noticed in these images is the way the weights of the informative voxels get closer to zero as the number of informative voxels increases. This could cause problems when voxels have highly similar timecourses - they won't be weighted in terms of the information in each, but rather as a function of the information in each and the number of voxels with a similar amount of information.

Wednesday, June 6, 2012

temporal compression for different image acquisition schemes

A question was posted on the mvpa-toolbox mailing list about how do temporal compression: which volumes should you pick to correspond to an event, particularly if the timing of the events is jittered. What follows is a version of my reply.

The case when stimulus onset is time-locked to image acquisition is the easiest. In this case I generally guess which images (acquired volumes) should correspond to peak HRF and average those. This is straightforward if the TR is short compared to the time period you want to temporally compress (e.g. a twenty-second event and two-second TR) but can get quite dodgy if the events and TR are close in time (e.g. events that last a second). In these cases I generally think of analyzing single timepoints or generating PEIs.

If stimulus onset is jittered in relation to image acquisition I follow a similar logic: if the jitter is minimal compared to the TR (e.g. events start either half or three-quarters of the way through a 1.5 second TR) or to the number of volumes being averaged (e.g. a block design and 12 volumes are being averaged each time) I'll probably just ignore the jitter. But if the jitter is large (e.g. 4 sec TR and completely randomized stimulus onset) I'll think of PEIs again.

By PEIs I mean "parameter estimate images" - fitting a linear model assuming the standard HRF and doing MVPA with the beta weights. I described some of this and presented a comparison of doing averaging and PEIs on the same datasets in "The impact of certain methodological choices on multivariate analysis of fMRI data with support vector machines".

As a general strategy I look at the TR, stimulus timing, and event duration for each particular experiment and question then think about in which volumes the BOLD response we're looking for probably falls. If it's a clear answer, I pick those volumes. If not, I design PEIs or reformulate the question. None of this is a substitute for proper experimental design and randomization, of course, and fitting PEIs is not a cure-all.

Tuesday, June 5, 2012

MVPA significance with the binomial test: update 1

A clarification is in order: in the previous post I imply that Francisco advocates using the binomial to test for significance across subjects as I illustrated in the R code. He doesn't. The code showing how to calculate the binomial for each person separately is the way he describes, though.

Francisco gives a few ideas in the "Group level analysis" section of his "Information Mapping" paper (citation below), mostly advocating descriptions or count-based techniques. This can work fairly well for searchlight analyses (e.g. map the proportion of subjects with each voxel significant), but not as well for ROI-based analyses like the toy example in the previous post. He'll be sharing some thoughts in a future post.


 ResearchBlogging.orgPereira F, & Botvinick M (2011). Information mapping with pattern classifiers: a comparative study. NeuroImage, 56 (2), 476-96 PMID: 20488249

Monday, June 4, 2012

MVPA significance with the binomial test

Using the binomial test to evaluate significance is advocated by Francisco Pereira and Tom Mitchell (and others!); reference below. The method is easy to explain, but the exact implementation is not transparent to me: how to count the "number of correctly labeled test set examples" when I have multiple cross-validation folds, multiple subjects, and perhaps multiple repeats?

The common practice (as far as I can tell; I'll post your reply/comment if you disagree!) is to handle these problems by 1) counting the number of examples in the entire test set (summing over cross-validation folds) and 2) using the average over the multiple subjects and repeats as the number correct.

Here's an example in R:
# 3 runs, 10 examples in each (5 of two classes), 3 people
# here's how each person did:
(8 + 6 + 5)/30   # 19 correct, 0.6333333
(10 + 5 + 5)/30  # 20 correct, 0.6666667
(10 + 8 + 7)/30  # 25 correct, 0.8333333

# calculate the average accuracy for number correct
mean(c(0.6333333, 0.6666667, 0.8333333));   # 0.7111111
0.71 * 30  # 21.3 examples correct on average

# calculate the binomial test of the average. 
# have to round to a whole number correct.
binom.test(20, 30, 0.5) #  p-value = 0.09874

# calculating the binomial summing over the number of examples in
# all the subjects makes a much smaller p-value:
(8 + 6 + 5 + 10 + 5 + 5 + 10 + 8 + 7) # 64
binom.test(64, 90, 0.5);  # p-value = 7.657e-05

# t-test version (also common, though not directly comparable)
t.test(c(0.6333333, 0.6666667, 0.8333333), mu=0.5, alternative='greater'); # p-value = 0.03809

# binomial tests for each person.
binom.test(20, 30, 0.5)   # p-value = 0.09874
binom.test(19, 30, 0.5)   # p-value = 0.2005
binom.test(25, 30, 0.5)   # p-value = p-value = 0.0003249

While not the case in this toy example (if do the binomial on the average), I often find that the binomial produces smaller group p-values than the t-test or permutation testing. Nevertheless, I do not often use it, because so much information gets lost: variation between people, between test sets, between replications (if done).

It seems odd to use the across-subjects average and then have the exact same test (20 correct out of 30) regardless of whether there were 3 subjects or 300, but worse to sum over the number of examples in all the subjects (64 correct out of 90), because then the difference required for a significance is so small and the actual test set size is lost completely. It also 'feels' improper to me to use the total number of test set examples in all the cross-validation folds; even if all of the examples are fully independent the performance on the cross-validation folds certainly is not.

I am certainly not advocating that we ignore all research that used the binomial! But I do tend to interpret the significance levels that result with even more skepticism than usual.

ResearchBlogging.org Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview NeuroImage, 45 (1) DOI: 10.1016/j.neuroimage.2008.11.007




see additional posts on this topic: 1