Monday, July 30, 2012

Nipype impressions

I was asked to take a look at Nipype (great logo, Arno!). Here are some of my impressions; I haven't tried coding or installing it. If you have, I'd be curious to hear how it's gone and it you think any of this is off the mark.

Nipype aims to connect different fMRI software packages (spm, fsl, etc) so that you can program a reproducible set of steps. Your preprocessing could mix programs (e.g. smooth in one program, realign in another), and it should be easier to switch a step from one program to another (e.g. smooth in fsl instead of spm) with relatively few changes to the program.

I like that Nipype makes it possible to reproduce and share analysis steps; the current procedures are far too messy and error-prone. Nipype looks to require a fair bit of programming skill (python), however, as you're coding at a step above the underlying routines/programs: to do it properly you'd need a decent understanding of both nipype and the programs it's calling (from fsl or whatever).

Nipype strikes me as most useful to a skilled programmer wanting to run several versions of an analysis (e.g. with different motion-correction algorithms) or the same analysis many, many times (lots of subjects). It makes it more straightforward to make an established set of analysis steps (such as preprocessing) available to other people. It's hard to tell without trying it out, but it looks like a nipype program could be set up to be run by someone that doesn't know the underlying routines (e.g. an RA running the same preprocessing steps on a new group of subjects).

I have concerns about missing errors, specifying options, and stability. It's often necessary to check initial output before moving to the next step in the pipeline; these errors would need to be caught somehow (go back and check intermediate files?). Most routines have many options (e.g. FWHM, interpolation algorithm). Does Nipype set the default values? Making every single option visible through Nipype seems daunting, which is one of my concerns about stability: Nipype will need to be updated any time one of the underlying programs is updated. And installation would likely be non-trivial (all the necessary dependencies and paths), unless it's run in NeuroDebian.

A few final thoughts: First, I've had terrible troubles before with left/right image flipping and shifting centerpoints when changing from one software package to another (and that's not even thinking about changing from one atlas to another). Trying to make packages that have different standard spaces play nicely together strikes me as very non-trivial (I don't mean code-wise, I mean keeping track of alignment). Second, would there be a benefit to using Nipype if the pipeline only includes one program (e.g. why Nipype instead of spm8 batching)? It seems in this case it might just be adding extra complication: the need to learn the program options, plus the Nipype code.

So will I recommend we try Nipype right now? No. But I'll keep an eye on it and am certainly willing to revise my opinion.


ResearchBlogging.org Gorgolewski K, Burns CD, Madison C, Clark D, Halchenko YO, Waskom ML, & Ghosh SS (2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5 PMID: 21897815

Thursday, July 12, 2012

but replications are possible

Considering the posts on all the reasons that fMRI findings can be dodgy and replications unlikely, I'll share some recent experiences I've had in which proper replications succeeded.

I can't go into any detail (yet), but this image is a few slices showing the results with the first dataset in red and the second in blue. We definitely tweaked the analysis of the first dataset a bit; I did as much of the tweaking as possible with a control classification rather than the real one, but I can't claim that all the choices were made on a priori principle. But I would claim that the choices were made without any data peeking for the second dataset: the blue blobs are the very first set of across-subjects results; just the analysis we used for the first dataset. And I'm extra-pleased by these results because they are properly independent: different scanner, different people, even different stimuli (but the same task and experimental questions, of course).

Within the last week I've actually had another case of replicating results in properly independent datasets; a completely different set of analyses than the blobs above. It's an anatomical ROI-based analysis, so I don't have pretty pictures to show, but the tables are very lovely.

So do not give up hope: replications are possible. But we must be very vigilant in our methods: we are all easy to fool.

Monday, July 9, 2012

many, many options (continued)

Thanks for the comments on the previous post; here's a few thoughts and responses.

optimizing

Jonas Kaplan mentioned optimizing the methodology on a pilot subject or two, which are then not included in the main analysis. That can be a great solution, especially if the pilot subjects are collected ahead of time and need to be collected regardless.

Something that's worked for me is figuring out the analysis using a positive control classification, then applying it to the real classification. For example, classifying the button presses (did the person push the first-finger or second-finger response button?). Which button was pressed is of course not of interest, but if I can't classify button pressing using motor voxels, I know something is not right, and I can use button-pressing classification accuracy as a sort of yardstick (if button-presses are classified around 80% accuracy and the task I actually care about at 75%, that's probably a strong signal).

Neither strategy (testing out methodology on different subjects or classifications) is perfect (the pilots could be much different than the other subjects, the control classification could have a radically different signal than the classification of interest, etc.), but they're better than nothing, or trying out many different analyses with the real data.

how big an impact?

An anonymous person commented that they've found that often a variety of methodological choices produces similar results, mentioning that sometimes a change will increase accuracy but also variability, resulting in similar conclusions/significance.

My experience is similar: sometimes changing a factor has a surprisingly small impact on the results. Classifier type comes to mind immediately; I've tried both linear svms and random forests (very different classifiers) on the same datasets with very similar results. Changing the number of cross-validation folds doesn't always make a big difference, though changing schemes entirely can (e.g. partitioning on the runs, then ignoring the runs in the partitioning schemes).

I've found that some choices have made very large differences in particular datasets; enough to change the interpretation completely.
  • temporal compression. I've seen massive differences in results when compressing to one example per run instead of one example per block; much more accurate with more averaging. I interpret this as the increase in signal/noise (from averaging more volumes together) outweighing the decrease in the number of datapoints (fewer examples with more compression).
  • scaling and detrending. Some datasets, particularly when classifying across subjects, have very different results depending on which scaling (i.e. normalizing, across-volumes within a single voxel, or across voxels within a single image) technique was used.
  • balancing. This one's a bit different; I mean choosing a subset of examples if necessary to ensure that equal numbers of each class are sent to the classifier. For example, suppose we've compressed to one example per block, and are partitioning on the runs, but are only including blocks the person answered correctly. We might then have different numbers of examples in the classes in each cross-validation fold. I usually handle this by subsetting the bigger class at random - some of the examples are left out. Since there are many ways to choose which examples are left out, I usually do ten different random subsets and average the results. Sometimes which examples are included can have a very large difference in accuracy. In these cases I often look at the processing and analysis stream to see if there's a way to make the results more stable (such as by increasing the number of examples in the training set).

Monday, July 2, 2012

many, many options

Let me join the chorus of people singing the praises of the recent article "False-positive psychology". If you haven't read the paper yet, go read it, especially if you don't particularly like statistics. One of the problems they highlight is "researcher degrees of freedom." In the case of MVPA I'd summarize this as having so many ways to do an analysis that if you know what sort of result you'd like you can "play around" a bit until you find one that yields what you'd like. This isn't cheating in the sense of making up data, but more insidious: how do you determine what analysis you should perform, and when to accept the results you found?

Neuroskeptic has a great post on this topic, listing some of the researcher degrees of freedom in analyzing a hypothetical fMRI experiment:
"Let's assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data."
What are some of the options for analyzing this with MVPA? This is not an exhaustive list by any stretch, just the first few that came to mind.
 temporal compression
  • Average the volumes to one per block. Which volumes to include in the average (i.e. to account for the hemodynamic lag)?
  • Create parameter estimate images (PEIs) (i.e. fit a linear model and do MVPA on the beta weights), one per block. The linear model could be canonical or individualized.
  • Average the volumes to one per run. Calculate the averages from the block files or all at once from the raw images.
  • Create one PEI for each run.
  • Analyze individual volumes (first volume in each block, second volume in each block, etc).
classifier
  • the "default": linear svm, c=1.
  • a linear svm, but fit the c.
  • a nonlinear svm (which type?).
  • a different classifier (random forest, naive bayes, ....).
  • correlation-based
  • linear discriminants (multiple options)
cross-validation
  • on the runs
  • on a combination of runs (first two runs out, next two out, etc)
  • ignoring the runs (ten-fold, leave-three-examples-out, etc)
  • on the subjects (leave-one-subject-out)
  • on the runs, but including multiple subjects
voxels
  • whole-brain
  • ROI (anatomical, functional, hybrid)
  • searchlight (which radius? which shape? how to combine across subjects?)
  • resize the voxels?
  • scale (normalize) the data? (across voxels within an example, across examples?). Center, normalize the variance, take out linear trends, take out nonlinear trends?
Simmons, Nelson, and Simonsohn argue that we (as authors and reviewers) need to be clear about why we chose particular combinations, and how sensitive the results were to those choices. For example, there is not always a clear choice as to the best cross-validation scheme to use in a particular analysis; several may be equally valid (leave-three-out, leave-four-out, leave-five-out, ...). If you only try one scheme and report it, that's fine. But if you try multiple, then only report the one that "worked", you've really increased the odds of finding a false positive. We need to be honest about how stable the results are to these types of (sometimes arbitrary) analytical choices.



ResearchBlogging.org Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061

Tuesday, June 26, 2012

attending SfN

I'll be attending SfN this year, and will have a poster Monday morning. It's not a method poster, but uses some nice MVPA (if I say so myself!). I'm still putting my schedule together; contact me if you'd like to get together and chat about MVPA.

Wednesday, June 20, 2012

linear SVM weight behavior

Lately I've been thinking about how linear svm classification performance and weight maps look when the informative voxels are highly similar, since voxels can of course be highly correlated in fMRI. I made a few little examples that I want to share; R code for these examples is here.

These examples are similar to the ones I posted last month: one person, two experiment conditions (classes), four runs, four examples of each class in each run. Classifying with linear svm, c=1, partitioning on the runs, 100 voxels.Images are of the weights from the fitted svm; averaged over the four cross-validation folds.

In each case I generated random numbers for one class for each voxel. If the voxel is "uninformative" I copied the set of random numbers for the other class, if the voxel is "informative" I added a small number (the "bias") to the random numbers to form the other class. In other words, a non-informative voxel's value on the first class A example in run 1 is the same as the first class B example in run 1. If the voxel is informative, the first class B example in run 1 will be equal to the value of the first class A example plus the bias.

I ran these three ways: with all the informative voxels being identical (i.e. I generated one "informative" voxel than copied it the necessary number of times), with all informative voxels equally informative (equal bias) but not identical, and with varying bias in the informative voxels (so they were not identical or equally informative).

Running the code will let you generate graphs for each cross-validation fold and however many informative voxels you wish; I'll show just a few here.

In the graph for 5 identical informative voxels the informative voxels have by far the strongest weights, when there are 50 identical informative voxels they 'fade': their weights are less than the uninformative voxels.





Linear svms produce a weighted sum of the voxel values; a small weight on each is needed when there are so many identically informative voxels.
This does not happen when the voxels are equally informative, but not identical: the weights are largest (most negative, in this case) for the informative voxels (left side of the image).

The accuracy is higher than with the 50 identical informative voxels, though the bias is the same in both cases.

When the informative voxels are more variable the weight map is also more variable, with voxels with more bias having higher weights.










recap

The most striking thing I noticed in these images is the way the weights of the informative voxels get closer to zero as the number of informative voxels increases. This could cause problems when voxels have highly similar timecourses - they won't be weighted in terms of the information in each, but rather as a function of the information in each and the number of voxels with a similar amount of information.

Wednesday, June 6, 2012

temporal compression for different image acquisition schemes

A question was posted on the mvpa-toolbox mailing list about how do temporal compression: which volumes should you pick to correspond to an event, particularly if the timing of the events is jittered. What follows is a version of my reply.

The case when stimulus onset is time-locked to image acquisition is the easiest. In this case I generally guess which images (acquired volumes) should correspond to peak HRF and average those. This is straightforward if the TR is short compared to the time period you want to temporally compress (e.g. a twenty-second event and two-second TR) but can get quite dodgy if the events and TR are close in time (e.g. events that last a second). In these cases I generally think of analyzing single timepoints or generating PEIs.

If stimulus onset is jittered in relation to image acquisition I follow a similar logic: if the jitter is minimal compared to the TR (e.g. events start either half or three-quarters of the way through a 1.5 second TR) or to the number of volumes being averaged (e.g. a block design and 12 volumes are being averaged each time) I'll probably just ignore the jitter. But if the jitter is large (e.g. 4 sec TR and completely randomized stimulus onset) I'll think of PEIs again.

By PEIs I mean "parameter estimate images" - fitting a linear model assuming the standard HRF and doing MVPA with the beta weights. I described some of this and presented a comparison of doing averaging and PEIs on the same datasets in "The impact of certain methodological choices on multivariate analysis of fMRI data with support vector machines".

As a general strategy I look at the TR, stimulus timing, and event duration for each particular experiment and question then think about in which volumes the BOLD response we're looking for probably falls. If it's a clear answer, I pick those volumes. If not, I design PEIs or reformulate the question. None of this is a substitute for proper experimental design and randomization, of course, and fitting PEIs is not a cure-all.