Monday, July 2, 2012

many, many options

Let me join the chorus of people singing the praises of the recent article "False-positive psychology". If you haven't read the paper yet, go read it, especially if you don't particularly like statistics. One of the problems they highlight is "researcher degrees of freedom." In the case of MVPA I'd summarize this as having so many ways to do an analysis that if you know what sort of result you'd like you can "play around" a bit until you find one that yields what you'd like. This isn't cheating in the sense of making up data, but more insidious: how do you determine what analysis you should perform, and when to accept the results you found?

Neuroskeptic has a great post on this topic, listing some of the researcher degrees of freedom in analyzing a hypothetical fMRI experiment:
"Let's assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data."
What are some of the options for analyzing this with MVPA? This is not an exhaustive list by any stretch, just the first few that came to mind.
 temporal compression
  • Average the volumes to one per block. Which volumes to include in the average (i.e. to account for the hemodynamic lag)?
  • Create parameter estimate images (PEIs) (i.e. fit a linear model and do MVPA on the beta weights), one per block. The linear model could be canonical or individualized.
  • Average the volumes to one per run. Calculate the averages from the block files or all at once from the raw images.
  • Create one PEI for each run.
  • Analyze individual volumes (first volume in each block, second volume in each block, etc).
  • the "default": linear svm, c=1.
  • a linear svm, but fit the c.
  • a nonlinear svm (which type?).
  • a different classifier (random forest, naive bayes, ....).
  • correlation-based
  • linear discriminants (multiple options)
  • on the runs
  • on a combination of runs (first two runs out, next two out, etc)
  • ignoring the runs (ten-fold, leave-three-examples-out, etc)
  • on the subjects (leave-one-subject-out)
  • on the runs, but including multiple subjects
  • whole-brain
  • ROI (anatomical, functional, hybrid)
  • searchlight (which radius? which shape? how to combine across subjects?)
  • resize the voxels?
  • scale (normalize) the data? (across voxels within an example, across examples?). Center, normalize the variance, take out linear trends, take out nonlinear trends?
Simmons, Nelson, and Simonsohn argue that we (as authors and reviewers) need to be clear about why we chose particular combinations, and how sensitive the results were to those choices. For example, there is not always a clear choice as to the best cross-validation scheme to use in a particular analysis; several may be equally valid (leave-three-out, leave-four-out, leave-five-out, ...). If you only try one scheme and report it, that's fine. But if you try multiple, then only report the one that "worked", you've really increased the odds of finding a false positive. We need to be honest about how stable the results are to these types of (sometimes arbitrary) analytical choices. Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061


  1. Thanks that is a great list! You should also add feature selection/extraction into this.

    The good news is that at least in my experience most of these options usually have very little influence on the results. Averaging volumes within a block tends to produced higher accuracy but the variance scales as well so the pattern of results is usually similar. I think the biggest effect comes from feature selection and perhaps also the choice of classifier (although that is less of a problem).

  2. Good post. I'd also add preprocessing steps to the list like spatial smoothing with various kernel sizes.

    We often collect 1 or 2 pilot subjects that we know will not be part of the final dataset that we can use to play around with many of these variables. Then we will make choices about all of the analysis options before collecting the "real" data.

  3. Very nice indeed, as always. Perhaps there should be some standard template that contains flags to what methods a researcher had used in the analysis, i.e. the protocol of the analysis. This would save the reader the time needed to decode what was done.

    As to the false positive(s) issue, I have seen a study (Pub. in Neuroimaging) in which the authors used few feature selection and few classifiers (for each subject), then, they selected the results that gave the highest accuracy. Their excuse is that they used a 7T MRI to obtain high resolution fMRI, which means higher levels of noise. Would anyone consider this case as "false positive?"

    1. I've seen versions of that as well (I'm not sure which exact paper you mean, which is not a good sign). I suppose it could be done properly, if you were very careful about hold-out sets and data peeking, but it seems a very risky strategy. And 7T data is no excuse. :)