optimizingJonas Kaplan mentioned optimizing the methodology on a pilot subject or two, which are then not included in the main analysis. That can be a great solution, especially if the pilot subjects are collected ahead of time and need to be collected regardless.
Something that's worked for me is figuring out the analysis using a positive control classification, then applying it to the real classification. For example, classifying the button presses (did the person push the first-finger or second-finger response button?). Which button was pressed is of course not of interest, but if I can't classify button pressing using motor voxels, I know something is not right, and I can use button-pressing classification accuracy as a sort of yardstick (if button-presses are classified around 80% accuracy and the task I actually care about at 75%, that's probably a strong signal).
Neither strategy (testing out methodology on different subjects or classifications) is perfect (the pilots could be much different than the other subjects, the control classification could have a radically different signal than the classification of interest, etc.), but they're better than nothing, or trying out many different analyses with the real data.
how big an impact?An anonymous person commented that they've found that often a variety of methodological choices produces similar results, mentioning that sometimes a change will increase accuracy but also variability, resulting in similar conclusions/significance.
My experience is similar: sometimes changing a factor has a surprisingly small impact on the results. Classifier type comes to mind immediately; I've tried both linear svms and random forests (very different classifiers) on the same datasets with very similar results. Changing the number of cross-validation folds doesn't always make a big difference, though changing schemes entirely can (e.g. partitioning on the runs, then ignoring the runs in the partitioning schemes).
I've found that some choices have made very large differences in particular datasets; enough to change the interpretation completely.
- temporal compression. I've seen massive differences in results when compressing to one example per run instead of one example per block; much more accurate with more averaging. I interpret this as the increase in signal/noise (from averaging more volumes together) outweighing the decrease in the number of datapoints (fewer examples with more compression).
- scaling and detrending. Some datasets, particularly when classifying across subjects, have very different results depending on which scaling (i.e. normalizing, across-volumes within a single voxel, or across voxels within a single image) technique was used.
- balancing. This one's a bit different; I mean choosing a subset of examples if necessary to ensure that equal numbers of each class are sent to the classifier. For example, suppose we've compressed to one example per block, and are partitioning on the runs, but are only including blocks the person answered correctly. We might then have different numbers of examples in the classes in each cross-validation fold. I usually handle this by subsetting the bigger class at random - some of the examples are left out. Since there are many ways to choose which examples are left out, I usually do ten different random subsets and average the results. Sometimes which examples are included can have a very large difference in accuracy. In these cases I often look at the processing and analysis stream to see if there's a way to make the results more stable (such as by increasing the number of examples in the training set).