In the previous post I shared R code to perform a searchlight analysis on a sample dataset; this post describes R code to perform a ROI-based analysis on the dataset. In addition to the code and dataset files linked in the previous sentence, the code requires a ROI mask, shown at left.
This code is essentially a simpler version of the searchlight code: instead of reading out which voxels to classify from the lookup and 3d files it gets the voxel coordinates from the ROI mask. The key part of this code is that the ROI mask and functional data must be fully aligned - as the images are read into R. This can easily go awry; I strongly suggest you spot-check several voxels in R and an outside program (such as MRIcroN) to make sure the laterality of all images agree.
Wednesday, January 29, 2014
demo: R code to perform a searchlight analysis
Here is an R script for performing a single-subject searchlight analysis with R. You are welcome to adapt the code for your own use so long as I am cited; the code is a demo that performs an entire single-subject searchlight analysis on the sample dataset, but will need to be edited for other datasets. This is an R script, not a package, because it is not "plug-and-play", though I hope it is straightforward and clear.
The demo takes a 4d nifti as input and creates a 3d nifti output image in which the value in each voxel is the classification accuracy of the voxel's searchlight. The code creates iterative-shell shaped searchlights, but can be altered for other shapes.
The code performs two preliminary steps before doing the searchlight analysis. These just need to be done once for each dataset (e.g. if multiple subjects have been spatially normalized, all use the same lookup and 3d files). The first creates a 3d.rds file, which assigns an integer to each brain voxel and serves to go between each voxel's 3d coordinates and the list of searchlight surrounds, which is stored in a lookup table. Each row of the lookup table corresponds to the brain voxel integer labels stored in 3d.rds; the columns list the voxel-name integers making up the searchlight for that row's voxel. The searchlight analysis proceeds by going through each voxel in turn, looking up its searchlight voxels in the lookup table, pulling the values for those voxels from the 4d input image, then writing the classification accuracy into the voxel's place in the 3d output image.
I hope that this code and/or logic is useful for people setting up their own searchlight analyses, whether in R or another language. I believe that most searchlight analysis implementations use a similar logic, though some others are probably more optimized for speed! If you give it a try, please let me know how it goes.
I created the demo dataset from a little bit of a real dataset, labeling volumes to have strong information in motor and visual areas. This is what my final searchlight accuracy map looked like, plotted on the ch2bet template in MRIcroN and scaled for accuracies of 0.5 to 1.
This shows the accuracy maps produced from this little demo dataset using 2-voxel radius searchlights, shaped as iterative shells (right) and "spheres" (left). The areas with the highest accuracies are the same in both, as we'd hoped, but the accuracies tend to be a bit higher on the right (shells), likely because 2-voxel radius spherical searchlights are smaller (have fewer voxels) than 2-voxel radius iterative shell searchlights. (Note that this doesn't mean one shape is better than the other, they're simply different shapes, and it's probably best for consistency that we all use the same shape, which means spheres.)
The demo takes a 4d nifti as input and creates a 3d nifti output image in which the value in each voxel is the classification accuracy of the voxel's searchlight. The code creates iterative-shell shaped searchlights, but can be altered for other shapes.
The code performs two preliminary steps before doing the searchlight analysis. These just need to be done once for each dataset (e.g. if multiple subjects have been spatially normalized, all use the same lookup and 3d files). The first creates a 3d.rds file, which assigns an integer to each brain voxel and serves to go between each voxel's 3d coordinates and the list of searchlight surrounds, which is stored in a lookup table. Each row of the lookup table corresponds to the brain voxel integer labels stored in 3d.rds; the columns list the voxel-name integers making up the searchlight for that row's voxel. The searchlight analysis proceeds by going through each voxel in turn, looking up its searchlight voxels in the lookup table, pulling the values for those voxels from the 4d input image, then writing the classification accuracy into the voxel's place in the 3d output image.
I hope that this code and/or logic is useful for people setting up their own searchlight analyses, whether in R or another language. I believe that most searchlight analysis implementations use a similar logic, though some others are probably more optimized for speed! If you give it a try, please let me know how it goes.I created the demo dataset from a little bit of a real dataset, labeling volumes to have strong information in motor and visual areas. This is what my final searchlight accuracy map looked like, plotted on the ch2bet template in MRIcroN and scaled for accuracies of 0.5 to 1.
UPDATE 13 May 2015
The searchlight demo code above has code for creating searchlights shaped as iterative shells. It is now perhaps most common to make "spherical" searchlights, so this updated searchlight demo has code to create searchlights of the same shape as used by pyMVPA, BrainVoyager, and the Princeton toolbox. I also cleaned up a few comments, and changed the lookup table and 3d image to plain text and NIfTI, instead of .rds, for access in multiple programs, not just R.This shows the accuracy maps produced from this little demo dataset using 2-voxel radius searchlights, shaped as iterative shells (right) and "spheres" (left). The areas with the highest accuracies are the same in both, as we'd hoped, but the accuracies tend to be a bit higher on the right (shells), likely because 2-voxel radius spherical searchlights are smaller (have fewer voxels) than 2-voxel radius iterative shell searchlights. (Note that this doesn't mean one shape is better than the other, they're simply different shapes, and it's probably best for consistency that we all use the same shape, which means spheres.)
Saturday, January 25, 2014
"tough love for fMRI"
I lifted the title from a great post on Chris Chambers' blog NeuroChambers: "Tough love for fMRI: questions and possible solutions". He has a very succinct summary of a lot of the problems the field has been struggling with: "MRI = v expensive method + chronically under-powered designs + intense publication pressure + lack of data sharing = huge fraud incentive." And not just an incentive to outright fraud, but also to practices that fall on the spectrum of questionable research practices.
I happened to read Chris Chambers' post the same day I read "Cluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations", a new NeuroImage paper by Woo, Krishnan, and Wager, and I must say the combination is disheartening.
The paper describes mass-univariate analysis interpretation, but there are a lot of parallels between the problems in spatial specificity they point out and those that can occur in searchlight or ROI-based MVPA. In short: it is not appropriate to interpret sub-regions when an analysis produces a large cluster that spans several regions; some of the cluster voxels are significant, but you can't say which ones. You can't say that an area is "activated" simply because it is part of a larger activated area that passed a cluster threshold - it could be that only adjacent regions (other parts of the cluster) are actually activated.
The authors make several practical suggestions, one of which is to individually color-code each cluster that passed the cluster-extent thresholding (so you'd plot a group of blue voxels, another of red voxels, etc), rather than the common practice of plotting all surviving voxels with the same heatmap color scale. They further recommend "that figure legends and captions explicitly state that the true activation location and extent within each significant cluster cannot be determined." This pushes interpretation (properly) to the cluster level, paralleling ROI-based MVPA, which presents descriptions of the properties of ROIs, not the voxels they contain.
This (and other advice in the paper) is very practical, but I suspect people will resist: it is a lot harder to come up with a compelling story (and get a blob in the "right spot") when clusters are properly interpreted. But this is the sort of "tough love" that is needed for fMRI.
Woo CW, Krishnan A, & Wager TD (2014). Cluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations. NeuroImage PMID: 24412399
I happened to read Chris Chambers' post the same day I read "Cluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations", a new NeuroImage paper by Woo, Krishnan, and Wager, and I must say the combination is disheartening.
The paper describes mass-univariate analysis interpretation, but there are a lot of parallels between the problems in spatial specificity they point out and those that can occur in searchlight or ROI-based MVPA. In short: it is not appropriate to interpret sub-regions when an analysis produces a large cluster that spans several regions; some of the cluster voxels are significant, but you can't say which ones. You can't say that an area is "activated" simply because it is part of a larger activated area that passed a cluster threshold - it could be that only adjacent regions (other parts of the cluster) are actually activated.
The authors make several practical suggestions, one of which is to individually color-code each cluster that passed the cluster-extent thresholding (so you'd plot a group of blue voxels, another of red voxels, etc), rather than the common practice of plotting all surviving voxels with the same heatmap color scale. They further recommend "that figure legends and captions explicitly state that the true activation location and extent within each significant cluster cannot be determined." This pushes interpretation (properly) to the cluster level, paralleling ROI-based MVPA, which presents descriptions of the properties of ROIs, not the voxels they contain.
This (and other advice in the paper) is very practical, but I suspect people will resist: it is a lot harder to come up with a compelling story (and get a blob in the "right spot") when clusters are properly interpreted. But this is the sort of "tough love" that is needed for fMRI.
Tuesday, December 31, 2013
Akama 2012: Decoding semantics across fMRI sessions
I want to highlight just a few methodology-type bits of Akama 2012 ("Decoding semantics across fMRI sessions with different stimulus modalities"); I encourage you to read the full paper for discussion of why they actually did the experiment (and what they made of it!).
Figure 4 shows the single-volume classification accuracy for each of the five participants (P1, P2, etc) and two stimulus modalities ("audio" and "ortho"), classifying whether the stimulus (in both modalities) was of a tool or an animal. They used a 1-second TR, and (I think) leave-one-run-out cross-validation, with six runs. They label the dashed horizontal line as "Chance Level" from a binomial test, which I think is rather lenient, but the accuracies are high enough that I won't quibble about significance (I'd just label 50% as chance, since it's two-class classification).
As the authors point out, the peak classification accuracy is closer to 7 seconds after onset in four of the participants, instead of 5, as in the default SPM HRF (thick red line). The lines are hard to see, but P5 (darkish purple) is the person with the most similarity between the canonical HRF and classification accuracy curves, for both modalities. The accuracy curves tend to be similar across modalities within participants, or, as they put it: "The main regularities were specific to participants, with correlations between the profiles of both conditions of 0.88 for P1, 0.72 for P2, 0.81 for P3, 0.91 for P4, and 0.96 for P5."
This is the nicest figure I've seen showing how accuracy increases over time and varies over people; the scanning parameters and experimental design let the authors calculate this at a higher resolution than usual. This figure will be going into my introduction-to-MVPA slides.
A new experiment could be run, in which both audio and ortho stimuli are presented in each session (e.g. three runs of each on each day). Is cross-modal classification less accurate than within-modal, or is cross-session accuracy less than within-session accuracy? In other words, I expect analyses involving images collected on separate days to be less accurate than analyses with images collected within a single session, simply because time has passed (the scanner properties could change a bit, the person could have different hydration or movement, etc). It's not possible to sort out the session and modality effects in Akama 2012, since they're confounded.
Also, I wonder if the feature selection method (picking top 500 voxels by ANOVA ranking in training set) hurt the likelihood of finding strong cross-modal classification. A ROI-based analysis might be an appropriate alternative, particularly since Tool vs. Animal classification has been studied before.
Hiroyuki Akama, Brian Murphy, Li Na, Yumiko Shimizu, & Massimo Poesio (2012). Decoding semantics across fMRI sessions with different stimulus modalities: a practical MVPA study Frontiers in NeuroInformatics DOI: 10.3389/fninf.2012.00024
accuracy over time
The main bit I want to highlight is Figure 4 ("Comparison between the model accuracy function and the canonical HRF in the range of 0–20 s after stimulus onset."):Figure 4 shows the single-volume classification accuracy for each of the five participants (P1, P2, etc) and two stimulus modalities ("audio" and "ortho"), classifying whether the stimulus (in both modalities) was of a tool or an animal. They used a 1-second TR, and (I think) leave-one-run-out cross-validation, with six runs. They label the dashed horizontal line as "Chance Level" from a binomial test, which I think is rather lenient, but the accuracies are high enough that I won't quibble about significance (I'd just label 50% as chance, since it's two-class classification).
As the authors point out, the peak classification accuracy is closer to 7 seconds after onset in four of the participants, instead of 5, as in the default SPM HRF (thick red line). The lines are hard to see, but P5 (darkish purple) is the person with the most similarity between the canonical HRF and classification accuracy curves, for both modalities. The accuracy curves tend to be similar across modalities within participants, or, as they put it: "The main regularities were specific to participants, with correlations between the profiles of both conditions of 0.88 for P1, 0.72 for P2, 0.81 for P3, 0.91 for P4, and 0.96 for P5."
This is the nicest figure I've seen showing how accuracy increases over time and varies over people; the scanning parameters and experimental design let the authors calculate this at a higher resolution than usual. This figure will be going into my introduction-to-MVPA slides.
cross-modal comments
The experiment had two sessions (I think on separate days, though this isn't clear): one in which images of hand tools and animals (same each modality) were accompanied by auditory labels ("audio") and the other, with written labels ("ortho"). Classification was of image category (tool or animal), and was done either with a single session's data (all audio or all ortho), or cross-session (cross-modal), training with audio and testing with ortho, or training with ortho and testing with audio. Classification accuracies were pretty high both times (figures 2 and 3), but higher within- than cross-modality, and the authors spend some time exploring and discussing possible reasons why. Here are a few more ideas.A new experiment could be run, in which both audio and ortho stimuli are presented in each session (e.g. three runs of each on each day). Is cross-modal classification less accurate than within-modal, or is cross-session accuracy less than within-session accuracy? In other words, I expect analyses involving images collected on separate days to be less accurate than analyses with images collected within a single session, simply because time has passed (the scanner properties could change a bit, the person could have different hydration or movement, etc). It's not possible to sort out the session and modality effects in Akama 2012, since they're confounded.
Also, I wonder if the feature selection method (picking top 500 voxels by ANOVA ranking in training set) hurt the likelihood of finding strong cross-modal classification. A ROI-based analysis might be an appropriate alternative, particularly since Tool vs. Animal classification has been studied before.
Tuesday, December 3, 2013
permutation schemes: leave-one-subject-out
I've posted before about permutation schemes for the group level; here's an explicit illustration of the case of leave-one-subject-out cross-validation, using a version of these conventions. I'm using "leave-one-subject-out" as shorthand to refer to cross-validation in which partitioning is done on the participants, as illustrated here.
This diagram illustrates a fMRI dataset from two tasks and four participants, with one example of each task per person. Since there are four participants there are four cross-validation folds; here, the arrows indicate that for the first fold the classifier is trained on data from subjects 2, 3, and 4, while subject 1's data makes up the test set. The overall mean accuracy results from averaging the four accuracies (see this for more explanation).
How to get a significance value for the overall mean accuracy (grey oval)? We can't just do a t-test (or binomial) over the accuracies we got by leaving out each person (as often done after within-subjects classification), because the accuracies are not at all independent: we need a test for the final mean accuracy itself.
My strategy is to do dataset-wise permutation testing by label-flipping within one or more participants each fold. In this case, there are only four participants, so we could flip labels in just one or two of the participants, for 10 possible permutations (choose(4,1) + choose(4,2)) (flipping all the subjects' labels would recreate the original dataset, since this is dataset-wise permutation testing.) Here, the task labels for participant 1 only are flipped (the first example becomes task 2, the second, task 1). The entire cross-validation then proceeds with each flipped-labels dataset.
I follow the same strategy when there are more than one example of each class per person: when the person's labels are to be 'flipped' for a permutation I relabel all the class 1 examples as class 2 (and class 2 as class 1). While the number of possible permutations is limited by number of participants in this scheme, it increases rapidly with the number of participants.
Anyone do anything different?
This diagram illustrates a fMRI dataset from two tasks and four participants, with one example of each task per person. Since there are four participants there are four cross-validation folds; here, the arrows indicate that for the first fold the classifier is trained on data from subjects 2, 3, and 4, while subject 1's data makes up the test set. The overall mean accuracy results from averaging the four accuracies (see this for more explanation).
How to get a significance value for the overall mean accuracy (grey oval)? We can't just do a t-test (or binomial) over the accuracies we got by leaving out each person (as often done after within-subjects classification), because the accuracies are not at all independent: we need a test for the final mean accuracy itself.
My strategy is to do dataset-wise permutation testing by label-flipping within one or more participants each fold. In this case, there are only four participants, so we could flip labels in just one or two of the participants, for 10 possible permutations (choose(4,1) + choose(4,2)) (flipping all the subjects' labels would recreate the original dataset, since this is dataset-wise permutation testing.) Here, the task labels for participant 1 only are flipped (the first example becomes task 2, the second, task 1). The entire cross-validation then proceeds with each flipped-labels dataset.
I follow the same strategy when there are more than one example of each class per person: when the person's labels are to be 'flipped' for a permutation I relabel all the class 1 examples as class 2 (and class 2 as class 1). While the number of possible permutations is limited by number of participants in this scheme, it increases rapidly with the number of participants.
Anyone do anything different?
Thursday, November 14, 2013
"Supplemental or detrimental?"
The title of this post is copied from an article in the Scientist back in 2011, but is just as relevant now. I'm a big fan of sensitivity analyses: demonstrating that the findings are not due to a particular choice of parameters (preprocessing, classifier, searchlight diameter, etc), but hold over a range of reasonable choices. Such sensitivity analyses strike me as a practical way to guard against false-positive or over-fished findings; and perhaps the only way to demonstrate robustness for highly complex studies like most MVPA. I really don't think there's going to be a single statistical approach that's going to solve all our problems - multiple approaches are necessary.
Which brings me back to "supplemental or detrimental": there's no succinct way to describe and show the outcome of things like sensitivity analyses in a few sentences, or incorporate them into a short manuscript. Many, many methodological details just can't be explained briefly, aren't critical for the main discussion, but are very interesting for specialists and necessary for replication. For example, papers often simply state that "permutation testing was used", without explaining exactly how the relabeling was carried out. The precise permutation scheme sometimes makes a big difference in what's considered significant. But even I don't necessarily want to go through a half-page explanation of the permutation scheme in the main text; I'd rather see a sentence-or-two description, accompanied by a pointer to a complete description (maybe even with figures and code) in the Supplemental. Similarly, I might just want to see a sentence in the main Results mentioning that they found a similar pattern of results with a different group-level statistic, with the actual details in the Supplemental.
But what's an author to do when journals have extremely restrictive policies like those for Neuron ("Each main figure or table can be associated with up to one supplemental figure. Therefore, the total number of supplemental figures should not exceed the number of main figures and tables.") or Nature Neuroscience ("Please note that Supplementary methods are not allowed. A supplementary note should be used only in consultation with the editors...")? None of the options are great: Authors can submit in journals without such restrictions (like NeuroImage), but other concerns sometimes make submitting to particular journals necessary. Authors can strip out the "extra" information to fit space requirements, perhaps leaving oblique references to it ("single-subject information maps were consistent"), but that makes it impossible to judge what the hidden analyses actually were. Supplemental analyses could perhaps be hosted elsewhere (e.g. lab websites, personal blogs), in the hopes that readers can find the material via google, but these links are not permanent, nor tied to the article.
It strikes me as inherently contradictory to lament replicability problems while simultaneously having highly restrictive publication rules for supplementary information. Complex science is complex, and requires complex description. Good science can't always be properly conveyed by a few words, even by the best authors; and MVPA studies should be accompanied by additional details, even when such details make a publication "confusing and unwieldy".
Which brings me back to "supplemental or detrimental": there's no succinct way to describe and show the outcome of things like sensitivity analyses in a few sentences, or incorporate them into a short manuscript. Many, many methodological details just can't be explained briefly, aren't critical for the main discussion, but are very interesting for specialists and necessary for replication. For example, papers often simply state that "permutation testing was used", without explaining exactly how the relabeling was carried out. The precise permutation scheme sometimes makes a big difference in what's considered significant. But even I don't necessarily want to go through a half-page explanation of the permutation scheme in the main text; I'd rather see a sentence-or-two description, accompanied by a pointer to a complete description (maybe even with figures and code) in the Supplemental. Similarly, I might just want to see a sentence in the main Results mentioning that they found a similar pattern of results with a different group-level statistic, with the actual details in the Supplemental.
But what's an author to do when journals have extremely restrictive policies like those for Neuron ("Each main figure or table can be associated with up to one supplemental figure. Therefore, the total number of supplemental figures should not exceed the number of main figures and tables.") or Nature Neuroscience ("Please note that Supplementary methods are not allowed. A supplementary note should be used only in consultation with the editors...")? None of the options are great: Authors can submit in journals without such restrictions (like NeuroImage), but other concerns sometimes make submitting to particular journals necessary. Authors can strip out the "extra" information to fit space requirements, perhaps leaving oblique references to it ("single-subject information maps were consistent"), but that makes it impossible to judge what the hidden analyses actually were. Supplemental analyses could perhaps be hosted elsewhere (e.g. lab websites, personal blogs), in the hopes that readers can find the material via google, but these links are not permanent, nor tied to the article.
It strikes me as inherently contradictory to lament replicability problems while simultaneously having highly restrictive publication rules for supplementary information. Complex science is complex, and requires complex description. Good science can't always be properly conveyed by a few words, even by the best authors; and MVPA studies should be accompanied by additional details, even when such details make a publication "confusing and unwieldy".
Wednesday, November 6, 2013
travels: NIN and HDM
I'll be traveling a fair bit in the next month: I'll be at the Netherlands Institute of Neuroscience the last two weeks of November, then giving a talk at the 1st International Workshop on High Dimensional Data Mining (HDM) in Dallas, TX on 7 December. Drop me a line if you'll be around either and like to meet.
Subscribe to:
Posts (Atom)




