Tuesday, December 31, 2013

Akama 2012: Decoding semantics across fMRI sessions

I want to highlight just a few methodology-type bits of Akama 2012 ("Decoding semantics across fMRI sessions with different stimulus modalities"); I encourage you to read the full paper for discussion of why they actually did the experiment (and what they made of it!).

accuracy over time

The main bit I want to highlight is Figure 4 ("Comparison between the model accuracy function and the canonical HRF in the range of 0–20 s after stimulus onset."):

Figure 4 shows the single-volume classification accuracy for each of the five participants (P1, P2, etc) and two stimulus modalities ("audio" and "ortho"), classifying whether the stimulus (in both modalities) was of a tool or an animal. They used a 1-second TR, and (I think) leave-one-run-out cross-validation, with six runs. They label the dashed horizontal line as "Chance Level" from a binomial test, which I think is rather lenient, but the accuracies are high enough that I won't quibble about significance (I'd just label 50% as chance, since it's two-class classification).

As the authors point out, the peak classification accuracy is closer to 7 seconds after onset in four of the participants, instead of 5, as in the default SPM HRF (thick red line). The lines are hard to see, but P5 (darkish purple) is the person with the most similarity between the canonical HRF and classification accuracy curves, for both modalities. The accuracy curves tend to be similar across modalities within participants, or, as they put it: "The main regularities were specific to participants, with correlations between the profiles of both conditions of 0.88 for P1, 0.72 for P2, 0.81 for P3, 0.91 for P4, and 0.96 for P5."

This is the nicest figure I've seen showing how accuracy increases over time and varies over people; the scanning parameters and experimental design let the authors calculate this at a higher resolution than usual. This figure will be going into my introduction-to-MVPA slides.

cross-modal comments

The experiment had two sessions (I think on separate days, though this isn't clear): one in which images of hand tools and animals (same each modality) were accompanied by auditory labels ("audio") and the other, with written labels ("ortho"). Classification was of image category (tool or animal), and was done either with a single session's data (all audio or all ortho), or cross-session (cross-modal), training with audio and testing with ortho, or training with ortho and testing with audio. Classification accuracies were pretty high both times (figures 2 and 3), but higher within- than cross-modality, and the authors spend some time exploring and  discussing possible reasons why. Here are a few more ideas.

A new experiment could be run, in which both audio and ortho stimuli are presented in each session (e.g. three runs of each on each day). Is cross-modal classification less accurate than within-modal, or is cross-session accuracy less than within-session accuracy? In other words, I expect analyses involving images collected on separate days to be less accurate than analyses with images collected within a single session, simply because time has passed (the scanner properties could change a bit, the person could have different hydration or movement, etc). It's not possible to sort out the session and modality effects in Akama 2012, since they're confounded.

Also, I wonder if the feature selection method (picking top 500 voxels by ANOVA ranking in training set) hurt the likelihood of finding strong cross-modal classification. A ROI-based analysis might be an appropriate alternative, particularly since Tool vs. Animal classification has been studied before.

ResearchBlogging.org Hiroyuki Akama, Brian Murphy, Li Na, Yumiko Shimizu, & Massimo Poesio (2012). Decoding semantics across fMRI sessions with different stimulus modalities: a practical MVPA study Frontiers in NeuroInformatics DOI: 10.3389/fninf.2012.00024

Tuesday, December 3, 2013

permutation schemes: leave-one-subject-out

I've posted before about permutation schemes for the group level; here's an explicit illustration of the case of leave-one-subject-out cross-validation, using a version of these conventions. I'm using "leave-one-subject-out" as shorthand to refer to cross-validation in which partitioning is done on the participants, as illustrated here.
This diagram illustrates a fMRI dataset from two tasks and four participants, with one example of each task per person. Since there are four participants there are four cross-validation folds; here, the arrows indicate that for the first fold the classifier is trained on data from subjects 2, 3, and 4, while subject 1's data makes up the test set. The overall mean accuracy results from averaging the four accuracies (see this for more explanation).

How to get a significance value for the overall mean accuracy (grey oval)? We can't just do a t-test (or binomial) over the accuracies we got by leaving out each person (as often done after within-subjects classification), because the accuracies are not at all independent: we need a test for the final mean accuracy itself.

My strategy is to do dataset-wise permutation testing by label-flipping within one or more participants each fold. In this case, there are only four participants, so we could flip labels in just one or two of the participants, for 10 possible permutations (choose(4,1) + choose(4,2)) (flipping all the subjects' labels would recreate the original dataset, since this is dataset-wise permutation testing.) Here, the task labels for participant 1 only are flipped (the first example becomes task 2, the second, task 1). The entire cross-validation then proceeds with each flipped-labels dataset.

 I follow the same strategy when there are more than one example of each class per person: when the person's labels are to be 'flipped' for a permutation I relabel all the class 1 examples as class 2 (and class 2 as class 1). While the number of possible permutations is limited by number of participants in this scheme, it increases rapidly with the number of participants.

Anyone do anything different?