Tuesday, December 31, 2013

Akama 2012: Decoding semantics across fMRI sessions

I want to highlight just a few methodology-type bits of Akama 2012 ("Decoding semantics across fMRI sessions with different stimulus modalities"); I encourage you to read the full paper for discussion of why they actually did the experiment (and what they made of it!).

accuracy over time

The main bit I want to highlight is Figure 4 ("Comparison between the model accuracy function and the canonical HRF in the range of 0–20 s after stimulus onset."):

Figure 4 shows the single-volume classification accuracy for each of the five participants (P1, P2, etc) and two stimulus modalities ("audio" and "ortho"), classifying whether the stimulus (in both modalities) was of a tool or an animal. They used a 1-second TR, and (I think) leave-one-run-out cross-validation, with six runs. They label the dashed horizontal line as "Chance Level" from a binomial test, which I think is rather lenient, but the accuracies are high enough that I won't quibble about significance (I'd just label 50% as chance, since it's two-class classification).

As the authors point out, the peak classification accuracy is closer to 7 seconds after onset in four of the participants, instead of 5, as in the default SPM HRF (thick red line). The lines are hard to see, but P5 (darkish purple) is the person with the most similarity between the canonical HRF and classification accuracy curves, for both modalities. The accuracy curves tend to be similar across modalities within participants, or, as they put it: "The main regularities were specific to participants, with correlations between the profiles of both conditions of 0.88 for P1, 0.72 for P2, 0.81 for P3, 0.91 for P4, and 0.96 for P5."

This is the nicest figure I've seen showing how accuracy increases over time and varies over people; the scanning parameters and experimental design let the authors calculate this at a higher resolution than usual. This figure will be going into my introduction-to-MVPA slides.

cross-modal comments

The experiment had two sessions (I think on separate days, though this isn't clear): one in which images of hand tools and animals (same each modality) were accompanied by auditory labels ("audio") and the other, with written labels ("ortho"). Classification was of image category (tool or animal), and was done either with a single session's data (all audio or all ortho), or cross-session (cross-modal), training with audio and testing with ortho, or training with ortho and testing with audio. Classification accuracies were pretty high both times (figures 2 and 3), but higher within- than cross-modality, and the authors spend some time exploring and  discussing possible reasons why. Here are a few more ideas.

A new experiment could be run, in which both audio and ortho stimuli are presented in each session (e.g. three runs of each on each day). Is cross-modal classification less accurate than within-modal, or is cross-session accuracy less than within-session accuracy? In other words, I expect analyses involving images collected on separate days to be less accurate than analyses with images collected within a single session, simply because time has passed (the scanner properties could change a bit, the person could have different hydration or movement, etc). It's not possible to sort out the session and modality effects in Akama 2012, since they're confounded.

Also, I wonder if the feature selection method (picking top 500 voxels by ANOVA ranking in training set) hurt the likelihood of finding strong cross-modal classification. A ROI-based analysis might be an appropriate alternative, particularly since Tool vs. Animal classification has been studied before.

ResearchBlogging.org Hiroyuki Akama, Brian Murphy, Li Na, Yumiko Shimizu, & Massimo Poesio (2012). Decoding semantics across fMRI sessions with different stimulus modalities: a practical MVPA study Frontiers in NeuroInformatics DOI: 10.3389/fninf.2012.00024

Tuesday, December 3, 2013

permutation schemes: leave-one-subject-out

I've posted before about permutation schemes for the group level; here's an explicit illustration of the case of leave-one-subject-out cross-validation, using a version of these conventions. I'm using "leave-one-subject-out" as shorthand to refer to cross-validation in which partitioning is done on the participants, as illustrated here.
This diagram illustrates a fMRI dataset from two tasks and four participants, with one example of each task per person. Since there are four participants there are four cross-validation folds; here, the arrows indicate that for the first fold the classifier is trained on data from subjects 2, 3, and 4, while subject 1's data makes up the test set. The overall mean accuracy results from averaging the four accuracies (see this for more explanation).

How to get a significance value for the overall mean accuracy (grey oval)? We can't just do a t-test (or binomial) over the accuracies we got by leaving out each person (as often done after within-subjects classification), because the accuracies are not at all independent: we need a test for the final mean accuracy itself.

My strategy is to do dataset-wise permutation testing by label-flipping within one or more participants each fold. In this case, there are only four participants, so we could flip labels in just one or two of the participants, for 10 possible permutations (choose(4,1) + choose(4,2)) (flipping all the subjects' labels would recreate the original dataset, since this is dataset-wise permutation testing.) Here, the task labels for participant 1 only are flipped (the first example becomes task 2, the second, task 1). The entire cross-validation then proceeds with each flipped-labels dataset.

 I follow the same strategy when there are more than one example of each class per person: when the person's labels are to be 'flipped' for a permutation I relabel all the class 1 examples as class 2 (and class 2 as class 1). While the number of possible permutations is limited by number of participants in this scheme, it increases rapidly with the number of participants.

Anyone do anything different?

Thursday, November 14, 2013

"Supplemental or detrimental?"

The title of this post is copied from an article in the Scientist back in 2011, but is just as relevant now. I'm a big fan of sensitivity analyses: demonstrating that the findings are not due to a particular choice of parameters (preprocessing, classifier, searchlight diameter, etc), but hold over a range of reasonable choices. Such sensitivity analyses strike me as a practical way to guard against false-positive or over-fished findings; and perhaps the only way to demonstrate robustness for highly complex studies like most MVPA. I really don't think there's going to be a single statistical approach that's going to solve all our problems - multiple approaches are necessary.

Which brings me back to "supplemental or detrimental": there's no succinct way to describe and show the outcome of things like sensitivity analyses in a few sentences, or incorporate them into a short manuscript. Many, many methodological details just can't be explained briefly, aren't critical for the main discussion, but are very interesting for specialists and necessary for replication. For example, papers often simply state that "permutation testing was used", without explaining exactly how the relabeling was carried out. The precise permutation scheme sometimes makes a big difference in what's considered significant. But even I don't necessarily want to go through a half-page explanation of the permutation scheme in the main text; I'd rather see a sentence-or-two description, accompanied by a pointer to a complete description (maybe even with figures and code) in the Supplemental. Similarly, I might just want to see a sentence in the main Results mentioning that they found a similar pattern of results with a different group-level statistic, with the actual details in the Supplemental.

But what's an author to do when journals have extremely restrictive policies like those for Neuron ("Each main figure or table can be associated with up to one supplemental figure. Therefore, the total number of supplemental figures should not exceed the number of main figures and tables.") or Nature Neuroscience ("Please note that Supplementary methods are not allowed. A supplementary note should be used only in consultation with the editors...")? None of the options are great: Authors can submit in journals without such restrictions (like NeuroImage), but other concerns sometimes make submitting to particular journals necessary. Authors can strip out the "extra" information to fit space requirements, perhaps leaving oblique references to it ("single-subject information maps were consistent"), but that makes it impossible to judge what the hidden analyses actually were. Supplemental analyses could perhaps be hosted elsewhere (e.g. lab websites, personal blogs), in the hopes that readers can find the material via google, but these links are not permanent, nor tied to the article.

It strikes me as inherently contradictory to lament replicability problems while simultaneously having highly restrictive publication rules for supplementary information. Complex science is complex, and requires complex description. Good science can't always be properly conveyed by a few words, even by the best authors; and MVPA studies should be accompanied by additional details, even when such details make a publication "confusing and unwieldy".

Wednesday, November 6, 2013

travels: NIN and HDM

I'll be traveling a fair bit in the next month: I'll be at the Netherlands Institute of Neuroscience the last two weeks of November, then giving a talk at the 1st International Workshop on High Dimensional Data Mining (HDM) in Dallas, TX on 7 December. Drop me a line if you'll be around either and like to meet.

Saturday, October 19, 2013

nice doughnuts: pretty searchlight quirks

More than a year ago I posted an explanation about how strange ring-type edge effects can show up in searchlight analysis information maps, but this morning I saw the most vivid example of the phenomenon I've ever run across in real data.

These images show slices from two leave-one-subject-out searchlight analyses (linear SVM). The subjects were viewing all sorts of movie stimuli, and for a positive control I classified whether the people were viewing stimuli showing people moving, or whether the stimuli showed still people. This analysis should work very well, reporting massive information (high classification accuracy, in this case) in motor, somatosensory, and visual areas.

At right are a few slices of the information map that results when the examples are normalized within each searchlight (i.e. to a mean of 0, standard deviation of 1). The highest-classifying searchlights are in the proper brain areas (and with properly high accuracies), but in strange, hollow, shapes.

And here's what it looks like when the images are not normalized within each searchlight. The same areas have accurate searchlights, but now they make solid "blobs", with the most highly accurate searchlights (yellow) in the centers - where normalized searchlights had chance accuracy.

Read that previous post for a full explanation. This is the sort of data we'd expect to have the strongest edge/center effects: focal areas with high information, areas large enough to allow searchlights to fit inside the middle of the informative areas. It's also likely that the activations in these areas are all in the same direction (higher activity during the moving stimuli than the non-moving stimuli), creating consistent differences in the amount of BOLD throughout the areas. Normalizing each example within each searchlight removes this consistent BOLD difference, and so classification fails.

Thursday, September 26, 2013

connectome workbench: view subjects' data

In the previous post I described how to import a nifti image into the Connectome Workbench, as well as how to install and get started with the Workbench. In this post I'll describe how to get and view the Human Connectome Project data, in and out of the Workbench. If you know of easier/more reliable/faster ways to do anything I describe here, please let me know.

getting the data

I found downloading the data from the Connectome database quite straightforward. You need to get an account (and agree to usage terms), but it's all quite fast. The datasets are large, so I suggest you just try one person's data to start. To do so, click the "Download Now" button under the "1 subject" box. It won't actually download right away, but rather take you to a screen where you can specify what exactly you want to download. On the second screen, click the "queue for download" buttons next to each sort of file you want. I suggest you pick the preprocessed structural data (e.g. "HCP Q3 Data Release Structural Preprocessed") and some functional data (e.g. "HCP Q3 Data Release Task Motor Skills fMRI Preprocessed"). The click the "Download Packages" button at the bottom of the page to actually start the download.

I was confused the first time because the files went into Aspera Connect's default directory; on my Windows box I right-click the Aspera icon and select Preferences to see (and change) where it puts the files. This thread on the HCP-Users mailing list (and this documentation) talks about their use of Aspera Connect for the downloading.

figuring out the files: task labels and volumetric data

Once you've got the files downloaded, unzip them to somewhere handy. They'll create a big directory structure, with the top directory called the subject's ID code (e.g. 100307_Q3). If you followed my suggestions above, the functional data will be in /MNINonLinear/Results/tfMRI_MOTOR_LR/ and /MNINonLinear/Results/tfMRI_MOTOR_RL/. Two motor (and most task-type) runs are in the HCP protocol, with _LR and _RL referring to the scanner encoding direction used for each run.

The tasks are described in Barch et al. 2013 (that issue of NeuroImage has several articles with useful details about the HCP protocols and data), and the onset times (in seconds) are given in text files in the EVs subdirectories for each run's data, as well as in larger files (eprime output?) in each run directory (e.g. MOTOR_run2_TAB.txt).

I couldn't find a concise key for the task files, but working back and forth between Barch 2013 and the files I figured out that the first column of each EVs file is the onset time of a block, and the second column the block duration, also in seconds (the TR is 720 msec, so it's easy enough to calculate between TR and seconds). rh.txt is the right hand (actually finger) movement, rf.txt the right foot (toe) movement, t.txt the tongue movement, etc.
UPDATE (19 May 2014): The "tfMRI scripts and data files" section Q3_Release_Reference_Manual.pdf describes this a bit better.

Now that we know the onset times for each task block we can do a volumetric analysis in our usual favorite manner, using the preprocessed .nii.gz files in each run directory (e.g. tfMRI_MOTOR_LR.nii.gz). A full description of the preprocessing is in Glasser et al. 2013; motion correction and spatial normalization (to the MNI template) is included.

These are pretty functional images; this is one of the functional runs.

figuring out the files: surface-based

The HCP data is not just available in volumes, but surfaces as well (actually, a big focus is on surface-type data, see Glasser 2013), and these files need to be opened with Workbench programs. I certainly don't have all of these figured out, so please send along additional tips or corrections if you have them.

The surface-type functional data is in .dtseries.nii files in each runs' directory (e.g. tfMRI_MOTOR_LR_Atlas.dtseries.nii). As far as I can tell these are not "normal" NIfTI images (despite the .nii suffix) but rather a "CIFTI dense timeseries file". Various wb_command programs can read these files, as well as the Workbench.

So, let's view the functional data in Workbench. My approach is similar to what I described for viewing a NIfTI: first get a blank brain (in this case, the downloaded subject's anatomy), then load in the functional data (in this case, the .dtseries.nii for a motor run). I believe the proper 'blank brain' to use is the one downloaded into the subject's /MNINonLinear/fsaverage_LR32k/ directory; there's only one .spec file in that directory. Opening that .spec file (see the first post for what that means) will bring up a set of labels and surfaces, shown at right.

Now we can open up the functional data. Choose File then Open File in the top Workbench menu, and change the selection to Files of type: "Connectivity - Dense Data Series Files (*.dtseries.nii)". Navigate to the directory for a run and open up the .dtseries.nii (there's just one per run). (aside, the "Connectivity" in the File Type box confuses me; this isn't connectivity data as far as I can tell)

As before, the Workbench screen won't change, but the data should be loaded. Switch to the Charting tab on the bottom of the screen (Overlay ToolBox section), and click the little Graph button to the left of the .dtseries.nii file name (marked with an orange arrow below). Now, if you click somewhere on the brain two windows will appear: one showing the timecourse of the grayordinate you clicked on (read out of the .dtseries.nii) and the usual Information window giving the label information for the grayordinate.


Clicking the Export ... button in the Time Course window will let you save a text file (it says "Dvars Files" but it's plain text) of the time course, but I haven't yet worked out how to extract sets of timecourses, though people are quite helpful on the mailing list. The various .gii and .*.nii files are still rather a black box for me; access appears to be from various wb_command programs.

Friday, September 20, 2013

connectome workbench: plot a NIfTI image

NOTICE (8 February 2018): The material in this post is outdated. I'm leaving it after the jump for archiving, but if you're trying to learn Workbench you should consult this new tutorial for a general introduction, and this one for instructions about creating a surface from a volumetric image.

Monday, September 2, 2013

linear svm behavior: discontinuous information detection

Since starting with MVPA I've struggled with connecting classification accuracy to the properties of the voxels that (as a group) have the accuracy: does the high accuracy mean that all the voxels are informative? Just some of them? How to tell? Even with a linear svm these are not easy questions to answer.

Some of my efforts to figure out what linear SVMs are doing with fMRI data contributed to the examples in my recent paper about searchlight analysis. In this post I'll describe figure 4 in the paper (below); this is in the supplemental as Example 1 (code is example1.R).

The simulated dataset is for just one person, and has a single ROI of 500 voxels, two classes (balanced, so chance is 0.5), and four runs. Classification is with a linear SVM (c=1), averaged over a four-fold cross-validation (partitioning on the runs).

A key for this example is that the 500 voxels were created to be equally informative: the values for each voxel were created by choosing random numbers (from a uniform distribution) for one class' examples, and then adding 0.1 to each value for the examples of the other class. For concreteness, here's Figure 1 from the Supplemental, showing part of the input data:
So, we have 500 voxels, the data for each constructed so that each voxel is equally informative (in the sense of amount of activation difference between the classes). Now, let's classify. As expected, if we classify with all 500 voxels the accuracy is perfect (expected, since we know the classes vary). But what if we use less than all the voxels? I set up the example to classify sequentially larger voxel subsets: each larger subset is made by adding a voxel to the previous subset, so that each subset includes the voxels from all smaller subsets. For example, the two-voxel subset includes voxels 1 and 2, the three-voxel subset has voxels 1, 2, and 3, the four-voxel subset voxels 1, 2, 3, and 4, etc.

Figure 4 in the manuscript (below) shows one happens for one run of the simulation: the number of voxels increases from 1 to 500 along the x-axis, and accuracy from 0.5 (chance) to 1 (perfect) along the y.

In the manuscript I call this an example of discontinuous information detection: the accuracy does not increase smoothly from 0.5 to 1, but rather jumps around. For example, the 42-voxel subset classifies at 0.81. Adding voxels one-at-a-time, we get accuracies of 0.72, 0.75, 0.78, 0.94 ... the accuracy went down abruptly, then back up again. Why? My best guess is that it has to do with the shape of the points in hyperspace; each time we add a voxel we also add a dimension, and so change the point clouds, making it easier (or not) for the linear SVM to separate.

What does this mean in practice? I guess it makes me think of classification accuracies as "slippery": somewhat difficult to grasp and unpredictable, so you need to use extreme caution before thinking that you understand their behavior.

Discontinuous information detection is perhaps also an argument for stability testing. In the Figure 4 run (graph above), the accuracy didn't change much after around 210 voxels: it was consistently very high. It is much easier to interpret our results if we're pretty confident that we're in one of these stable zones. For example, suppose we have an anatomic ROI that classifies well. Now, suppose we add in the border voxels, one at a time, classifying  each "ROI + 1" subset. I'd be a lot more confident that the ROI captured a truly informative set of voxels if the accuracy stayed more-or-less the same as individual voxels were added, than if it had huge variance.

We want to make sure we're on a plateau of high accuracy (robust to minor changes in voxels or processing), not that we happened to land upon a pointy mountain top, such that any tiny variation will send us back to chance. And this example shows that linear SVMs can easily make "pointy mountains" with fMRI data.

Friday, August 30, 2013

What is a support vector machine?

This is one of the most readable, non-mathematical introductions to SVMs I've ever run across in the literature. He uses microarray data for the example, but replace "measured mRNA expression level" with "BOLD activation" and the situation is pretty much the same as in MVPA. 

He includes a description of what it means to have a soft (or hard) margin; I'll add that the conventional c=1 used with linear classifiers for MVPA corresponds to a fairly soft margin: many training points are allowed to be on the 'wrong' side of the decision boundary.

ResearchBlogging.orgWilliam S Noble (2006). What is a support vector machine? Nature Biotechnology, 24, 1565-1567 DOI: 10.1038/nbt1206-1565

Thursday, August 22, 2013

Zhang 2013: response from Jiaxiang Zhang

Jiaxiang Zhang kindly responded to the questions and comments in my previous post, and gave permission to excerpt and summarize our conversation here. Quotes from Jiaxiang are in courier font.

ROI creation

Quite a bit of my commentary was about the ROI creation process. Jiaxiang said that my outline was essentially correct, and provided some additional details and clarification. To make it a bit more readable, here's a new version of the "ROI creation" section from the previous post, answering my questions and adding in his corrections and responses.

  1. Do a mass-univariate analysis, finding voxels with significantly different activity during the chosen and specified contexts. (Two-sided, not "greater activity during the chosen than the specified context" as in my original version.)
  2. Threshold the group map from #1 at p < 0.001 uncorrected, cluster size 35 voxels.
  3. The main univariate analysis (Figure 3A and Table 1) used a FWE-corrected threshold (p<0.05). For ROI definition, a common uncorrected voxelwise threshold (p<0.001) was used, which is more liberal for subsequent MVPA. The cluster extent had very limited effects here, because all the ROIs were much larger than the extent threshold.
  4. Use Freesurfer to make cortical surface maps of the group ROIs, then transform back into each person's 3d space.
  5. We defined the grey-matter ROIs on a cortical surface (in Freesurfer), and then transformed back to the native space in NifTI format (which can be loaded into the Princeton MVPA toolbox).
    I asked if they'd had better results using Freesurfer than SPM for the spatial transformation, or if there was some other reason for their procedure.
    This is an interesting issue. I have not directly compared the two, but I would guess the two methods will generate similar results.
  6. Feature selection: pick the 120 voxels from each person, ROI, and training set with the largest "difference between conditions" from a one-way ANOVA. Concretely, this ANOVA was of the three rules (motion, color, size), performed in each person and voxel separately. I asked why they chose 120 voxels (rather than all voxels or some other subset), and if they'd tried other subset sizes.
  7. We have not tried to use all ROI voxels, because ROIs have different sizes across regions and subjects (e.g., V1). I think it is more appropriate (debatable) to fix the data dimensions. A fixed number of voxels after the feature selection for each participant and each ROI was used, to enable comparisons across regions and participants. 120 voxels were chosen because it was (roughly) the lower limit of all 190 individual ROI’s sizes (19 participants * 10 ROIs). 2 out of the 190 individual ROIs were smaller than 120 voxels (108 and 113 voxels, respectively), for which we used all voxels in that region.
  8. Normalize each example across the 120 voxels (for each person, ROI, and training set).

ROI stability

Several of my previous comments were about what, for lack of a better term, I'll call "ROI stability": were (pretty much) the same 120 voxels chosen across the cross-validation folds within each person, classification, ROI)? Here is our email exchange:
The mean ROI size is 309 ± 134 voxels (±s.d. across ROIs and participants). I did a simple test on our data. Across ROIs and participants, 26.60% ± 19.00% of voxels were not selected in any cross-validations (i.e., “bad” voxels), 10.56% ± 5.36% of voxels were selected only in one fold of cross-validation (i.e., “unstable” voxels), and 48.07% ± 25.34% of voxels were selected in 4 or more folds of cross-validations (i.e., “more stable” voxels). Although this is not a precise calculation of stability, the selected voxels do not change dramatically between cross-validations. It'd be good to know if there are other more established methods to quantify stability.
(my reply:) That is interesting! Unfortunately, I don't know of any "established" methods to evaluate stability in ROI definition procedures like this; yours is the only one I can think of that used a fold-wise selection procedure. Did you look at the location of any of those voxel categories - e.g. are the "bad" voxels scattered around the ROIs or clustered?
I do not see consistent spatial patterns on voxel categories.
I think this idea of stability, both how to measure it and how to interpret different degrees (e.g. what proportion of the voxels should be selected in each fold for a ROI to be considered stable?), is an interesting area and worthy of more attention.

significance testing

My questions about significance testing were more general, trying to understand exactly what was done.

First, about the group-level permutation testing for the ROI analyses:
For each of the 5000 permutations (in each participant and each ROI), classes in the training and test sets were relabelled, and the permuted data was analysed in the same procedure as the observed data. The order of the relabeling in each permutation was kept the same across participants. All the permuted mean accuracies formed the null distributions and were used to estimate the statistical significance of the observed classification accuracy. 
If I understand, is this similar to the "scheme 1 stripes" in http://mvpa.blogspot.com/2012/11/permutation-testing-groups-of-people.html? In other words, you didn't bootstrap for the group null distribution, but rather kept the linking across subjects. Concretely, you calculated 5000 group-level averages for the null distribution, each of the 5000 the across-subjects average after applying the same relabling scheme to each participant. This method of doing the group-level permutation test seems quite sensible to me (when possible), but I haven't run across many examples in the literature.
Yes. Your blog post is clearer on describing this method! 
How to concisely (but clearly) describe group-level permutation schemes (not to mention the impact of choosing various schemes) is another area worthy of more attention.

Next, I'd wondered how the comparisons were made between the searchlight and ROI-based results.
The peak coordinates from searchlight analysis fall within the functional ROIs in PMd, PMv and IPS. I think the similarity between the two analyses, as claimed in our paper, should not be interpreted as “equality” in terms of the blob’s shapes and extents, but their approximate anatomical locations. The two types of analysis are complementary and are based on different assumptions. The searchlight assumes that information is represented in multiple adjacent voxels (as shown in searchlight “blobs”), while the ROI analysis assumes that information is distributed within a spatially bounded region. Therefore, unless a ROI is directly defined from searchlight results, significant classification accuracy in one ROI does not necessary imply that a searchlight blob would perfectly match the spatial distribution of that ROI. 

I absolutely agree that searchlight and ROI-based analyses are complementary, based on different assumptions (and detect different types of information), and shouldn't perfectly overlap. Having the 'peak' searchlight coordinate fall into particular ROIs is a reasonable way to quantify the similarity. This is another area in which there is unfortunately no standard comparison technique (that I know of); I could imagine looking at the percent overlap or something as well.

and the questions

I'd asked three questions at the end of the previous post; here are answers from Jiaxiang for the first two (the third was answered above).

Question 1:
The PEIs for both chosen and specified trials were used in that ROI analysis (no new PEIs were created). 
So there were twice as many examples (all the chosen and all the specified) in the pooled analysis as either context alone.

Question 2:
For the cross-stage classification, chosen and specified trials at the decision stage were modeled separately (chosen and specified trials at the decision stage were averaged together in the univariate analysis and the main MVPA). 
So new PEIs were made, including just the decision time stage or just the maintenance stage?

The new first level model for cross-stage classification include both decision and maintenance stages, as in the main model.

A fascinating study (and methods); thanks again to Jiaxiang for generously answering my questions and letting me share our correspondence here!

Friday, August 9, 2013

indexing woes: FSL and caret

Back in June I shared the story of some headaches caused by inconsistent indexing: R (oro.nifti) and MRIcroN are 1-based, while python is 0-based. I later added SPM (1-based) and afni (0-based) to the collection; now, a kind person contributed images from doing the same test in FSL and caret: they are both 0-based.

As the person who sent the images wrote (I'm assuming wryly), ".... 'proper' programming languages (python, C) start counting/indexing from 0, while maths based applications like R and Matlab start from 1."

In FSL, the voxel at 17,15,20 is number 13 (0-based, like afni and pyMVPA), with voxel 25 at 18,16,21 (click images to enlarge).

And the same (0-based) is the case in caret, as shown in these images:

Thursday, August 8, 2013

Zhang 2013: "Choosing the Rules: Distinct and Overlapping Frontoparietal Representations of Task Rules for Perceptual Decisions"

"Choosing the Rules: Distinct and Overlapping Frontoparietal Representations of Task Rules for Perceptual Decisions" has interesting combinations of ROI-based and searchlight analyses. I think I figured out a lot of the methods, but all of the steps and choices were not clear. Frustratingly, why some of the thresholds were set as they were (e.g. for ROI definition) was not explained. Were these the only values tried? I tend to keep an eye out for explanations of why particular choices were made, especially when not "standard", aware of the ongoing discussions of problems caused by too many experimenter degrees of freedom. Few choices can really be considered standard, but I must say I look for more of an explanation when reading something like "the top 120 voxels were selected in each ROI" than "we used a linear SVM, c=1, libsvm package".


As usual, I'm going to focus on analysis details rather than the theoretical motivation and interpretation, but
here's a bit of introduction.

The task design is shown at right (Figure 1 from the paper). The top shows what the subject saw (and when): first, a screen with two identical (or not) task cues, then a delay, then a moving-dot stimulus, then they responded and indicated which cue (rule) they'd used to make the response.

The moving-dot stimuli always were similar, but the response the subjects were supposed to make (which button to push) depended on the rule associated with each  of the three cues (respond to dot size, color, or movement direction). Additionally, two cues were always presented simultaneously, but sometimes ("specified context") the same cue was shown twice (in which case the participant was to perform that rule), while other times (the "chosen context") two different cues were shown (in which case the participant was to pick which cue of the two they wanted to follow).

To summarize, we're interested in distinguishing three rule cues (motion, color, size), across two contexts (specified and chosen). The tasks were in eight runs, with 30 trials in each run. The cue and contexts were in random order, but with full balance within each run (5 repeats of each of the 2 contexts * 3 cues).


  • SPM8 for motion and slice-time correction. MVPA in subject space (normalized afterwards for group-level analysis).
  • They created PEIs in SPM for the MVPA, one per trial (so 240, 8 runs * 30 trials/run), with the timing matched to the length of the delay periods. { note: "PEIs" are parameter estimate images, made by fitting a HRF-aware linear model to the voxel timecourses do then doing MVPA on the beta weights produced by the model. }
  • Classification was with linear SVMs (c=1), Princeton MVPA toolbox (and libsvm), partitioning on the runs, within-subjects (each person classified separately, then results combined across subjects to get the group-level results).

The classifications were of three main types, each of which was carried out both in ROIs and searchlights:
  1. Rule classification, using either the chosen context only, the specified context only, or both (question #1). "Representation of chosen and specified rules: multivariate analysis" results section.
  2. Rule cross-classification, in which classifiers were trained to distinguish rules with trials from one context, then tested with trials from the other context. "Generalization and consistency of rule representation across contexts" results section.
  3. Stage cross-classification, in which classifiers were trained to distinguish rules using the "maintenance stage" and tested with images from the "execution stage", each context separately. This is classifying across parts of each trial, but it is not clear to me which images were classified for this analysis (question #2 below). "Invariant rule representation during maintenance and execution stages" results section.

ROI creation

NOTE: see update to this section here.

I was intrigued by how they defined the ROIs (other than BA17, which was from an anatomical atlas). Here's my guess at what they did; I'm not sure I figured out all the steps.
  1. Do a mass-univariate analysis, finding voxels with greater activity during the chosen than the specified context. ("Establishing and maintaining the neural representations of rules under different contexts: univariate analysis" section)
  2. Threshold the group map from #1 at p < 0.001 uncorrected, cluster size 35 voxels. 9 blobs pass this threshold, listed in Table 1 and shown in Figure 3A. (I think; two different thresholds seem to be used when describing the ROIs.)
  3. Use Freesurfer to make cortical surface maps of the group ROIs, then transform back into each person's 3d space. (I think; is this to eliminate non-grey-matter voxels? Why back and forth from cortical-surface to volumetric space?)
  4. Feature selection: pick the 120 voxels from each person, ROI, and training set with the largest "difference between conditions" from a one-way ANOVA. It's not clear what is meant by "conditions" in this paragraph; context, rule, both?
  5. Normalize each example across the 120 voxels (for each person, ROI, and training set).
I could find no mention of the stability of these final 120 voxels, nor of what proportion of starting ROI voxels 120 is, nor of why these particular thresholds were chosen (and if any others were tried).

It is proper to do the feature selection on the training set only, but interpretability might be affected if stability is low. In this case, they did leave-one-run-out cross-validation with eight runs, making eight versions of each ROI for each person. If these eight versions are nearly identical, it makes sense to consider them the same ROI. But if 120 voxels is something like 1/10th of the starting voxels, and a different tenth is chosen in each of the eight versions, shouldn't we think of them as eight different ROIs? Unfortunately, stability is not addressed in the manuscript, as far as I can tell.

a few other comments

  • I liked that they discussed both searchlight and ROI-based results (I've argued that searchlight analyses can be more powerful combined with ROI-type analyses), but found the combination here quite difficult to interpret. I would have liked to have seen direct comparisons between the two analyses; how much do the searchlight 'blobs' (such as in Figure 5B) overlap with the 9 ROIs? Each section of the results had a phrase along the lines of "The searchlight analysis largely confirmed the ROI results (Fig. 5A), with significant rule coding in the PMv, ..." I don't see how they quantified these claims of similarity, however.
  • Significance testing for the ROI-based analyses was with permutation testing, following what I'd call a fold-wise scheme with both training and testing sets relabeled. How the group-level distribution was calculated is not described.
  • Significance testing for the searchlight analyses was by spatially normalizing then smoothing each individual's accuracy map, then doing a voxel-wise t-test for accuracy greater than chance.


  1. in the "Representation of chosen and specified rules: multivariate analysis" section it says that one analysis was after "... pooling data from the chosen and specified conditions ...". Were separate PEIs made for this analysis (i.e. fitting all trials of each rule, regardless of context), or were PEIs made for the specified and chosen trials separately, then both used for this analysis (so that there were twice as many examples in the pooled analysis as either context alone)?
  2. Were new PEIs made for the cross-stage classification, were individual volumes classified, or something else? "Therefore, we constructed fMRI patterns of rule execution from activations at the onset of the RDK stimuli, and examined whether classifiers trained on patterns from maintenance stage could discriminate patterns from execution stage."
  3. Are my guesses for the steps of the ROI creation technique correct? Why were these thresholds chosen? Any other techniques for ROI creation tried? How stable were the ROIs after feature selection?

Update (22 August 2013): Jiaxiang Zhang answered my questions and corrected a few parts of this post here.

Zhang J, Kriegeskorte N, Carlin JD, & Rowe JB (2013). Choosing the rules: distinct and overlapping frontoparietal representations of task rules for perceptual decisions. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33 (29), 11852-62 PMID: 23864675

FYI: 1st International Workshop on High Dimensional Data Mining (HDM 2013)

The call for papers for the 1st International Workshop on High Dimensional Data Mining (HDM 2013) ends on 17 August 2013. Not directly about MVPA, but we certainly should be interested in how to work with high dimensional data!

Tuesday, July 23, 2013

PRoNTo (Pattern Recognition for Neuroimaging Toolbox) 1.1 released

I haven't yet tried running PRoNTo, but have heard some positive things. Definitely worth a look if you're looking for MVPA software, particularly if you're using SPM and MATLAB.

Sunday, July 7, 2013

"jumping stars": a side effect of low power

In brief, statistical power refers to the long-run frequency of rejecting H0 and accepting H1 (if the experiment was repeated), when the truth is that the effect exists - the null hypothesis should be rejected. Unfortunately, psychological and neuroimaging studies tend to be underpowered. There are plenty of practical and historical reasons for this, see the references listed below for readable introductions.

In this post I'll demonstrate one side effect of low-powered studies: reproducing results can be very difficult, even when the effects actually exist. I sometimes refer to this effect as "jumping stars", for reasons that will become apparent. This example is adapted from one presented in Maxwell (2004).


Suppose that we want to test if depression level is related to “self-perceived competence” in academic, appearance, athletic, behavioral, and social domains. To answer this question, we measure the depression level and those five domains of self-perceived competence in 100 people.

Once we have the data, we do a multiple linear regression to see which competence areas are most related to the depression score. (The variables are scaled in such a way that positive regression coefficients imply that higher levels of perceived competence correspond to lower levels of depression.)

For concreteness, here's the data for the first few (pretend) people:

Now we do the linear regression, and get the following:
What can we conclude? Well, it looks like higher levels of athletic and social perceived competence are associated with lower levels of depression. These p-values are quite good, surviving even a Bonferroni correction for multiple comparisons.

Now, someone decides to replicate our study, and measures depression and self-perceived competence in 100 different people. They got this result from the linear regression:

And another study's results:

Oh no! Things are not replicating very well .... to summarize:

What happened? As you probably guessed from my introduction about power, these three studies are not "bad" or finding non-existent effects, but rather are underpowered.

These three datasets were created when I ran the same simulation (R code here) three times. The simulation creates an underpowered study in which the factors are equally important: all five measures are correlated at 0.3 with the depression score.

Thus, all factors should be found equally important. But we don’t have enough power to find them all, so we only find a few in each individual study. The results of each study are thus not exactly incorrect, but rather incomplete, and unstable: which effects were detected in each varied randomly.

And that is why I refer to this effect as "jumping stars": which row of the results table gets the stars (is found significant) varies as it is repeated; the significance stars seem to jump around unpredictably.


There is no easy fix for the jumping stars problem, since it's due to the hard-to-solve problem of low statistical power. And it can be quite difficult to distinguish this situation - true effects but low power - from a situation in which replication fails because the effects are nonexistent (due to random chance).

As Maxwell (2004) explains (and you really should read it if you haven't), jumping stars cause much more problems when many analyses are performed, since, when the truth is that the effects are real, the likelihood that a low-powered study will find one of the effects significant is much, much larger than the likelihood that the study will find all of the effects significant (which is actually the truth).

All of this can add up to a messy, unstable, contradictory literature in which replication is very difficult - even if the effects actually are present!

But I am not counseling despair, rather awareness and caution. We must be very, very cautious in interpreting negative results; aware that many true effects could have been missed. And we should keep power in mind when designing experiments and analyses: more is generally better.

some references

Maxwell, S.E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147-163.

Sedlmeier & Gigerenzer, 1989. Do Studies of Statistical Power have an Effect on the Power of Studies?

Yarkoni, T. (2009). Big Correlations in Little Studies: Inflated Correlations Reflect Low Statistical Power. Perspectives on Psychological Science, 4, 294 - 298.

Friday, June 14, 2013

"ironic" comments on cross-validation

There is an interesting set of responses to "Ten ironic rules for non-statistical reviewers" (Friston, 2012) in press at NeuroImage. Many interesting issues are raised, but I want to touch on the discussion of cross-validation in the reply Sample size and the fallacies of classical inference (Friston, 2013).

from Friston, 2013:
"Is cross validation useful?
[...] I think that the role of cross validation in neuroimaging deserves further discussion. As articulated clearly by Lindquist et al., the goals of inference and cross validation are very distinct. Cross validation is generally used to validate a model (or its parameters) that predicts or classifies using new observations. Technically, this is usually cast in terms of a posterior predictive probability density. Cross validation is used to validate this posterior predictive density for diagnostic or other purposes (such as out-of-sample estimation or variable selection). However, these purposes do not include model comparison or hypothesis testing. In other words, cross validation is not designed for testing competing hypotheses or comparing different models. This would be no problem, were it not for the fact that cross validation is used to test hypotheses in brain mapping. For example, do the voxels in my hippocampal volume of interest encode the novelty of a particular stimulus? To answer this question one has to convert the cross validation scheme into a hypothesis testing scheme - generally by testing the point null hypothesis that the classification accuracy is at chance levels. It is this particular application that is suboptimal. The proof is straightforward: if a test of classification accuracy gives a different p-value from the standard log likelihood ratio test then it is – by the Neyman–Pearson Lemma – suboptimal. In short, a significant classification accuracy based upon cross validation is not an appropriate proxy for hypothesis testing. It is in this (restricted) sense that the Neyman–Pearson Lemma comes into play.

I have mixed feelings about cross validation, particularly in the setting of multivariate pattern classification procedures. On the one hand, these procedures speak to the important issue of multivariate characterisation of functional brain architectures. On the other hand, their application to hypothesis testing and model selection could be viewed as a non-rigorous and slightly lamentable development."
I fully agree that "cross validation is not designed for testing competing hypotheses or comparing different models", but do not see how that is a problem for (typical) MVPA hypothesis testing.

This image is my illustration of a common way cross-validation is used in MVPA to generate a classification accuracy (see this post for an explanation). As Friston says in the extract, we often want to test whether the classification accuracy is at chance. My preferred approach in these sorts of cases is usually dataset-wise permutation testing, in which the task labels are changed, then the classification (and averaging the accuracies obtained on the cross-validation folds) is repeated, resulting in a null distribution for mean accuracy to which the true mean accuracy can be compared (and a significance level obtained, if desired).

In this scenario the cross-validation does not play a direct role in the hypothesis testing: it is one step of the procedure that led from the data to the statistic we're hypothesizing about. The classification accuracy is not significant "based upon cross validation", but rather significant based on the permutation testing; cross-validation is one stage of the procedure that led to the classification accuracy.

ResearchBlogging.org Friston, K. (2013). Sample size and the fallacies of classical inference NeuroImage DOI: 10.1016/j.neuroimage.2013.02.057

Tuesday, June 11, 2013

indexing woes: spm and afni

For "fun" I decided to add spm and afni to my set of voxel indexes started with MRIcroN, R, and pyMVPA.

SPM8 follows the 1-based style of MRIcroN and R:

while afni follows the 0-based style of pyMVPA:

So, it's probably not fair to think of either the one or zero-based styles as "correct", but this needs to be kept in mind when referring to voxel by i,j,k coordinates programmatically.

I don't use fsl or brain voyager; send me how these (or other) programs show the voxel coordinates and I'll add them to the collection.

Update 9 August 2013: fsl and caret are both 0-based

MVPA Permutation Schemes: Permutation Testing in the Land of Cross-Validation

Next week I will be attending PRNI 2013, presenting "MVPA Permutation Schemes: Permutation Testing in the Land of Cross-Validation". The papers will be published, but are not online yet. I'll add a link once one's online, but let me know if you'd like a copy now and I'll send you a pdf. The little demo accompanying the paper is already online, here (it's a zip, containing a pdf explanation plus knitr R code and files). I'd like to expand the paper to include more examples and firmer advice, but the little paper is a (hopefully) a clear overview of the issues.

The paper describes a few ways in which permutation testing is commonly done in MVPA, focusing on how the cross-validation folds (aka data partitions) are accounted for. If you're not familiar with why cross-validation is relevant for MVPA, I think this is a pretty readable statistically-oriented introduction, you could try a paper I wrote a few years ago (Etzel, J.A., Gazzola, V., Keysers, C., 2009. An introduction to anatomical ROI-based fMRI classification analysis. Brain Research 1282, 114-125.), or just google.

two permutation schemes

One goal of the paper is to point out that "we did a permutation test" is not a sufficient description for MVPA, since there are many reasonable ways to set up permutation tests. We use the terms "dataset-wise" and "fold-wise" to describe two common schemes. Since these terms aren't standard, we illustrate the two schemes with a running example.

introduction: running example

The running example is based on what I think is pretty much the simplest situation possible for task-based classification analysis for a single subject:

"This person completed three runs of fMRI scanning, each of which contained three blocks each of two different tasks. These task blocks were presented with sufficient rest intervals to allow the task-related BOLD signal to return to baseline, making it reasonable to assume that the task labels can be permuted [2, 3]. We further assume that the image preprocessing (motion correction, etc.) was adequate to remove most linear trends and uninteresting signals. Temporal compression [4] was performed, so that each task block is represented in the final dataset as a single labeled vector of voxel values (Fig. 1). There are n entries in each vector, corresponding to the voxels falling within an anatomically-defined region of interest (ROI). We assume that n is small enough (e.g. 100) that further feature selection is not necessary.

We wish to use a classification algorithm (e.g. linear support vector machines) to distinguish the two tasks, using all n voxels listed in the dataset. For simplicity, we will partition the data on the runs (three-fold CV): leave out one run, train on the two remaining runs, and repeat, leaving out each run in turn. The three test set accuracies are then averaged to obtain the overall classification accuracy (Fig. 2), which, if greater than chance, we interpret as indicating that the voxels’ BOLD varied with task."

I carry this way of illustrating cross-validation and classification through the later figures. The white-on-black color indicates that these examples have the true task labels: the numbers in the circles (which are 'seen' by the classifier) match those of the third (task) column in the dataset.

permutation schemes

Now, the permutation testing. We need to put new task labels on, but where? There are 20 ways of reordering the task labels; shown in the figures as colored circles on a light-grey background.

Under the dataset-wise scheme we put the new task labels on before the cross-validation, carrying the new labels through the cross-validation folds. Figure 4 shows how this works when both the training and testing sets are relabeled, while Figure 5 shows how it works when only the training sets are relabeled.

Note that the dataset's structure is maintained under the dataset-wise permutation scheme when both training and testing sets are relabeled (Figure 4 has the same pattern of arrows as Figure 2). Some of the arrows are shared between Figure 5 and Figure 2, but the property (in the real data) that each labeling is used in a test set is lost.

Under the fold-wise permutation scheme we put the new task labels on during the cross-validation. Figure 6 shows this for relabeling the training data only, as suggested in the pyMVPA documentation. Figure 6 has a similar structure to Figure 5, but the coloring is different: under the dataset-wise scheme a run is given the same set of permuted labels in all training sets in a permutation, while under the fold-wise scheme each run gets a different set of permuted labels (i.e. in Figure 5 run 1 is purple in both the first and second cross-validation folds, while in Figure 6 run 1 is purple in the first and red in the second).

does the permutation scheme matter?

Yes, at least sometimes. We are often dealing with such small, messy datasets in MVPA that shifting the rank by just a few values can really matter. The simulations in the little demo show a few (tweakable) cases.

Here are the first two repetitions of the demo (this is in a more readable format in the pdf in the demo). In the first, the true-labeled accuracy was 0.64 (vertical line), the p-value for both.dset (permuting training and testing labels dataset-wise, ala figure 4) was 0.013, train.dset (dataset-wise, permuting training only, ala figure 5) was 0.002; both.fold (permuting both, fold-wise) was 0.001, and train.fold (fold-wise permuting training only, ala figure 6) was 0.005. On repetition 2, the p-values were: both.dset=0.061, train.dset=0.036, both.fold=0.032, and train.fold=0.028. That gives us p-values above and below the 'magical' value of 0.05, depending on how we do the permutation test.

final thoughts

Which permutation scheme should we use for MVPA? Well, I don't know a universally-applicable answer. As we know, how simulated datasets are created can really, really matter, and I certainly don't claim this little demo is representative of true fMRI data. That said, the pattern in null distributions above - larger variance on dataset-wise than fold-wise schemes (and so a more stringent test) - is common in my experience, and unsurprising. It seems clear that more of the dataset structure is kept under dataset-wise schemes, which is consistent with the goal of matching the permutation test as closely as possible to the true data analysis.

My feeling is that dataset-wise permutation schemes, particularly permuting both the training and testing sets (Fig. 4) is the most rigorous test for a dataset like the little running example here. Dataset-wise permuting of either just the training or just the testing set labels may be preferable in some cases, such as 'cross' classification, when the training and testing datasets are not runs but rather independent datasets (e.g. different experimental conditions or acquisition days).

I don't think that fold-wise schemes should be recommended for general use (datasets like the running example), since some of the dataset structure (similarity of training data across cross-validation folds) present in the true data is lost.

UPDATE (20 June 2013): Here is a link to a copy of my poster, plus expanded simulation code.

UPDATE (22 July 2013): Here is a link to a copy of the paper, as well.

UPDATE (6 November 2013): The paper is now up at IEEE and has the DOI 10.1109/PRNI.2013.44.

Wednesday, June 5, 2013

indexing woes: a story of pyMVPA, R, and mricron

In the hope that someone else will avoid the headaches we've experienced, here is a brief note about indexing in MRIcroN, R (specifically, oro.nifti), and pyMVPA. The python code - and finding the source of the discrepancy - is thanks to Ben Acland.

First, MRIcroN and R. In the image above I show the fake brain NIfTI file I've used before opened in MRIcroN. In both, the value of the voxel at coordinates 18, 16, 21 is 13.

Now, pyMVPA. Here's the code (thanks, Ben!), with a screenshot from running it within NeuroDebian:
from mvpa2.suite import *
ds = fmri_dataset('/home/brain/host/fakeBrain.nii.gz')
def ijkIdx(dataset, i, j, k):
  return np.where((dataset.fa['voxel_indices'][:,0] == i)
    & (dataset.fa['voxel_indices'][:,1] == j)
    & (dataset.fa['voxel_indices'][:,2] == k))[0][0]
ds.samples[0,ijkIdx(ds, 18,16,21)]
ds.samples[0,ijkIdx(ds, 17,15,20)]

Notice the lines marked in green: the voxel at 18, 16, 21 does not have the value 13, but 25: the voxel at 17, 15, 20 has the value 13 (orange). Python is zero-based and R is one-based, which may be the source of this discrepancy.

Ben wrote the little function ijkIdx to return a voxel value by i,j,k coordinates since pyMVPA stores the values internally as vectors; see the pyMVPA tutorial for more details.

Update 11 June 2013: spm is also 1-based while afni is 0-based
Update 9 August 2013: fsl and caret are both 0-based.

Wednesday, May 29, 2013

nice methods!

I don't have a long commentary to write about Multi-voxel coding of stimuli, rules, and responses in human frontoparietal cortex except to say that reading the methods (and the rest of the paper) was very pleasant: the software, algorithms, preprocessing, etc. are all clearly and thoroughly described, without becoming bogged down. Next time I'm asked for an example of how to write up an MVPA study, this is the paper I'll recommend.

ResearchBlogging.orgWoolgar A, Thompson R, Bor D, & Duncan J (2011). Multi-voxel coding of stimuli, rules, and responses in human frontoparietal cortex. NeuroImage, 56 (2), 744-52 PMID: 20406690

Friday, May 24, 2013

Schapiro 2013: response from Anna Schapiro

Anna Schapiro, first author of "Neural representations of events arise from temporal community structure", kindly provided additional details (even a diagram!), addressing the questions I asked about her research in my previous post. She gave me permission to share extracts from her reply here, which we hope will be useful to others as well.

She said that my descriptions and guesses were basically accurate, including that they used the searchmight toolbox to perform the searchlight analysis, and so used cubical searchlights by default.

My first question on the previous post was, "Were the item pairs matched for time as well as number of steps?" Anna replied that:
"We did at one point try to balance time as well as the number of steps between items. So we were averaging correlations between items that were both the same number of steps and the same amount of time apart. But I found that some of the time/step bins had very few items and that that made the estimates significantly more noisy. So I opted for the simpler step balance approach. Although it doesn't perfectly address the time issue, it also doesn't introduce any bias, so we thought that was a reasonable way to go."

about the node pairs

My second question was about how many correlation differences went into each average, or, more generally, which within-cluster and between-cluster pairs were correlated.

"Regarding choices of pairs, I think the confusion is that we chose one Hamiltonian path for each subject and used the forwards and backwards versions of that path throughout the scan (see the beginning of Exp 2 for an explanation of why we did this). Let's assume that that path is the outermost path through the graph, as you labeled in your post [top figure]. Then the attached figure [second figure] shows the within and between cluster comparisons that we used, with nodes indexed by rows and columns, and color representing the distance on the graph between the two nodes being compared. We performed the correlations for all node pairs that have the same color (i.e., were the same distance away on the graph) and averaged those values before averaging all the within or between cluster correlations."

Here's the node-numbered version of Figure 1 I posted previously, followed by the figure Anna created to show which node pairs would be included for this graph (the images should enlarge if you click on them).

Concretely, then, if the Hamiltonian path for a person was that shown in Figure 1 (around the outside of the graph, as the nodes are numbered), correlations would be calculated for the node pairs marked in the second figure on each of the approximately 20 path traversals. This is about 60 correlations for each path traversal, sorted into 30 within-cluster comparisons and 30 between-cluster comparisons ("about" 60 since some pairs on each traversal might be omitted depending on where the path started/ended).

The 30 pairs (correlations) for the within-cluster comparisons would be collapsed into a single number by averaging in two steps. First, the correlations of the same length would be averaged (e.g. three for length-4-within-cluster: 11-15, 6-10, 1-5), giving four averages (one for length-1 pairs, one for length-2 pairs, one for length-3 pairs, and one for length-4 pairs). Second, these four averages would be averaged, giving a single average for within- and between-cluster comparisons. This two-step averaging somewhat reduces the influence of path length imbalance (averaging all pairs together would include 12 length-1 pairs in the within-cluster comparison but only 3 length-1 pairs in the between-cluster comparison), though may not eliminate it completely. I wonder if picking three pairs of each length to average (i.e. all length-4 within-cluster but only a third of the length-1 within-cluster) would change the outcome?

group-level analysis

My third question was about how exactly the group analysis was performed. Anna replied that

"yes, we used a one-sample t-test on our within-between statistic values. In this scenario, randomise permutes the sign of the volumes for each subject. On each permutation, the entire volume for a particular subject is left alone or multiplied by -1. Then it looks for clusters of a certain size in this nonsense dataset. In the end it reports clusters in the true dataset are significantly more likely to be found than in these shuffled versions. I like using randomise because it preserves the spatial smoothness in every region of the brain in every subject. Searchlights may create smoothness that have a different character than other analyses, but we don't have to worry about it, since that smoothness is preserved in the null distribution in this permutation test."

I then asked, "Do you mean that it [randomise] permutes the sign of the differences for each person? So it is changing signs on the difference maps, not changing labels on the (processed) BOLD data then recalculating the correlations?", to which she replied that, "I feed randomise within-between maps, so yes, it's permuting the sign of the differences."

Thank you again, Anna Schapiro, for the helpful and detailed replies, and for allowing me to share our correspondence here!