Wednesday, May 29, 2013

nice methods!

I don't have a long commentary to write about Multi-voxel coding of stimuli, rules, and responses in human frontoparietal cortex except to say that reading the methods (and the rest of the paper) was very pleasant: the software, algorithms, preprocessing, etc. are all clearly and thoroughly described, without becoming bogged down. Next time I'm asked for an example of how to write up an MVPA study, this is the paper I'll recommend.

ResearchBlogging.orgWoolgar A, Thompson R, Bor D, & Duncan J (2011). Multi-voxel coding of stimuli, rules, and responses in human frontoparietal cortex. NeuroImage, 56 (2), 744-52 PMID: 20406690

Friday, May 24, 2013

Schapiro 2013: response from Anna Schapiro

Anna Schapiro, first author of "Neural representations of events arise from temporal community structure", kindly provided additional details (even a diagram!), addressing the questions I asked about her research in my previous post. She gave me permission to share extracts from her reply here, which we hope will be useful to others as well.

She said that my descriptions and guesses were basically accurate, including that they used the searchmight toolbox to perform the searchlight analysis, and so used cubical searchlights by default.

My first question on the previous post was, "Were the item pairs matched for time as well as number of steps?" Anna replied that:
"We did at one point try to balance time as well as the number of steps between items. So we were averaging correlations between items that were both the same number of steps and the same amount of time apart. But I found that some of the time/step bins had very few items and that that made the estimates significantly more noisy. So I opted for the simpler step balance approach. Although it doesn't perfectly address the time issue, it also doesn't introduce any bias, so we thought that was a reasonable way to go."

about the node pairs

My second question was about how many correlation differences went into each average, or, more generally, which within-cluster and between-cluster pairs were correlated.

"Regarding choices of pairs, I think the confusion is that we chose one Hamiltonian path for each subject and used the forwards and backwards versions of that path throughout the scan (see the beginning of Exp 2 for an explanation of why we did this). Let's assume that that path is the outermost path through the graph, as you labeled in your post [top figure]. Then the attached figure [second figure] shows the within and between cluster comparisons that we used, with nodes indexed by rows and columns, and color representing the distance on the graph between the two nodes being compared. We performed the correlations for all node pairs that have the same color (i.e., were the same distance away on the graph) and averaged those values before averaging all the within or between cluster correlations."

Here's the node-numbered version of Figure 1 I posted previously, followed by the figure Anna created to show which node pairs would be included for this graph (the images should enlarge if you click on them).

Concretely, then, if the Hamiltonian path for a person was that shown in Figure 1 (around the outside of the graph, as the nodes are numbered), correlations would be calculated for the node pairs marked in the second figure on each of the approximately 20 path traversals. This is about 60 correlations for each path traversal, sorted into 30 within-cluster comparisons and 30 between-cluster comparisons ("about" 60 since some pairs on each traversal might be omitted depending on where the path started/ended).

The 30 pairs (correlations) for the within-cluster comparisons would be collapsed into a single number by averaging in two steps. First, the correlations of the same length would be averaged (e.g. three for length-4-within-cluster: 11-15, 6-10, 1-5), giving four averages (one for length-1 pairs, one for length-2 pairs, one for length-3 pairs, and one for length-4 pairs). Second, these four averages would be averaged, giving a single average for within- and between-cluster comparisons. This two-step averaging somewhat reduces the influence of path length imbalance (averaging all pairs together would include 12 length-1 pairs in the within-cluster comparison but only 3 length-1 pairs in the between-cluster comparison), though may not eliminate it completely. I wonder if picking three pairs of each length to average (i.e. all length-4 within-cluster but only a third of the length-1 within-cluster) would change the outcome?

group-level analysis

My third question was about how exactly the group analysis was performed. Anna replied that

"yes, we used a one-sample t-test on our within-between statistic values. In this scenario, randomise permutes the sign of the volumes for each subject. On each permutation, the entire volume for a particular subject is left alone or multiplied by -1. Then it looks for clusters of a certain size in this nonsense dataset. In the end it reports clusters in the true dataset are significantly more likely to be found than in these shuffled versions. I like using randomise because it preserves the spatial smoothness in every region of the brain in every subject. Searchlights may create smoothness that have a different character than other analyses, but we don't have to worry about it, since that smoothness is preserved in the null distribution in this permutation test."

I then asked, "Do you mean that it [randomise] permutes the sign of the differences for each person? So it is changing signs on the difference maps, not changing labels on the (processed) BOLD data then recalculating the correlations?", to which she replied that, "I feed randomise within-between maps, so yes, it's permuting the sign of the differences."

Thank you again, Anna Schapiro, for the helpful and detailed replies, and for allowing me to share our correspondence here!

Friday, May 10, 2013

SA:PPP in one sentence

A one-sentence summary: If your interpretation involves more than searchlights, you need more than a searchlight analysis.

searchlight analysis interpretation: worst case scenario

I like searchlight analysis. It avoids some curse-of-dimensionality headaches, covers the whole brain, makes pretty pictures (brain blob maps!), and can be easier for people accustomed to mass-univariate analyses.

But if I like searchlight analysis, why write a paper about it with "pitfalls" in the title? Well, because things can go badly wrong. I do not at all want to imply that searchlight analysis should be abandoned! Instead, I think that searchlight analyses need to be interpreted cautiously; some common interpretations do not always hold. Am I just picking nits, or does this actually matter in applications?

Here's an example of how searchlight analysis interpretation often is written up in application papers.

Suppose that the image at left is a slice of a group-level searchlight analysis results map for some task, showing the voxels surviving significance testing. These voxels form two large clusters, one posterior in region Y and the other a bit more lateral in region X (I'll just call them X and Y because I'm not great at anatomy and it really doesn't matter for the example - this isn't actually even searchlight data). We write up a paper, describing how we found significant clusters in the left X and Y which could correctly classify our task. Our interpretation is focused on possible contributions of areas X and Y to our task, drawing parallels to other studies talking about X and Y.

This is the sort of interpretation that I think is not supported by the searchlight analysis alone, and should not be made without additional lines of evidence.

Why? Because, while the X and Y regions could be informative for our task, the searchlight analysis itself does not demonstrate that these are more informative than other regions, or even that the voxel clusters themselves are informative (as implied in the interpretation). It is possible that the voxel clusters found significant in the searchlight analysis may not actually be informative outside the context of the particular searchlight analysis we ran (i.e. dependent upon our choice of classifier, distance metric, and group-level statistics).

The problem is the way my hypothetical interpretation shifts from the searchlight analysis itself to regions and clusters: The analysis found voxels which are at the center of significant searchlights, but the interpretation is about the voxels and regions, not the searchlights. Unfortunately, it is not guaranteed that if information is significantly present at one spatial scale (the searchlights) it will be present at smaller ones (the voxels) or larger (the regions).

Back to the hypothetical paper, it is possible that classifying the task using only the voxels making up cluster X could fail (i.e. instead of a searchlight analysis we make a ROI from a significant cluster and classify the task with just those voxels). This is one of my worst-case scenarios: the interpretation that regions X and/or Y have task information is wrong.

But that's not the only direction in which the interpretation could be wrong: X and Y could be informative, but a whole lot of other regions could also be informative, even more informative than X and Y. This is again a problem of shifting the scale of the interpretation away from the searchlights themselves: our searchlight analysis did not show that X and Y are the most significant clusters. One way of picturing this is if we did another searchlight analysis, this time using searchlights with the same number of voxels as Y: we could end up with a very different map (the center of Y will be informative, but many other voxels could also be informative, perhaps more than Y itself).

These are complex claims, and include none of the supporting details and demonstrations found in the paper. My goal here is rather to highlight the sort of searchlight analysis interpretations that the paper describe; the sorts of interpretations that are potentially problematic. But, note the "potentially problematic", not "impossible"! The paper (and future posts) describe follow-up tests that can support interpretations like in my scenario, ways to show we're not in one of the worst cases.

Thursday, May 9, 2013

Schapiro 2013: "Neural representations of events arise from temporal community structure"

While not a methods-focused paper, this intriguing and well-written paper includes an interesting application of searchlight analysis which I'll explore a bit here. I'm only going to describe a bit of the searchlight - related analyses here, you really should take a look at the full paper.

First, though, they used cubical searchlights! I have an informal collection of searchlight shapes, and suspect that the authors used cubical searchlights from Francisco's legacy, though I couldn't find a mention of which software/scripts they used for the MVPA. (I don't mean to imply cubes are bad, just a less-common choice.)

a bit of background

Here's a little bit about the design relevant for the searchlight analysis; check the paper for the theoretical motivation and protocols. Briefly, the design is summarized in their Figure 1: Subjects watched long sequences of images (c). There were 15 images, not shown in random order, but rather in orders chosen by either random walks or Hamiltonian paths on the network in (a). I superimposed unique numbers on the nodes to make them easier to refer to later; my node "1" was not necessarily associated with image "1" (though it could have been).

Subjects didn't see the graph structure (a), just long (1,400 images) sequences of images (c). When each image appeared they indicated whether each image was rotated from its 'proper' orientation. The experiment wasn't about the orientation, however, but rather about the sequences: would subjects learn the underlying community structure?

The searchlight analysis was not a classification but instead rather similar to RSA (representational similarity analysis), though they didn't mention RSA. In their words,
"Thus, another way to test our prediction that items in the same community are represented more similarly is to examine whether the multivoxel response patterns evoked by each item come to be clustered by community. We examined these patterns over local searchlights throughout the entire brain, using Pearson correlation to determine whether activation patterns were more similar for pairs of items from the same community than for pairs from different communities."
Using Figure 1, the analysis is asking whether a green node (e.g. number 2) is more similar to other green nodes than to purple or orange nodes. It's not just a matter of taking all of the images and sorting them by node color, though - there are quite a few complications.

setting up the searchlight analysis

The fMRI session had 5 runs, each of which had 160 image presentations, during which the image orders alternated between random walks and Hamiltonian paths. They only wanted to include the Hamiltonian paths in the searchlight analysis (for theoretical reasons, see the paper), which I think would work out to around 5 eligible path-traversals per run (160/15 = 10.6/2 =~ 5); each node/image would have about 5 presentations per run. They didn't include images appearing at the beginning of a path-traversal, so I think there would be something less than 25 total possible image presentations to include in the analyses.

Hamiltonian paths in the graph mean that not all node orderings are possible: nodes of the same color will necessarily be visited sequentially (with the starting point's color potentially visited at the beginning and end of the path). For example, one path is to follow the nodes in the order of the numbers I gave them above: starting at 1 and ending at 15. Another path could be (1:5, 6,8,7,9,10, 11:15). But (1:5, 6,8,10,7,9, 11:15) is NOT possible - we'd have to got through 10 again to get out of the purple nodes, and Hamiltonian paths only visit each node once. Rephrased, once we reach one of the light-colored boundary nodes (1,5,6,10,11,15) we need to visit all the dark-colored nodes of that color before visiting the other boundary node of the same color.

This linking of order and group makes the searchlight analysis more difficult: they only want to capture same-cluster/different-cluster similarity differences due to cluster, not that the different-cluster images appeared separated by more time than the same-cluster images (since fMRI volumes collected closer together in time will generally be more similar to each other than fMRI volumes collected farther apart in time). They tried to compensate by calculating similarities for pairs of images within each path that were separated by the same number of steps (but see question 1 below). 

For example, there are three possible step-length-1 pairs for node 1: 15-1-2; 15-1-3; 15-1-4. The dark-colored nodes (2,3,4; 7,8,9; 12,13,14) can't be the "center" for any step-length-1 pairs since it takes at least 2 steps to reach the next cluster. Every node could be the "center" for a step-length-2 pair, but there are many more valid pairings for the dark-colored nodes than the light-colored ones.

The authors say that "Across these steps, each item participated in exactly four within-cluster pair correlations and exactly four across-cluster pair correlations.", but it's not clear to me whether this count means one correlation of each step-length or if only four pairings went into each average. It seems like there would be many more possible pairings at each step-length than four.

Once the pairings for each person have been defined calculating the statistic for each pairing on each searchlight would be relatively straightforward: get the three 27-voxel vectors corresponding to the item presentation, its same-cluster item presentation, and its different-cluster item presentation. Then, calculate the correlation between the item and its same-cluster and different-cluster items, Fisher-transform, and subtract. We'd then have a set of differences for each searchlight (one for each of the pairings), which are averaged and the average assigned to the center voxel.

I think this is an interesting analysis, and hopefully if you've read this far you found my description useful. I think my description is accurate, but had to guess in a few places, and still have a few questions:
  1. Were the item pairs matched for time as well as number of steps? In other words, with the interstimulus interval jittering two steps could be as few as 3 seconds (1img + 1iti + 1img + 1iti + 1img) or as many as 11 seconds (1img + 5iti + 1img + 5iti + 1img).
  2. How many correlation differences went into each average? Were these counts equal for step-lengths or in every subject?
  3. How was the group analysis done? The authors describe using fsl's randomise on the difference maps; I guess a voxel-wise one-sample t-test, for difference != 0? What was permuted?
I'll gladly post any answers, comments, or clarifications I receive.

UPDATE, 24 May: Comments from Anna Schapiro are here. Schapiro, A., Rogers, T., Cordova, N., Turk-Browne, N., & Botvinick, M. (2013). Neural representations of events arise from temporal community structure Nature Neuroscience, 16 (4), 486-492 DOI: 10.1038/nn.3331

Monday, May 6, 2013

musings and meanderings on messes

I've been thinking lately about the "messiness" of science: what is the best course of action when the results do not tell a clear story but are a "mess" ... some results clearly supporting the hypothesis, others not, statistical tests not quite demonstrating the interpretation we want to make, etc.

In neuroimaging the practice is often to keep trying different analyses until "something sensible" turns up. Perhaps a different preprocessing strategy will make a 'blob' appear where we expected it to; perhaps using a conjunction analysis will 'get rid of' the strange activations appearing during this condition; perhaps changing the cross-validation scheme will eliminate the below-chance classification. Such practices are partly unavoidable: when we do not know the proper choice that should be made in each step of the analysis it does not make sense to stop when the first set of guesses fail. But these practices are very dangerous: exploding the experimenter degrees of freedom can make it possible to call nearly anything significant; and scientific progress depends upon robust results.

In my experience, very clean results - those that support a simple story - tend to be written up quickly and sent to higher-impact journals. This practice is also understandable, and a good idea, if the results came out so cleanly because the experiment was so powerful and definitive. But the practice is highly dangerous if the results only appear to be clean, whether because the "messy" parts were not mentioned (e.g. omitting the 100 analyses that didn't show the effect), or from outright fraud.

Is science best served by fetishizing clean results to the point that some degree of plastic surgery is required for publication? Clearly not.

Diederik Stapel pointed to some of these forces in the fascinating New York Times Magazine article:

"Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that, he told me. He insisted that he loved social psychology but had been frustrated by the messiness of experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty — instead of the truth,” he said. He described his behavior as an addiction that drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high.
In his early years of research — when he supposedly collected real experimental data — Stapel wrote papers laying out complicated and messy relationships between multiple variables. He soon realized that journal editors preferred simplicity. “They are actually telling you: ‘Leave out this stuff. Make it simpler,’ ” Stapel told me. Before long, he was striving to write elegant articles."
I agree that real datasets, particularly in fields like psychology and neuroimaging, are very messy. But that does not mean we should give up, surgically altering our results to appear clean (or fabricating them). I think we should instead embrace unavoidable mess, expanding analysis techniques capable of locating true islands of stability in a sea of mess.

I'm rambling and waxing poetic this morning, but want to convey a few ideas:

First, if the true situation is likely "mess" we should not think "send to Science" when we see an exceptionally clean set of results, but rather "what went wrong?" or "should we believe this?" Extraordinary claims - and a clean result in a neuroimaging study often is extraordinary - require extraordinary evidence.

Second, we should be more tolerant of messy results, allowing a few loose ends or unexpected patterns in an otherwise-solid experiment. Including descriptions of the actual analysis (not just those parts that "turned out") should be encouraged, not penalized.

Third, we should aim for stability as well as significance in analysis results. In MVPA, this could be a degree of resistance to arbitrary analysis choices (e.g. "this ROI classifies well, this ROI does not" appearing over a range of reasonable cross-validation schemes and temporal compression methods). I trust a result far more if it appears consistently across analyses than if it only appears in one particular scheme, even if the p-value is more significant in that one scheme. We should perhaps even insist upon demonstrations of stability, particularly if claiming something occurs generally.

UPDATE 17 May 2013: Andrew Gelman has some interesting comments on how statisticians can help psychologists do their research better.

supplemental information for "Searchlight analysis: Promise, pitfalls, and potential"

My commentary on searchlight analysis, "Searchlight analysis: Promise, pitfalls, and potential" is now up at NeuroImage . All of the supplemental information is there as well, but renamed ... which is rather annoying, since I refer to the R scripts by file name. I'd asked them to fix it (and make the .R files downloadable as text instead of zipped), but evidently they couldn't.

So, here is a mapping of the names, and a link to download all the files (named correctly) at once, saving some headaches. (FYI: .R files are plain text (.txt), suffixed .R to indicate that they contain R code.)
  • "Supplementary material 1" is the supplementary text, downloaded as "mmc1.pdf".
  • "Supplementary material 2", which downloads as "mmc2.txt", should be dataset.txt, the input data for the example dataset used in many of the examples in the supplementary material.
  • "Supplementary material 3", downloads as, containing the single file "NIM10275-mmc3.R", which is example1.R
  • "Supplementary material 4",, is dataset_exploration.R
  • "Supplementary material 5", is example3.R
  • "Supplementary material 6", is rareInformative.R
  • "Supplementary material 7", is searchlight_group.R
  • "Supplementary material 8", is searchlight.R
  • "Supplementary material 9", is sharedFunctions.R
I will post about some of the examples and suggestions in the paper; I added the new label SA:PPP to make it easier to find all the relevant posts. Feel free to submit questions or comments as well; anonymously or not as you prefer.