Wednesday, April 24, 2019

comparing fMRIPrep and HCP Pipelines: with version 1.3.2

We've been working on preparing to switch our DMCC preprocessing pipelines over to fMRIPrep. For the resting state component we decided to use the XCP system, which works well with fMRIPrep-preprocessed datasets ... but (as of April 2019) requires fMRIPrep version 1.3.2, not the version 1.1.7 we'd used in our preprocessing comparisons (since that was the current version when we began). A quick check showed that fMRIPrep version 1.3.2 and 1.1.7 produced similar - but not identical - images, so we reran the comparisons, using fMRIPrep version 1.3.2.

As we'd hope, the differences in the preprocessed images and GLM results between fMRIPrep version 1.1.7 and 1.3.2 are very small, much smaller than between either fMRIPrep version and the HCP pipeline, at both the single subject and group level. The conclusions in my previous summary post apply to the 1.3.2 results as well. Here I'll show some group results to parallel the previous summary; contact me if you'd like more of the results, more detail, or the underlying data.

First, the same 20 Schaefer parcels passed the threshold of having t > 1 (uncorrected) in all four tasks and preprocessing combinations when the fp132 (fMRIPrep version 1.3.2) results were substituted for the previous 1.1.7 results (compare with here and here):


This is not surprising, given the extremely similar GLM results for each person between the two fMRIPrep versions. Using these 20 parcels, I made some scatterplots to show the comparison in coefficients between the various preprocessing styles (the ones for HCP and fMRIPrep ("fp") are the same as in the previous summary, just with tasks overplotted; see that post for more explanation). Note that in these plots "fp132" is fMRIPrep version 1.3.2 and "fp" or "fMRIPrep" is version 1.1.7.



It is clear that the tightest correlations are between the two versions of fMRIPrep preprocessing, surface to surface and volume to volume ("fp132 vs. fMRIPrep"). The plots comparing "HCP vs. fMRIPrep" (1.1.7) and "HCP vs. fp132" are similar, as are the "Volume vs. Surface" plots within each fMRIPrep version.

I also set up mixed models as before, adding the fMRIPrep 1.3.2 results. The contrasts below are from a model including all four tasks, the 20 parcels, and the 13 test people, only estimating the effect of preprocessing combination on the parameter difference. I highlighted the interesting contrasts: red are volume vs. surface within each pipeline (surface better for both fMRIPrep versions); blue show that the two fMRIPrep versions were not different; green shows that the surface estimates were higher with either fMRIPrep than HCP, and volume estimates marginally higher.


So, we're still "full steam ahead" with fMRIPrep. I don't know which part of the fMRIPrep pipeline changed between 1.1.7 and 1.3.2 to make our results slightly different, but the differences don't change the interpretation of the comparisons.

Tuesday, January 15, 2019

RSA: how to describe with a single number? - update 2

This post is another entry in the occasional series about RSA matrix quantification; the last one described two common methods: one based on differences (mean subtraction; the "contrast" method) and the other on Kendall's tau-a. Another common method is to use Pearson correlation.

I've thought of correlation as a very different quantification metric than mean subtraction. However, Michael Freund, a graduate student in the CCP lab, pointed out that there are connections between them: if you normalize the RSA matrix in the right way, quantification by mean subtraction is equivalent to correlation. This post has examples to illustrate this equivalence, as well as how the two methods (mean subtraction without scaling and correlation-types) vary in what they measure. Code for the examples and figures is at the end of this post.

Here are the 10 example RSA matrices and the reference matrix used for quantification. In the reference matrix, we expect the grey cells (1, the "target" cells) to have higher correlation than the white (0).  The number in the cells of each of the 10 example RSA matrices (e.g., from different participants) are Pearson correlations. Color scaling ranges from dark blue (1) to dark red (-1), with white for 0.

And here are the quantification scores for each matrix, calculated by each of the four methods. First, notice that the ordering of the quantification scores for the 10 example matrices is the same for difference and correlation quantification after vnorm scaling (diff.vnorm and corr.vnorm), and the same as the no-scaling correlation quantification (corr). Calculating the differences without scaling (diff, black) gives a different ordering. This demonstrates the property that Michael Freund pointed out: the distinction between the difference and correlation quantification isn't the metric but whether the RSA matrix is vector-normalized before quantification (see code line 127).

So the quantification scores (and associated interpretation - which example matrix is "best"?) vary between diff and corr (along with diff.vnorm and corr.vnorm, but I'll just use corr as shorthand); but where do the differences come from?

The two methods agree that example 4 is worst (the lowest quantification score of these ten). This strikes me as reasonable: neither of the two "target" cells (with 1 in the reference matrix) are the cells with the highest correlation - example 4 doesn't match the reference at all.

More interesting are examples 1 and 9. diff considers 9 the best by far, with 1 tied for the almost-worst, while corr considers 9 the fourth-best and 1 similar at fifth-best. Looking at the example matrices, in both 9 and 1 the two target cells have higher correlation than all the other cells, but the range of values is much larger in 9 (the not-target cells have negative correlation) than 1 (where all cells have correlation between 0.705 and 0.873). This variance difference contributes strongly to the diff method (so the two matrices have very different quantification scores), but is "undone" by the vector normalization, so corr gives 1 and 9 similar quantification scores. Examples 2 and 7 also illustrate this property.

I'll also point out examples 1 and 2, which are given the same quantification score by the diff method but 2 is better than 1 with corr. Why? 1 and 2 are identical except for the two target cells, which have  different values in 1 but the same value in example 2 - the average (via Fisher's r-to-z). 1 and 2 are identical with the diff quantification because the same number results in when the target cells are averaged. Example 2 is much better than 1 with corr, however, because having the same number in the target cells is a better match to the reference matrix, in which the same number (1) is also in the target cells.

So, which to use? If you want the size of the correlations to matter (9 better than 1), you should use diff (i.e., difference method without scaling). If you want the best quantification scores to be when all of the target cells have the same correlation (2 better than 1), you should use corr (or either of the methods after scaling). But if you just want higher correlation in the target cells, without needing equality, you should use diff.

code below the fold

Friday, January 11, 2019

comparing fMRIPrep and HCP Pipelines: 20 parcels and summary

This post continues the previous; a summary for fMRIPrep 1.3.2 is here; start with the introduction to this series.

The previous post showed the t-values for each parcel with the different preprocessings, but it's still a lot to absorb, and not all parcels are equally interesting. I thought it'd be useful to concentrate on the parcels with the largest high > low cognitive control effects in all four tasks, so tried a simple threshold: which parcels have t > 1 (uncorrected) in all four of the tasks, and all four of the preprocessing combinations? The twenty parcels above pass this test, and their anatomic locations are quite sensible for high > low cognitive control. This parcel-picking strategy is somewhat arbitrary, but seems reasonably unbiased.


The coefficients for each participant were shown in the previous post. To summarize those distributions, these scatterplots show the t-values for the 20 parcels (each plotting symbol a unique parcel). The number of parcels on each side of the diagonal are listed in the corners. When analyzing surfaces, more parcels had higher t-values when preprocessing was with fMRIPrep in all four tasks, most prominently Cuedts. When analyzing volumes the story was mixed (equal split of parcels in Axcpt, higher t-values with fMRIPrep in Cuedts and Sternberg; higher t-values with HCP on Stroop). Comparing surface and volume within each preprocessing (second set of scatterplots), there were higher t-values in volumes in three of the four tasks for HCP; two of the four for fMRIPrep.

The t-values are a rough measure of effect size, but don't consider the entire distribution; another strategy is to fit mixed models, which allows the coefficients for each person to be included. These really can't be sensibly summarized in a few sentences; see the last few pages of this knitr (source here) for the output. But very briefly, I used this setup for the first model (R nlme code): lme(fixed=diff~surf.vol*hcp.fp*task.id, random=list(sub.id=~1, parcel.id=~1, task.id=~1), data=mm.tbl); where surf.vol was "surface" or "volume" and hcp.fp was "HCP" or "fMRIPrep"; sub.id, parcel.id, and task.id labeled the subjects, parcels (20 shown above) and tasks. Consistent with the graphs above, this model had significant interactions of surf.vol:hcp.fp and surf.vol:task.id. Looking only at volumes, the hcp.fp effect was significant, with fMRIPrep > HCP. Within only surfaces there was still an interaction of hcp.fp and task.id, so the dataset had to be subsetted further. In these smaller models, fMRIPrep surfaces > HCP surfaces in all tasks; fMRIPrep surfaces > fMRIPrep volumes in all but Cuedts. Here is the output from this reduced model for Axcpt; the other tasks are here.


summary thoughts

This has been a long series of posts, and I hope interesting and useful! I've included quite a few details in these posts, but not the full dataset; we do plan to make it available, but it is obviously quite a lot and not simple to share. It seems most useful to release the preprocessed 4d images along with the afni code and key model output (most of the post-afni model code is in the knitrs linked in these posts); please contact me if you'd like something specific.

My overall impression? fMRIPrep looks preferable for preprocessing, and surface analysis looks good. I was honestly hoping that surfaces wouldn't turn out so well, since I find the formats bothersome, interpolation problematic, and the preprocessing time consuming. Volumes are obviously required for subcortical areas, but for now, we will continue to run cortical surface GLMs.

There are of course many other comparisons that can be made, and some other analyses that I did that aren't in these posts. I made a good faith effort to set up the comparisons to have the final GLM statistics as equivalent and unbiased as possible, but of course not everything can be made equal (e.g., there are more vertices per parcel with the HCP than the fMRIPrep preprocessing because of the different surface meshes). 

It's hard to say how well these results will hold for other datasets; for example, I did not fit simple GLMs since the aim was to compare the DMCC's GLMs. Different acquisition parameters may influence the results quite a bit, particularly voxel size for surface analysis (at some larger sizes I would expect surface analysis to fail). I am very curious to hear about the results if anyone else tries comparisons like these, to see how typical they are. But for now, we're using fMRIPrep for new task fMRI experiments.

comparing fMRIPrep and HCP Pipelines: parcelwise GLM statistics

As explained in the previous post (the introduction is here) with the target knots and contrasts for each task defined, and with the Schaefer parcellation to identify comparable sets of voxels and vertices, we're now able to see if there's a difference in the GLM between the four preprocessing combinations (surface, volume, fMRIPrep, HCP pipelines).


In the previous post I showed some of the group-average TENT curves. Above are the coefficients going into those curves, but for all 13 test people - the four green dots for AX-CPT parcel 90 above are (almost) the same means as the four lines at knot 4 (shaded) in the previous post ("almost" since the green dots here are normal means, while the previous ones were robust; the individual participant's coefficients are the same). A unique plotting symbol is used for each participant, and the four columns of points at each parcel are the four preprocessings. Plots with the distributions for all 400 parcels are here (source here).

For a simple way to describe the distribution at each parcel I used t-tests: is the across-subjects mean different than zero? These plots have each parcel colored by t, HCP in the first row, then fMRIPrep, then their difference. The differences are HCP - fMRIPrep, so cool colors indicate parcels with larger ts when preprocessed with fMRIPrep. There are a lot of cool colors in the difference images, but it's still a lot of data to absorb; the next (final?) post in this series will summarize these statistics for just 20 of these parcels.









comparing fMRIPrep and HCP Pipelines: GLM TENT curves

As described in the introduction to this series of posts, the primary goal of these analyses is to compare the GLM estimates: is there a difference in the high vs. low cognitive control statistics produced from the different preprocessing pipelines (HCP, fMRIPrep) and spaces (volume, surface)? The previous step of the analysis confirmed that the whole-brain GLMs produced sensible results, so here I'll start showing the statistics averaged within parcels. As noted before, averaging the statistics within Schaefer parcels is especially useful here, since they are defined in all three results spaces (MNI volume; surface fsaverage5, surface HCP). We've been using the 400-parcel resolution in our lab, so I adopted it here as well; I haven't tried with other resolutions.

Below is a detailed explanation of how to read these figures. They're complex, but the best way I've found (so far!) to sensibly compare sets of GLM results for this dataset. Surface-derived curves are shown in the top pair of figures (with the surface brains), and volume-derived in the lower (with the volume brains). These are group (over the 13 test people) estimates for two informative parcels (90 and 91 in the left hemisphere). Files with all parcels can be downloaded here; this zip contains the compiled pdfs (that I took these images from) as well as the source knitr (.rnw).


There are five panes in each row. The first has a picture of the parcel and its name, surface for surface statistics (though I need to fix the text label spacing) and volume for volume statistics. The bar graphs by the parcel image show the sustained (block) regressors (robust mean and SEM) for each of the four tasks, in alphabetical order (same as the line graphs). Aside: the volume bar graph has space for "P" and "R" - these are the proactive and reactive DMCC sessions. I adapted this knitr template from one that included all three sessions; please ignore the empty spaces and "PRO" "REA" labels (fMRIPrep preprocessing was only run on Baseline session images).

The other four panes show the event-related part of the model for each task; the two regressors subtracted to make the high-low cognitive control contrast for each task are listed in the title after the task name. In the curve figures, the thick line at each knot (x-axis) is the across-subjects (robust) mean coefficient difference; the thin lines are the (robust) SEMs. Trimming of 0.1 was used in both cases; trimse from the WRS2 package was used for the SEM and base R for the trimmed mean. The solid line is calculated from the HCP-preprocessed images, dashed with fMRIPrep (abbreviated fp). The small bars below each set of lines gives the difference at the corresponding knot; HCP-fMRIPrep (so if the HCP-derived mean coefficient is larger the difference will be positive and the little bar will be above the line, on the side marked with "hcp").

The number of knots (x-axis) in each graph varies, since the tasks vary in duration: Sternberg trials are longest, Stroop trials are shortest. Each knot is 2 TRs (2.4 ms) in all tasks. The Stroop curve looks a bit HRF-ish; since each trial is just a few seconds, as is reasonable. The others are a bit delayed or doubled (in the case of Sternberg) due to the trial dynamics; there's lots more detail on the DMCC website, if you're curious.

For the purpose of the preprocessing comparisons, it's most relevant that we identified one knot for each task that should be most relevant for these high vs. low cognitive control contrasts, which is shaded in grey on the plots. Focusing on a single knot in each task gives a way to summarize the curves - we're not equally interested in all knots' estimates - which will be shown in the next post. Note that which knots are "most relevant" and which contrasts (and GLMs) would be used for high vs. low was determined before these preprocessing comparisons were conceived, primarily on the basis of task design, but also checking against the volumetric HCP-preprocessed GLM results.

I described these two parcels (90 and 91) as "informative" because in all four cases (surface, volume, HCP, fMRIPrep) the TENT curves have a fairly sensible shape (given the task timing) in all tasks, with a positive (high > low cognitive control) mean at the target knot (shaded). These two parcels are also located in areas that are sensible for having more activity in high cognitive control conditions. In the next post I'll show whole-brain results again, but summarized within parcel and at the target knot.

comparing fMRIPrep and HCP Pipelines: full-brain GLM images

As described in the introduction to this series of posts, the primary goal of these analyses is to compare the GLM estimates: is there a difference in the high vs. low cognitive control statistics produced from the different preprocessing pipelines (HCP, fMRIPrep) and spaces (volume, surface)? As explained in the introductory post, I've found comparing the statistics within Schaefer parcels most useful, since these are defined in all four results spaces.

But the parcel-averaged statistics were calculated from mass-univariate (every voxel and vertex separately) GLMs, and it's dangerous to make parcel averages without first looking at some of the full-brain GLM results to make sure everything seems sensible. So, here I'll show a few single-subject GLM results, before moving on to group and parcel-averaged in the next post(s).

The "buttons" GLMs are not of direct interest - they only model the button presses, not the cognitive tasks. But they're useful for a positive control: activity should be in motor areas and possibly visual (given the linking of response and visual change in some of our tasks). Below are the "button1" F statistics from the Buttons GLM for the same participant used in the previous posts. (Since these are TENT models, the Fstat provides a single summary of activity across the set of knots.)




First, the volume results. Since they're both in the same output space I can subtract the statistics voxel-wise, for the third row of images for each task. I subtracted HCP-fMRIPrep, so cool colors are voxels where the fMRIPrep-volume-preprocessed statistic was larger than the HCP-volume-preprocessed. As expected, the peaks are generally in grey matter, and pretty similar between the two pipelines: the two AX-CPT images look more similar to each other than do the HCP-preprocessed AX-CPT with the HCP-preprocessed Cued task-switching image. This is very good: I expect some variation between the three tasks (though all should have some motor), but the two pipelines ought to make similar statistics, given that they started with the same DICOMs, event timings, etc. While the two pipelines' images are visually similar, the difference images have quite a bit of blue in sensible brain areas, indicating that the fMRIPrep-preprocessed images had larger Fs in those voxels.




And here are the same statistics for the HCP and fMRIPrep-preprocessed surfaces. I couldn't subtract the two surfaces, since the two pipelines use different surface targets (so there isn't vertex-to-vertex correspondence; there are about 3x as many vertices in the HCP surface than fsaverage5). The inflation of the two underlays doesn't match exactly, but it's possible to tell that the statistics sort much more by task than preprocessing, which, as noted before, is very reassuring. There's also clear motor activity in all images, less in AX-CPT than the others, consistent with what is seen in the volume images for this person.

This person is quite typical, as are the conclusions drawn from this button1 F statistic: the full-brain statistical images tend to sort much more by statistic (which contrast, which task) than by preprocessing. But there are differences in the statistic magnitudes and cluster boundaries, and these are hard to evaluate qualitatively, particularly on the surfaces. Averaging the statistics within parcels makes it possible to quantify these differences, and so directly compare the four sets of results.

Thursday, January 10, 2019

comparing fMRIPrep and HCP Pipelines: tSNR images

See this for the introduction to this series of posts. Yesterday I showed that the motion regressors produced by the two programs are nearly identical (as we'd hope, since they started with the same DICOMs!). But how do the preprocessed images compare? Short version: the preprocessed images are qualitatively similar in appearance (and standard QC images), with differences between people more obvious than differences between the pipeline. But only "similar", not "nearly identical" (which is how I described the motion regressors).

(Much) Longer version: The main goal of these preprocessing tests is to compare the GLM estimates (described a bit in the introduction and more later), not to perform QC on the images from the pipelines. This adds a complication, however, because the images that go into the GLMs are not exactly the same as the images that come out of the pipelines. It strikes me as most useful to look at the images as ready to go into the GLMs, since that's the final step at which things vary - identical GLMs were performed at every voxel and vertex.

We wanted to make the GLM estimate comparisons as direct and fair as possible, so aimed to do equivalent transformations of the surface and volume preprocessed images from each pipeline. If you think we missed that target, please let me know. We use afni for the after-preprocessing steps, as summarized here. First, both fMRIPrep and the HCP pipelines produce MNI-normalized volumes, but the bounding box is a bit different, so we used 3dresample to match the fMRIPrep images to the HCP, as well as to reorient to LPI. Both the fMRIPrep and HCP pipeline volumes are then smoothed (3dBlurToFWHM, -FWHM 4); surfaces are NOT smoothed with afni commands, since they were smoothed in the preprocessing pipelines. Finally, the timeseries at each voxel (or vertex) was scaled (in the usual way for afni, see this or this), using 3dcalc -a full.timeseries -b mean.timeseries -expr 'min(200, a/b*100)*step(a)*step(b)', where mean.timeseries is an image of the voxel (or vertex)-wise means, calculated with 3dTstat.


Single frames (this is #100 from the same run as in the previous post) from the ready-for-afni images look similar; an obvious difference is that the HCP pipelines mask the brain, while fMRIPrep does not. I have the color scaling the same in both images (90 to 100) and the intensity in the brain is similar (as it should be, given the scaling); the voxel under the crosshairs has the value 99.8 after fMRIPrep and 98.6 after the HCP pipelines.The frontal dropout (yellow arrow) is obvious in both, but looks like weird striping in the HCP image because of the masking; without masking it blends into the background. Spot-checking frames of the images I can't say that I've found one preprocessing consistently "better" than the other; by eye, sometimes one looks a bit better, sometimes the other.


Here's the same frame, from the surfaces. The HCP pipelines project to a different underlay surface anatomy than fMRIPrep, so the brain shapes are a bit different (fMRIPrep to fsaverage5, 10242 vertices/hemisphere; HCP with 32492 vertices/hemisphere). To be honest, I find these harder to interpret ... swirly colors.



If the single frames look similar, what about QC images? Well, by eye they also strike me as "different but similar", without an overwhelming preference for one being better than the other. The above are tSNR images of the same test run, all with the same color scaling.


Is the story in this whole series of comparison posts going to be "similar"? No, differences turn up in the GLM results, especially when summarizing within parcels ... stay tuned.

Note: the movement graphs and tSNR images were made with R knitr code. The full pdf (four tasks for this person) is here, and its source here. The paths/etc. won't work to compile the knitr (unless you're on our network!), but the plotting code and afni commands may prove useful; the sourced gifti plotting functions are here.