Wednesday, August 14, 2019

comparing fMRIPrep and HCP Pipelines: Resting state benchmarks?

While I'm not a particular fan of resting state, each DMCC session includes a pair of short resting state runs, so we need to include them in the comparisons of fMRIPrep and the HCP Pipelines. This post collects some of my notes and "meanderings" on how such comparisons have been done by others and what we plan to do.

As previously described, for the task runs we decided to use well-understood task effects as benchmarks to measure quality: larger GLM statistics in the expected direction and brain regions are better. Specifically, a contrast of high - low cognitive control conditions (e.g., in Stroop "high" is a color word printed in a different color, "low" the color word printed in the matching color) should be positive (more BOLD in high than low) in frontoparietal regions. Other positive control tests could be of pushing a button or not (targeting M1) and visual stimulus on screen or not (targeting visual regions).

These task benchmarks are appealing because the ground truth is known: high - low cognitive control contrast results should look a particular way. If they look like they should, then I know that everything worked properly, and so can move to comparing the strength of the results under different preprocessing schemes.

But what is a benchmark test for resting state? How can I tell if the preprocessing and analysis was successful, so that it's valid to compare the statistical outcomes?

My first thought was that the focus would be on resting state connectivity matrices, that these matrices are the analysis target in the same way that GLM statistics are (often) the outcome of interest in task fMRI. This still seems sensible to me: if we have the same set of nodes/parcels in the same person with the same rsfMRI runs, shouldn't larger correlation matrix numbers in stereotypically correlated cells (e.g., those assigned to the same "community" in the Gordon parcellation) be better? It looks like this is done sometimes (e.g., Aquino et al. (2019)), and we will try it, but most resting state processing comparison papers I found use a different strategy, as succinctly stated on a poster at OHBM 2019 (W349, Kayvanrad, Strother, & Chen):
In the absence of a "ground truth" for FC, we can but rely on alternative reference measures such as high-frequency noise contributions.
There seems to be a fair amount of consensus on the type of "alternative reference measures" that should be used: ones aimed at measuring the degree to which effects that we think should not be present (e.g., motion correlations) are indeed not present in the data after preprocessing.

So, what are these alternative reference measures? Table 2 of the useful review/protocol Ciric et al. (2018) summarizes:

It seems that using the relationship between Quality Control and Functional Connectivity ("QC-FC") to evaluate signal quality and denoising efficacy has been around since the huge effect of movement on functional connectivity estimates was described in several papers in 2012 (Power et al.; Satterthwaite, et al.; Van Dijk, Sabuncu, & Buckner).

How exactly these QC-FC indices are calculated appears to vary a bit between groups and over time. For example, Burgess (2016) Figure 3 shows "QC-rsFC" plots from different denoising procedures; the QC measure was "quantified by proportion of time points censored using the combined FD and DVARS criteria", a different quantification than the "mean framewise displacement" in Ciric et al. Table 2 above (and Aquino 2019).

The aim for our preprocessing comparisons is much more modest than most of the papers I've mentioned: we're not developing a new pipeline or validating a new denoising algorithm, just trying to confirm that the reasonable resting state analysis results obtained from HCP Pipeline preprocessing are present after fMRIPrep and XCP; that we're not seeing a big drop in quality with a shift in pipeline. I don't want to attempt to identify the "best possible" versions of the QC-FC indices (there probably isn't an ideal version, regardless), but rather use some that seem in wide recent use and easy to understand and calculate.

Finally, the plan for the DMCC comparisons: 1) we will make functional correlation matrices for the two pipelines for each participant, using the Gordon et al., (2016) parcellation (akin to Aquino, et al. (2019) Figure 10), in the hopes of identifying known structure (i.e., the defined communities) in each (clearer structure and higher correlations better). 2) We will compute the QC-FC correlations for each pipeline (using mean FD), comparing the pipelines as in Aquino et al. (2019) Figure 7b (distribution closer to 0 better). 3) We will compare the QC-FC distance dependence, as in Aquino et al. Figure 8 (flatter better).

Especially those of you more experienced with resting state analyses: does this seem like a sensible set of analyses to compare preprocessing pipelines? Anything I should add (or subtract)?

As a (relative) outsider, the idea of evaluating on the basis of fewer artifacts (for lack of better word - effects we don't want to be present) is rather unsatisfying; analyses and acquisitions can go wrong in so many ways that I find positive controls (e.g., high-low, button-pressing) more convincing. Perhaps an equivalent would be the strength of correlation between brain regions that are accepted as being highly functionally connected (or not)? Is there an accepted set of such regions and relationships?

5 comments:

  1. This is a complicated topic, and is not settled. There is real neural signal that correlates with nuisances like motion (the brain is controlling the motion, after all), there is real neural signal in the "global" signal that GSR and related methods remove, etc. Many of the measures implicitly assume that neural signals don't correlate with these things, because if you had a hypothetical ideal cleanup, then removing the global neural signal, or motion-correlated neural signal, etc, would improve the value the measure gives.

    Our temporal ICA paper presents a new method to do what GSR was effectively intended to do, without removing the neural signal that GSR removes. We also point out flaws in many of the existing measures of cleanup quality, see page 29 of the supplemental information PDF on https://www.sciencedirect.com/science/article/pii/S1053811918303963 .

    ReplyDelete
    Replies
    1. Thanks for the pointer - that is a very relevant discussion and reinforces my discomfort with the QC-FC family of measures.

      Delete
  2. We recently published a paper addressing some of these same concerns. I started working on this project almost three years ago now (not constantly working on it, moving around etc), and read this blogpost early on in the project. It was really helpful to see your dissatisfaction with the QC-FC metrics and dig deeper into that with the data we had, as well as suggest better positive control measures such as Default Mode Network replicability (compared within same subject across 2 sessions) and identifiability (compared to a template). Thank you for your blog!


    Neuroinformatics journal: https://link.springer.com/article/10.1007/s12021-022-09565-8
    biorxiv: https://www.biorxiv.org/content/10.1101/2021.09.18.460908v1.abstract

    ReplyDelete
    Replies
    1. Sounds interesting, I hope to look at your work more closely!

      Delete
    2. Thank you! - Vyom (sorry not sure why my name didn't show up in the earlier comment)

      Delete