A quick recommendation: I strongly suggest pre-calculating the relabeling schemes before running a permutation test. In other words, prior to actually running the code doing all the calculations necessary to generate the null distribution, determine which relabeling will be used for each iteration of the permutation test, and store these new labels so that they can be read out again. To be clear, I think the only alternative to pre-calculating the relabeling scheme is to generate them at run time, such as by randomly resampling a set of labels during each iteration of the permutation test; that's not what I'm recommending here.
There are several reasons I think this is a good principle to follow for any "serious" permutation test (e.g. one that might end up in a publication):
Safety and reproducibility. It's a lot easier to confirm that the relabeling scheme is operating as expected when it can be checked outside of debug mode/run time. At minimum, I check that there are no duplicate entries, and that the randomization looks reasonable (e.g. labels chose at approximately equal frequencies?). Having the relabeling stored also means that the same permutation test can be run at a later time, even if the software or machines have changed (built-in randomization functions are not always guaranteed to produce the same output with different machines or versions of the software).
Easy of separating the jobs. I am fortunate to have access to an excellent supercomputing cluster. Since my permutations are pre-calculated, I can run a permutation test quickly by sending many separate, non-interacting jobs to the different cluster computers. For example, I might start one job that runs permutations 1 to 20, another job running permutations 21 to 30, etc. In the past I've tried running jobs like this by setting random seeds, but it was much more buggy than explicitly pre-calculating the labelings. Relatedly, if a job crashes for some reason it's a lot better to be able to start after the last-completed permutation if they've been pre-calculated.
Friday, August 22, 2014
Thursday, August 21, 2014
more on the LD-t
In a previous post I wrote a bit about A Toolbox for Representational Similarity Analysis, and my efforts at figuring out the LD-t statistic. Nikolaus Kriegeskorte kindly pointed me to some additional information, and cleared up some of my confusion. Note that I'll be using the term "LD-t" in this post for consistency (and since it's short); it's a "cross-validated, normalized variation on the Mahalanobis distance", as phrased in Nili (2014).
First, the LD-t has been described and used previously (before Nili et al 2014), though (as far as I can tell) not in the context of representational similarity analysis (RSA). It is summarized in this figure (Figure S8a from Kriegeskorte et al. (2007 PNAS)); the paper's supplemental text has additional explanation ("Significance testing of ROI response-pattern differences" section). A bit more background is also on pages 69-71 of Niko's thesis.
To give a high-level picture, calculating the LD-t involves deriving a t-value from Fisher linear discriminant test set output. The linear discriminant is fit (weights calculated from the training data) with the standard algorithms, but using an error covariance matrix calculated from the residuals of fitting a GLM to the entire training dataset (time-by-voxel data matrix), which is then run through the Ledoit-Wolf Covariance Shrinkage Estimator function.
This isn't a procedure that can be dropped into an arbitrary MVPA workflow. For example, for my little classification demo I provided a single-subject 4D dataset, in which each of the 20 volumes is the temporally-compressed version (averaging, in this case) of an experimental block; the first ten class A, the second ten class B. The demo then uses cross-validation and an SVM to get an accuracy for distinguishing class A and B. It is trivial to replace SVM in that demo with LDA, but that will not produce the LD-t. One reason is that the LD-t procedure requires splitting the dataset into two parts (a single training and testing set), not arbitrary cross-validation schemes, but there are additional differences.
Here's a description of the procedure; many thanks to Carolina Ramirez for guiding me through the MATLAB! We're assuming a task-based fMRI dataset for a single person.
This procedure is for a single subject. For group analysis, in Kriegeskorte et al. (2007) they did a group analysis by "concatenating the discriminant time courses of all subjects and fitting a composite design matrix with separate predictors for each subject." Nili et al. (2014, supplemental) suggests a different approach: averaging the LD-t RDMs cell-wise across people, then dividing by the square root of the number of subjects.
The statistic produced by these calculations is related to the cross-validated MANOVA described by Carsten Allefeld; I hope to compare them thoroughly in the future.
First, the LD-t has been described and used previously (before Nili et al 2014), though (as far as I can tell) not in the context of representational similarity analysis (RSA). It is summarized in this figure (Figure S8a from Kriegeskorte et al. (2007 PNAS)); the paper's supplemental text has additional explanation ("Significance testing of ROI response-pattern differences" section). A bit more background is also on pages 69-71 of Niko's thesis.
To give a high-level picture, calculating the LD-t involves deriving a t-value from Fisher linear discriminant test set output. The linear discriminant is fit (weights calculated from the training data) with the standard algorithms, but using an error covariance matrix calculated from the residuals of fitting a GLM to the entire training dataset (time-by-voxel data matrix), which is then run through the Ledoit-Wolf Covariance Shrinkage Estimator function.
This isn't a procedure that can be dropped into an arbitrary MVPA workflow. For example, for my little classification demo I provided a single-subject 4D dataset, in which each of the 20 volumes is the temporally-compressed version (averaging, in this case) of an experimental block; the first ten class A, the second ten class B. The demo then uses cross-validation and an SVM to get an accuracy for distinguishing class A and B. It is trivial to replace SVM in that demo with LDA, but that will not produce the LD-t. One reason is that the LD-t procedure requires splitting the dataset into two parts (a single training and testing set), not arbitrary cross-validation schemes, but there are additional differences.
Here's a description of the procedure; many thanks to Carolina Ramirez for guiding me through the MATLAB! We're assuming a task-based fMRI dataset for a single person.
- Perform "standard" fMRI preprocessing: motion correction, perhaps also slice-timing correction or spatial normalization. The values resulting from this could be output as a 4d data matrix of the same length as the raw data: one (preprocessed) image for each collected TR. We can also think of this as a (very large) time-by-voxel data matrix, where time is the TR.
- Create design matrices (X) for each half of the dataset (A and B, using the terms from the Toolbox file fisherDiscrTRDM.m). These are as usual for fMRI mass-univariate analysis: the columns generally includes predictors for motion and linear trends as well as the experimental conditions, which have been convolved with a hemodynamic response function.
- Create data matrices (Y) for each half of the dataset (training (A) and testing (B)), structured as usual for fMRI mass-univariate analysis, except including just the voxels making up the current ROI.
- Now, we can do the LD-t calculations; this is function fishAtestB_optShrinkageCov_C, also from the Toolbox file fisherDiscrTRDM.m. First, fit the linear model to dataset A (training data):
eBa=inv(Xa'*Xa)*Xa'*Ya; % calculate betas
eEa=Ya-Xa*eBa; % calculate error (residuals) matrix
- Use the training set residuals matrix (eEa) to estimate the error covariance, and apply the Ledoit-Wolf Covariance Shrinkage Estimator function to the covariance matrix.This is Toolbox file covdiag.m, and returns the the shrinkage estimate of the covariance matrix, Sa.
- Use the inverse of that covariance matrix (invSa) and the training-set PEIs (eBa) to calculate the Fisher linear discriminant (C is a contrast matrix, made up of -1, 0, and 1 entries, corresponding to the design matrix). Fitting the discriminant function produces a set of weights, one for each voxel.
was=C'*eBa*invSa; % calculate linear discriminant weights
- Now, project the test dataset (Yb) onto this discriminant: yb_was=Yb*was';
- The final step is to calculate a t-value describing how well the discriminant separated the test set. t-values are a number divided by its standard error, which is the purpose of the final lines of the fishAtestB_optShrinkageCov_C function. The values are adjusted before the t-value calculation; it looks like this is intended to compensate for differing array dimensions (degrees of freedom). I don't fully understand each line of code, but here they are, for completeness:
invXTXb=inv(Xb'*Xb);
ebb_was=invXTXb*(Xb'*yb_was);
eeb_was=yb_was-Xb*ebb_was;
nDFb=size(yb_was,1)-size(Xb,2);
esb_was=diag(eeb_was'*eeb_was)/nDFb;
C_new=C(1:min([size(ebb_was,1),size(C,1)]),:);
ctb_was2=diag(C_new'*ebb_was);
se_ctb_was2=sqrt(esb_was.*diag(C_new'*invXTXb*C_new));
ts=ctb_was2./se_ctb_was2;
- Once the t-value is calculated it can be converted to a p-value if desired, using the standard t-distribution.
This procedure is for a single subject. For group analysis, in Kriegeskorte et al. (2007) they did a group analysis by "concatenating the discriminant time courses of all subjects and fitting a composite design matrix with separate predictors for each subject." Nili et al. (2014, supplemental) suggests a different approach: averaging the LD-t RDMs cell-wise across people, then dividing by the square root of the number of subjects.
The statistic produced by these calculations is related to the cross-validated MANOVA described by Carsten Allefeld; I hope to compare them thoroughly in the future.
Wednesday, August 6, 2014
King 2014: MEG MVPA and temporal generalization matrices
Last week at ICON I attended an interesting talk by Stanislas Dehaene, in which he described some work in recent papers by Jean-RĂ©mi King; I'll highlight a few aspects (and muse a bit) here.
First, MVPA of MEG data. I've never worked with MEG data, but this paper (citation below) describes classifying (linear SVM, c=1!) trial type within-subjects, using amplitude in each of the 306 MEG sensors instead of BOLD in voxels (using all the MEG sensors is a whole-brain analysis: features span the entire brain). I should go back to (as the authors do here) referring to "MVPA" as an acronym for "multivariate pattern analysis" instead of "multi-voxel pattern analysis" to not exclude MEG analyses!
Second, I was quite taken with what they call "temporal generalization matrices", illustrated in their Figure 1, shown at left. These matrices succinctly summarize the results of what I'd call "cross-timepoint classification": rather than doing some sort of temporal compression to get one summary example per trial, they classify each timepoint separately (e.g. all the images at onset - 0.1 seconds; all the images at onset + 0.1 seconds). Usually I've seen this sort of analysis plotted as accuracy at each timepoint, like Figure 3 in Bode & Haynes 2009.
"Temporal generalization matrices" take timepoint-by-timepoint analyses a step further: instead of training and testing on images from the same timepoint, they train on images from one timepoint, then test on images from every other timepoint, systematically. The axes are thus (within-trial) timepoint, training-set timepoint on the y-axis and testing-set timepoint on the x-axis. The diagonal is from training and testing on the same timepoint, same as the Bode & Haynes 2009-style line graphs.
Since this is MEG data, there are a LOT of timepoints, giving pretty matrices like these, taken from Figure 3. In the left-side matrix the signal starts around 0.1 seconds (bottom left corner of the red blotch) and lasts until around 0.4 seconds, without much generalization - a classifier trained at timepoint 0.2 won't accurately classify timepoint 0.4. The right-side matrix has much more generalization: a classifier trained at timepoint 0.2 classified all the way to the end of the trial. Altogether, these matrices are nice summaries of a huge number of classifications.
Finally, note the dark blue blotch in the left-side matrix: they found clear below-chance accuracy for certain combinations of timepoints, which they discuss quite a bit in the manuscript. I'm pointing this out since below-chance accuracies are a persistent issue, and they (properly) didn't over-interpret (or hide) it in this case.
King JR, Gramfort A, Schurger A, Naccache L, & Dehaene S (2014). Two distinct dynamic modes subtend the detection of unexpected sounds. PLoS One, 9 (1) PMID: 24475052
First, MVPA of MEG data. I've never worked with MEG data, but this paper (citation below) describes classifying (linear SVM, c=1!) trial type within-subjects, using amplitude in each of the 306 MEG sensors instead of BOLD in voxels (using all the MEG sensors is a whole-brain analysis: features span the entire brain). I should go back to (as the authors do here) referring to "MVPA" as an acronym for "multivariate pattern analysis" instead of "multi-voxel pattern analysis" to not exclude MEG analyses!
Second, I was quite taken with what they call "temporal generalization matrices", illustrated in their Figure 1, shown at left. These matrices succinctly summarize the results of what I'd call "cross-timepoint classification": rather than doing some sort of temporal compression to get one summary example per trial, they classify each timepoint separately (e.g. all the images at onset - 0.1 seconds; all the images at onset + 0.1 seconds). Usually I've seen this sort of analysis plotted as accuracy at each timepoint, like Figure 3 in Bode & Haynes 2009.
"Temporal generalization matrices" take timepoint-by-timepoint analyses a step further: instead of training and testing on images from the same timepoint, they train on images from one timepoint, then test on images from every other timepoint, systematically. The axes are thus (within-trial) timepoint, training-set timepoint on the y-axis and testing-set timepoint on the x-axis. The diagonal is from training and testing on the same timepoint, same as the Bode & Haynes 2009-style line graphs.
Since this is MEG data, there are a LOT of timepoints, giving pretty matrices like these, taken from Figure 3. In the left-side matrix the signal starts around 0.1 seconds (bottom left corner of the red blotch) and lasts until around 0.4 seconds, without much generalization - a classifier trained at timepoint 0.2 won't accurately classify timepoint 0.4. The right-side matrix has much more generalization: a classifier trained at timepoint 0.2 classified all the way to the end of the trial. Altogether, these matrices are nice summaries of a huge number of classifications.
Finally, note the dark blue blotch in the left-side matrix: they found clear below-chance accuracy for certain combinations of timepoints, which they discuss quite a bit in the manuscript. I'm pointing this out since below-chance accuracies are a persistent issue, and they (properly) didn't over-interpret (or hide) it in this case.
King JR, Gramfort A, Schurger A, Naccache L, & Dehaene S (2014). Two distinct dynamic modes subtend the detection of unexpected sounds. PLoS One, 9 (1) PMID: 24475052
Subscribe to:
Posts (Atom)