Tuesday, June 26, 2012
attending SfN
I'll be attending SfN this year, and will have a poster Monday morning. It's not a method poster, but uses some nice MVPA (if I say so myself!). I'm still putting my schedule together; contact me if you'd like to get together and chat about MVPA.
Wednesday, June 20, 2012
linear SVM weight behavior
Lately I've been thinking about how linear svm classification performance and weight maps look when the informative voxels are highly similar, since voxels can of course be highly correlated in fMRI. I made a few little examples that I want to share; R code for these examples is here.
These examples are similar to the ones I posted last month: one person, two experiment conditions (classes), four runs, four examples of each class in each run. Classifying with linear svm, c=1, partitioning on the runs, 100 voxels.Images are of the weights from the fitted svm; averaged over the four cross-validation folds.
In each case I generated random numbers for one class for each voxel. If the voxel is "uninformative" I copied the set of random numbers for the other class, if the voxel is "informative" I added a small number (the "bias") to the random numbers to form the other class. In other words, a non-informative voxel's value on the first class A example in run 1 is the same as the first class B example in run 1. If the voxel is informative, the first class B example in run 1 will be equal to the value of the first class A example plus the bias.
I ran these three ways: with all the informative voxels being identical (i.e. I generated one "informative" voxel than copied it the necessary number of times), with all informative voxels equally informative (equal bias) but not identical, and with varying bias in the informative voxels (so they were not identical or equally informative).
Running the code will let you generate graphs for each cross-validation fold and however many informative voxels you wish; I'll show just a few here.
In the graph for 5 identical informative voxels the informative voxels have by far the strongest weights, when there are 50 identical informative voxels they 'fade': their weights are less than the uninformative voxels.
Linear svms produce a weighted sum of the voxel values; a small weight on each is needed when there are so many identically informative voxels.
This does not happen when the voxels are equally informative, but not identical: the weights are largest (most negative, in this case) for the informative voxels (left side of the image).
The accuracy is higher than with the 50 identical informative voxels, though the bias is the same in both cases.
When the informative voxels are more variable the weight map is also more variable, with voxels with more bias having higher weights.
These examples are similar to the ones I posted last month: one person, two experiment conditions (classes), four runs, four examples of each class in each run. Classifying with linear svm, c=1, partitioning on the runs, 100 voxels.Images are of the weights from the fitted svm; averaged over the four cross-validation folds.
In each case I generated random numbers for one class for each voxel. If the voxel is "uninformative" I copied the set of random numbers for the other class, if the voxel is "informative" I added a small number (the "bias") to the random numbers to form the other class. In other words, a non-informative voxel's value on the first class A example in run 1 is the same as the first class B example in run 1. If the voxel is informative, the first class B example in run 1 will be equal to the value of the first class A example plus the bias.
I ran these three ways: with all the informative voxels being identical (i.e. I generated one "informative" voxel than copied it the necessary number of times), with all informative voxels equally informative (equal bias) but not identical, and with varying bias in the informative voxels (so they were not identical or equally informative).
Running the code will let you generate graphs for each cross-validation fold and however many informative voxels you wish; I'll show just a few here.
In the graph for 5 identical informative voxels the informative voxels have by far the strongest weights, when there are 50 identical informative voxels they 'fade': their weights are less than the uninformative voxels.
Linear svms produce a weighted sum of the voxel values; a small weight on each is needed when there are so many identically informative voxels.
This does not happen when the voxels are equally informative, but not identical: the weights are largest (most negative, in this case) for the informative voxels (left side of the image).
The accuracy is higher than with the 50 identical informative voxels, though the bias is the same in both cases.
When the informative voxels are more variable the weight map is also more variable, with voxels with more bias having higher weights.
recap
The most striking thing I noticed in these images is the way the weights
of the informative voxels get closer to zero as the number of
informative voxels increases. This could cause problems when voxels have
highly similar timecourses - they won't be weighted in terms of the
information in each, but rather as a function of the information in each
and the number of voxels with a similar amount of information.
Wednesday, June 6, 2012
temporal compression for different image acquisition schemes
A question was posted on the mvpa-toolbox mailing list about how do temporal compression: which volumes should you pick to correspond to an event, particularly if the timing of the events is jittered. What follows is a version of my reply.
The case when stimulus onset is time-locked to image acquisition is the easiest. In this case I generally guess which images (acquired volumes) should correspond to peak HRF and average those. This is straightforward if the TR is short compared to the time period you want to temporally compress (e.g. a twenty-second event and two-second TR) but can get quite dodgy if the events and TR are close in time (e.g. events that last a second). In these cases I generally think of analyzing single timepoints or generating PEIs.
If stimulus onset is jittered in relation to image acquisition I follow a similar logic: if the jitter is minimal compared to the TR (e.g. events start either half or three-quarters of the way through a 1.5 second TR) or to the number of volumes being averaged (e.g. a block design and 12 volumes are being averaged each time) I'll probably just ignore the jitter. But if the jitter is large (e.g. 4 sec TR and completely randomized stimulus onset) I'll think of PEIs again.
By PEIs I mean "parameter estimate images" - fitting a linear model assuming the standard HRF and doing MVPA with the beta weights. I described some of this and presented a comparison of doing averaging and PEIs on the same datasets in "The impact of certain methodological choices on multivariate analysis of fMRI data with support vector machines".
As a general strategy I look at the TR, stimulus timing, and event duration for each particular experiment and question then think about in which volumes the BOLD response we're looking for probably falls. If it's a clear answer, I pick those volumes. If not, I design PEIs or reformulate the question. None of this is a substitute for proper experimental design and randomization, of course, and fitting PEIs is not a cure-all.
The case when stimulus onset is time-locked to image acquisition is the easiest. In this case I generally guess which images (acquired volumes) should correspond to peak HRF and average those. This is straightforward if the TR is short compared to the time period you want to temporally compress (e.g. a twenty-second event and two-second TR) but can get quite dodgy if the events and TR are close in time (e.g. events that last a second). In these cases I generally think of analyzing single timepoints or generating PEIs.
If stimulus onset is jittered in relation to image acquisition I follow a similar logic: if the jitter is minimal compared to the TR (e.g. events start either half or three-quarters of the way through a 1.5 second TR) or to the number of volumes being averaged (e.g. a block design and 12 volumes are being averaged each time) I'll probably just ignore the jitter. But if the jitter is large (e.g. 4 sec TR and completely randomized stimulus onset) I'll think of PEIs again.
By PEIs I mean "parameter estimate images" - fitting a linear model assuming the standard HRF and doing MVPA with the beta weights. I described some of this and presented a comparison of doing averaging and PEIs on the same datasets in "The impact of certain methodological choices on multivariate analysis of fMRI data with support vector machines".
As a general strategy I look at the TR, stimulus timing, and event duration for each particular experiment and question then think about in which volumes the BOLD response we're looking for probably falls. If it's a clear answer, I pick those volumes. If not, I design PEIs or reformulate the question. None of this is a substitute for proper experimental design and randomization, of course, and fitting PEIs is not a cure-all.
Tuesday, June 5, 2012
MVPA significance with the binomial test: update 1
A clarification is in order: in the previous post I imply that Francisco advocates using the binomial to test for significance across subjects as I illustrated in the R code. He doesn't. The code showing how to calculate the binomial for each person separately is the way he describes, though.
Francisco gives a few ideas in the "Group level analysis" section of his "Information Mapping" paper (citation below), mostly advocating descriptions or count-based techniques. This can work fairly well for searchlight analyses (e.g. map the proportion of subjects with each voxel significant), but not as well for ROI-based analyses like the toy example in the previous post. He'll be sharing some thoughts in a future post.
Pereira F, & Botvinick M (2011). Information mapping with pattern classifiers: a comparative study. NeuroImage, 56 (2), 476-96 PMID: 20488249
Francisco gives a few ideas in the "Group level analysis" section of his "Information Mapping" paper (citation below), mostly advocating descriptions or count-based techniques. This can work fairly well for searchlight analyses (e.g. map the proportion of subjects with each voxel significant), but not as well for ROI-based analyses like the toy example in the previous post. He'll be sharing some thoughts in a future post.
Pereira F, & Botvinick M (2011). Information mapping with pattern classifiers: a comparative study. NeuroImage, 56 (2), 476-96 PMID: 20488249
Monday, June 4, 2012
MVPA significance with the binomial test
Using the binomial test to evaluate significance is advocated by Francisco Pereira and Tom Mitchell (and others!); reference below. The method is easy to explain, but the exact implementation is not transparent to me: how to count the "number of correctly labeled test set examples" when I have multiple cross-validation folds, multiple subjects, and perhaps multiple repeats?
The common practice (as far as I can tell; I'll post your reply/comment if you disagree!) is to handle these problems by 1) counting the number of examples in the entire test set (summing over cross-validation folds) and 2) using the average over the multiple subjects and repeats as the number correct.
Here's an example in R:
While not the case in this toy example (if do the binomial on the average), I often find that the binomial produces smaller group p-values than the t-test or permutation testing. Nevertheless, I do not often use it, because so much information gets lost: variation between people, between test sets, between replications (if done).
It seems odd to use the across-subjects average and then have the exact same test (20 correct out of 30) regardless of whether there were 3 subjects or 300, but worse to sum over the number of examples in all the subjects (64 correct out of 90), because then the difference required for a significance is so small and the actual test set size is lost completely. It also 'feels' improper to me to use the total number of test set examples in all the cross-validation folds; even if all of the examples are fully independent the performance on the cross-validation folds certainly is not.
I am certainly not advocating that we ignore all research that used the binomial! But I do tend to interpret the significance levels that result with even more skepticism than usual.
Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview NeuroImage, 45 (1) DOI: 10.1016/j.neuroimage.2008.11.007
see additional posts on this topic: 1
The common practice (as far as I can tell; I'll post your reply/comment if you disagree!) is to handle these problems by 1) counting the number of examples in the entire test set (summing over cross-validation folds) and 2) using the average over the multiple subjects and repeats as the number correct.
Here's an example in R:
# 3 runs, 10 examples in each (5 of two classes), 3 people
# here's how each person did:
(8 + 6 + 5)/30 # 19 correct, 0.6333333
(10 + 5 + 5)/30 # 20 correct, 0.6666667
(10 + 8 + 7)/30 # 25 correct, 0.8333333
# calculate the average accuracy for number correct
mean(c(0.6333333, 0.6666667, 0.8333333)); # 0.7111111
0.71 * 30 # 21.3 examples correct on average
# calculate the binomial test of the average.
# have to round to a whole number correct.
binom.test(20, 30, 0.5) # p-value = 0.09874
# calculating the binomial summing over the number of examples in
# all the subjects makes a much smaller p-value:
(8 + 6 + 5 + 10 + 5 + 5 + 10 + 8 + 7) # 64
binom.test(64, 90, 0.5); # p-value = 7.657e-05
# t-test version (also common, though not directly comparable)
t.test(c(0.6333333, 0.6666667, 0.8333333), mu=0.5, alternative='greater'); # p-value = 0.03809
# binomial tests for each person.
binom.test(20, 30, 0.5) # p-value = 0.09874
binom.test(19, 30, 0.5) # p-value = 0.2005
binom.test(25, 30, 0.5) # p-value = p-value = 0.0003249
While not the case in this toy example (if do the binomial on the average), I often find that the binomial produces smaller group p-values than the t-test or permutation testing. Nevertheless, I do not often use it, because so much information gets lost: variation between people, between test sets, between replications (if done).
It seems odd to use the across-subjects average and then have the exact same test (20 correct out of 30) regardless of whether there were 3 subjects or 300, but worse to sum over the number of examples in all the subjects (64 correct out of 90), because then the difference required for a significance is so small and the actual test set size is lost completely. It also 'feels' improper to me to use the total number of test set examples in all the cross-validation folds; even if all of the examples are fully independent the performance on the cross-validation folds certainly is not.
I am certainly not advocating that we ignore all research that used the binomial! But I do tend to interpret the significance levels that result with even more skepticism than usual.
Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview NeuroImage, 45 (1) DOI: 10.1016/j.neuroimage.2008.11.007
see additional posts on this topic: 1
Subscribe to:
Posts (Atom)