The common practice (as far as I can tell; I'll post your reply/comment if you disagree!) is to handle these problems by 1) counting the number of examples in the entire test set (summing over cross-validation folds) and 2) using the average over the multiple subjects and repeats as the number correct.
Here's an example in R:
# 3 runs, 10 examples in each (5 of two classes), 3 people
# here's how each person did:
(8 + 6 + 5)/30 # 19 correct, 0.6333333
(10 + 5 + 5)/30 # 20 correct, 0.6666667
(10 + 8 + 7)/30 # 25 correct, 0.8333333
# calculate the average accuracy for number correct
mean(c(0.6333333, 0.6666667, 0.8333333)); # 0.7111111
0.71 * 30 # 21.3 examples correct on average
# calculate the binomial test of the average.
# have to round to a whole number correct.
binom.test(20, 30, 0.5) # p-value = 0.09874
# calculating the binomial summing over the number of examples in
# all the subjects makes a much smaller p-value:
(8 + 6 + 5 + 10 + 5 + 5 + 10 + 8 + 7) # 64
binom.test(64, 90, 0.5); # p-value = 7.657e-05
# t-test version (also common, though not directly comparable)
t.test(c(0.6333333, 0.6666667, 0.8333333), mu=0.5, alternative='greater'); # p-value = 0.03809
# binomial tests for each person.
binom.test(20, 30, 0.5) # p-value = 0.09874
binom.test(19, 30, 0.5) # p-value = 0.2005
binom.test(25, 30, 0.5) # p-value = p-value = 0.0003249
While not the case in this toy example (if do the binomial on the average), I often find that the binomial produces smaller group p-values than the t-test or permutation testing. Nevertheless, I do not often use it, because so much information gets lost: variation between people, between test sets, between replications (if done).
It seems odd to use the across-subjects average and then have the exact same test (20 correct out of 30) regardless of whether there were 3 subjects or 300, but worse to sum over the number of examples in all the subjects (64 correct out of 90), because then the difference required for a significance is so small and the actual test set size is lost completely. It also 'feels' improper to me to use the total number of test set examples in all the cross-validation folds; even if all of the examples are fully independent the performance on the cross-validation folds certainly is not.
I am certainly not advocating that we ignore all research that used the binomial! But I do tend to interpret the significance levels that result with even more skepticism than usual.
Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview NeuroImage, 45 (1) DOI: 10.1016/j.neuroimage.2008.11.007
see additional posts on this topic: 1
Nice topic
ReplyDelete>I often find that the binomial produces smaller group p-values than the t-test or permutation testing.
This could be related a narrower binomial distribution.
In that paper, Pereira etal assume, which could be quite legitimate, that for MVPA classifier "The probability of achieving k successes out of n independent trials is given by the binomial distribution." Therefore, if the binomial distribution is narrower than the permutation distribution, its p-value will be smaller.
To investigate this issue, one can derive several permutation distributions and then quantify how much they are truly binomial , i.e. following the assumption that they have a binomial distribution with n trials and probability of success 0.5 (two class), 0.25 (four class), etc.
-Rawi