Monday, September 2, 2013

linear svm behavior: discontinuous information detection

Since starting with MVPA I've struggled with connecting classification accuracy to the properties of the voxels that (as a group) have the accuracy: does the high accuracy mean that all the voxels are informative? Just some of them? How to tell? Even with a linear svm these are not easy questions to answer.

Some of my efforts to figure out what linear SVMs are doing with fMRI data contributed to the examples in my recent paper about searchlight analysis. In this post I'll describe figure 4 in the paper (below); this is in the supplemental as Example 1 (code is example1.R).

The simulated dataset is for just one person, and has a single ROI of 500 voxels, two classes (balanced, so chance is 0.5), and four runs. Classification is with a linear SVM (c=1), averaged over a four-fold cross-validation (partitioning on the runs).

A key for this example is that the 500 voxels were created to be equally informative: the values for each voxel were created by choosing random numbers (from a uniform distribution) for one class' examples, and then adding 0.1 to each value for the examples of the other class. For concreteness, here's Figure 1 from the Supplemental, showing part of the input data:
So, we have 500 voxels, the data for each constructed so that each voxel is equally informative (in the sense of amount of activation difference between the classes). Now, let's classify. As expected, if we classify with all 500 voxels the accuracy is perfect (expected, since we know the classes vary). But what if we use less than all the voxels? I set up the example to classify sequentially larger voxel subsets: each larger subset is made by adding a voxel to the previous subset, so that each subset includes the voxels from all smaller subsets. For example, the two-voxel subset includes voxels 1 and 2, the three-voxel subset has voxels 1, 2, and 3, the four-voxel subset voxels 1, 2, 3, and 4, etc.

Figure 4 in the manuscript (below) shows one happens for one run of the simulation: the number of voxels increases from 1 to 500 along the x-axis, and accuracy from 0.5 (chance) to 1 (perfect) along the y.

In the manuscript I call this an example of discontinuous information detection: the accuracy does not increase smoothly from 0.5 to 1, but rather jumps around. For example, the 42-voxel subset classifies at 0.81. Adding voxels one-at-a-time, we get accuracies of 0.72, 0.75, 0.78, 0.94 ... the accuracy went down abruptly, then back up again. Why? My best guess is that it has to do with the shape of the points in hyperspace; each time we add a voxel we also add a dimension, and so change the point clouds, making it easier (or not) for the linear SVM to separate.

What does this mean in practice? I guess it makes me think of classification accuracies as "slippery": somewhat difficult to grasp and unpredictable, so you need to use extreme caution before thinking that you understand their behavior.

Discontinuous information detection is perhaps also an argument for stability testing. In the Figure 4 run (graph above), the accuracy didn't change much after around 210 voxels: it was consistently very high. It is much easier to interpret our results if we're pretty confident that we're in one of these stable zones. For example, suppose we have an anatomic ROI that classifies well. Now, suppose we add in the border voxels, one at a time, classifying  each "ROI + 1" subset. I'd be a lot more confident that the ROI captured a truly informative set of voxels if the accuracy stayed more-or-less the same as individual voxels were added, than if it had huge variance.

We want to make sure we're on a plateau of high accuracy (robust to minor changes in voxels or processing), not that we happened to land upon a pointy mountain top, such that any tiny variation will send us back to chance. And this example shows that linear SVMs can easily make "pointy mountains" with fMRI data.