I was intrigued by the use of searchlight analysis by Dirk Walther et al. in "Natural Scene Categories Revealed in Distributed Patterns of Activity in the Human Brain." The searchlight analysis was not the primary interest but rather a control.

The main analysis was ROI-based, with both functional and anatomical ROIs. Much of the paper describes and interprets the accuracy of these ROIs, but they also wanted to "explore brain regions outside of our predefined ROIs," which they did with the searchlight analysis.

The authors ran a searchlight analysis in each subject, with the searchlight sized to approximate the number of voxels in the ROIs. Matching the searchlight and ROI size improves interpretability, given how much spatial structure there is in fMRI data. It's obviously an imprecise matching, since the ROIs followed anatomical and/or functional boundaries and the searchlights did not, but it is a start.

Reading the paper, I got the impression that it could have stood on the ROI-based analysis alone, but the authors (or reviewers) asked if information (as they measured it) was present in other areas. The searchlight analysis was useful for answering that question; it would certainly have been worrying if voxels throughout the brain showed the same classification characteristics as the ROIs. The searchlight analysis also served as a sort of sensitivity test: accurate searchlights were found near the ROIs, despite none of the searchlights exactly matching the shape of the ROIs. This indicates that the classifications found in those areas were not some fluke of the exact voxels picked but rather present in multiple subsets of voxels in those areas.

Walther, D., Caddigan, E., Fei-Fei, L., & Beck, D. (2009). Natural Scene Categories Revealed in Distributed Patterns of Activity in the Human Brain Journal of Neuroscience, 29 (34), 10573-10581 DOI: 10.1523/JNEUROSCI.0559-09.2009

## Saturday, May 26, 2012

## Monday, May 14, 2012

### high dimension, low sample size

MVPA datasets usually fall into the "high dimension, low sample size" (HDLSS) category: tens of examples, hundreds of voxels. Strange things can happen in hyperspace ...

I'm accustomed to thinking of my images as points in hyperspace (number of dimensions == number of voxels). Currently I'm working a dataset that classifies well (two classes, linear svm, ROI-based). There is one training set and two testing sets, one of which is classified more accurately than the other. We want to describe the characteristics of the testing sets that underlie this difference.

A first idea was to compute the distance between the points in each of the two classes and datasets: are the points in the more-accurate dataset closer together than the less-accurate dataset? Or the centroids further apart? But calculating the distances hasn't been very helpful: both datasets seem extremely similar.

Perhaps I'm running into the phenomenon described by Hall et al. (2005) and Durrant & Kaban (2009): with many dimensions all pairwise distances between points can asymptotically approach the same value: all the points are equidistant. Durrant & Kaban also discuss "distance concentration": some datasets have all points quite evenly spaced while others (also HDLSS) do not, and the datasets with less concentration are easier to classify and characterize. Perhaps looking at the distance concentration of fMRI datasets can be useful for explaining why some datasets are extremely difficult to classify, or show very low classification accuracy ("anti-learning").

One other discussion that struck me in Hall, Marron, & Neeman was that of how many classifiers can perform similarly on HDLSS datasets. Many people use linear svms as a sort of default for fMRI MVPA, partly because comparisons don't usually show a large difference in performance with other classifiers. Perhaps these HDLSS phenomena are part of the reason why performance can be so similar.

Hall, Peter, Marron, J. S., Neeman, Amnon. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

http://dx.doi.org/10.1111/j.1467-9868.2005.00510.x

Durrant, Robert J. and Ata Kaban. 2009. When is 'nearest neighbour' meaningful: A converse theorem and implications. Journal of Complexity.

http://dx.doi.org/10.1016/j.jco.2009.02.011

I'm accustomed to thinking of my images as points in hyperspace (number of dimensions == number of voxels). Currently I'm working a dataset that classifies well (two classes, linear svm, ROI-based). There is one training set and two testing sets, one of which is classified more accurately than the other. We want to describe the characteristics of the testing sets that underlie this difference.

A first idea was to compute the distance between the points in each of the two classes and datasets: are the points in the more-accurate dataset closer together than the less-accurate dataset? Or the centroids further apart? But calculating the distances hasn't been very helpful: both datasets seem extremely similar.

Perhaps I'm running into the phenomenon described by Hall et al. (2005) and Durrant & Kaban (2009): with many dimensions all pairwise distances between points can asymptotically approach the same value: all the points are equidistant. Durrant & Kaban also discuss "distance concentration": some datasets have all points quite evenly spaced while others (also HDLSS) do not, and the datasets with less concentration are easier to classify and characterize. Perhaps looking at the distance concentration of fMRI datasets can be useful for explaining why some datasets are extremely difficult to classify, or show very low classification accuracy ("anti-learning").

One other discussion that struck me in Hall, Marron, & Neeman was that of how many classifiers can perform similarly on HDLSS datasets. Many people use linear svms as a sort of default for fMRI MVPA, partly because comparisons don't usually show a large difference in performance with other classifiers. Perhaps these HDLSS phenomena are part of the reason why performance can be so similar.

**references**:Hall, Peter, Marron, J. S., Neeman, Amnon. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

http://dx.doi.org/10.1111/j.1467-9868.2005.00510.x

Durrant, Robert J. and Ata Kaban. 2009. When is 'nearest neighbour' meaningful: A converse theorem and implications. Journal of Complexity.

http://dx.doi.org/10.1016/j.jco.2009.02.011

## Thursday, May 10, 2012

### numerical explanation of scaling effects

MS Al-Rawi posted proofs of the effects of scaling in the comments, which I moved here.

x ϵ a…..(1)

y ϵ b…..(2)

According to the given example, y=x+1, or let’s discuss the general case, y = x+ k such that k≠0.

####

Now, to perform scaling according to:

x_normalized=(x-μ_x)/σ_x, …..(3)

where, μ_x and σ_x denote the mean of x and the standard deviation of x, respectively.

y_normalized =(y-μ_y)/σ_y. …..(4)

Now, by using y=x+k, when finding the mean we will have:

μ _y = μ _x +k, ….(5)

which shows that the mean is also shifted by k, so good so far? Probably not. To find the standard deviation we use,

σ_y=| E[y- μ _y]|. …..(6)

I will neglect the norm |..| to simplify the notation. Now, by substituting; y=x+k, equations (6) and (5) into (4) we get:

y_normalized =( x + k -(μ_x + k) )/E[ x + k –(μ_x + k)], ….(7)

y_normalized =(x-μ_x) )/E[ x –μ_x)], ….(8)

y_normalized =(x-μ_x)/ σ_x …(9)

which proves that, y_normalized= x_normalized.

Which means that for the above case, we will have exactly the same values (after normalization, or, scaling) in both classes, thus, it would be impossible for SVM or any other classifier to separate these examples after the so called row-scaling (normalizing volumewise, across all voxels within each example).

[x_i] =[x_i, y_i] …values from (run# something) ....(10)

[x_i] =[x_i, x_i+k] .....(11)

We can easily show that no matter how much μ_x_i was, the value k≠0 will always make sure that the normalization will give separable values, e.g.,

(x_i- μ_x_i)/ σ_x_i ≠ (x_i+k - μ_x_i)/ σ_x_i ....(12)

which shows that x_i’s in class a will have a shift difference from y_i’s in class b by a value of k/σ_x_i

x=[x_1, x_2,…, x_d] ....(13)

y=[p, x_2,…, x_d]....(14)

We can easily show that

μ_y= μ_x +(p-x_1)/d...(15)

we know that, y- μ _y gives:

[p, x_2,…x_d] - μ_x +(p-x_1)/d],

= [p, x_2,…x_d]- μ_x +(p-x_1)/d

= x - μ_x + (p-x_1)(1+1/d)

= x- μ_x +Q..................(16)

y_normalized= (x- μ_x +Q)/E[x- μ_x +Q] …..(16)

y_normalized ≠ x_normalized

So, for x to be equal to y after row-wise normalization, the following condition should be valid

Q=0 …..(18)

(p-x_1)(1+1/d)=0, only if p=x_1. ....(19)

Therefore, our classifier will be able to classify these examples even shifting one voxel.

Note: A similar proof can be constructed if we change 10 voxels, or any number of voxels. Similar proofs could also be constructed for the other two cases.

## for the case when the entire ROI is affected:

Let an instance (i.e., example) belonging to the first class be denoted by the vector x (e.g., x=[x_1,x_2,…,x_d], which has a d dimension), and let the one belonging to the second class be denoted as y. Or, more formally,x ϵ a…..(1)

y ϵ b…..(2)

According to the given example, y=x+1, or let’s discuss the general case, y = x+ k such that k≠0.

####
**"row-scaling" (normalizing volumewise, across all voxels within each example)**

Now, to perform scaling according to:x_normalized=(x-μ_x)/σ_x, …..(3)

where, μ_x and σ_x denote the mean of x and the standard deviation of x, respectively.

y_normalized =(y-μ_y)/σ_y. …..(4)

Now, by using y=x+k, when finding the mean we will have:

μ _y = μ _x +k, ….(5)

which shows that the mean is also shifted by k, so good so far? Probably not. To find the standard deviation we use,

σ_y=| E[y- μ _y]|. …..(6)

I will neglect the norm |..| to simplify the notation. Now, by substituting; y=x+k, equations (6) and (5) into (4) we get:

y_normalized =( x + k -(μ_x + k) )/E[ x + k –(μ_x + k)], ….(7)

y_normalized =(x-μ_x) )/E[ x –μ_x)], ….(8)

y_normalized =(x-μ_x)/ σ_x …(9)

which proves that, y_normalized= x_normalized.

Which means that for the above case, we will have exactly the same values (after normalization, or, scaling) in both classes, thus, it would be impossible for SVM or any other classifier to separate these examples after the so called row-scaling (normalizing volumewise, across all voxels within each example).

#### "run-column scaling" (normalizing voxelwise, all examples within each run separately)

In this case, we will have to normalize x’s and y’s in the first run, thus the normalization will contain finding the mean and the standard deviation of this group, such that y=x+k. I don’t want the notations to be quite messy, so I will give an example assuming only one example per run. Let me use the symbol [x_i] to distinguish a vector.[x_i] =[x_i, y_i] …values from (run# something) ....(10)

[x_i] =[x_i, x_i+k] .....(11)

We can easily show that no matter how much μ_x_i was, the value k≠0 will always make sure that the normalization will give separable values, e.g.,

(x_i- μ_x_i)/ σ_x_i ≠ (x_i+k - μ_x_i)/ σ_x_i ....(12)

which shows that x_i’s in class a will have a shift difference from y_i’s in class b by a value of k/σ_x_i

## for the case when only part of the ROI is affected:

#### The row-wise case again.

In this case let me be more extreme by claiming that only one voxel (let it be the first voxel, having a value p) in examples from class b differs from examples in class a, thus,x=[x_1, x_2,…, x_d] ....(13)

y=[p, x_2,…, x_d]....(14)

We can easily show that

μ_y= μ_x +(p-x_1)/d...(15)

we know that, y- μ _y gives:

[p, x_2,…x_d] - μ_x +(p-x_1)/d],

= [p, x_2,…x_d]- μ_x +(p-x_1)/d

= x - μ_x + (p-x_1)(1+1/d)

= x- μ_x +Q..................(16)

y_normalized= (x- μ_x +Q)/E[x- μ_x +Q] …..(16)

y_normalized ≠ x_normalized

So, for x to be equal to y after row-wise normalization, the following condition should be valid

Q=0 …..(18)

(p-x_1)(1+1/d)=0, only if p=x_1. ....(19)

Therefore, our classifier will be able to classify these examples even shifting one voxel.

Note: A similar proof can be constructed if we change 10 voxels, or any number of voxels. Similar proofs could also be constructed for the other two cases.

## Tuesday, May 8, 2012

### NIfTI image handedness

The handedness of NIfTI images has given me some major headaches. Here's my take on the situation, in the hope others can avoid some of the problems I ran into. If you have other workarounds, solutions, or alternative explanations I'd love to hear them! Some of this is specific to R, some isn't.

Here is my illustration of the left- and right-handed orientation for brain imaging data. The difference is where the origin (voxel at 0,0,0) is located: does the i axis increase left-to-right or right-to-left?

Both orientations are valid in NIfTI files. There isn’t a flag in the NIfTI header to indicate whether the data in any particular image is in one handedness or the other; the qform and sform matrices are supposed to make this apparent.

Here is my illustration of the left- and right-handed orientation for brain imaging data. The difference is where the origin (voxel at 0,0,0) is located: does the i axis increase left-to-right or right-to-left?

Both orientations are valid in NIfTI files. There isn’t a flag in the NIfTI header to indicate whether the data in any particular image is in one handedness or the other; the qform and sform matrices are supposed to make this apparent.

My problems arose because I was reading left-handed images with oro.nifti and AnalyzeFMRI: the files are always read in as if they were right-handed, causing left-right flipping. This isn’t a matter of neurological vs.
radiological convention: the flipped images are mirror-reversed, causing the
left side of an overlay to appear on the right side of the anatomically template
(i.e. the template and overlay do not agree on which side is left) if the images are output then viewed in a program such as MRIcron.

In my testing, image files can be read and written repeatedly from
R without flipping:

**IF**they are right-handed arrays. oro.nifti writes NIfTI files assuming the source R array is right-handed. NIfTI files are also read assuming they are right-handed, producing a flipped R array if they are not.
My work-around has been to check the handedness of images whenever I read an image from a software package (spm, afni, etc) into R, and "manually" change the array to a right-handed orientation if it's not already.

Here is R code and images showing the flipping.

## Monday, May 7, 2012

### scaling and searchlight analyses: edge effects

The previous two posts demonstrated the effect of different types of data scaling (normalizing/detrending) on ROI-based analysis when a constant mass-univariate effect is present is either all of the voxels equally or just some of them. Here I'll talk about a couple of implications for searchlight analysis.

This "doughnut" effect can be trouble if we

Each searchlight is a small ROI. Doing
mean-subtraction or row-scaling within each searchlight (or using a metric
insensitive to magnitude, like correlation) can maybe reduce the likelihood that uniform differences are not responsible for the classification (but can cause edge effects). Performing the scaling
on the entire brain (or anything bigger than the searchlight) does not
eliminate this possibility (and could potentially introduce artifacts, as described in the previous post). Things are never clean in real situations ...

This example illustrates some edge effects that can happen with scaling in searchlights.

Say this is a 2d set of voxels in which we're doing a searchlight analysis (the yellow square). The reddish squares are voxels that differ across the conditions, say by having activation in class 'b' equal to activation in class 'a' + 1 (a uniform difference like in the scaling examples).

Here the "informative" voxels from the searchlight analysis are shown in light green, if we don't do scaling within each searchlight. (I colored all voxels for which the searchlight contains at least one of the reddish voxels).

And here are the "informative" voxels from the searchlight analysis if we do row-scaling (or mean-subtraction) within each searchlight: the left-side blob is now a doughnut: the center reddish voxels are not included as informative voxels. This happens because the activation difference is removed by the scaling in searchlights completely contained within the blob, but not in ones that contain only some of the blob.

This "doughnut" effect can be trouble if we

**want**to detect all of the reddish voxels: we're missing the ones in the center of the blob, which presumably would have the strongest effect and be most likely to spatially overlap across subjects. But it can also be trouble if we**don't want**to detect voxels with a mass-univariate difference, as pointed out in an example by Jesse Rissman on the mvpa-toolbox list.## Sunday, May 6, 2012

### scaling, part 2: only part of the ROI affected

**baseline**

This example is like the one in the previous post, except that only the
first 10 voxels in class b are 1 + (a value); the others are the same as class
a. Here is the baseline data:

This toy dataset is 25 voxels in two classes (a and b) and two runs, with two examples of each. The first two columns of the class b images are darker than the corresponding class a images (since 1 was added to the voxel values) while the other 15 voxels are identical in both classes.These baseline images are classified perfectly by a linear svm: the classifiers had no problem finding the ten informative voxels.

**run-column scaling**

Next I normalize the voxels individually, separately in each run.

The colors changed since the voxel values changed but the informative voxels are still distinguishable: the first two columns of voxels are different in the two rows while the last three are identical. This set of images is still classified perfectly by the linear svm.

**row-scaling**

Here I normalize each image separately (across all the voxels in each image).

The key thing to notice here is that the first two columns are no longer the only informative voxels: all of the voxels are different in the two classes, even though I created the dataset with only the first ten voxels having information. The images are still classified perfectly.

**row-wise mean-subtraction**

Subtracting the mean row-wise (across all voxels in each image separately) also adds information to the uninformative voxels, and the images are yet again classified perfectly.

**discussion**

The images were classified perfectly in all cases: no type of scaling used here (the most common methods for MVPA) was able to remove the signal. In this example I added a constant to some of the voxels, while in the previous I added a constant to all of the voxels. Generally, a constant difference (e.g. uniformly more BOLD in one condition than another) can be removed (by row-scaling or mean-subtraction) when present equally in all voxels, but not if only some of the voxels are affected.

Notice the "leaking" of information across voxels during row-scaling and row-wise mean-subtraction: voxels that were NOT informative before scaling WERE informative after scaling. This is not a problem if you will only make inferences about the group of voxels as a whole - you don't try to see which voxels within the ROI were contributing to the classification. But it would be possible to introduce distortions if a ROI was row-scaled and then a searchlight analysis was performed: some voxels could look informative that were not before the scaling.

Code for this example is in the same R file as the previous example.

## Saturday, May 5, 2012

### scaling, part 1: entire ROI affected

This is the first of several posts demonstrating the impact of scaling of MVPA. This first case is of a ROI-based analysis in which every voxel has higher BOLD for one condition than the other. This is obviously a toy situation, but analogous to a uniform mass-univariate within the ROI. In later posts I'll show cases where only some of the voxels are affected and the impact on searchlight analyses.

I'll take a flat ROI of 25 voxels, with two conditions ("a" and "b), two runs, and two examples in each run. I filled the voxels for condition "a" with random numbers, then added 1 to each voxel value to get the corresponding image for condition "b".

Here are the values for the first 10 voxels of the dataset:

The difference between the datasets is obvious to the eye (class "b" is more blue than class "a") and it is classified perfectly by a linear svm.

Next, I perform run-column scaling (normalizing voxelwise, all examples within each run separately)).

This does not remove the difference between class a and b in each voxel, though the values are changed. The dataset is still classified perfectly by a linear svm.

But row-scaling (normalizing volumewise, across all voxels within each example) does remove the difference between the a and b classes, so classification fails.

Likewise, row-subtraction (removing the mean from all voxels in each example) will remove the difference between the classes and cause classification to fail.

To recap: if you're performing a ROI-based MVPA and have a uniform effect (e.g. all voxels have a higher BOLD for one type of stimuli than the other) row-scaling and row-subtraction will eliminate this information, but column-scaling will not.

R code for these analyses is available here.

I'll take a flat ROI of 25 voxels, with two conditions ("a" and "b), two runs, and two examples in each run. I filled the voxels for condition "a" with random numbers, then added 1 to each voxel value to get the corresponding image for condition "b".

Here are the values for the first 10 voxels of the dataset:

The difference between the datasets is obvious to the eye (class "b" is more blue than class "a") and it is classified perfectly by a linear svm.

Next, I perform run-column scaling (normalizing voxelwise, all examples within each run separately)).

This does not remove the difference between class a and b in each voxel, though the values are changed. The dataset is still classified perfectly by a linear svm.

But row-scaling (normalizing volumewise, across all voxels within each example) does remove the difference between the a and b classes, so classification fails.

Likewise, row-subtraction (removing the mean from all voxels in each example) will remove the difference between the classes and cause classification to fail.

To recap: if you're performing a ROI-based MVPA and have a uniform effect (e.g. all voxels have a higher BOLD for one type of stimuli than the other) row-scaling and row-subtraction will eliminate this information, but column-scaling will not.

R code for these analyses is available here.

### introducing MVPA Meanderings

Hello! This is a blog about fMRI MVPA methodology. I plan to post occasionally about methodological issues that strike my fancy, such as potential pitfalls, debates about best practices, guidelines, and interesting papers.

Subscribe to:
Posts (Atom)