MVPA Meanderings: scaling

Showing posts with label scaling. Show all posts

Monday, May 14, 2018

mean pattern subtraction confusion

A grad student brought several papers warning against subtracting the mean pattern (and other types of normalization) before correlational analyses (including, but not only, RSA) to my attention. I was puzzled: Pearson correlation ignores additive and multiplicative transformations, so how could it be affected by subtracting the mean? Reading closely, my confusion was from what exactly was meant by "subtracting the mean pattern"; it is still the case that not all forms of "normalization" before correlational MVPA are problematic.

The key is where and how the mean-subtraction and/or normalization is done. Using the row- and column-scaling terminology from Pereira, Mitchell, and Botvinick (2009) (datasets arranged with voxels in columns, examples (trials, frames, whatever) in rows): the mean pattern subtraction warned against in Walther et al. (2016) and Garrido et al. (2013) is a particular form of column-scaling, not row-scaling. (A different presentation of some of these same ideas is in Hebart and Baker (2018), Figure 5.)

Here's my illustration of two different types of subtracting the mean; code to generate these figures is below the jump. The original pane shows the "activity patterns" for each of five trials in a four-voxel ROI. The appearance of these five trials after each has been individual normalized (row-scaled) is in the "trial normalized" pane, while the appearance after the mean pattern was subtracted is at right (c.f. Figure 1D in Walther et al. (2016) and Figure 1 in Garrido et al. (2013)).

Note that Trial2 is Trial1+15; Trial3 is Trial1*2; Trial4 is the mirror image of Trial1; Trial5 has a shape similar to Trial4 but positioned near Trial1. (Example adapted from Figure 8.1 of Cluster analysis for researchers (1984) by H. Charles Romesburg.)

And here are matrix views of the Pearson correlation and Euclidean distance between each pair of trials in each version of the dataset:

The mean pattern subtraction did not change the patterns' appearance as much as the trial normalization did: the patterns are centered on 0, but the extents are the same (e.g., voxel2 ranges from 0 to 80 in the original dataset, -40 to 40 after mean pattern subtraction). The row (trial) normalized image has a much different appearance: centered on zero, but the patterns for trials 1, 2, and 3 are now identical, and the range is much less (e.g., voxel2 is now a bit less than -1 to a bit more than 1). Accordingly, the trial-normalized similarity matrix is identical to the original when calculated with correlation, but different when calculated with Euclidean distance; the mean pattern subtracted similarity matrix is identical to the original when calculated with Euclidean distance, but different when calculated with Pearson correlation.

This is another case where lack of clear terminology can lead to confusion; "mean pattern subtraction" and "subtracting the mean from each pattern" are quite different procedures and have quite different effects, depending on your distance metric (correlation or Euclidean).

numerical explanation of scaling effects

MS Al-Rawi posted proofs of the effects of scaling in the comments, which I moved here.

for the case when the entire ROI is affected:

Let an instance (i.e., example) belonging to the first class be denoted by the vector x (e.g., x=[x_1,x_2,…,x_d], which has a d dimension), and let the one belonging to the second class be denoted as y. Or, more formally,
x ϵ a…..(1)
y ϵ b…..(2)
According to the given example, y=x+1, or let’s discuss the general case, y = x+ k such that k≠0.

"row-scaling" (normalizing volumewise, across all voxels within each example)

Now, to perform scaling according to:
x_normalized=(x-μ_x)/σ_x, …..(3)
where, μ_x and σ_x denote the mean of x and the standard deviation of x, respectively.
y_normalized =(y-μ_y)/σ_y. …..(4)

Now, by using y=x+k, when finding the mean we will have:
μ _y = μ _x +k, ….(5)

which shows that the mean is also shifted by k, so good so far? Probably not. To find the standard deviation we use,
σ_y=| E[y- μ _y]|. …..(6)

I will neglect the norm |..| to simplify the notation. Now, by substituting; y=x+k, equations (6) and (5) into (4) we get:
y_normalized =( x + k -(μ_x + k) )/E[ x + k –(μ_x + k)], ….(7)
y_normalized =(x-μ_x) )/E[ x –μ_x)], ….(8)
y_normalized =(x-μ_x)/ σ_x …(9)

which proves that, y_normalized= x_normalized.

Which means that for the above case, we will have exactly the same values (after normalization, or, scaling) in both classes, thus, it would be impossible for SVM or any other classifier to separate these examples after the so called row-scaling (normalizing volumewise, across all voxels within each example).

"run-column scaling" (normalizing voxelwise, all examples within each run separately)

In this case, we will have to normalize x’s and y’s in the first run, thus the normalization will contain finding the mean and the standard deviation of this group, such that y=x+k. I don’t want the notations to be quite messy, so I will give an example assuming only one example per run. Let me use the symbol [x_i] to distinguish a vector.
[x_i] =[x_i, y_i] …values from (run# something) ....(10)
[x_i] =[x_i, x_i+k] .....(11)

We can easily show that no matter how much μ_x_i was, the value k≠0 will always make sure that the normalization will give separable values, e.g.,
(x_i- μ_x_i)/ σ_x_i ≠ (x_i+k - μ_x_i)/ σ_x_i ....(12)

which shows that x_i’s in class a will have a shift difference from y_i’s in class b by a value of k/σ_x_i

for the case when only part of the ROI is affected:

The row-wise case again.

In this case let me be more extreme by claiming that only one voxel (let it be the first voxel, having a value p) in examples from class b differs from examples in class a, thus,
x=[x_1, x_2,…, x_d] ....(13)
y=[p, x_2,…, x_d]....(14)

We can easily show that
μ_y= μ_x +(p-x_1)/d...(15)

we know that, y- μ _y gives:
[p, x_2,…x_d] - μ_x +(p-x_1)/d],
= [p, x_2,…x_d]- μ_x +(p-x_1)/d
= x - μ_x + (p-x_1)(1+1/d)
= x- μ_x +Q..................(16)
y_normalized= (x- μ_x +Q)/E[x- μ_x +Q] …..(16)
y_normalized ≠ x_normalized

So, for x to be equal to y after row-wise normalization, the following condition should be valid
Q=0 …..(18)
(p-x_1)(1+1/d)=0, only if p=x_1. ....(19)

Therefore, our classifier will be able to classify these examples even shifting one voxel.

Note: A similar proof can be constructed if we change 10 voxels, or any number of voxels. Similar proofs could also be constructed for the other two cases.

Monday, May 7, 2012

scaling and searchlight analyses: edge effects

The previous two posts demonstrated the effect of different types of data scaling (normalizing/detrending) on ROI-based analysis when a constant mass-univariate effect is present is either all of the voxels equally or just some of them. Here I'll talk about a couple of implications for searchlight analysis.

Each searchlight is a small ROI. Doing mean-subtraction or row-scaling within each searchlight (or using a metric insensitive to magnitude, like correlation) can maybe reduce the likelihood that uniform differences are not responsible for the classification (but can cause edge effects). Performing the scaling on the entire brain (or anything bigger than the searchlight) does not eliminate this possibility (and could potentially introduce artifacts, as described in the previous post). Things are never clean in real situations ...

This example illustrates some edge effects that can happen with scaling in searchlights.

Say this is a 2d set of voxels in which we're doing a searchlight analysis (the yellow square). The reddish squares are voxels that differ across the conditions, say by having activation in class 'b' equal to activation in class 'a' + 1 (a uniform difference like in the scaling examples).

Here the "informative" voxels from the searchlight analysis are shown in light green, if we don't do scaling within each searchlight. (I colored all voxels for which the searchlight contains at least one of the reddish voxels).

And here are the "informative" voxels from the searchlight analysis if we do row-scaling (or mean-subtraction) within each searchlight: the left-side blob is now a doughnut: the center reddish voxels are not included as informative voxels. This happens because the activation difference is removed by the scaling in searchlights completely contained within the blob, but not in ones that contain only some of the blob.

This "doughnut" effect can be trouble if we want to detect all of the reddish voxels: we're missing the ones in the center of the blob, which presumably would have the strongest effect and be most likely to spatially overlap across subjects. But it can also be trouble if we don't want to detect voxels with a mass-univariate difference, as pointed out in an example by Jesse Rissman on the mvpa-toolbox list.