Thursday, May 10, 2012

numerical explanation of scaling effects

MS Al-Rawi posted proofs of the effects of scaling in the comments, which I moved here.

for the case when the entire ROI is affected:

Let an instance (i.e., example) belonging to the first class be denoted by the vector x (e.g., x=[x_1,x_2,…,x_d], which has a d dimension), and let the one belonging to the second class be denoted as y. Or, more formally,
x ϵ a…..(1)
y ϵ b…..(2)
According to the given example, y=x+1, or let’s discuss the general case, y = x+ k such that k≠0.

"row-scaling" (normalizing volumewise, across all voxels within each example)

Now, to perform scaling according to:
x_normalized=(x-μ_x)/σ_x, …..(3)
where, μ_x and σ_x denote the mean of x and the standard deviation of x, respectively.
y_normalized =(y-μ_y)/σ_y. …..(4)

Now, by using y=x+k, when finding the mean we will have:
μ _y = μ _x +k, ….(5)

which shows that the mean is also shifted by k, so good so far? Probably not. To find the standard deviation we use,
σ_y=| E[y- μ _y]|. …..(6)

I will neglect the norm |..| to simplify the notation. Now, by substituting; y=x+k, equations (6) and (5) into (4) we get:
y_normalized =( x + k -(μ_x + k) )/E[ x + k –(μ_x + k)], ….(7)
y_normalized =(x-μ_x) )/E[ x –μ_x)], ….(8)
y_normalized =(x-μ_x)/ σ_x …(9)

which proves that, y_normalized= x_normalized.

Which means that for the above case, we will have exactly the same values (after normalization, or, scaling) in both classes, thus, it would be impossible for SVM or any other classifier to separate these examples after the so called row-scaling (normalizing volumewise, across all voxels within each example).

"run-column scaling" (normalizing voxelwise, all examples within each run separately)

In this case, we will have to normalize x’s and y’s in the first run, thus the normalization will contain finding the mean and the standard deviation of this group, such that y=x+k. I don’t want the notations to be quite messy, so I will give an example assuming only one example per run. Let me use the symbol [x_i] to distinguish a vector.
[x_i] =[x_i, y_i] …values from (run# something) ....(10)
[x_i] =[x_i, x_i+k] .....(11)

We can easily show that no matter how much μ_x_i was, the value k≠0 will always make sure that the normalization will give separable values, e.g.,
(x_i- μ_x_i)/ σ_x_i ≠ (x_i+k - μ_x_i)/ σ_x_i ....(12)

which shows that x_i’s in class a will have a shift difference from y_i’s in class b by a value of k/σ_x_i

for the case when only part of the ROI is affected:

The row-wise case again.

In this case let me be more extreme by claiming that only one voxel (let it be the first voxel, having a value p) in examples from class b differs from examples in class a, thus,
x=[x_1, x_2,…, x_d] ....(13)
y=[p, x_2,…, x_d]....(14)

We can easily show that
μ_y= μ_x +(p-x_1)/d...(15)

we know that, y- μ _y gives:
[p, x_2,…x_d] - μ_x +(p-x_1)/d],
= [p, x_2,…x_d]- μ_x +(p-x_1)/d
= x - μ_x + (p-x_1)(1+1/d)
= x- μ_x +Q..................(16)
y_normalized= (x- μ_x +Q)/E[x- μ_x +Q] …..(16)
y_normalized ≠ x_normalized

So, for x to be equal to y after row-wise normalization, the following condition should be valid
Q=0 …..(18)
(p-x_1)(1+1/d)=0, only if p=x_1. ....(19)

Therefore, our classifier will be able to classify these examples even shifting one voxel.

Note: A similar proof can be constructed if we change 10 voxels, or any number of voxels. Similar proofs could also be constructed for the other two cases.