The null distributions are narrower when four runs are included:
Since both the null distributions are wider for two runs and the classification accuracy is worse, the p-values are less significant for two runs than four (below; they should get bigger if you click on them). For example, repetition #9 had an accuracy of 0.72 with two runs, which resulted (when permuting the training data only) in a rank of 10. Repetition #6 with four runs also had an accuracy of 0.72, but this time had a rank of 0 (the true-labeled data was more accurate than all permutations).
|with two runs|
|with four runs|
The true-labeled data accuracy (given as "real" in the tables) varies quite a bit more over the ten repetitions with only two runs compared to four runs (.57 to .9 with two runs, .69 to .84 with four runs). This strikes me as expected: the classification with only two runs is much more difficult - we have much less statistical power - and so is less stable. The permutation distributions should also be wider (have more variance) when we have less power.