As before, here are the null distributions resulting from ten simulations using either a bias of 0.05 or 0.15.
These are from using two runs:
and these from using four runs:
The null distributions within each pane pretty much overlap: the curves don't change as much with the different biases as they do with changing the number of runs or the permutation scheme.
The true-labeled accuracy and so the p-values change quite a lot, though:
|with two runs, bias = 0.05|
|with four runs, bias = 0.05|
So, in these simulations, changing the amount of difference between the classes did not substantially change the null distributions (particularly when permuting the entire dataset together). Is this sensible? I'm still thinking, but it does strike me as reasonable that if the relabeling fully disrupts the class structure then the amount of signal actually in the data should have less of an impact on the null distribution than other aspects of the data such as number of examples.