UPDATE 13 May 2020: I'm now quite sure that the high variability wasn't due to the multiple fmriprep runs, but rather the way I picked the training/testing examples. I'm leaving this post here, but put most of it "below the fold" since my initial interpretation was incorrect. I will eventually release code and more details.
UPDATE 24 April 2020: I realized that there was at least one other possible big contributing factor to the difference between the runs: randomness in how I chose the not-button examples. I'll fix this and see what happens to the SVM results. I previously found extremely similar GLM results between fmriprep versions 1.1.2 and 1.3.2, so the variability here is very possibly only from my randomness in setting up the classification.