Monday, May 6, 2013

musings and meanderings on messes

I've been thinking lately about the "messiness" of science: what is the best course of action when the results do not tell a clear story but are a "mess" ... some results clearly supporting the hypothesis, others not, statistical tests not quite demonstrating the interpretation we want to make, etc.

In neuroimaging the practice is often to keep trying different analyses until "something sensible" turns up. Perhaps a different preprocessing strategy will make a 'blob' appear where we expected it to; perhaps using a conjunction analysis will 'get rid of' the strange activations appearing during this condition; perhaps changing the cross-validation scheme will eliminate the below-chance classification. Such practices are partly unavoidable: when we do not know the proper choice that should be made in each step of the analysis it does not make sense to stop when the first set of guesses fail. But these practices are very dangerous: exploding the experimenter degrees of freedom can make it possible to call nearly anything significant; and scientific progress depends upon robust results.

In my experience, very clean results - those that support a simple story - tend to be written up quickly and sent to higher-impact journals. This practice is also understandable, and a good idea, if the results came out so cleanly because the experiment was so powerful and definitive. But the practice is highly dangerous if the results only appear to be clean, whether because the "messy" parts were not mentioned (e.g. omitting the 100 analyses that didn't show the effect), or from outright fraud.

Is science best served by fetishizing clean results to the point that some degree of plastic surgery is required for publication? Clearly not.

Diederik Stapel pointed to some of these forces in the fascinating New York Times Magazine article:

"Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that, he told me. He insisted that he loved social psychology but had been frustrated by the messiness of experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty — instead of the truth,” he said. He described his behavior as an addiction that drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high.
In his early years of research — when he supposedly collected real experimental data — Stapel wrote papers laying out complicated and messy relationships between multiple variables. He soon realized that journal editors preferred simplicity. “They are actually telling you: ‘Leave out this stuff. Make it simpler,’ ” Stapel told me. Before long, he was striving to write elegant articles."
I agree that real datasets, particularly in fields like psychology and neuroimaging, are very messy. But that does not mean we should give up, surgically altering our results to appear clean (or fabricating them). I think we should instead embrace unavoidable mess, expanding analysis techniques capable of locating true islands of stability in a sea of mess.

I'm rambling and waxing poetic this morning, but want to convey a few ideas:

First, if the true situation is likely "mess" we should not think "send to Science" when we see an exceptionally clean set of results, but rather "what went wrong?" or "should we believe this?" Extraordinary claims - and a clean result in a neuroimaging study often is extraordinary - require extraordinary evidence.

Second, we should be more tolerant of messy results, allowing a few loose ends or unexpected patterns in an otherwise-solid experiment. Including descriptions of the actual analysis (not just those parts that "turned out") should be encouraged, not penalized.

Third, we should aim for stability as well as significance in analysis results. In MVPA, this could be a degree of resistance to arbitrary analysis choices (e.g. "this ROI classifies well, this ROI does not" appearing over a range of reasonable cross-validation schemes and temporal compression methods). I trust a result far more if it appears consistently across analyses than if it only appears in one particular scheme, even if the p-value is more significant in that one scheme. We should perhaps even insist upon demonstrations of stability, particularly if claiming something occurs generally.

UPDATE 17 May 2013: Andrew Gelman has some interesting comments on how statisticians can help psychologists do their research better.


  1. If you are interested in reproducibility as well as prediction you might be interested in the NPAIRS framework by Dr Stephen Strother

    In short this framework works by splitting your group in halfs and assigning subjects randomly for each try.
    In each try it tries all possible pipeline choices on each split.
    In this way you get a brain map for the analysis from one half of your group that you can compare with the brain map of the other half such that you can check for reproducibility between them and you use the brain map to generate a design matrix from the data of the other group such that you can compare this predicted design matrix with the real one and therefore check if your analysis was able to predict accurately for a different set of subjects (the other half).

    This is done repeatedly with different samplings of the group and a statistic is build.


    1. Your comment makes me think I should put NPAIRS back onto my 'to do' list; I found it rather impenetrable in the past but haven't looked at it lately.