In this post I'll demonstrate one side effect of low-powered studies: reproducing results can be very difficult, even when the effects actually exist. I sometimes refer to this effect as "jumping stars", for reasons that will become apparent. This example is adapted from one presented in Maxwell (2004).
exampleSuppose that we want to test if depression level is related to “self-perceived competence” in academic, appearance, athletic, behavioral, and social domains. To answer this question, we measure the depression level and those five domains of self-perceived competence in 100 people.
Once we have the data, we do a multiple linear regression to see which competence areas are most related to the depression score. (The variables are scaled in such a way that positive regression coefficients imply that higher levels of perceived competence correspond to lower levels of depression.)
For concreteness, here's the data for the first few (pretend) people:
Now we do the linear regression, and get the following:
Now, someone decides to replicate our study, and measures depression and self-perceived competence in 100 different people. They got this result from the linear regression:
And another study's results:
Oh no! Things are not replicating very well .... to summarize:
What happened? As you probably guessed from my introduction about power, these three studies are not "bad" or finding non-existent effects, but rather are underpowered.
These three datasets were created when I ran the same simulation (R code here) three times. The simulation creates an underpowered study in which the factors are equally important: all five measures are correlated at 0.3 with the depression score.
Thus, all factors should be found equally important. But we don’t have enough power to find them all, so we only find a few in each individual study. The results of each study are thus not exactly incorrect, but rather incomplete, and unstable: which effects were detected in each varied randomly.
And that is why I refer to this effect as "jumping stars": which row of the results table gets the stars (is found significant) varies as it is repeated; the significance stars seem to jump around unpredictably.
commentsThere is no easy fix for the jumping stars problem, since it's due to the hard-to-solve problem of low statistical power. And it can be quite difficult to distinguish this situation - true effects but low power - from a situation in which replication fails because the effects are nonexistent (due to random chance).
As Maxwell (2004) explains (and you really should read it if you haven't), jumping stars cause much more problems when many analyses are performed, since, when the truth is that the effects are real, the likelihood that a low-powered study will find one of the effects significant is much, much larger than the likelihood that the study will find all of the effects significant (which is actually the truth).
All of this can add up to a messy, unstable, contradictory literature in which replication is very difficult - even if the effects actually are present!
But I am not counseling despair, rather awareness and caution. We must be very, very cautious in interpreting negative results; aware that many true effects could have been missed. And we should keep power in mind when designing experiments and analyses: more is generally better.
some referencesMaxwell, S.E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147-163.
Sedlmeier & Gigerenzer, 1989. Do Studies of Statistical Power have an Effect on the Power of Studies?
Yarkoni, T. (2009). Big Correlations in Little Studies: Inflated Correlations Reflect Low Statistical Power. Perspectives on Psychological Science, 4, 294 - 298.