When you have a research project — the Open Science Collaboration (OSC) — that includes 270 scientists working on breakthrough science, you would hope they would get some of the basics correct. Like designing a randomized study that was methodologically sound and could stand up to scrutiny from their peers.
But the ground-breaking article published in August 2015 by 44 researchers, “Estimating the reproducibility of psychological science” (Nosek et al., 2015) appears to have had some significant flaws. A new article suggests there actually is no ‘replicability crisis’ in psychology after all.
Four researchers from Harvard University and the University of Virginia (Gilbert et al., 2016) published their findings in Science (their psychology replications website hosts all the data and material). They believe they found three major statistical errors in the original study that call into serious question its findings. The new researchers claim, “Indeed, the evidence is consistent with the opposite conclusion — that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%.”
The original study (Nosek et al., 2015) tried to reproduce the findings from 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. The first criticism of the study is that this was not a randomized selection of psychology studies. Instead, the Nosek group limited its selection of studies to only three journals representing a paltry two disciplines of psychology, leaving out major areas like developmental and clinical psychology. Then Nosek et al. employed a complex set of arbitrary rules and criteria that actually disqualified more than 77 percent of the studies from the three journals they examined.
Research that starts with a biased sample is bound to have problems. By not starting with a randomized sample, the researchers already helped set the stage for their disappointing findings.
Let’s (Significantly) Change the Studies We Replicate
Even worse than starting off with a biased, non-randomized sample was how researchers actually conducted the replications. First, researchers invited “particular teams to replicate particular studies or they allowed the teams to select the studies they wished to replicate.” Rather than randomly assigning researchers to studies to be replicated, they let the researchers choose — bringing in each researcher’s biases, to perhaps pick studies they thought were least likely to be replicated.
The new studies sometimes differed significantly from the old studies they were trying to replicate. Here’s just one (of at least a dozen) examples of how the replicated study introduced significant complications:
In another study, White students at Stanford University watched a video of four other Stanford students discussing admissions policies at their university (Crosby, Monin, & Richardson, 2008). Three of the discussants were White and one was Black. During the discussion, one of the White students made offensive comments about affirmative action, and the researchers found that the observers looked significantly longer at the Black student when they believed he could hear the others’ comments than when he could not. Although the participants in the replication study were students at the University of Amsterdam, they watched the same video of Stanford students talking (in English!) about Stanford’s admissions policies.
Could students at an Amsterdam university really understand what affirmative action in America even was, given the significant cultural differences between American and Amsterdam society? Astoundingly, the researchers who conducted the replication said the studies were “virtually identical” (and naturally, they are biased to say so, since it is their study). Yet the original researchers, recognizing the significant cultural differences in the two populations, did not endorse the new replication study.
Gilbert and his colleagues found this sort of problem in not just one, but many of the replication studies. It seems odd that Nosek et al. felt like these sort of inconsistencies wouldn’t impact the study’s quality (or “fidelity,” as the researchers term it). Yet clearly these are significant qualitative differences that surely would impact the replicability of the study.
We Need More Power!
A study can stand or fall on its design. And a key part of a research study’s design is its power. The replication study used a design that was likely doomed to fail from the start. Low-power designs cannot pick up effect sizes that higher-powered studies can. By choosing to go with a low-power design, Nosek and colleagues virtually ensured their negative findings before they collected a single datapoint.
Nosek and colleagues floated a few straw-man arguments for the choice in design, which Gilbert et al. shot down one by one in their reply. The conclusion of Gilbert and his colleagues?
In summary, none of the arguments made [by the replication researchers] disputes the fact that the authors of [the new study] used a low powered design, and that (as our analyses of the ML2014 data demonstrate) this likely led to a gross underestimation of the true replication rate in their data.
Other psychology researchers ran a similar replication experiment back in 2014 (Klein et al., 2014). Using a high powered design, they found that most psychology studies they examined did replicate — 11 of 13 experiments rerun. To test the impact of Nosek et al.’s lower-powered design, Gilbert et al. estimated the 2014 study’s replication rate would’ve dropped from 85 percent to 34 percent. A significant and telling difference.
So What Do We Really Know About Reproducibility of Psychological Science?
More than we thought. Given Gilbert et al.’s critique and the mawkish response from the original researchers, it looks more likely that the Nosek et al. study was critically flawed.
It appears that psychological science is more reproducable than we thought — good news for both science and psychology.
Gilbert, D., King, G., Pettigrew, S. & Wilson, T. (2016). Comment on ‘Estimating the reproducibility of psychological science’. Science, 351, 1037a-1037b.
Klein, RA, Ratliff, M Vianello, RB Adams Jr, Š Bahník, MJ Bernstein, et al. (2014). Investigating Variation in Replicability: A “Many Labs” Replication Project. Social Psychology, 45, 142-152
Nosek et al. & Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349. DOI: 10.1126/science.aac4716
Nosek et al. (2016). Response to comment on ‘Estimating the reproducibility of psychological science’. Science, 351, 1037. DOI: 10.1126/science.aad9163