An attempt to reproduce psychology studies last year suggested big problems for the field, when it was found many of the results could not be reproduced. However, a paper in Science argues that the problem lies with the reproduction efforts, rather than the original research.
More than a million scientific papers are published each year, and inevitably many contain errors not picked up in peer review. The Center for Open Science is attempting to quantify how many problematic papers are slipping through the review process. Yet Professor Daniel Gilbert of Harvard University thinks the Center needs to start with its own techniques.
Last year, the Center's Open Science Collaboration (OSC) published a study reporting that attempts to reproduce 100 psychology papers from 2008 had been unsuccessful in 64 cases. At the time, collaboration member Dr. Patrick Goodbourn of the University of Sydney told IFLScience that many of the effects investigated were probably real, but smaller than the initial papers indicated. Nevertheless, the outcome was seen as discrediting psychological research.
Gilbert argues the OSC made multiple errors in their methodology, individually serious enough to undermine the conclusion, and cumulatively devastating.
"If you want to estimate a parameter of a population, then you either have to randomly sample from that population or make statistical corrections for the fact that you didn't," said co-author Professor Gary King in a statement. "The OSC did neither." Consequently, Gilbert and King argue, the study could, at most, tell us that certain subfields of psychology might have a problem, rather than the whole field.
Moreover, the attempts to reproduce the initial studies didn't use identical populations or even sample sizes. "Readers surely assumed that if a group of scientists did a hundred replications, then they must have used the same methods to study the same populations. In this case, that assumption would be quite wrong,” said Gilbert.
At the time, Goodbourn said the OSC team “chose samples that should have had a 95 percent probability of achieving statistical significance.” However, Gilbert and King claim the OSC used inappropriate statistical techniques, underestimating the sample sizes required.
The statistical debate may be confusing, but one conclusion is that location matters. Pathdoc/shutterstock
Most studies were done at different locations from the original, but these were usually considered similar enough not to make much difference. But Gilbert argues some were highly unsuitable: A study of race and affirmative action at Stanford University in America was replicated at the University of Amsterdam, which produced very different results. “Some of the replications were quite faithful to the originals, but anyone who carefully reads all the replication reports will find many more examples like this one," Gilbert said.
Gilbert and King divided the replication attempts into what they call high- and low-fidelity studies. Those categorized as low-fidelity studies were four times as likely to not match the original outcomes as the high-fidelity efforts, suggesting the lack of precision in matching the original study was more important than any flaws in the original study.
IFLScience attempted to contact a member of the OSC team for a response, but was unsuccessful. However, many of the OSC authors have provided a technical comment defending their statistical methodology without addressing the other criticisms.