See also Scientific publishing.
Peer review is our vaunted bulwark against non-science nonsense. We say not to trust work until it’s peer reviewed. It’s a stamp of Seriousness and Authority. It’s also basically a lottery.
Bornmann, L., Mutz, R., & Daniel, H.-D. (2010). A Reliability-Generalization Study of Journal Peer Reviews: A Multilevel Meta-Analysis of Inter-Rater Reliability and Its Determinants. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331
Meta-analysis of peer review experiments. Show fairly low inter-rater reliability. This isn’t necessarily bad – if everyone always agrees, why use multiple reviewers, and shouldn’t your reviewers be more diverse? – but does pose the question “How can we say peer review filters for quality research when nobody can agree on what constitutes quality?”
Jefferson, T., & Rudin, M. (2007). Editorial peer review for improving the quality of reports of biomedical studies. Cochrane Database of Systematic Reviews. doi:10.1002/14651858.MR000016.pub3
Emerson, G. B. et al. (2010). Testing for the presence of positive-outcome bias in peer review: a randomized controlled trial. Archives of Internal Medicine, 170(21), 1934–1939. doi:10.1001/archinternmed.2010.406
“Reviewers were more likely to recommend the positive version of the test manuscript for publication than the no-difference version (97.3% vs 80.0%, P < .001). Reviewers detected more errors in the no-difference version than in the positive version (0.85 vs 0.41, P < .001). Reviewers awarded higher methods scores to the positive manuscript than to the no-difference manuscript (8.24 vs 7.53, P = .005), although the ‘Methods’ sections in the 2 versions were identical.”
This is a pretty good argument for Registered Reports.
I get the impression from the following studies that reviewers don’t have a clear idea how to review papers, don’t consistently improve the quality of papers they review, and frequently miss errors. Are there ways of doing better? Are there super-reviewers who consistently give really good reviews?
Goodman, S. N., Altman, D. G., & George, S. L. (1998). Statistical reviewing policies of medical journals. Journal of General Internal Medicine, 13(11), 753–756. doi:10.1046/j.1525-1497.1998.00227.x
Most medical papers don’t get formal statistical review before publication.
Schroter, S. et al. (2008). What errors do peer reviewers detect, and does training improve their ability to detect them? JRSM, 101(10), 507–514. doi:10.1258/jrsm.2008.080062
Reviewers, even after training, don’t detect most methodological errors deliberately inserted into papers. Admittedly some of the errors were things like “Poor justification for conducting the study” (categorized as “major”) and the Hawthorne effect (minor), which do not seem huge to me; but “unjustified conclusions” were detected by less than half of the reviewers, as was a “discrepancy between abstract & results”.
Rooyen, S. van, Delamothe, T., & Evans, S. J. W. (2010). Effect on peer review of telling reviewers that their signed reviews might be posted on the web: Randomised controlled trial. BMJ, 341, c5729. doi:10.1136/bmj.c5729
Telling reviewers that their reviews may be published did not significantly affect the quality of the reviews, as assessed by editors, but did make reviewers spend more time on them.
Callaham, M., & McCulloch, C. (2011). Longitudinal Trends in the Performance of Scientific Peer Reviewers. Annals of Emergency Medicine, 57(2), 141–148. doi:10.1016/j.annemergmed.2010.07.027
Review quality, as rated by editors, declines gradually as researchers age, either because their reviews genuinely get worse or the editors expect more of them.
Hopewell, S. et al. (2014). Impact of peer review on reports of randomised trials published in open peer review journals: retrospective before and after study. BMJ, 349, g4145. doi:10.1136/bmj.g4145
Compares papers on randomized trials before and after their reviews, to see if changes suggested by reviewers were good ideas. Mostly there were small improvements in reporting quality, but reviewers did often suggest unplanned analyses, and didn’t notice many other reporting defects.