Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review.

*Journal of Abnormal and Social Psychology*,*65*(3), 145–153. doi:10.1037/h0045186The seminal article. Studies in psychology had, at the time, less than 50% power for “medium” effect sizes.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?

*Psychological Bulletin*,*105*(2), 309–316. doi:10.1037/0033-2909.105.2.309Follow-up on Cohen: the situation hasn’t gotten any better.

Maxwell, S. E. (2004). The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies.

*Psychological Methods*,*9*(2), 147–163. doi:10.1037/1082-989X.9.2.147Points out that the studies Cohen surveyed had, on average, 70 statistical tests each. Underpowered studies persist because there’s so much rampant multiple testing that you’re bound to find

*something*significant, regardless of power, so nobody feels compelled to increase their sample sizes.Moher, D., Dulberg, C. S., & Wells, G. A. (1994). Statistical power, sample size, and their reporting in randomized controlled trials.

*JAMA*,*272*(2), 122–124. doi:10.1001/jama.1994.03520020048013Surveyed large RCTs and found that of those with negative results, only a third had 80% power to detect a 50% difference in groups.

Button, K. S. et al. (2013). Power failure: why small sample size undermines the reliability of neuroscience.

*Nature Reviews. Neuroscience*,*14*(5), 365–376. doi:10.1038/nrn3475By comparing individual neuroscience studies to meta-analyses, assuming the meta-analysis effect sizes were correct, found that the median power of individual studies is 21%. Publication may make the assumption dubious, but in the direction of overestimation, which means the true power is possibly even worse.

Dumas-Mallet, E., Button, K. S., Boraud, T., Gonon, F., & Munafò, M. R. (2017). Low statistical power in biomedical science: a review of three human research domains.

*Royal Society Open Science*,*4*(2), 160254–11. doi:10.1098/rsos.160254Repeated the above analysis in neurology and psychiatry, with similarly poor results.

Szucs, D., & Ioannidis, John P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature.

*PLOS Biology*,*15*(3), e2000797. doi:10.1371/journal.pbio.2000797Analyzed cognitive neuroscience and psychology papers to show that the average study has 44% power to detect medium effect sizes, and the problem is worse for high-impact journals. Leads to an estimate that “false report probability is likely to exceed 50%”. Also calculates post-hoc power from the observed effect sizes, but see Hoenig and Heisey (2001) below.

Charles, P., Giraudeau, B., Dechartres, A., Baron, G., & Ravaud, P. (2009). Reporting of sample size calculation in randomised controlled trials: review.

*BMJ*,*338*, b1732. doi:10.1136/bmj.b1732RCTs report sample size calculations with insufficient detail, and their calculations often seem to be wrong. Only 34% of trials reported all necessary data, calculated accurately, and used appropriate assumptions.

Vankov, I., Bowers, J., & Munafò, M. R. (2014). On the persistence of low power in psychological science.

*The Quarterly Journal of Experimental Psychology*,*67*(5), 1037–1040. doi:10.1080/17470218.2014.885986Found that authors of psychology papers often justify their sample size by what previous studies typically used (usually not enough), rather than by actual power analysis.

Gelman, A., & Weakliem, D. (2009). Of beauty, sex, and power: statistical challenges in estimating small effects.

*American Scientist*,*97*, 310–316. http://www.stat.columbia.edu/~gelman/research/unpublished/power4r.pdfPoints out that, beyond Type I and Type II error rates, we also have to worry about overestimation of effect sizes: the significance filter combined with low power means only overestimates will be statistically significant. Gelman calls this “Type M” errors; I call it “truth inflation.”

Hoenig, J., & Heisey, D. (2001). The abuse of power: the pervasive fallacy of power calculations for data analysis.

*The American Statistician*,*55*(1), 19–24. doi:10.1198/000313001300339897It used to be common, after failing to obtain a statistically significant result, to calculate one’s statistical power to see if failure to reject the null is definitive or not. Hoenig and Heisey show this is fallacious.

[To read] Machery, E. (2012). Power and Negative Results.

*Philosophy of Science*,*79*(5), 808–820. doi:10.1086/667877Claims to prove Hoenig and Heisey wrong?