Statistical power and underpowered research

Prevalence of poor power

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. doi:10.1037/h0045186

The seminal article. Studies in psychology had, at the time, less than 50% power for “medium” effect sizes.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105(2), 309–316. doi:10.1037/0033-2909.105.2.309

Follow-up on Cohen: the situation hasn’t gotten any better.
Maxwell, S. E. (2004). The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies. Psychological Methods, 9(2), 147–163. doi:10.1037/1082-989X.9.2.147

Points out that the studies Cohen surveyed had, on average, 70 statistical tests each. Underpowered studies persist because there’s so much rampant multiple testing that you’re bound to find something significant, regardless of power, so nobody feels compelled to increase their sample sizes.
Moher, D., Dulberg, C. S., & Wells, G. A. (1994). Statistical power, sample size, and their reporting in randomized controlled trials. JAMA, 272(2), 122–124. doi:10.1001/jama.1994.03520020048013

Surveyed large RCTs and found that of those with negative results, only a third had 80% power to detect a 50% difference in groups.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews. Neuroscience, 14(5), 365–376. doi:10.1038/nrn3475

By comparing individual neuroscience studies to meta-analyses, assuming the meta-analysis effect sizes were correct, found that the median power of individual studies is 21%. Publication may make the assumption dubious, but in the direction of overestimation, which means the true power is possibly even worse.
Dumas-Mallet, E., Button, K. S., Boraud, T., Gonon, F., & Munafò, M. R. (2017). Low statistical power in biomedical science: a review of three human research domains. Royal Society Open Science, 4(2), 160254–11. doi:10.1098/rsos.160254

Repeated the above analysis in neurology and psychiatry, with similarly poor results.
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), e2000797. doi:10.1371/journal.pbio.2000797

Analyzed cognitive neuroscience and psychology papers to show that the average study has 44% power to detect medium effect sizes, and the problem is worse for high-impact journals. Leads to an estimate that “false report probability is likely to exceed 50%”. Also calculates post-hoc power from the observed effect sizes, but see Hoenig and Heisey (2001) below.
Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2017). The power of bias in economics research. The Economic Journal, 127(605), F236–F265. doi:10.1111/ecoj.12461

A similar method as Button et al. (2013), applied to economics, finding that the median area in economics has only 10.5% of its papers with adequate (80%) power. Also shows that many of these areas suffer from serious truth inflation.

Prevalence of power calculations

Charles, P., Giraudeau, B., Dechartres, A., Baron, G., & Ravaud, P. (2009). Reporting of sample size calculation in randomised controlled trials: review. BMJ, 338, b1732. doi:10.1136/bmj.b1732

RCTs report sample size calculations with insufficient detail, and their calculations often seem to be wrong. Only 34% of trials reported all necessary data, calculated accurately, and used appropriate assumptions.
Vankov, I., Bowers, J., & Munafò, M. R. (2014). On the persistence of low power in psychological science. The Quarterly Journal of Experimental Psychology, 67(5), 1037–1040. doi:10.1080/17470218.2014.885986

Found that authors of psychology papers often justify their sample size by what previous studies typically used (usually not enough), rather than by actual power analysis.

Interpretation of power

Gelman, A., & Weakliem, D. (2009). Of beauty, sex, and power: statistical challenges in estimating small effects. American Scientist, 97, 310–316. http://www.stat.columbia.edu/~gelman/research/unpublished/power4r.pdf

Points out that, beyond Type I and Type II error rates, we also have to worry about overestimation of effect sizes: the significance filter combined with low power means only overestimates will be statistically significant. Gelman calls this “Type M” errors; I call it “truth inflation.”
Hoenig, J., & Heisey, D. (2001). The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1), 19–24. doi:10.1198/000313001300339897

It used to be common, after failing to obtain a statistically significant result, to calculate one’s statistical power to see if failure to reject the null is definitive or not. Hoenig and Heisey show this is fallacious.
[To read] Machery, E. (2012). Power and Negative Results. Philosophy of Science, 79(5), 808–820. doi:10.1086/667877

Claims to prove Hoenig and Heisey wrong?