No, average statistical power is not as high as you think: Tracing a statistical error as it spreads through the literature

https://statmodeling.stat.columbia.edu/2020/05/18/no-average-statistical-power-is-not-as-high-as-you-think-tracing-a-statistical-error-as-it-spreads-through-the-literature/

I was reading this recently published article by Sakaluk et al. and came across a striking claim:

Despite recommendations that studies be conducted with 80% power for the expected effect size, recent reviews have found that the average social science study possesses only a 44% chance of detecting an existing medium-sized true effect (Szucs & Ioannidis, 2017).

I noticed this not because the claimed 44% was so low but because it was so high! I strongly doubt that the average social science study possesses a power of anything close to 44%. Why? Because 44% is close to 50%, and a study will have power of 50% if the true effect is 2 standard errors away from zero. I doubt that typical studies have such large effects.

I was curious where the 44% came from, so I looked up the Szucs and Ioannidis article, “Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature.” Here’s the relevant passage from the abstract:

We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century.

This is fine—but I don’t think most effect sizes are so large. To put it another way, what they call a “medium effect,” I would call a huge effect. So, realistically, power will be much much less than 44%.

This is important, because if researchers come into a study with the seemingly humble expectation of 44% power, then they’ll expect that they’ll get “p less than 0.05” about half the time, and if they don’t they’ll think that something went wrong. Actually, though, the only way that researchers have been having such a high apparent success rate in the past is from forking paths. The expectation of 44% power has bad consequences.