Why “bigger sample size” is not usually where it’s at.

Aidan O’Gara writes:

I realized when reading your JAMA chocolate study post that I don’t understand a very fundamental claim made by people who want better social science: Why do we need bigger sample sizes?

The p-value is always going to be 0.05, so a sample of 10 people is going to turn up a false positive for purely random reasons exactly as often as a sample of 1000: precisely 5% of the time. That’ll be increased if you have forking paths, bad experiment design, etc., but is there any reason to believe that those factors weigh more heavily in a small sample?

Let’s take the JAMA chocolate example. If this study is purely capturing noise, you’d need to run 20 experiments to get a statistically significant result like this. If they studied a million people, they’d also need only 20 experiments to get a false positive from noise alone. Let’s say they’re capturing not only noise but bad/malicious statistical design–degrees of freedom, manipulating the experiment. Is this any less common in studies of a million people? Why?

“We need bigger sample sizes” is something I’ve heard a million times, but I just realized I don’t get it. Thanks in advance for the explanation.

My reply:

Sure, more data always helps, but I don’t typically argue that larger sample size is the most important thing. What I like to say is that we need better measurement.

If you’re measuring the wrong thing (as in those studies of ovulation and clothing and voting that got the dates of peak fertility wrong) or if your measurements are super noisy, then a large sample size won’t really help you: Increasing N will reduce variance but it won’t do anything about bias.

Regarding your question above: First, I doubt the study “is purely capturing noise.” There’s lots of variation out there, coming from many sources. My concern is not that these researchers are studying pure noise; rather, my concern is that the effects they’re studying are highly variable and context-dependent, and all this variation will make it hard to find any consistent patterns.

Also, in statistics we often talk about estimating the average treatment effect, but if the treatment effect depends on context, then there’s no universally-defined average to be estimated.

Finally, you write, “they’d also need only 20 experiments to get a false positive from noise alone.” Sure, but I don’t think anybody does 20 experiments and just publishes one of these. What you should really do is publish all 20 experiments, or, better still, analyze the data from all 20 together. But, again, if your measurements are too variable, it won’t matter anyway.