Is “abandon statistical significance” like organically fed, free-range chicken?

The question: is good statistics scalable?

This comes up a lot in discussions on abandoning statistical significance, null-hypothesis significance testing, p-value thresholding, etc. I recommend accepting uncertainty, but what if it’s decision time—what to do?

How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)

Even if you agree with me that null-hypothesis significance testing is almost always a bad idea, that it would be better to accept uncertainty and propagate it through our decision making process rather than collapsing the wavefunction with every little experiment, even if you agree that current practices of reporting statistically significant comparisons as real and non-significant comparisons as zero are harmful and impede our scientific understanding, even if you’d rather use prior information in the steps of inference and reporting of results, even if you don’t believe in ESP, himmicanes, ages ending in 9, embodied cognition, and all the other silly and unreplicated results that were originally sold on the basis if statistically significance, even if you don’t think it’s correct to say that stents don’t work just because p was 0.20, even if . . . etc. . . . even if all that, you might still feel that our proposal to abandon statistical significance is unrealistic.

Sure, sure, you might say, if researchers who have the luxury to propagate their uncertainty, fine, but what if they need to make a decision right now about what ideas to pursue next? Sure, sure, null hypothesis significance testing is a joke, and Psychological Science has published a lot of bad papers, but journals have to do something, they need some rule, right? And there aren’t enough statisticians out there to carefully evaluate each claim. It’s not like every paper sent to a psychology journal can be sent to Uri Simonsohn, Greg Francis, etc., for review.

So, the argument goes, yes, there’s a place for context-appropriate statistical inference and decision making, but such analyses have to be done one at a time. Artisanal statistics may be something for all researchers to aspire to, but in the here and now they need effective, mass-produced tools, and p-values and statistical significance is what we’ve got.

My response

McShane, Gal, Robert, Tackett, and I wrote:

One might object here and call our position naive: do not editors and reviewers require some bright-line threshold to decide whether the data supporting a claim is far enough from pure noise to support publication? Do not statistical thresholds provide objective standards for what constitutes evidence, and does this not in turn provide a valuable brake on the subjectivity and personal biases of editors and reviewers?

We responded to this concern in two ways.

First:

Even were such a threshold needed, it would not make sense to set it based on the p-value given that it seldom makes sense to calibrate evidence as a function of this statistic and given that the costs and benefits of publishing noisy results varies by field. Additionally, the p-value is not a purely objective standard: different model specifications and statistical tests for the same data and null hypothesis yield different p-values; to complicate matters further, many subjective decisions regarding data protocols and analysis procedures such as coding and exclusion are required in practice and these often strongly impact the p-value ultimately reported.

Second:

We fail to see why such a threshold screening rule is needed: editors and reviewers already make publication decisions one at a time based on qualitative factors, and this could continue to happen if the p-value were demoted from its threshold screening rule to just one among many pieces of evidence.

To say it again:

Journals, regulatory agencies, and other decision-making bodies already use qualitative processes to make their decisions. Journals are already evaluating papers one at a time using a labor-intensive process. I don’t see that removing a de facto “p less than 0.05” rule would make this process any more difficult.