“The Generalizability Crisis” in the human sciences

In an article called The Generalizability Crisis, Tal Yarkoni writes:

Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condition. I demonstrate how foundational assumptions of the “random effects” model used pervasively in psychology impose far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack any meaningful connection to the statistical quantities they are putatively based on. I argue that failure to consider generalizability from a statistical perspective lies at the root of many of psychology’s ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.

I pretty much agree 100% with everything he writes in this article. These are issues we’ve been talking about for awhile, and Yarkoni offers a clear and coherent perspective. I only have two comments, and these are more a matter of emphasis than anything else.

1. Near the beginning of the article, Yarkoni writes of two ways of drawing scientific conclusions from statistical evidence:

The “fast” approach is liberal and incautious; it makes the default assumption that every observation can be safely generalized to other similar-seeming situations until such time as those generalizations are contradicted by new evidence. . . .

The “slow” approach is conservative, and adheres to the opposite default: an observed relationship is assumed to hold only in situations identical, or very similar to, the one in which it has already been observed. . . .

Yarkoni goes on to say that in modern psychology, it is standard to use the fast approach, that the fast approach gets attention and rewards, but that in general the fast approach is wrong, that instead we should be using the fast approach to generate conjectures but use the slow approach when trying to understand what we know.

I agree, and I also agree with Yarkoni’s technical argument that the slow approach corresponds to a multilevel model in which there are varying intercepts and slopes corresponding to experimental conditions, populations, etc. That is, if we are fitting the model y = a + b*x + error to data (x_i, y_i), i=1,…,n, we should think of this entire experiment as study j, with the model y = a_j + b_j*x + error, and different a_j, b_j for each potential study. To put it another way, a_j and b_j can be considered as functions of the experimental conditions and the mix of people in the experiment.

Or, to put it another way, we have an implicit multilevel model with predictors x at the individual level and other predictors at the group level that are implicit in the model for a, b. And we should be thinking about this multilevel model even when we only have data from a single experiment.

This is all related to the argument I’ve been making for awhile about “transportability” in inference, which in turn is related to an argument that Rubin and others have been making for decades about thinking of meta-analysis in terms of response surfaces.

To put it another way, all replications are conceptual replications.

So, yeah, these ideas have been around for awhile. On the other hand, as Yarkoni notes, standard practice is to not think about these issues at all and to just make absurdly general claims from absurdly specific experiments. Sometime it seems that the only thing that makes researchers aware of the “slow” approach is when someone fails to replicate one of their studies, at which point the authors suddenly remember all the conditions on generality that they somehow forgot to mention in their originally published work. (See here or an extreme case that really irritated me.) So Yarkoni’s paper could be serving a useful role even if all it did was remind us of the challenges of generalization. But the paper does more than that, in that it links this statistical idea with many different aspects of practice in psychology research.

That all said, there’s one way in which I disagree with Yarkoni’s characterization of scientific inferences as “fast” or “slow.” I agree with him that the “fast” approach is mistaken. But I think that even his “slow” approach can be too strong!

Here’s my concern. Yarkoni writes, “The ‘slow’ approach is conservative, and adheres to the opposite default: an observed relationship is assumed to hold only in situations identical, or very similar to, the one in which it has already been observed.”

But my problem is that, in many cases, I don’t even think the observed relationship holds in the situations in which has been observed.

To put it more statistically: Claims in the sample do not necessarily generalize to the population. Or, to put it another way, correlation does not even imply correlation.

Here’s a simple example: I go the store, buy a die, I roll it 10 times and get 3 sixes, and I conclude that the probability of getting a six from this die is 0.3. That’s a bad inference! The result from 10 die rolls gives me just about no useful information about the probability of rolling a six.

Here’s another example, just as bad but not so obviously bad: I find a survey of 3000 parents, and among those people, the rate of girl births was 8% higher among the most attractive parents than among the other parents. That’s a bad inference! The result from 3000 births gives me just about no useful information about the probability of a girl birth.

So, in those examples, even a “slow” inference (e.g., “This particular die is biased,” or “More attractive parents from the United States in this particular year are more likely to have girls”) is incorrect.

This point doesn’t invalidate any of Yarkoni’s article; I’m just bringing it up because I’ve sometimes seen a tendency in open-science discourse for people to give too much of the benefit of the doubt to bad science. I remember this with that ESP paper from 2011: people would say that this paper wasn’t so bad, it just demonstrated general problems in science. Or they’d accept that the experiments in the paper offered strong evidence for ESP, it was just that the evidence overwhelmed their prior. But no, the ESP paper was bad science, and it didn’t offer strong evidence. (Yes, that’s just my opinion. You can have your own opinion, and I think it’s fine if people want to argue (mistakenly, in my view) that the ESP studies are high-quality science. My point is that if you want to argue that, argue it, but don’t take that position by default.)

That was my point when I argued against over-politeness in scientific discourse. The point is not to be rude to people. We can be as polite as we want to individual people. The point is that there are costs, serious costs, to being overly polite to scientific claims. Every time you “bend over backward” to give the benefit of the doubt to scientific claim A, you’re rigging things against the claim not-A. And, in doing so, you could be doing your part to lead science astray (if the claims A and not-A are of scientific importance) or to hurt people (if the claims A and not-A have applied impact). And by “hurt people,” I’m not talking about authors of published papers, or even of hardworking researchers who didn’t get papers published because they couldn’t compete with the fluff that gets published by PNAS etc., I’m talking about the potential consumers of this research.

Here I’m echoing the points made by Alexey Guzey in his recent post on sleep research. I do not believe in giving a claim the benefit of the doubt, just cos it’s published in a big-name journal or by a big-name professor.

In retrospect, instead of saying “Against politeness,” I should’ve said “Against deference.”

Anyway, I don’t think Yarkoni’s article is too deferential to dodgy published claims. I just wanted to emphasize that even his proposed “slow” approach to inference can let a bunch of iffy claims sneak in.

Later on, Yarkoni writes:

Researchers must be willing to look critically at previous studies and flatly reject—on logical and statistical, rather than empirical, grounds—assertions that were never supported by the data in the first place, even under the most charitable methodological assumptions.

I agree. Or, to put it slightly more carefully, we don’t have to reject the scientific claim; rather, we have to reject the claim that the experimental data and hand provide strong evidence for the attached scientific claim (rather than merely evidence consistent with the claim). Recall the distinction between truth and evidence.

Yarkoni also writes:

The mere fact that a previous study has had a large influence on the literature is not a sufficient reason to expend additional resources on replication. On the contrary, the recent movement to replicate influential studies using more robust methods risks making the situation worse, because in cases where such efforts superficially “succeed” (in the sense that they obtain a statistical result congruent with the original), researchers then often draw the incorrect conclusion that the new data corroborate the original claim . . . when in fact the original claim was never supported by the data in the first place.

I agree. This is the sort of impoliteness, or lack of deference, that I think is valuable going forward.

Or, conversely, if we want to be polite and deferential to embodied cognition and himmicanes and air rage and ESP and ages ending in 9 and the critical positivity ratio and all the rest . . . then let’s be just as polite and deferential to all the zillions of unpublished preprints, all the papers that didn’t get into JPSP and Psychological Science and PNAS, etc. Vaccine denial, N rays, spoon bending, whatever. The whole deal. But that way lies madness.

Let me again yield the floor to Yarkoni:

There is an unfortunate cultural norm within psychology (and, to be fair, many other fields) to demand that every research contribution end on a wholly positive or “constructive” note. This is an indefensible expectation that I won’t bother to indulge.

Thank you. I thank Yarkoni for his directness, as earlier I’ve thanked Alexey Guzey, Carol Nickerson, and others for expressing negative attitudes that are sometimes socially shunned.

2. I recommend that Yarkoni avoid the use of the terms fixed and random effects as this could confuse people. He uses “fixed” to imply non-varying, which makes a lot of sense, but in economics they use “fixed” to imply unmodeled. In the notation of this 2005 post, he’s using definition 1, and economists are using definition 5. The funny thing is that everyone who uses these terms thinks they’re being clear. But the terms have different meanings for different people. Later on page 7 Yarkoni alludes to definitions 2 and 3. The whole fixed and random thing is a mess.

Conclusion

Let me conclude with the list of recommendations with which Yarkoni concludes:

Draw more conservative inferences

Take descriptive research more seriously

Fit more expansive statistical models

Design with variation in mind

Emphasize variance estimates

Make riskier predictions

Focus on practical predictive utility

I agree. These issues come up not just in psychology but also in political science, pharmacology, and I’m sure lots of other fields as well.