Their findings don’t replicate, but they refuse to admit they might’ve messed up. (We’ve seen this movie before.)

Ricardo Vieira writes:

I have been reading the replication efforts by the datacolada team (in particular Leif Nelson and Joe Simmons). You have already mentioned some of their work here and here.

They have just published the #7 installation of the series, and I felt it was a good time to summarize the results for myself, specially to test whether my gut-feelings were actually in line with reality. I thought you and your readers might like to discuss it.

Most of these articles describe a spectacular failure to replicate the original effects (5-6 out of 7 studies), even with much larger sample sizes. However, what I find most interesting are not the results, but the replies from the original authors when asked to comment on the replication results.

The most obvious point is that nobody seems willing to say: ‘the proposed effect is probably not real’, even in face of some overwhelming evidence from the replication studies.

Although most authors seem to agree (in a very pondered manner) with the replication results, a substantial portion (4 out of 7) emphasize that they are happy/glad to see a trend (or subtrend) in the same direction as the original study, even when this is clearly not significant (either statistically or practically). A partially overlapping portion (4 out of 7) state a strong belief that their effect is real.

I think the public nature of this work and the fact that it is released slowly, poses an interesting dilemma for the authors of the future failed replications. Will they start to acknowledge the strong possibility that their effects are probably as illusory as the n-1 (give or take) that came before? Or will they increasingly believe that they are the exception to the norm? A third hypothesis is that involved authors will interpret the past collection of replications as generally more positive than either me or other people whose work is not on the line.

A fourth hypothesis is that the results are actually much more positive than I think they are. To keep myself in check, I compiled a short summary of the replication results and authors replies, that hopefully is not too detached from the source (I welcome any corrections to my classification). Here it is:

Effect: “Consumers’ views of scenery from a high physical elevation induce an illusory source of control, which in turn intensifies risk taking”
Replication: Failed
Reply: Accepts results
Explanation: Things have changed

Effect: “Independent consumers who choose on behalf of large groups tend to choose more selfishly, whereas everyone else tends to choose less selfishly”
Replication: Succeeded but then failed (after removing typo)
Reply: Effect is real. Happy for directional trend
Explanation: Study conducted in a different time of the year

Effect: “Consumers with ‘low self-concept clarity’ are more motivated to keep their identities stable by (1) retaining products that are relevant to their identities, and (2) choosing not to acquire new products that are relevant to their identities”
Replication: Failed
Reply: Happy for directional trend
Explanation: Treatment may have been less effective. Decline in Mturk quality

Effect: “Consumers induced to feel curious are more likely to choose the indulgent options (gym)”
Replication: Succeeded but then failed (after removing confound)
Reply: Effect is real. Intrigued by suggested confound

Effect: “Consumers are more likely to use a holistic process (vs. an attribute-by-attribute comparison) to judge anthropomorphized products”
Replication: Failed
Reply: Happy for directional trend
Explanation: Things have changed. Decline in Mturk quality

Effect: “Scarcity decreases consumers’ tendency to use price to judge product quality”
Replication: Inconclusive (differential attrition)
Reply: Effect is real. No problem in original data

Effect: “Presenting multiple product replicates as a group (vs. presenting a single item) increases product efficacy perceptions because it leads consumers to perceive products as more homogenous and unified around a shared goal
Replication: Failed
Reply: Effect is real. Happy for directional trend
Explanation: Things have changed

Along similar lines, Fritz Strack writes:

Attached is a recent paper by Fabrigan, Wegener & Petty (2020) that discusses the “replication crisis” in psychology within the framework of different types of validity (Cook & Campbell, 1979). As it is very critical of the current movement focussing myopically on the statistical variant, I thought you might be interested in commenting on this publication.

My reaction to all this:

1. I’m impressed at how much effort Nelson and Simmons put into each of these replications. They didn’t just push buttons; they looked at each study in detail. This is a lot of work, and they deserve credit for it.

Some leaders of the academic establishment have said that people who do original studies deserve more credit than mere critics, as an original study requires creativity and can advance science, whereas a criticism is at best a footnote on existing work. But I disagree with that stance. Or, I should say, it depends on the original study and it depends on the criticism. Some original studies do advance science, while others are empty cargo-cult exercises that at best waste people’s time and at worst can send entire subfields into blind alleys, as well as burning up millions of dollars and promoting a sort of quick-fix Ted-talk thinking that can distract from real efforts to solve important problems. From the other direction, some critical work is thoughtless formal replication that sidesteps the scientific questions at hand, but others—such as those of Nelson and Simmons linked above—are deeply engaged.

Remember Jordan Anaya’s statement, “I know Wansink’s work better than he does, it’s depressing really.” That’s a not uncommon experience we have when doing science criticism and replication: The original study “worked,” so nobody looked at it very carefully. It’s stunning how many mistakes and un-thought-through decisions can be sitting in published papers. I know—it’s true of some of my published work too.

In the case of the datacolada posts, I’ve not read most of the original articles or the blogs, so I’ll refrain from commenting on the details. But just in general terms, I’ve seen lots of examples where a scientific criticism has more value than the work being criticized.

Sometimes. Not always. And often it’s debatable. For example, is Alexey Guzey’s criticism of Why We Sleep more valuable than Matthew Walker’s book? I don’t know. I really don’t. Yes, Walker makes errors and misrepresents data, and Guzey is contributing a lot by tracking down these details. Any future researcher wanting to follow up on Walker’s work should definitely read Guzey before going on, just to get a sense of the evidence really is. On the other hand, Walker put together lots of things in one place, and, even though his book is fatally flawed, it arguably is still making an important contribution. Sleep—unlike beauty-and-sex ratio, ovulation and voting, embodied cognition, himmicanes, etc.—is an important topic, and even though Why We Sleep misfires on many occasions, it may be making a real contribution to our understanding.

Anyway, I don’t know that the datacolada work will get “enough credit,” whatever that means, but in any case I appreciate it, and I say that even though I have at times expressed annoyance at their blogging style.

2. The big thing is that I agree with Vieira. At the very least, researchers should admit the possibility that they might have been mistaken in their interpretation of earlier results.

Look at it this way. Sometimes—many times—researchers go into a project strongly believing that their substantive hypothesis is true. In that case, fine, do a small between-person study and it’s very unlikely that the results will actually contradict your hypothesis. In that case, the mistake in the original paper is subtle, it’s the claim of strong evidence when there is no strong evidence. Then when the replication finds no strong evidence, the researchers remain where they started, believing in their original hypothesis. It’s hard for them to pinpoint what they did wrong, because they haven’t been thinking about the distinction between evidence and truth. From their point of view, they’ve broken some arbitrary rule—they’ve “p-hacked,” which is about as silly as the other arbitrary rule of “p less than 0.05” that they had to follow earlier. They see methodologists as like cops (or as our nudgelords would say, Stasi) and they care less about silly statistical rules and more about real science.

Other times researchers are surprised by their results. The data go against the researchers’ initial hypothesis. In this case, learning that the data analysis was flawed and the the result doesn’t replicate should cause a rethink. But typically it doesn’t, I think because researchers are all too good at taking an unexpected result and convincing themselves that it makes perfect sense.

This is a particularly insidious chain of reasoning: Data purportedly provide evidence supporting a scientific theory. But the finding doesn’t replicate and the data analysis was flawed. No worries: at this point the theory has already been established as truth. “You have no choice but to accept,” etc.

The theory has climbed a latter into acceptance. The ladder’s kicked away, but the theory’s still there.

3. The Fabrigar et al. paper seems fine in a general sense, but I don’t think they wrestle enough with the idea that effects and comparisons are much smaller and less consistent than traditionally imagined by many social researchers. To bring up some old examples, it’s a mistake to come into the analysis of an experiment with the expectation that women are 20 percentage points more likely to vote for Barack Obama during a certain time of the month, or that a small intervention on four-year-olds will increase later adult income by 40%. Statistics-based science is quantitative. Effect sizes matter.