Let’s do preregistered replication studies of the cognitive effects of air pollution—not because we think existing studies are bad, but because we think the topic is important and we want to understand it better.

In the replication crisis in science, replications have often been performed of controversial studies on silly topics such as embodied cognition, extra-sensory perception, and power pose.

We’ve been talking recently about replication being something we do for high-quality studies on important topics. That is, the point of replication is not the hopeless endeavor of convincing ESP scholars etc. that they’re barking up the wrong tree, but rather to learn about something we really care about.

With that in mind, I suggest that researchers perform some careful preregistered replications of some studies of air pollution and cognitive function.

I thought about this after receiving this in email from Russ Roberts:

You may have seen this. It is getting a huge amount of play with people amazed and scared by how big the effects are on cognition.

THIRTY subjects. Most results not significant.

If you feel inspired to write on it, please do…

What I found surprising was how many smart people I know have been taken by how large the effects are. Mostly economists. I have now become sensitized to be highly skeptical of these kinds of findings. (Perhaps too skeptical, but put that to the side…)

Patrick Collison, not an economist, but CEO of Stripe and a very smart person, posted this, which has been picked up and spread by The Browser which has a wide and thoughtful audience and by Collison on twitter. Collison’s piece is a brief list of other studies that “confirm” the cognitive losses due to air pollution.

My general reaction (using one of the studies) is that if umpires do a dramatically worse job on high pollution days because their brains are muddled by pollution, there must have been a massive (and noticeable) improvement in accuracy over the last 40 years as particulate matter has fallen in the US. Same with chess players—another study—there should be many more grandmasters and the quality of chess play overall in the US should be dramatically improved.

The big picture

There are some places where I agree with Roberts and some places where I disagree. I’ll go through all this in a moment, but first I want to set out the larger challenges that we face in this sort of problem.

I agree on the general point that we should be skeptical of large claims. A difficulty here is that the claims come in so fast: There’s a large industry of academic research producing millions of scientific papers a year, and on the other side there are about 5 of us who might occasionally look at a paper critically. A complicating factor here is that some of these papers are good, some of the bad papers can have useful data, and even the really bad papers have some evidence pointing in the direction of their hypotheses. So the practice of reading the cited papers is just not scalable.

Even in the above little example, Collinson links to 9 articles, and it’s not like I have time to read all of them. I skimmed through the first one (The Impact of Indoor Climate on Human Cognition: Evidence from Chess Tournaments, by Steffen Kunn, Juan Palacios, and Nico Pestel) and it seemed reasonable to me.

Speaking generally, another challenge is that if we see serious problems with a paper (as with the first article discussed above in this post), we can set it aside. The underlying effect might be real, but that particular study provides no evidence. But when a paper seems reasonable (as with the article on chess performance), it could just be that we haven’t noticed the problems yet. Recall that the editors of JPSP didn’t see the huge (in retrospect) problems with Bem’s ESP study, and recall that Arthur Conan Doyle didn’t realize that these photos were faked.

To get back to Roberts’s concerns: I have no idea what are the effects of air pollution on cognitive function. I really just don’t know what to think. I guess the way that researchers are moving forward on this is to look at various intermediate outcomes such as blood flow to the brain.

To step back: on one hand, the theory here seems plausible; on the other hand, I know about all the social and statistical reasons why we should expect effect size estimates to be biased upward. There’s a naive view that associates large type S and type M errors with crappy science of the Wansink variety, but even carefully reviewed studies published in top journals by respected researchers have these problems.

Preregistered replication to the rescue

So we’re at an impasse. Plausible theories, some solid research articles with clear conclusions, but this is all happening in a system with systematic biases.

This is where careful preregistered replication studies can come in. The point of such studies is not to say that the originally published findings “replicated” or “didn’t replicate,” but rather to provide new estimates that we can use, following the time-reversal heuristic.

Again, the choice to perform the replication should be considered as a sign of respect for the original studies: that they are high enough quality, and on an important enough topic, to motivate the cost and effort of a replication.

Getting into the details

1. I agree with Roberts that the first study he links to has serious problems. I’ll discuss these below the fold, but the short story is that I see no reason to believe any of it. I mean, sure, the substantive claims might be true, but if the estimates in the article are correct, it’s really just by accident. I can’t see the empirical analysis adding anything to our understanding. It’s not as bad as that beauty-and-sex-ratio study which, for reasons of statistical power, was doomed from the start—but given what’s reported in the published paper, the data are too noisy to be useful.

2. As noted above, I looked quickly at the first paper on Collinson’s list and I saw no obvious problems. Sure, the evidence is only statistical—but we sometimes can learn from statistical evidence. For reasons of scalability (see above discussion), I did not read the other articles on the list.

3. I’d like to push against a couple of Roberts’s arguments. Roberts writes:

If umpires do a dramatically worse job on high pollution days because their brains are muddled by pollution, there must have been a massive (and noticeable) improvement in accuracy over the last 40 years as particulate matter has fallen in the US.

Actually, I expect that baseball umpires have been getting much more accurate over the past 40 years, indeed over the past century. In this case, though, I’d think that economics (baseball decisions are worth more money), sociology (the increasing professionalization of all aspects of sports), and technology (umpires’ mistakes are clear on TV) would all push in that direction. I’d guess that air pollution is minor compared to these large social effects. In addition, the findings of these studies are relative, comparing people on days with more or less pollution. A rise or decline in the overall level of pollution, that’s different: it’s perfectly plausible that umps do worse on polluted days than on clear days because their bodies are reacting to an unexpected level of strain, and the same effect would not arise from higher pollution levels every day.

Roberts continues:

Same with chess players . . . there should be many more grandmasters and the quality of chess play overall in the US should be dramatically improved.

Again, I think it’s pretty clear that the quality of chess play overall has improved, at least at the top level. But, again, any effects of pollution would seem to be minor compared to social and technological changes.

So I feel that Roberts is throwing around a bit too much free-floating skepticism.

P.S. As promised, here are my comments on the first paper that Roberts linked to, that I do think has problems.

Here’s the abstract:

This paper assesses the effect of short-term exposure to particulate matter (PM) air pollution on human cognitive performance via a double cross over experimental design. Two distinct experiments were performed, both of which exposed subjects to low and high concentrations of PM. Firstly, subjects completed a series of cognitive tests after being exposed to low ambient indoor PM concentrations and elevated PM concentrations generated via candle burning, which is a well-known source of PM. Secondly, a different cohort underwent cognitive tests after being exposed to low ambient indoor PM concentrations and elevated ambient outdoor PM concentrations via commuting on or next to roads. Three tests were used to assess cognitive performance: Mini-Mental State Examination (MMSE), the Stroop Color and Word test, and Ruff 2 & 7 test. The results from the MMSE test showed a statistically robust decline in cognitive function after exposure to both the candle burning and outdoor commuting compared to ambient indoor conditions. The similarity in the results between the two experiments suggests that PM exposure is the cause of the short-term cognitive decline observed in both. The outdoor commuting experiment also showed a statistically significant short-term cognitive decline in automatic detection speed from the Ruff 2 and 7 selective attention test. The other cognitive tests, for both the candle and commuting experiments, showed no statistically significant difference between the high and low PM exposure conditions. The findings from this study are potentially far reaching; they suggest that elevated PM pollution levels significantly affect short term cognition. This implies average human cognitive ability will vary from city to city and country to country as a function of PM air pollution exposure.

And here are the key results:

Also this:

A mean of 41.4 with a standard deviation of 46.1 . . . That implies that much of the time the concentration was pretty low. So let’s look at the result as a function of concentration:

Nothing much going on. And this is their best result—none of the other outcomes are even statistically significant!

The paper includes a desperate attempt to put a positive spin on things:

There appears to be a tendency for subjects exposed to to the highest PM2.5 mass concentrations during the candle burning test to have a greater reduction in test performance after exposure to candle burning, however, a linear regression does not provide a statistically robust gradient value (p = 0.610).

That’s a good one: indeed, I don’t think anyone would call p=0.610 a statistically robust finding!

The paper continues:

To further assess the apparent tendency, we compared the differential T-score to subjects who were exposed to a PM2.5 concentration above or below the daily WHO recommendation (25 µg/m³), see Fig. 3. The average differential T-score for when PM2.5 was significantly greater when the PM2.5 concentration is greater than the WHO recommendation compared to when it is less than the recommendation. The two distributions, with PM2.5 greater or less than the WHO recommendation, were not normally distributed, as assessed by the Kolmogorov-Smirnov normality test. Hence, the Mann-Whitney test was performed to compare the medians of the two groups different T-scores, the results showed that the p-value not adjusted for ties was 0.045, and adjusted for ties was 0.041. When the PM2.5 concentration was less than the WHO recommendation, the median differential T-score (=50) was significantly higher than the value obtained (=42) when the PM2.5 concentration was greater than the recommendation. This finding suggests that higher exposures to PM2.5 lead to a greater decline in short term cognitive performance. The seemingly non-linear relationship between cognition and PM2.5 concentration, see Fig. 2, suggests a threshold mass concentration of PM2.5 is required before cognitive decline is observed.

Wow. I mean, just wow. I don’t think I’ve ever seen this many forking paths in a single paragraph. Daryl Bem, Brian Wansink, step aside and watch how the pros do it!

Seriously, though, I suppose that the authors of this paper, as with so many other researchers, are trying their best and just have the idea that the goal of quantitative research is to find something, somewhere, that’s “statistically significant”—and then move to story time.

In some sense, though, the paper was a success, as it was featured in the London Times:

Traffic pollution damages commuters’ brains

Going to work can make you stupid, scientists have found, with your brain capacity falling sharply because of exposure to traffic pollution during the daily commute.

The researchers tested people in Birmingham before and after they travelled along busy roads during rush hour, and found their performance in cognitive tests was significantly lower after their journey. . . .

To be fair, I’m guessing that the Science Editor for this newspaper lives in London, so maybe he had a tough commute and his cognitive abilities were diminished at the time he was editing this piece.

Also, Ray Keene writes for the London Times, right? So it’s not like their standards are so damn high.