How to get out of the credulity rut (regression discontinuity edition): Getting beyond whack-a-mole

This one’s buggin me.

We’re in a situation now with forking paths in applied-statistics-being-done-by-economists where we were, about ten years ago, in applied-statistics-being-done-by-psychologists. (I was going to use the terms “econometrics” and “psychometrics” here, but that’s not quite right, because I think these mistakes are mostly being made, by applied researchers in economics and psychology, but not so much by actual econometricians and psychometricians.)

It goes like this. There’s a natural experiment, where some people get the treatment or exposure and some people don’t. At this point, you can do an observational study: start by comparing the average outcomes in the treated and control group, then do statistical adjustment for pre-treatment differences between groups. This is all fine. Resulting inferences will be model-dependent, but there’s no way around it. You report your results, recognize your uncertainty, and go forward.

That’s what should happen. Instead, what often happens is that researchers push that big button on their computer labeled REGRESSION DISCONTINUITY ANALYSIS, which does two bad things: First, it points them toward an analysis that focuses obsessively on adjusting for just one pre-treatment variable, often a relatively unimportant variable, while insufficiently adjusting for other differences between treatment and control groups. Second, it leads to an overconfidence borne from the slogan, “causal identification,” which leads researchers, reviewers, and outsiders to think that the analysis has some special truth value.

What we typically have is a noisy, untrustworthy estimate of a causal effect, presented with little to no sense of the statistical challenges of observational research. And, for the usual “garden of forking paths” reason, the result will typically be “statistically significant,” and, for the usual “statistical significance filter” reason, the resulting estimate will be large and newsworthy.

Then the result appears in the news media, often reported entirely uncritically or with minimal caveats (“while it’s too hasty to draw sweeping conclusions on the basis of one study,” etc.).

And then someone points me with alarm to the news report, and I read the study, and sometimes it’s just fine but often it has the major problems listed above. And then I post something on the study, and sometime between then and six months in the future there is a discussion, where most of the commenters agree with me (selection bias!) and some commenters ask some questions such as, But doesn’t the paper have a robustness study? (Yes, but this doesn’t address the real issues because all the studies in the robustness analysis are flawed in a similar way as the original study) and, But regression discontinuity analysis is OK, right? (Sometimes, but ultimately you have to think of such problems as observational studies, and all the RD in the world won’t solve your problem if there are systematic differences between treatment and control groups that are not explained by the forcing variable) and, But didn’t they do a placebo control analysis that found no effect? (Yes, but this doesn’t address the concern that the statistically-significant main finding arose from forking paths, and there are forking paths in the choice of placebo study too, also the difference between statistically significant and non-significant is not itself . . . ok, I guess you know where I’m heading here), and so on.

These questions are ok. I mean, it’s a little exhausting seeing them every time, but it’s good practice for me to give the answers.

No, the problem I see is outside this blog, where journalists and, unfortunately, many economists, have the inclination to accept these analyses as correct by default.

It’s whack-a-mole. What’s happening is that researchers are using a fundamentally flawed statistical approach, and if you look carefully you’ll find the problems, but the specific problem can look different in each case.

With the air-pollution-in-China example, the warning signs were the fifth-degree polynomial (obviously ridiculous from a numerical analysis perspective—Neumann is spinning in his grave!—but it took us a few years to explain this to the economics profession) and the city with the 91-year life expectancy (which apparently would’ve been 96 years had it been in the control group). With the air-filters-in-schools example, the warning sign was that there was apparently no difference between treatment and control groups in the raw data; the only way that any result could be obtained was through some questionable analysis. With the unions-and-stock-prices example, uh, yeah, just about everything there was bad, but it got some publicity nonetheless because it told a political story that people wanted to hear. Other examples show other problems. But one problem with whack-a-mole is that the mole keeps popping up in different places. For example, if example #1 teaches you to avoid high-degree polynomials, you might thing that example #2 is OK because it uses a straight-line adjustment. But it’s not.

So what’s happening is that, first, we get lost in the details and, second, you get default-credulous economists and economics journalists needing to be convinced, each time, of the problems in each particular robustness study, placebo check, etc.

One thing that all those examples have in common is that if you just look at the RD plot straight, removing all econometric ideology, it’s pretty clear that overfitting is going on:

In every case, the discontinuity jumps out only because it’s been set against an artifactual trend going the other direction. In short: an observed difference close to zero that is magnified by something big by means of a spurious adjustment. It can go the other way too—an overfitted adjustment used to knock out a real difference—but I guess we’d be less likely to see that, as researchers are motivated to find large and statistically significant effects. Again, all things are possible, but it is striking that if you just look at the raw data you don’t see anything: this particular statistical analysis is required to make the gap appear.

And, the true sign of ideological blinders: the authors put these graphs in their own articles without seeing the problems.

Good design, bad estimate

Let me be clear here. There’s good and bad.

The good is “regression discontinuity,” in the sense of a natural experiment that allows comparison of exposed and control groups, where there is a sharp rule for who gets exposed and who gets the control: That’s great. It gives you causal identification in the sense of not having to worry about selection bias: you know the treatment assignment rule.

The bad is “regression discontinuity,” in the sense of a statistical analysis that focuses on modeling of the forcing variable with no serious struggle with the underlying observational study problem.

So, yes, it’s reasonable that economists, policy analysts, and journalists like to analyze and write about natural experiments: this really can be a good way of learning about the world. But this learning is not automatic. It requires adjustment for systematic differences between exposed and control groups—which cannot in general be done by monkeying with the forcing variable. Monkeying with the forcing variable can, however, facilitate the task of coming up with a statistically significant coefficient on the discontinuity, so there’s that.

But there’s hope

But there’s hope. Why do I say this? Because where we are now in applied economics—well-meaning researchers performing fatally flawed studies, well-meaning economists and journalists amplifying these claims and promoting quick-fix solutions, skeptics needing to do the unpaid work of point-by-point rebuttals and being characterized as “vehement statistics nerds”—this is exactly where psychology was, five or ten years ago.

Remember that ESP study? When it came out, various psychologists popped out to tell us that it was conducted just fine, that it was solid science. It took us years to realize how bad that study was. (And, no, this is not a moral statement, I’m not saying the researcher who did the study was a bad person. I don’t really know anything about him beyond what I’ve read in the press. I’m saying that he is a person who was doing bad science, following the bad-science norms in his field.) Similarly with beauty-and-sex ratio, power pose, that dude who claimed to he could predict divorces with 90% accuracy, etc.: each study had its own problems, which had to be patiently explained, over and over again to scientists as well as to influential figures in the news media. (Indeed, I don’t think the Freakonomics team ever retracted their endorsement of the beauty-and-sex-ratio claim, which was statistically and scientifically ridiculous but fit in well with a popular gender-essentialist view of the world.)

But things are improving. Sure, the himmicanes claim will always be with us—that combination of media exposure, PNAS endorsement, and researcher chutzpah can go a long way—but, if you step away from some narrow but influential precincts such as the Harvard and Princeton psychology departments, NPR, and Ted World HQ, you’ll see something approaching skepticism. More and more researchers and journalists are realizing that randomized experiment plus statistical significance does not necessarily equal scientific discovery, that, in fact, “randomized experiment” can motivate researchers to turn off their brains, “statistical significance” occurs all by itself with forking paths, and the paradigm of routine “scientific discovery” can mislead.

And it’s an encouraging sign that if you criticize a study that happens to have been performed by a psychologist, that psychologists and journalists on the web do not immediately pop up with, But what about the robustness study?, or Don’t you know that they have causal identification?, etc. Sure, there are some diehards who will call you a Stasi terrorist because you’re threatening the status quo of backscratching comfort, but it’s my impression that the mainstream of academic psychology recognizes that randomized experiment plus statistical significance does not necessarily equal scientific discovery. They’re no longer taking a published claim as default truth.

My message to economists

Savvy psychologists have realized that just because a paper has a bunch of experiments, each with a statistically significant result, it doesn’t mean we should trust any of the claims in the paper. It took psychologists (and statisticians such as myself) a long time to grasp this. But now we have.

So, to you economists: Make that transition that savvy psychologists have already made. In your case, my advice is, no longer accept a claim by default just because it contains an identification strategy, statistical significance, and robustness checks. Don’t think that a claim should stand, just cos nobody’s pointed out any obvious flaws. And when non-economists do come along and point out some flaws, don’t immediately jump to the defense.

Psychologists have made the conceptual leap: so can you.

My message to journalists

I’ll repeat this from before:

When you see a report of an interesting study, he contact the authors and push them with hard questions: not just “Can you elaborate on the importance of this result?” but also “How might this result be criticized?”, “What’s the shakiest thing you’re claiming?”, “Who are the people who won’t be convinced by this paper?”, etc. Ask these questions in a polite way, not in any attempt to shoot the study down—your job, after all, is to promote this sort of work—but rather in the spirit of fuller understanding of the study.

Science journalists have made the conceptual leap: so can you.

P.S. You (an economist, or a journalist, or a general reader) might read all the above and say, Sure, I get your point, robustness studies aren’t what they’re claimed to be, forking paths are a thing, you can’t believe a lot of these claims, etc., BUT . . . air pollution is important! evolutionary psychology is important! power pose could help people! And, if it doesn’t help, at least it won’t hurt much. Same with air filters: who could be against air filters?? To which I reply: Sure, that’s fine. I got no problem with air filters or power pose or whatever (I guess I do have a problem with those beauty-and-sex-ratio claims as they reinforce sexist attitudes, but that’s another story, to be taken up with Freakonomics, not with Vox): If you want to write a news study promoting air filters in schools, or evolutionary psychology, or whatever, go for it: just don’t overstate the evidence you have. In the case of the regression discontinuity analyses, I see the overstatement of evidence as coming from a culture of credulity within academia and journalism, a combination of methodological credulity with in academic social science (the idea that identification strategy + statistical significance = discovery until it’s been proved otherwise) and credulity in science reporting (the scientist-as-hero narrative).

P.P.S. I’m not trying to pick on econ here, or on Vox. Economists are like psychologists, and Vox reporters are like science reporters in general: they all care about the truth, they all want to use the best science, and they all want to help people. I sincerely think that if psychologists and science reporters can realize what’s been going on and do better, so can economists and Vox reporters. I know it’s taken me awhile to (see here and here) to move away from default credulity. It’s not easy, and I respect that.

P.P.P.S. Yes, I know that more important things are going on in the world right now. I just have to make my contributions where I can.