No, I don’t believe that claim based on regression discontinuity analysis that . . .

https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/

Despite the p-less-than-0.05 statistical significance of the discontinuity in the above graph, no, I do not believe that losing a close election causes U.S. governors to die 5-10 years longer, as was claimed in this recently published article.

Or, to put it another way: Despite the p-less-than-0.05 statistical significance of the discontinuity of the above graph, and despite the assurance of the authors that this finding is robust, no, I do not believe that winning a close election causes U.S. governors to live 5-10 years longer.

How can I say this? I will answer in five ways:
1. Common sense
2. Our experience as consumers of research
3. Statistical analysis
4. Statistical design
5. Sociology of science.

1. Common sense: 5-10 years is a huge amount of lifetime. Even if you imagine dramatic causal sequences (for example, you lose an election and become depressed and slide into alcoholism and then drive your car into a tree), it’s hard to imagine such a huge effect.

2. Our experience as consumers of research: The recent history of social and behavioral science is littered with published papers that made claims of implausibly large effects, supported by statistically significant comparisons and seemingly solid research methods, which did not hold up under replication or methodological scrutiny. Some familiar examples to readers of this blog include the claim that beautiful parents were 8 percentage points more likely to have girls, the claim that students at Cornell had extra-sensory perception, the claim that women were three times more likely to wear red or pink clothing during certain times of the month, the claim that single women were 20 percentage points more likely to support Barack Obama during certain times of the month, and the claim that political moderates perceived the shades of gray more accurately than extremists on the left and right. We’ve been burned before. We’ve been burned enough times that we realize we don’t have to follow Kahneman’s now-retired dictum that “you have no choice but to accept that the major conclusions of these studies are true.”

At this point, I recommend that you pause your reading now and go to this post, “50 shades of gray: A research story,” and read all of it. It won’t take long: it’s just two long paragraphs from an article of Nosek, Spies, and Motyl and then a few sentences of remarks from me.

OK, did you read “50 shades of gray: A research story”? Great. Now let’s continue.

3. Statistical analysis: If there truly is no such large effect that losing an election causing you to lose 5 to 10 years of life, then how could these researchers have found a comparison that was (a) statistically significant, and (b) robust to model perturbations? My quick answer is forking paths. I know some of you are gonna hate to hear it, but there it is. The above graph might look compelling, but here’s what the raw data look like:

For ease of interpretation I’ve plotted the data in years rather than days.

Now let’s throw a loess on there and see what we get:

And here are two loess fits, one for negative x and one for positive x:

Yup. Same old story. Fit a negative slope right near the discontinuity, and then a positive jump is needed to make everything work out. The point is not that loess is the right thing to do here; the point is that this is what’s in these data.

The fit is noisy, and finding the discontinuity all depends on there being this strong negative relation between future lifespan and vote margin in this one election—but just for vote margins in this +/- 5 percentage point range. Without that negative slope, the discontinuity goes away. It’s just like three examples discussed here.

At this point you might say, no, the authors actually fit a local linear regression, so we can’t blame the curve, and that their results were robust . . . OK, we’ll get to that. My first point here is that the data are super-noisy, and fitting different models to these data will give you much different results. Again, remember that it makes sense that the data are noisy—there’s no good reason to expect a strong relationship between vote margin in an election and the number of years that someone will live afterward. Indeed, from any usual way of looking at things, it’s ludicrous to think that a candidate’s life expectancy is:
30 years if he loses an election by 5 percentage points
25 years if he loses narrowly
35 years if he wins narrowly
30 years if he wins by 5 percentage points.
It’s a lot more believable that this variation is just noise, some artifact of the few hundred cases in this dataset, than that it represents some general truth about elections, or even about elections for governor.

And here’s a regression. I show a bunch of regressions (with code) at the end of the post; here’s one including all the elections between 1945 and 2012, excluding all missing data, and counting the first race for each candidate who ran for governor multiple times:

lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset)
                   coef.est coef.se
(Intercept)        78.60     4.05  
won                 2.39     2.44  
age                -0.98     0.08  
decades_since_1950 -0.21     0.51  
margin             -0.11     0.22  
---
n = 311, k = 5
residual sd = 10.73, R-Squared = 0.35

The estimated effect is 2.4 years with a standard error of 2.4 years, i.e., consistent with noise.

But what about the robustness as reported in the published article? My answer to that is, first, the result is not so robust, as is indicated by the above graph and regression and demonstrated further in my P.S. below—and I wasn’t trying to make the result go away, I was just trying to replicate what they were doing in that paper,—and, second, let’s see what Uri Simonsohn had to say about this:

To demonstrate the problem I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p

People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

4. Statistical design: Another way to think about this is to consider the design of such a study. A priori we might consider an effect size of one additional year of life to be large, and on the border of plausibility. But this study has essentially no power to detect effects this small! You can see that from the standard errors on the regression. If an estimate of 5-10 years is two or three standard errors away from zero, than an effect of 1 year, or even 2 years, is statistically undetectable. So the study is really set up only to catch artifacts or noise. That’s the point of the “Power failure” paper by Button et al. (2012).

5. Sociology of science: If you are a social scientist, statistical methods should be your servant, not your master. It’s very tempting to say that the researchers who made the above graph followed the rules and that it’s not fair to pick on them. But the point of social science is not to follow the rules, it’s to gain understanding and make decisions. When following the rules gives silly results, it’s time to look more carefully at the rules and think hard about what went wrong. That’s what (many) psychologists did after the Bem ESP debacle and the rest of the replication crisis.

Yes, it’s possible that I’m wrong. Perhaps losing a close election really did kill these candidates for governor. It’s possible. But I don’t think so. I think this paper is just another example of statistical rule-following that’s out of control.

No smoking gun

It’s easy to criticize research when the forking paths are all out in the open (as with the ESP study) or when statistics show that your sample size is too low to detect anything by a factor of 100 (as in the beauty-and-sex-ratio example) or when there are obvious forking paths and a failed replication (as in the ovulation-and-clothing example) or when almost all the data have been excluded from the analysis (as in the union elections and stock price example) or when there’s flat-out research misconduct (as with pizzagate).

This example we’re discussing here is a bit different. It’s a clean analysis with clean data. The data are even publicly available (which allowed me to make the above graphs)! But, remember, honesty and transparency are not enough. If you do a study of an effect that is small and highly variable (which this one is: to the extent that winning or losing can have large effects on your lifespan, the effect will surely vary a lot from person to person), you’ve set yourself up for scientific failure: you’re working with noise.

I’m not happy about this, but that’s just how quantitative science works. So let me emphasize again that a study can be fatally flawed just by being underpowered, even if there’s no other obvious flaw in the study.

Or, to put it another way, there’s an attitude that causal identification + statistical significance = discovery, or that causal identification + robust statistical significance = discovery. But that attitude is mistaken. Even if you’re an honest and well-meaning researcher who has followed principles of open science.

P.S. R code is below. I followed the link at the published article and downloaded the data from here. The code didn’t quite run as is—the R code required a .csv file but all I could find was a .tab file, so I changed:

df_rdd 

to

df_rdd 

Then when I ran the next bit of the code:

if (sha1(df_rdd) != "300cc29bbecd2b630016c9bd2c8ef958dcc1b45d"){
  error("Wrong data file loaded or data has been changed!") }

I got an error:

Error in error("Wrong data file loaded or data has been changed!") : 
  could not find function "error"

Sol I checked this:

sha1(df_rdd)

and got this:

"e26cc6acbed7a7144cd2886eb997d4ae262cf400"

So maybe the data did get changed! I have no idea. But I was able to reproduce the graphs so things were probably OK.

I also noticed some data problems, such as cases where the politician's death date came decades before his election. There were a bunch of these, for example see here:

I'm not trying to say that this is a horrible dataset. This is just what happens when you try to replicate someone's analyses. Problems come up.

Once I started to go through the data, I realized there are a lot of analysis choices. The article in question focuses on details of the discontinuity analysis, but that's really the least of it, considering that we have no real reason to think that the vote margin should be predictive of longevity. The most obvious thing is that the best predictor of how long you'll live in the future is . . . how long you've lived so far. So I included current age as a linear predictor in the model. It then would be natural to interact the win/loss indicator with age. That to me makes a lot more sense than interacting with the vote margin. In general, it makes sense to interact the treatment with the most important predictors in your data.

Another issue is what to do with candidates who are still living. The published analysis just discarded them. It would be more natural, I think, to include such data using survival analysis. I didn't do this, though, as it would take additional work. I guess that was the motivation of the authors, too. No shame in that---I cut corners with missing data all the time---but it is yet another researcher degree of freedom.

Another big concern is candidates who run in multiple elections. It's not appropriate to use a regression model assuming independent errors when you have the same outcome for many cases. To keep things simple, I just kept the first election in the dataset for each candidate, but of course that's not the only choice; instead you might, for example, use the average vote margin for all the elections where the candidate ran.

Finally, for the analysis itself: it seems that the published regressions were performed using an automatic regression discontinuity package, rdrobust. I'm sure this is fine for what it is, but, again, I think that in this observational study the vote margin is not the most important thing to adjust for. Sure, you want to adjust for it, as there's no overlap on this variable, but we simply don't have enough cases right near the margin (the zone where we might legitimately approximate win or loss as a randomly applied treatment) to rely on the RD. 400 or so cases is just not enough to estimate a realistic effect size here. To keep some control over the analyses, I just fit some simple regressions. It would be possible to include local bandwidths etc. if that were desired, but, again, that's not really the point. The discontinuity assignment is relevant to our analysis, but it's not the only thing going on, and we have to keep our eye on the big picture if we want to seriously estimate the treatment effect.

So now some code, which I ran after first running the script longevity.R that was included at the replication site. First I hacked the dates:

death_date 

It turned out I didn't really need to do this, because the authors had made those date conversions, but I wanted to do it myself to make sure.

Next I cleaned the data, counting each candidate just once:

n  0, 1, 0)
lifetime = 1945 & election_year 

Again, I know my code is ugly. I did check the results a bit, but I wouldn't be surprised if I introduced a few bugs. I created the decades_since_1950 variable as I had the idea that longevity might be increasing, and I put it in decades rather than years to get a more interpretable coefficient. I restricted the data to 1945-2012 and to candidates who were no longer alive at this time because that's what was done in the paper, and I considered election margins of less than 10 percentage points because that's what they showed in their graph, and also this did seem like a reasonable boundary for close elections that could've gone either way (so that we could consider it as a randomly assigned treatment).

Then some regressions. I first did them using stan_glm but just for convenience I'll give you straight lm code here, including the arm package to get the cleaner display:

library("arm")
fit_1a 

The first model includes the most natural predictors; the second includes the interaction with age, as discussed above. And here are the results:

lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset)
                   coef.est coef.se
(Intercept)        78.60     4.05  
won                 2.39     2.44  
age                -0.98     0.08  
decades_since_1950 -0.21     0.51  
margin             -0.11     0.22  
---
n = 311, k = 5
residual sd = 10.73, R-Squared = 0.35

lm(formula = more_years ~ won + age + won:age + decades_since_1950 + 
    margin, data = data, subset = subset)
                   coef.est coef.se
(Intercept)        78.81     5.41  
won                 1.92     8.25  
age                -0.98     0.10  
decades_since_1950 -0.21     0.51  
margin             -0.11     0.22  
won:age             0.01     0.16  
---
n = 311, k = 6
residual sd = 10.75, R-Squared = 0.35

I was gonna start talking about how hard it would be to interpret the coefficient of the treatment ("won" = winning the election) in the second model, because of the interaction with age. But the coefficient for that interaction is so small and so noisy that let's just forget about it and go up to the earlier regression. The estimated treatment effect is 2.4 years (not 5-10 years) and its standard error is also 2.4.

What about excluding decades_since_1950?

lm(formula = more_years ~ won + age + margin, data = data, subset = subset)
            coef.est coef.se
(Intercept) 78.62     4.05  
won          2.36     2.44  
age         -0.99     0.08  
margin      -0.11     0.22  
---
n = 311, k = 4
residual sd = 10.72, R-Squared = 0.35

Nahhh, it doesn't do much. We could exclude age also:

lm(formula = more_years ~ won + margin, data = data, subset = subset)
            coef.est coef.se
(Intercept) 30.14     1.75  
won          1.70     3.01  
margin       0.10     0.27  
---
n = 311, k = 3
residual sd = 13.25, R-Squared = 0.01

Now the estimate's even smaller and noisier! We should've kept age in the model in any case. We could up the power by including more elections:

subset2 = 1945 & election_year 

Which gives us:

lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset2)
                   coef.est coef.se
(Intercept)        76.24     3.29  
won                 1.00     1.83  
age                -0.93     0.06  
decades_since_1950 -0.18     0.39  
margin              0.02     0.09  
---
n = 497, k = 5
residual sd = 10.83, R-Squared = 0.33

Now we have almost 500 cases, but we're still not seeing that large and statistically significant effect.

Look. I'm not saying that my regressions are better than the ones in the published paper. I'm pretty sure they're better in some ways and worse in others. I just found it surprisingly difficult to reproduce their results using conventional approaches. So, sure, I believe they found what they found . . . but I call the result fragile, not robust.

Just for laffs, I re-ran the analysis including the duplicate cases:

subset3 = 1945 & election_year 

Which yields:

lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset3)
                   coef.est coef.se
(Intercept)        74.65     3.18  
won                 3.15     1.89  
age                -0.91     0.06  
decades_since_1950 -0.03     0.41  
margin             -0.19     0.17  
---
n = 499, k = 5
residual sd = 10.93, R-Squared = 0.33

Almost statistically significant (whatever that's supposed to mean) with a z-score of 1.7, but still not in that 5-10 year range. Also it doesn't make sense to count a politician multiple times.

I thought I could get cute and remove age and decades_since_1950 and maybe something like the published paper's result would appear, but no luck:

lm(formula = more_years ~ won + margin, data = data, subset = subset3)
            coef.est coef.se
(Intercept) 28.45     1.32  
won          2.84     2.30  
margin      -0.12     0.20  
---
n = 499, k = 3
residual sd = 13.32, R-Squared = 0.00

I tried a few other things but I couldn't figure out how to get that huge and statistically significant coefficient in the 5-10 year range. Until . . . I ran their regression discontinuity model:

df_m % 
  filter(year >= 1945, living_day_imp_post > 0) 
main_1 

And this is the result:

Number of Obs.                 1092
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                 516         576
Eff. Number of Obs.            236         243
Order est. (p)                   1           1
Order bias  (q)                  2           2
BW est. (h)                  9.541       9.541
BW bias (b)                 19.017      19.017
rho (h/b)                    0.502       0.502
Unique Obs.                    516         555

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional  2749.283   873.601     3.147     0.002  [1037.057 , 4461.509]  
        Robust         -         -     3.188     0.001  [1197.646 , 5020.823]  
=============================================================================

You have to divide by 365.24, but you get the picture. Then I looked above, and I saw the effective number of observations was only 236 or 243, not even as many as the 311 I had earlier!

So let me try re-running the basic linear model but just in the zone where the candidates were within 5 percentage points of winning or losing:

subset4 = 1945 & election_year 

Here's the result:

lm(formula = more_years ~ won + age + decades_since_1950 + margin, 
    data = data, subset = subset4)
                   coef.est coef.se
(Intercept)        71.03     6.87  
won                 7.54     3.76  
age                -0.90     0.12  
decades_since_1950 -0.36     0.78  
margin             -1.14     0.68  
---
n = 153, k = 5
residual sd = 11.46, R-Squared = 0.31

Now we're getting somewhere! The estimate is 5-10 years, and it's 2 standard errors away from zero. We can juice it up a bit more by removing decades_since_1950 (a reasonable choice to remove) and age. We should really keep age in the model, no question about it, not including age in a remaining-length-of-life model is about as bad as not including smoking in a cancer model, but let's remove it just to see what happens:

lm(formula = more_years ~ won + margin, data = data, subset = subset4)
            coef.est coef.se
(Intercept) 22.70     2.52  
won         11.92     4.31  
margin      -2.00     0.77  
---
n = 153, k = 3
residual sd = 13.34, R-Squared = 0.05

Wowza! An estimated effect of more than 10 years of life, and it's 2.8 standard errors away from zero. Got it!

So now we can reverse engineer, starting with the RD model from the paper and keeping only the first race for each candidate:

df_m_first % 
  filter(year >= 1945, living_day_imp_post > 0, first_race)
main_1a 

which yields:

        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional  1482.670  1021.784     1.451     0.147  [-519.991 , 3485.330]  
        Robust         -         -     1.516     0.130  [-525.718 , 4115.893]  

And then including age as a predictor:

main_1b 

which yields:

        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional   819.480   708.968     1.156     0.248  [-570.072 , 2209.032]  
        Robust         -         -     1.051     0.293  [-740.342 , 2453.552]  

So, yeah, the robust setting in this package doesn't matter much in this example---but the analysis is very sensitive to the bandwidth (yes, the bandwidth is estimated from the data, but that just tells us it can be noisy; the fact that something is calculated automatically using some theory and a computer program doesn't mean it's correct in any particular example) and to the decision of how to handle candidates with multiple elections in the dataset, and to the decision of how to handle the age predictor.

And here's the (ugly) code for the graphs I made:


loess_lost 

P.P.S. This was all a lot of work. I did it because it's worth doing some work now and then---but don't forget the larger point, which is that we knew ahead of time not to trust this result because of the unrealistically large effect sizes, and we knew ahead of time not to think that causal identification + statistical significance + robustness tests = discovery, because we've been burned on that combination many times before.

To put it another way, don't think that you should give the claims of a published or preprinted social science study the benefit of the doubt, just because someone like me didn't bother to go to all the trouble to explain its problems. The point here is not to pick on the authors of this particular study---not at all. They're doing what they've been trained to do.

P.P.P.S. Again, nothing special about regression discontinuity here. All the same concerns would arise with any observational study, with specific concerns arising from regression, difference in differences, nonparametric modeling, instrumental variables, synthetic controls, etc. The same problems arise in experiments too, what with issues of missing data and extrapolation. The particular problems with regression discontinuity arise when (a) people naively think that causal identification + statistical significance = discovery, and (b) people take robustness tests, placebo controls, etc., too seriously. Alternative analyses are part of any good statistical study, but generally the point should be to explore and understand the limitations of one's conclusions, not to rule out alternative explanations. And, with all these kinds of study, if the underlying effect size is small and your measurements are noisy, you're drawing dead (as they say in poker). Again, this can happen in any study, ranging from a clean randomized experiment at one extreme, to a pure observational study on the other.

You might say that I'm shooting a rabbit with a cannon here. The research and writing of this post took me something like 10 hours! That's a lot of time to spend on an effect that I think is small, highly variable, and essentially undetectable.

My reason for writing this post is because I'm concerned about the general attitude that causal identification + statistical significance + robustness study = discovery. For my own purposes, I didn't feel the need to reanalyze the data---the problems were all clear to me from the beginning---but I put in the time to do all this to make it clear what's going on, to people who haven't been steeped in thinking about low power, forking paths, and all the rest. I'm hoping that the effort I put into taking apart the claims in this particular paper can be abstracted in some way so that my colleagues and I can write something more general that can help many researchers in the future. This is a path we often take in applied statistics: working on a single example in detail to develop more general methodological conclusions. Usually we think of this as doing an exemplary analysis that can serve as a template for future workflow and an inspiration for future methods; in this case it's more of an exploration of what can go wrong. But that can be important too.

My goal is not to "debunk" the claims in the paper under discussion; rather, my goal is to understand what went wrong, as a way of figuring out how we can all do better in the future.

P.P.P.P.S. OK, one more thing. This is for those of you who (a) have read this far, and (b) want to say something like this to me:

Hey, destructive statistician. Stop with your nihilism! The authors of this peer-reviewed paper did a very standard, reasonable analysis and found a large, statistically significant, and robust effect. All you did was perform a bunch of bad analyses. Even your bad analyses uniformly found positive effects. They just weren't statistically significant, but that's just cos you threw away data and used crude methods. It's well known that if you throw away data and use inefficient statistical analysis, that your estimates become more noisy, your standard errors become larger, and your power has gone down.

tl;dr. You, Gelman, maligned a competent, high-quality paper---one that passed peer review---by the simple expedient of re-running their analysis with fewer data points and a noisier statistical method and then concluding that their results were not statistically significant. Ironic that you, Gelman, did this tactic, given how disparaging you usually are about using statistical significance to make decisions.

The above response sounds kinda reasonable---indeed, some of you might agree with it!---and it has the form of a logical argument, but I think it's wrong, for several reasons.

First, regarding the point that the coefficient shows up positive in all my analyses, hence making this nothing more than a dispute over statistical significance:

This where it's helpful to have a Bayesian perspective---or, if you'd prefer, a design-based perspective. It goes like this.

Second, regarding the idea that I've replaced their state-of-the-art RD analysis with various low-power linear regressions:

Actually, no. The robust adaptive RD analysis does not have higher power or more statistical efficiency than the linear model. These are just different models. Indeed, the decision to not include age as a predictor lowers the power of the analysis. Regarding sample size: I took out the duplicate elections featuring the same candidate because it's inappropriate to consider these as independent data points; indeed, including them in that way can just give you artificially low standard errors. (I think it would be easy enough to show this using a bootstrap analysis or something like that.) The real point, though, is you can't tell much about the power of a study by looking at the z-score or statistical significance of a coefficient estimate.

P.P.P.P.P.S. Wow---the above post is really long. And I don't remember writing it at all. That's what blog-lag will do to you!