Linear or logistic regression with binary outcomes

Gio Circo writes:

There is a paper currently floating around which suggests that when estimating causal effects in OLS is better than any kind of generalized linear model (i.e. binomial). The author draws a sharp distinction between causal inference and prediction. Having gotten most of my statistical learning using Bayesian methods, I find this distinction difficult to understand. As part of my analysis I am always evaluating model fit, posterior predictive checks, etc. In what cases are we estimating a causal effect and not interested in what could happen later?

I am wondering if you have any insight into this. In fact, it seems like economists have a much different view of statistics than other fields I am more familiar with.

The above link is to a preprint, by Robin Gomila, “Logistic or linear? Estimating causal effects of treatments on binary outcomes using regression analysis,” which begins:

When the outcome is binary, psychologists often use nonlinear modeling strategies suchas logit or probit. These strategies are often neither optimal nor justified when the objective is to estimate causal effects of experimental treatments. . . . I [Gomila] draw on econometric theory and established statistical findings to demonstrate that linear regression is generally the best strategy to estimate causal effects of treatments on binary outcomes. . . . I recommend that psychologists use linear regression to estimate treatment effects on binary outcomes.

I don’t agree with this recommendation, but I can see where it’s coming from. So for researchers who are themselves uncomfortable with logistic regression, or who work with colleagues who get confused by the logistic transformation, I could modify the above advice, as follows:

1. Forget about the data being binary. Just run a linear regression and interpret the coefficients directly.

2. Also fit a logistic regression, if for no other reason than many reviewers will demand it!

3. From the logistic regression, compute average predictive comparisons. We discuss the full theory here, but there are also simpler versions available automatically in Stata and other regression packages.

4. Check that the estimates and standard errors from the linear regression in step 1 are similar to the average predictive comparisons and corresponding standard errors in step 3. If they differ appreciably, then take a look at your data more carefully—OK, you already should’ve taken a look at your data!—because your results might well be sensitive to various reasonable modeling choices.

Don’t get me wrong—when working with binary data, there are reasons for preferring logistic regression to linear. Logistic should give more accurate estimates and make better use of the data, especially when data are sparse. But in many cases, it won’t make much of a difference.

To put it another way: in my work, I’ll typically just do steps 3 and 4 above. But, arguably, if you’re only willing to do one step, then step 1 could be preferable to step 2, because the coefficients in step 1 are more directly interpretable.

Another advantage of linear regression, compared to logistic, is that linear regression doesn’t require binary data. Believe it or not, I’ve seen people discretize perfectly good data, throwing away tons of information, just because that’s what they needed to do to run a chi-squared test or logistic regression.

So, from that standpoint, the net effect of logistic regression on the world might well be negative, in that there’s a “moral hazard” by which the very existence of logistic regression encourages people to turn their outcomes into binary variables. I have the impression this happens all the time in biomedical research.

A few other things

I’ll use this opportunity to remind you of a few related things. My focus here is not on the particular paper linked above but rather on some of these general questions on regression modeling.

First, if the goal of regression is estimating an average treatment effect, and the data are well behaved, then linear regression might well behave just fine, if a bit inefficiently. The time when it’s important to get the distribution right is when you’re making individual predictions. Again, even if you only care about averages, I’d still generally recommend logistic rather than linear for binary data, but it might not be such a big deal.

Second, any of these methods can be a disaster if the model is far off. Both linear and logistic regression assume a monotonic relation between E(y) and x. If E(y) is a U-shaped function of x, then linear and logistic could both fail (unless you include x^2 as a predictor or something like that, and then this could introduce new problems at the extremes of the data). In addition, logistic assumes the probabilities are 0 and 1 at the extremes, and if the probabilities asymptote out at intermediate values, you’ll want to include that in your model too, which is no problem in Stan but can be more difficult with default procedures and canned routines.

Third, don’t forget the assumptions of linear regression, ranked in decreasing order of importance. The assumptions that first come to mind, are for many purposes the least important assumptions of the model. (See here for more on assumptions.)

Finally, the causal inference thing mentioned in the linked paper is a complete red herring. Regression models make predictions, regression coefficients correspond to average predictions over the data, and you can use poststratification or other tools to use regression models to make predictions for other populations. Causal inference using regression is a particular sort of prediction having to do with potential outcomes. There’s no reason that linear modeling is better or worse for causal inference than for other applications.