Update on social science debate about measurement of discrimination

Dean Knox writes:

Following up on our earlier conversation, we write to share a new, detailed examination of the article, Deconstructing Claims of Post-Treatment Bias in Observational Studies of Discrimination, by Johann Gaebler, William Cai, Guillaume Basse, Ravi Shroff, Sharad Goel, and Jennifer Hill (GCBSGH).

Here’s our new paper, Using Data Contaminated by Post-Treatment Selection?, with Will Lowe and Jonathan Mummolo. This is an important debate and the methods discussed are being used to study serious policy issues. We think these new derivations are valuable for those producing and consuming discrimination research.

Examining GCBSGH’s proposed approach was a very useful exercise for us. In the paper, we clear up some confusion about estimands, in particular showing that given post-treatment selection, analysts do not even get the controlled direct effect (CDE) among the people included in the experiment, unless observations are exactly the same despite responding to treatment differently. This is conceptually the same as arguing that IV recovers the ATE—I [Knox] think few reasonable analysts would argue that compliers are somehow exactly the same as the full sample. In our paper, we prove the following is logically equivalent to GCBSGH’s proposal: in ideal experimental settings where civilians of different racial groups are randomly assigned to police encounters pre-stop, acknowledging that biased police may stop minority civilians for as little as jaywalking but white civilians only for assault—yet arguing that both sets of stops are somehow identical in potential for police violence.

But there is a more important point that we hadn’t appreciated on first read: GCBSGH’s proposal is described as working even when treatment ignorability doesn’t hold. We now examine that aspect closely and find the proposed approach recovers the estimand in this more general setting only if post-treatment selection bias exactly cancels out omitted variable bias. Of course, analysts are free to assume whatever they want, but we think federal judges and civil rights orgs are unlikely to find this argument compelling.

We realize this exchange got heated, so we’ve tried hard to dial the tone back and just focus on the intellectual arguments. The back-and-forth has been relegated to a short section that runs through claimed “counterexamples” (which all mirror the parenthetical asides and appendices in our original paper) and also the fact that we couldn’t find a textual basis for their critique anywhere in our paper. Ultimately this is a pretty minor point, but we said we needed assumptions 1-4 for X under conditions Y, and they attacked us for assumptions 2/4/5 not being necessary for Z. Frustrating, but I guess we could’ve been clearer.

I’ve not tried to follow all the details here so I’ll let people go at it in the comments section. I’ll just say that I’m not sure what to make of the “knife-edge” or “measure zero” issue. Almost all statistical methods are correct only under precise assumptions which in observational data will never hold. But that doesn’t mean the methods are useless. When we estimated incumbency advantage in congressional elections (here and here), our estimates were only strictly kosher assuming random assignment or linearity and additivity, none of which are actually correct. This is not to say that Knox et al. are wrong in their arguments, just to say that biases will always be with us. Regarding the point about post-treatment selection bias exactly canceling out omitted variable bias: I think that must depend on what you’re estimating. This gets back to my comment in that earlier post that some of the disagreement between Knox et al. and Gaebler at al. is that they’re estimating different things.

I sent the above paragraph to Knox, who replied:

We must point out that your statement about estimands is simply inaccurate. Our new paper considers the exact same estimand as GCBSGH and we are very explicit about that. Our original paper also considered this estimand, among several others.

We also think the problem is pretty intuitive. You cannot, in general, pick up halfway through a selective process and run a standard analysis without bias, because the selection almost always breaks the apples-to-apples comparison between treatment and control.

The only time when you can ignore selection is when various differences happen to accidentally cancel (see our paper), and there are simply no substantive reasons to believe this is happening. That is the real problem. As we say in the paper, this is no as-if-randomness assumption. This is an assumption about observations being the same despite responding to treatment differently. The knife-edgedness isn’t the root of the issue, it just makes it worse.

I think that even if the estimands are the same mathematically, they can correspond to different applied goals. I still think the two sides of this debate are closer than they think. But in any case at this point you can read this all yourself.