I’m still struggling to understand hypothesis testing . . . leading to a more general discussion of the role of assumptions in statistics

I’m sitting at this talk where Thomas Richardson is talking about testing the hypothesis regarding a joint distribution of three variables, X1, X2, X3. The hypothesis being tested is that X1 and X2 are conditionally independent given X3. I don’t have a copy of Richardson’s slides, but here’s a paper that I think it related, just to give you a general sense of his theoretical framework.

The thing that’s bugging me is that I can’t see why anyone would want to do this, test the hypothesis that X1 and X2 are conditionally independent given X3. My problem is that in any situation where these two variables could be conditionally dependent, I think they will be conditionally dependent. It’s the no-true-zeroes thing; see the discussion starting on page 960 here. I’m not really interested in testing a hypothesis that I know is false, that I know would be rejected if I could just gather enough data.

That said, Thomas Richardson is a very reasonable person, so even though his talk is full of things that I think make no sense—he even brought up type 1 and type 2 errors!—I expect there’s something reasonable in all this research, I just have to figure out what.

I can think of a couple of possibilities.

First, maybe we’re not really trying to see if X1 and X2 are conditionally independent; rather, we’re trying to see whether we have enough data to reject the hypothesis of conditional independence. That is, the goal of the hypothesis test is not to accept or reject a hypothesis, but rather to make a statement about the strength of the data.

But I don’t think this is what Richardson was getting at, as he introduced the problem by saying that the goal would be to choose among models. I don’t like this. I think it would be a mistake to use the non-rejection of a hypothesis test to choose the model of conditional independence.

Second, maybe the hypothesis test is really being used as a sort of estimation. For example: I said that if two variables can be dependent, then they will. But what about a problem such as genetics, where two genes could be on the same or different chromosomes? If they’re on different chromosomes, you’ll have conditional independence. Well, not completely—there are always real-world complications—but close enough. I guess that’s the point, that “close enough” could be what you’re testing.

I think this might be what Richardson is getting at, because later in his presentation, he talked about strong and weak regimes of dependence. So I think the purpose of the hypothesis test is to choose the conditional independence model when the conditional dependence is weak enough.

OK, fine. But, given all that, I think it makes sense to estimate all these dependences directly rather than to test hypotheses. When I see all the contortions that are being done to estimate type 1 and type 2 errors . . . I just don’t see why bother. And I’m concerned that the application of these results can lead to bad science, in the same way that reasoning based on statistical significance can lead to bad science more generally.

That said, I can’t say that my above arguments are airtight. After all, my colleagues and I around making inferences based on normal distributions, logistic regressions, and all sorts of other assumptions that we know are false.

Assuming something false and then working from there to draw inferences: it’s a ridiculous way to proceed but it’s what we all do. Except in some very rare cases (for example, working out the distribution of profits from a casino), here’s really no alternative.

True, I get a bit annoyed when statisticians, computer scientists, and others talk about “assumption-free methods” and “theoretical guarantees“—but that’s all just rhetoric. Once we accept that all methods and theorems are based on assumptions, we can all proceed on an equal basis.

At this point it would be tempting to say that assumps are fine, we just need to evaluate our assumps. But that won’t quite work either, as what does it mean to “evaluate” an assump? The evaluation has to be along the lines of, How wrong is the assump? But how to compare, for example, the wrongness of the assumption of a normal distribution for state-level error terms, the wrongness of the assumption of a logistic link mapping these to probability of Republican vote choice, and the wrongness of a conditional independence assumption?

I guess one problem I have with work such as Richardson’s on conditional independence is that I fear that the ultimate purpose of these methods is often to give researchers an excuse to exclude potentially important interactions from their models, just because these interactions are not statistically significant. The trouble here is that (a) whether something is statistically significant is itself a very random feature of data, so in this case you’re essentially outsourcing your modeling decision to a random number, and (b) if lack of statistical significance is a concern, which it can be, then I think the ultimate concern is not whether the interaction in question is zero, but rather that the uncertainty in that interaction is large. In which case I think the right approach is to recognize that uncertainty, both through partial pooling of the estimate and through propagation of that uncertainty in subsequent inferences.

But then again you could say something similar about the statistical methods that my colleagues and I use, in that we’re riding on strong assumptions—just a different set of assumptions, that’s all.

So I’m not sure what to think. Different methods can work well on different applied problems, and all the methods discussed above are general frameworks, not specific algorithms or models, which means that effectiveness can come in the details—recall the principle that the most important aspect of a statistical method is not what it does with the data but rather what data it uses—so I can well imagine that, in the right hands, modeling the world in terms of conditional independence and estimating this structure through hypothesis testing could solve real problems. Still, that model seems awkward to me. It bothers me, and I’d need to be convinced that it really does anything useful.