How should those Lancet/Surgisphere/Harvard data have been analyzed?

As you will recall, the original criticism of the recent Lancet/Surgisphere/Harvard paper on hydro-oxy-whatever was not that the data came from a Theranos-like company that employs more adult-content models than statisticians, but rather that the data, being observational, required some adjustment to yield strong causal conclusions—and the causal adjustment reported in that article did not seem to be enough.

As James “not the racist dude who assured us that cancer would be cured by 2000” Watson wrote:

This is a retrospective study using data from 600+ hospitals in the US and elsewhere with over 96,000 patients, of whom about 15,000 received hydroxychloroquine/chloroquine (HCQ/CQ) with or without an antibiotic. The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people. . . .

The most obvious confounder is disease severity . . . The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. . . . This temporal aspect cannot be picked up a single severity measurement.

In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. . . .

Setting aside data problems of the sort that caused the article to be retracted, what should they have done?

I don’t have a specific answer for this particular study, but the general ideas are discussed in various textbooks in causal inference, for example chapter 20 of Regression and Other Stories. The basic idea is to start with the comparison of treated and control group in the observed data, then compare the two groups with respect to demographics and pre-treatment health status, then do some combination of matching and regression to estimate the treatment effect among the subset of people that had a chance of getting either option, then see how the estimate changes as you adjust for more things, then consider the effects of adjustment for important but unmeasured pre-treatment predictors. It’s all assumption-based, but if you do it carefully and make your assumptions clear, you can learn something. It’s not a button that can be pushed in a statistics package.

See here for further discussion of the challenges of adjustment for causal inference in observational studies.