Is JAMA potentially guilty of manslaughter?

No, of course not. I would never say such a thing. Sander Greenland, though, he’s a real bomb-thrower. He writes:

JAMA doubles down on distortion – and potential manslaughter if on the basis of this article anyone prescribes HCQ in the belief it is unrelated to cardiac mortality:

– “compared with patients receiving neither drug cardiac arrest was significantly more likely in patients receiving hydroxychloroquine+azithromycin
(adjusted OR, 2.13 [95% CI, 1.12-4.05]), but not hydroxychloroquine alone (adjusted OR, 1.91 [95% CI, 0.96-3.81]).”

– never mind that the null is already not credible… see
Do You Believe in HCQ for COVID-19? It Will Break Your Heart and Hydroxychloroquine-Triggered QTc-Interval Prolongations in COVID-19 Patients.

I’m not so used to reading medical papers, but I thought this would be a good chance to learn something, so I took a look. The JAMA article in question is “Association of Treatment With Hydroxychloroquine or Azithromycin With In-Hospital Mortality in Patients With COVID-19 in New York State,” and here are its key findings:

In a retrospective cohort study of 1438 patients hospitalized in metropolitan New York, compared with treatment with neither drug, the adjusted hazard ratio for in-hospital mortality for treatment with hydroxychloroquine alone was 1.08, for azithromycin alone was 0.56, and for combined hydroxychloroquine and azithromycin was 1.35. None of these hazard ratios were statistically significant.

I sent along some quick thoughts, and Sander responded to each of them! Below I’ll copy my remarks and Sander’s reactions. Medical statistics is Sander’s expertise, not mine, so you’ll see that my thoughts are more speculative and his are more definitive.

Andrew: The study is observational not experimental, but maybe that’s not such a big deal, given that they adjusted for so many variables? It was interesting to me that they didn’t mention the observational nature of the data in their Limitations section. Maybe they don’t bother mentioning it in the Limitations because they mention it in the Conclusions.

Sander: In med there is automatic downgrading of observational studies below randomized (no matter how fine the former or irrelevant the latter – med RCTs are notorious for patient selectivity). So I’d guess they didn’t feel any pressure to emphasize the obvious. But I’d not have let them get away with spinning it as “may be limited” – that should be “is limited.”

Andrew: I didn’t quite get why they analyzed time to death rather than just survive / not survive. Did they look at time to death because it’s a way of better adjusting for length of hospital stay?

Sander: I’d just guess they could say they chose to focus on death because that’s the bottom line – if you are doomed to die within this setting, it might be arguably better for both the patient in suffering (often semi-comatose) and terminal care costs to go early (few would dare say that in a research article).

[This doesn’t quite answer my question. I understand why they are focusing on death as an outcome. My question is why don’t they just take survive/death in hospital as a binary outcome? Why do the survival analysis? I don’t see that dying after 2 days is so much worse than dying after 5 days. I’m not saying the survival analysis is a bad idea; I just want to understand why they did it, rather than a more simple binary-outcome model. — AG]

Andrew: The power analysis seems like a joke: a study is powered to detect a hazard rate of 0.65 (i.e, 1.5 if you take the ratio in the other direction). That’s a huge assumed effect, no?

Sander: I view all power commentary for data-base studies like this one as a joke, period, part of the mindless ritualization of statistics that is passed off as needed for “objective standards”. (It has a rationale in RCTs to show that the study was planned to that level of detail, but still has no place in the analysis.)

Andrew: I can’t figure out why they include p-values in their balance table (Table 1). It’s not a randomized assignment so the null hypothesis is of no interest. What’s of interest is the size and direction of the imbalance, not a p-value.

Sander: Agreed. I once long ago argued with Ben Hansen about that in the context of confounder scoring, to no resolution. But at least he tried his best to give a rationale; I’m sure here it’s just another example of ritualized reflexes.

Andrew: Figure 2 is kinda weird. It has those steps, but it looks like a continuous curve. It should be possible to make a better graph using raw data. With some care, you should be able to construct such a graph to incorporate the regression adjustments. This is an obvious idea; I’m sure there are 50 biostatistics papers on the topic of how to make such graphs.

Sander: Proposals for using splines to develop such curves go back at least to the 1980s and are interesting in that their big advantage comes in rate comparisons in very finite samples, e.g., most med studies. (Technically the curves in Fig. 3 are splines too – zero-order piecewise constant splines).

[But I don’t think that’s what’s happening here. I don’t think those curves are fits to data; I’m guessing these are just curves from a fitted model that have been meaninglessly discretized. They look like Kaplan-Meier curves but they’re not. — AG]

Andrew: Table 3 bothers me. I’d like to see the unadjusted and adjusted rates of death and other outcomes for the 4 groups, rather than all these comparisons.

Sander: Isn’t what you want in Fig. 3 and Table 4? Fig. 3 is very suspect for me as the HCQ-alone and neither groups look identical there. I must have missed something in the text (well, I missed a lot). Anyway I do want comparisons in the end, but Table 3 is in my view bad because the comparisons I’d want would be differences and ratios of the probabilities, not odds ratios (unless in all categories the outcomes were uncommon, which is not the case here). But common software (they used SAS) does not offer my preferred option easily, at least not with clustered data like theirs. That problem arises again in their use of the E-value with their odds ratios, but the E-value in their citation is for risk ratios. By the way, Ioannidis has vociferously criticized the E-value in print from his usual nullistic position, and I have a comment in press criticizing the E-value from my anti-nullistic position!

Andrew: Their conclusion is that the treatment “was not significantly associated with differences in in-hospital mortality.” I’d like to see a clearer disentangling. In the main results section, it says that 24% of patients receiving HCQ died (243 out of 1006), compared to 11% of patients not receiving HCQ (49 out of 432). The statistical adjustment reduced this difference. I guess I’d like to see a graph with estimated difference on the y-axis and the amount of adjustment on the x-axis.

Sander: That’s going way beyond anything I normally see in the med lit. And I’m sure this was a rush job given the topic.

[Yeah, I see this. What I really want to do is to make this graph in some real example, then write it up, then put it in a textbook and an R package, and then maybe in 10 years it will be standard practice. You laugh, but 10 years ago nobody in political science made coefficient plots from fitted regressions, and now everyone’s doing it. And they all laughed at posterior predictive checks, but now people do that too. It was all kinds of hell to get our R-hat paper published back in 1991/1992, and now people use it all the time. And MRP is a thing. Weakly informative priors too! We can change defaults; it just takes work. I’ve been thinking about this particular plot for at least 15 years, and at some point I think it will happen. It took me about 15 years to write up my thoughts about Popperian Bayes, but that happened, eventually! — AG]

Andrew: This would be a good example to study further. But I’m guessing I already know the answer to the question, Are the data available?

Sander: Good luck! Open data is an anathema to much of the med community, aggravated by the massive confidentiality requirements imposed by funders, IRBs, and institutional legal offices. Prophetically, Rothman wrote a 1981 NEJM editorial lamenting the growing problem of these requirements and how they would strangulate epidemiology; a few decades later he was sued by an ambulance-chasing lawyer representing someone in a database Rothman had published a study from, on grounds of potentially violating the patient’s privacy.

[Jeez. — AG]

Andrew: I assume these concerns are not anything special with this particular study; it’s just the standard way that medical research is reported.

Sander: A standard way, yes. JAMA ed may well have forced everything bad above on this team’s write-up – I’ve heard several cases where that is exactly what the authors reported upon critical inquiries from me or colleagues about their statistical infelicities. JAMA and journals that model themselves on it are the worst that I know of in this regard. Thanks I think in part to the good influences of Steve Goodman, AIM and some other more progressive journals are less rigid; and most epidemiology journals (which often publish studies like this one except for urgency) are completely open to alternative approaches. One, Epidemiology, actively opposes and forbids the JAMA approach (just like JAMA forbids our approach), much to the ire of biostatisticians who built their careers around 0.05.

[Two curmudgeons curmudging . . . but I think this is good stuff! Too bad there isn’t more of this in scientific journals. The trouble is, if we want to get this published, we’d need to explain everything in detail, and then you lose the spontaneity. — AG]

P.S. Regarding that clickbait title . . . OK, sure, JAMA’s not killing anybody. But, if you accept that medical research can and should have life-and-death implications, then mistakes in medical research could kill people, right? If you want to claim that your work is high-stakes and important, then you have to take responsibility for it. And it is a statistical fallacy to take a non-statistically-significant result from a low-power study and use this as a motivation to default to the null hypothesis.