Various people have been contacting me lately about recommendations for design and analysis of clinical trials, with application to coronavirus. Below are some quick thoughts, or you can scroll down to the Summary Recommendations at the end. I’m sure there’s lots more to say on this topic but I’ll get my quick thoughts down here.
Basic analysis of one experiment
Jonathan Falk points us to this WSJ op-ed by Jeff Colyer and Daniel Hinthorn that reports:
But researchers in France treated a small number of patients with both hydroxychloroquine and a Z-Pak, and 100% of them were cured by day six of treatment. Compare that with 57.1% of patients treated with hydroxychloroquine alone, and 12.5% of patients who received neither. What’s more, most patients cleared the virus in three to six days rather than the 20 days observed in China.
Hmmm, 57.1% . . . that reminds me of that famous number 142857. 57.1% is exactly 4/7!
And 12.5%, of course that’s 1 out of 8.
First off, for 57.1% of patients treated by hydroxyquinine alone to be helped, the numbers given the hydroxyquinine alone must be a multiple of 7, and those getting neither must be a multiple of 8. With 100% success for both, we don’t know how many were given both. But let’s say that the trial had three arms of 8 (with one of the hydroxyquinine alone arm patients having dropped out for some reason).
That seems like a reasonable guess, so let’s go with it. In evaluating the new treatment (hydroxychloroquine and a Z-Pak), it seems that the relevant comparison is to hydroxychloroquine alone. The comparison is simple enough to do. In R:
library("rstanarm") library("arm") x
Here's what we get:Median MAD_SD (Intercept) 0.7 0.8 x 2.7 1.5
As you might expect, you get a big fat uncertainty interval, with the data roughly consistent with no effect on one extreme, or an effect of 5 on the other. Not too many treatments have an effect of 5 on the logit scale (that will take you from 50% to 99% on the probability scale), so I wouldn't take that high end too seriously.
You can't do the analysis using regular glm as there's complete separation. If you want to use classical, non-Bayesian, inference, you can do the Agresti and Coull approach and add 2 successes and 2 failures to each group, thus comparing 6/11 to 10/12:y
The result:0.29 0.18
OK, this is on the probability scale: the estimated increase in cure rate is somewhere between 0 or slightly negative, or huge.
Yet another possible analysis is to do a hypothesis test, but I'm not particularly interested in a hypothesis test because what I really want to know is the increase in cure rate.
Alternatively we could pool the two active treatments, then we have 12/15 cures in the treated group, compared to 1/8 in the control group:x
Here's what we get:Median MAD_SD (Intercept) -1.7 0.9 x 3.0 1.0
Or the Agresti-Coull:y
The result:0.49 0.16
Still a lot of uncertainty but it seems like a clear improvement. Or maybe it's some other aspect of the treatment; I don't know if the study was blinded.
In any case, no matter what analysis is done, the obvious recommendation here is to test the treatment on more than 24 people!
Doing more with the data
You're potentially throwing away a lot of information by summarizing each person's result by a binary, cured-or-not-cured-after-6-days variable:
- "Cured" is measured by some biomarkers? You could have a continuous measure, no?
- Why just 6 days? What's happening after 2 days, 4 days, 8 days, 10 days, etc?
Getting more granular data will also help resolve the difficulty of that 8-out-of-8 thing, moving the data off the boundary.
Analyzing many experiments
There's not just one therapy being tested! Lots of things are being tried. I think it makes sense to embed all these studies in a hierarchical model with treatment-level predictors, partial pooling, the whole deal.
Let's get real here. We're not trying to get a paper published, we're trying to save lives and save time.
When considering design for a clinical trial I'd recommend assigning cost and benefits and balancing the following:
- Benefit (or cost) of possible reduced (or increased) mortality and morbidity from COVID in the trial itself.
- Cost of toxicity or side effects in the trial itself.
- Public health benefits of learning that the therapy works, as soon as possible.
- Economic / public confidence benefits of learning that the therapy works, as soon as possible.
- Benefits of learning that the therapy doesn't work, as soon as possible, if it really doesn't work.
- Scientific insights gained from intermediate measurements or secondary data analysis.
- $ cost of the study itself, as well as opportunity cost if it reduces your effort to test something else.
This may look like a mess---but if you're not addressing these issues explicitly, you're addressing them implicitly. The problem's important, so want your sample size to be as large as possible. So first test everyone in the U.S., then take all the people with coronavirus and divide them into 2 groups, etc. OK, we can't do this because we can't test everyone, so we don't have infinite resources . . . Also, maybe don't do it on 100,000 people because maybe the regimen has some side effects . . . etc.
And, as always, I don't think "statistically significant" should be the goal. Suppose that the treatment increases recovery rate from, say, 80% to 85%. That's pretty good. But you'd like to know who those other 15% are. Maybe the treatment helps among some groups and not others, etc.
That said, if the goal is a quick statistics answer, then, sure you can do some simulations, for example if the recovery rate after 3 days is X without the therapy and Y with the therapy, and you do a study with N people in each group, etc etc.
Whatever therapies are being tried, should be monitored. Doctors should have some freedom to experiment, and they should be recording what happens. To put it another way, they're trying different therapies anyway, so let's try to get something useful out of all that.
It's also not just about "what works" or "does a particular drug work," but how to do it. For example, Colyer and Hinthorn write:
On March 9 a team of researchers in China published results showing hydroxychloroquine was effective against the 2019 coronavirus in a test tube. The authors suggested a five-day, 12-pill treatment for Covid-19: two 200-milligram tablets twice a day on the first day followed by one tablet twice a day for four more days.
You want to get something like optimal dosing, which could depend on individuals. But you're not gonna get good discrimination on this from a standard clinical trial or set of clinical trials. So we have to go beyond the learning-from-clinical-trial paradigm, designing large studies that mix experiment and observation to get insight into dosing etc.
Also, lots of the relevant decisions will be made at the system level, not the individual level. For example, Colyer and Hinthorn write:
Emergency rooms run the risk of one patient exposing a dozen nurses and doctors. Instead of exposed health workers getting placed on 14-day quarantine, they could receive hydroxychloroquine for five days, then test for the virus. That would allow health-care workers to return to work sooner if they test negative.
These sorts of issues are super important and go beyond the standard clinical-trial paradigm.
- Bayesian inference for treatment effect, not hypothesis test.
- Include more information from each patient, not just cured or not.
- Design and analyze multiple studies together using multilevel model.
- Use fake-data simulation when designing a study.
- Formal decision analysis using numbers for costs and benefits.
- Relevant decisions and outcomes are at the system level, not just the individual level.
- Continue gathering data after the treatment is released into the wild.
- Analyze clinical trial and subsequent data to get recommendations for dosing, drug combinations, etc., beyond simple yes/no on a single treatment plan.