Challenges to the Reproducibility of Machine Learning Models in Health Care; also a brief discussion about not overrating randomized clinical trials

Mark Tuttle pointed me to this article by Andrew Beam, Arjun Manrai, and Marzyeh Ghassemi, Challenges to the Reproducibility of Machine Learning Models in Health Care, which appeared in the Journal of the American Medical Association. Beam et al. write:

Reproducibility has been an important and intensely debated topic in science and medicine for the past few decades. . . . Against this backdrop, high-capacity machine learning models are beginning to demonstrate early successes in clinical applications . . . This new class of clinical prediction tools presents unique challenges and obstacles to reproducibility, which must be carefully considered to ensure that these techniques are valid and deployed safely and effectively.

Reproducibility is a minimal prerequisite for the creation of new knowledge and scientific progress, but defining precisely what it means for a scientific study to be “reproducible” is complex and has been the subject of considerable effort by both individual researchers and organizations like the National Academies of Science, Engineering, and Medicine. . . .

Replication is especially important for studies that use observational data (which is almost always the case for machine learning studies) because these data are often biased, and models could operationalize this bias if not replicated. The challenges of reproducing a machine learning model trained by another research team can be difficult, perhaps even prohibitively so, even with unfettered access to raw data and code. . . .

Machine learning models have an enormous number of parameters that must be either learned using data or set manually by the analyst. In some instances, simple documentation of the exact configuration (which may involve millions of parameters) is difficult, as many decisions are made “silently” through default parameters that a given software library has preselected. These defaults may differ between libraries and may even differ from version to version of the same library. . . .

Even if these concerns are addressed, the cost to reproduce a state-of-the-art deep learning model from the beginning can be immense. For example, in natural language processing a deep learning model known as the “transformer” has led to a revolution in capabilities across a wide range of tasks, including automatic question answering, machine translation, and algorithms that can write complex and nuanced pieces of descriptive text. Perhaps unsurprisingly, transformers require a staggering amount of data and computational power and can have in excess of 1 billion trainable parameters. . . . A recent study estimated that the cost to reproduce 1 of these models ranged from approximately $1 million to $3.2 million using publicly available cloud computing resources. Thus, simply reproducing this model would require the equivalent of approximately 3 R01 grants from the National Institutes of Health and would rival the cost of some large randomized clinical trials. . . .

I sent this to Bob Carpenter, who’s been thinking a lot about the replication crisis in machine learning, and who knows all about natural language processing too.

Here was Bob’s reaction:

None of what they list is unique to ML: lots of algorithm parameters with default settings, randomness in the algorithms, differing results between library versions. They didn’t mention different results from floating point due to hardware or software settings or compilers. About random seeds, they say, “One study found that changing this single, apparently innocuous number could inflate the estimated model performance by as much as 2-fold relative to what a different set of random seeds would yield.” Variation from different seeds isn’t innocuous, it’s fundamental, and it should be required in reporting results. There’s nothing different about deep belief nets in this regard compared to, say, MCMC or even k-means clustering via EM (algorithms that havebeen around since the 1950s and 1970s). All too often, multiple runs are done, the best one is reported, and the variance is ignored.

The costs do seem larger in ML. One to three megabucks to reproduce the NLP transformers is indeed daunting. I also liked how they used (U.S. National Institutes of Health) R01 grants as the scale instead of megabucks.

Can’t wait to see the blog responses after six months of cave aging.

We’ll see if we get any comments at all. The post doesn’t involve racism, p-values, or regression discontinuity, so it might not interest our readers so much!

The Beam et al. article concludes:

Determining if machine learning improves patient outcomes remains the most important test, and currently there is scant evidence of downstream benefit. For this, there is likely no substitute for randomized clinical trials. In the meantime, as machine learning begins to influence more health care decisions, ensuring that the foundation on which these tools are built is sound becomes increasingly pressing. In a lesson that is continuously learned, machine learning does not absolve researchers from traditional statistical and reproducibility considerations but simply casts modern light on these historical challenges. At a minimum, a machine learning model should be reproduced, and ideally replicated, before it is deployed in a clinical setting.

I pretty much agree. And, as Bob notes, the above is not empty Mom-and-apple-pie advice, as in the real world we often do see people running iterative algorithms (whether machine learning or Bayesian or whatever) without checks and validation. So, yeah.

But one place I will push back on is their claim that, to get evidence of downstream benefit, “there is likely no substitute for randomized clinical trials.” Randomized clinical trials are great for what they are, but they have limitations in realism, time limitations, and sample size. There are other measures of downstream benefit that can be used. Other measures are imperfect—but, then again, inferences from randomized experiments are imperfect too.