Hey, you. Yeah, you! Stop what you’re doing RIGHT NOW and read this Stigler article on the history of robust statistics

I originally gave this post the title, “Stigler: The Changing History of Robustness,” but then I was afraid nobody would read it. In the current environment of Move Fast and Break Things, not so many people care about robustness. Also, the widespread use of robustness checks to paper over brittle conclusions has given robustness a bad name.

This 2010 article by Stigler is excellent. I came across it while doing reading for a research project, and then I got to see all these cool bits:

[In a paper from 1953, George] Box wrote of the “remarkable property of ‘robustness’ to non-normality which [tests for comparing means] possess,” a property that he found was not shared by tests comparing variances. He directed his fire particularly toward Bartlett’s test, which some had suggested as a preliminary step, to check the assumption of equal variances before performing an ANOVA test of means. He summarized the results this way:

To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!

After this dissection, Bartlett’s test, much like a frog in a high school biology laboratory, was never the same again.

You go, George! I then went and read Box (1953), and it’s good stuff. It’s a funny thing, though: reading it, I get the feeling that Box was constrained—boxed in, as it were—by having to work within a hypothesis-testing framework. it’s all about testing for equality of means, testing for equality of variances, testing for normality—even though the ultimate purpose of these methods is not to test hypotheses that we know ahead of time are false, but to learn from data.

The idea of robustness is central to modern statistics, and it’s all about the idea that we can use models even when there are assumptions are not true—indeed, an important part of statistical theory is to develop models that work well, under realistic violations of these assumptions. But, due to historical circumstances, Box was forced to develop some of those ideas within a more constricted theoretical framework.

Back to Stigler, who continues:

[In his 1960 article,] Tukey called attention to the fact that in estimating the scale parameter of a normal distribution, the sample standard deviation ceases to be more efficient than the mean deviation if you contaminate the distribution with as little as 8-tenths of a percent from a normal component with three times the standard deviation. This took most statisticians of that era as a surprise . . .

0.008—that’s interesting, also this seems like an excellent homework assignment for a theoretical statistics class, to ask them to perform a simulation study evaluating the performance of these two estimators (and others) as a function of the size of the second component and its scale.

Stigler continues:

Any history is a product of its time; it must necessarily take the present view of the subject and look back, as if to ask, how did we get here? My 1973 account was just such a history, and it took the 1972 world of robustness as a given. Huber had brought attention to Simon Newcomb and his use in the 1880s of scale mixtures of normal distributions as ways of representing heteroscedasticity; I enlarged and extended that to other works. I noted that Newcomb had used an early version of Tukey’s sensitivity function, itself a forerunner of Hampel’s influence curve. I reviewed a series of early works to cope with outliers, and I trumpeted my discovery of Percy Daniell’s 1920 presentation of optimal and efficient weighting functions for linear functions of order statistics, and of Laplace’s 1818 asymptotic theory for least deviation estimators. I found M estimates in 1844 and trimmed means (called “discard averages”) in 1920.

I wonder where he found those trimmed means? I only ask because sometimes I’ve found gems in old psychometrics articles. But maybe there’s nothing special about psychometrics; maybe just about any applied field has great statistical ideas in the old literature, if you just know where to look.

And then Stigler usefully rounds out his discussion:

None of this, I [Stigler] hasten to say, is recounted to undercut the striking originality of Tukey and Huber and Hampel—to the contrary. I mean it in the spirit of Alfred North Whitehead’s famous statement that, “Everything of importance has been said before by somebody who did not discover it”; that is, to provide historical context, where one might now see that, for example, it was not the M estimates that were new in 1964, it was what Huber proved about them that was revolutionary.

But also this:

[L]east squares will remain the tool of choice unless someone concocts a robust methodology that can perform the same magic, a step that would require the suspension of the laws of mathematics.

I don’t think so! In the ten years since Stigler’s article has appeared, regularization has taken over. Instead of least squares, we do lasso, or regularized logistic regression, or deep learning. Even little rstanarm uses weak but proper priors by default. Lasso and Bayes and machine learning and modern computing have moved regularization toward default status. Sure, lots of people use least squares, and I’m sure they always will. But at this point I’d call it a legacy method more than a tool of choice. And no suspension of mathematical law was required.