What can be our goals, and what is too much to hope for, regarding robust statistical procedures?

Gael Varoquaux writes:

Even for science and medical applications, I am becoming weary of fine statistical modeling efforts, and believe that we should standardize on a handful of powerful and robust methods.

First, analytic variability is a killer, e.g. in “standard” analysis for brain mapping, for machine learning in brain imaging, or more generally in “hypothesis driven” statistical testing.

We need weakly-parametric models that can fit data as raw as possible, without relying on non-testable assumptions.

Machine learning provides these, and tree-based models need little data transformations.

We need non-parametric model selection and testing, that do not break if the model is wrong.

Cross-validation and permutation importance provide these, once we have chosen input (endogenous) and output (exogenous) variables.

If there are less than a thousand data points, all but the simple statistical question can and will be gamed (sometimes unconsciously), partly for lack of model selection. Here’s an example in neuroimaging.

I [Varoquaux] no longer trust such endeavors, including mine.

For thousands of data points and moderate dimensionality (99% of cases), gradient-boosted trees provide the necessary regression model.

They are robust to data distribution and support missing values (even outside MAR settings).

For thousands of data points and large dimensionality, linear models (ridge) are needed.

But applying them without thousands of data points (as I tried for many years) is hazardous. Get more data, change the question (eg analyze across cohorts).

Most questions are not about “prediction”. But machine learning is about estimating functions that approximate conditional expectations / probability. We need to get better at integrating it in our scientific inference pipelines.

My reply:

There are problems where automatic methods will work well, and problems where they don’t work so well. For example, logistic regression is great, but you wouldn’t want to use logistic regression to model Pr(correct answer) given ability, for a multiple choice test question where you have a 1/4 chance of getting the correct answer just by guessing. Here it would make more sense to use a model such as Pr(y=1) = 0.25 + 0.75*invlogit(a + bx). Of course you could generalize and then say, perhaps correctly, that nobody should ever do logistic regression; we should always fit the model Pr(y=1) = delta_1 + (1 – delta_1 – delta_2)*invlogit(a + bx). The trouble is that we don’t usually fit such models!

So I guess the point is that we should keep pushing to make our models more general. What this often means in practice is that we should be regularizing our fits. One big reason we don’t always fit general models is that it’s hard to estimate a lot of parameters using least squares or maximum likelihood or whatever.

I agree with your statement that “we should standardize on a handful of powerful and robust methods.” Defaults are not only useful; they are also in practice necessary. This also suggests that we need default methods for assessing the performance of these methods (fit to existing data and predictive power on new data). If users are given only a handful of defaults, then these users—if they are serious about doing their science, engineering, policy analysis, etc.—will need to do lots of checking and evaluation.

I disagree with your statement that we can avoid “relying on non-testable assumptions.” It’s turtles all the way down, dude. Cross-validation is fine for what it is, but we’re almost always using models to extrapolate, not to just keep on replicating our corpus.

Finally, it’s great to have thousands, or millions, or zillions of data points. But in the meantime we need to learn and make decisions from what information that we have.