More limitations of cross-validation and actionable recommendations

This post is by Aki.

Tuomas Sivula, Måns Magnusson, and I (Aki) have a new preprint paper that analyzes one of the limitations of cross-validation Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison.

Normal distribution has been used to present the uncertainty in cross-validation for a single model and in model comparison at least since the 1980’s (Breiman et al, 1984, Ch 11). Surprisingly, there hasn’t been much theoretical analysis of validity of the normal approximation (or it’s very well hidden).

We (Vehtari and Lampinen, 2002; Vehtari, Gelman and Gabry, 2017) have also recommended using normal approximation and our loo package reports elpd_loo SE and elpd_diff SE, but we have been cautious about making strong claims about their accuracy.

Starting points for writing this paper were

  1. Shao (1993) showed in (non-Bayesian) context of model selection with nested linear models and if the true model is included, the model selection based on LOO-CV with squared prediction error as the cost function is asymptotically inconsistent. In other words, if we are comparing models A and B, where B has an additional predictor with true coefficient being zero, then the difference of the predictive performance and the associated uncertainty have similar magnitude asymptotically.
  2. Bengio and Grandvalet (2004) showed that there is no generally unbiased estimator for the variance used in the normal approximation, and that the estimate tends to be underestimated. This is due to the dependency between cross-validation folds as each observation is used once for testing and K-1 (where in case of LOO, K=N) times for training (or conditioning the posterior in Bayesian approach).
  3. Bengio and Grandvalet also demonstrated that in case of small N or bad model misspecification the estimate tends to be worse. Varoquaux et al. (2017) and Varoquaux (2018) provide additional demonstrations that the variance is underestimated if N is small.
  4. The normal approximation is based on law of large numbers, but in finite cases the distribution of individual predictive utilities/losses can have a very skewed distribution which could make the normal approximation badly calibrated. Vehtari and Lampinen (2002) had proposed to use Bayesian bootstrap to take into account the skewness, but they didn’t provide thorough analysis whether it actually works.

What we knew before started writing this paper

  1. Shao’s inconsistency result is not that worrying as asymptotically the models A and B are indistinguishable and thus the posterior of that extra coefficient will concentrate on zero and the predictions from the models are indistinguishable. Shao’s result however hints that if the relative variance (compared to the difference) doesn’t go to zero then the central limit theorem is not kicking in as usual and the distribution is not necessarily asymptotically normal. We wanted to learn more about this.
  2. Bengio and Grandvalet focused on variance estimators and didn’t consider skewness, but also when demonstrating with outliers they also missed to look at the possible finite case bias and asymptotic behavior. We wanted to learn more about this.
  3. We wanted to learn more about what is a small N in case well-specified models. When comparing small N case and outlier case, we can consider as outliers dominating the sum and thus a small number of outliers case is similar to small N in well-specified case, except we can also get significant bias. We wanted to learn more about this.
  4. Vehtari and Lampinen proposed to use Bayesian bootstrap, but in later experiments there didn’t seem to be much benefit compared to normal approximation. We wanted to learn more about this.

There were many papers discussing the predictive performance estimates for single models, but it turned out that the uncertainty in the model comparison has much different behavior.

Thanks to hard work by Tuomas and Måns, we learned that the uncertainty estimates in model comparison can perform badly, namely when:

  1. the models make very similar predictions,
  2. the models are misspecified with outliers in the data, and
  3. the number of observations is small.

We also learned that the problematic skewness of the distribution of the error of the approximation occurs with models which are making similar predictions and it is possible that the skewness does not fade away when N grows. We show that considering the skewness of the sampling distribution is not sufficient to improve the uncertainty estimate as it has a weak connection to the skewness of the distribution of the estimators’ error. This explains why Bayesian bootstrap can’t improve calibration much compared to the normal approximation.

On Twitter someone considered our results as pessimistic, as we mention misspecified models and in real life we can assume that none of the models is the true data generating mechanism. With misspecified model we mean opposite of well-specified model that doesn’t need to the true data generating mechanism, and naturally the amount of misspecification matters. The discussion about well specified and misspecified models holds for any modeling approach and it’s not unique for cross-validation. Bengio and Grandvalet had used just the term outlier, but we wanted to emphasize that outlier is not necessary a property of the data generating mechanism, but more of something that is not well modeled with a given model.

We are happy that we now know better than ever before when we can trust CV uncertainty estimates. The consequences of the above points are

  1. The bad calibration when models are very similar makes LOO-CV less useful for separating very small effect sizes from zero effect sizes. When the models make similar predictions there is not much difference in the predictive performance, and thus for making predictions it doesn’t matter which model we choose. The bad calibration of the uncertainty estimate doesn’t matter as the possible error is small anyway. Separating very small effect sizes from zero effect sizes is very difficult problem anyway and whatever approach is used probably needs very well specified and well identifiable models (e.g. posterior probabilities of models also suffer from overconfidence) and large N.
  2. The model misspecification in model comparison should be avoided by proper model checking and expansion before using LOO-CV. But this is something we should do anyway (and posterior probabilities of models also suffer from overconfidence in case of model misspecification)
  3. Small differences in the predictive performance can not reliably be detected by LOO-CV if the number of observations is small. What is small? We write in the paper “small data (say less than 100 observations)”, but of course that is not a change-point in the behavior, but the calibration improves gradually when N gets larger.

Cross-validation is often advocated for M-open case where we assume that none of the compared models is presenting the true data generating mechanism. The point 2. doesn’t invalidate the M-open case. If the model misspecification is bad and N is not very big, then the calibration in comparison gets worse, but the cross-validation is still useful for detecting big differences, and only when trying to detect small differences we need well behaving models. This is true for any modeling approach.

We don’t have the following in the paper, so you can consider this as my personal opinion based on what we learned. Based on the paper we could add to loo package documentation that

  1. If
    • the compared models are well specified
    • N is not too small (say > 100)
    • and elpd_diff > 4

    then elpd_diff SE is likely to be good presentation of the related uncertainty.

  2. If
    • the compared models are well specified
    • N is not too small (say 100)
    • elpd_diff and elpd_diff SE

      then elpd_diff SE is not a good presentation of the related uncertainty, but the error is likely to be small. Stacking can provide additional insight as it takes into account not only the average difference, but the shape of the predictive distributions and combination of models can perform be better than a single model.

    • If
      • the compared models are not well specified

      then elpd_diff and related SE can be useful, but you should improve your models anyway.

    • If
      • N is small (say

        then proceed with caution and think harder as with any statistical modeling approach in case of small data (or get more observations). (There can’t be an exact rule when N is small to be able to make inference as sometimes just N=1 can be sufficient to say that what was observed is possible etc.)

All this is supported with plenty of proofs and experiments (24 page article and 64 page appendix), but naturally we can’t claim that these recommendations are bullet proof and happy to see counter examples.

In the paper we intentionally avoid dichotomous testing and focus on what we can say about the error distribution as it has more information than yes/no answer.


We have also another new (much shorter, just 22 pages) paper Unbiased estimator for the variance of the leave-one-out cross-validation estimator for a Bayesian normal model with fixed variance showing that although there is no generally unbiased estimator (as shown by Bengio and Grandvalet (2004)) there can be unbiased estimator for a specific model. The unbiasedness is not the goal itself, but this paper shows that it could be possible to derive model specific estimators that could have better calibration and smaller error than the naive estimator discussed above.