How good is the Bayes posterior for prediction really?

It might not be common courtesy of this blog to make comments on a very-recently-arxiv-ed paper. But I have seen two copies of this paper entitled “how good is the Bayes posterior in deep neural networks really” left on the tray of the department printer during the past weekend, so I cannot underestimate the popularity of the work.

So, how good is the Bayes posterior in deep neural networks really, especially when it is inaccurate or bogus?

The paper argues that in a deep neural network, for prediction purposes, the full posterior yields a worse accuracy/cross-extropy than the one from the point estimation procedures such as stochastic gradient descent (sgd). Even more strikingly, it claims that a quick remedy, no matter how ad hoc it may look like, is to reshape the posterior via a power transformation, which they call a “cold posterior” propto (p(theta|y))^{1/T} for temperature T<1.  Effectively the cold temperature concentrates the posterior density more around the MAP. According to the empirical evaluation the authors claim the cold posterior is superior to both the exact posterior and the point estimation in terms of predictive performance.

First of all, I should congratulate the authors for their new empirical results on Bayesian deep learning, a field that seems to have been advocated back to a few decades ago but neither sophisticatedly defended nor comprehensively applied, at least for modern-day deep models. Presumably the largest barrier is computation, as the exact sampling from a posterior in a deep net is infeasible given today’s computer.

Indeed, even in this paper, it is hard for me to tell, if the undesired performance of “exact” Bayesian posterior prediction is attributed to Thomas Bayes, or to Paul Langevin: the authors use an overdamped Langevin dynamic to sample from the posterior, except for omitting the Metropolis adjustment. They do report some diagnostics, but not quite exhaustive, as a more apparent and arguably more powerful test is to run multiple chains and see if they have mixed. Later in their section 4.4, the authors seem not to distinguish between the Monte Carlo error and the sampling error, and suggest even the first term can to too large in the method, which makes me even more curious how reliable the “posterior” is. Further, according to figure 4, there is an obvious discrepancy between the samples (over all temperatures) drawn from HMC and the Langevin dynamic used here even in a toy example with network depth to be 2 or 3. I don’t think HMC itself is necessarily the gold standard either:  In a complicated model such as ResNet, HMC suffers from all multimodality and non-log-convexity, and would hardly mix. So are we just blaming the Bayesian posterior from some points that were drawn from a theoretically-biased and practically-hardly-converged sampler?

In section 5, the authors conjecture that deep learning practice violates the likelihood principle because of some computation techniques such as dropout. To me, this is not what the likelihood principle is particularly relevant. Rather it is another reason why the posterior from the proposed sampler is computationally concerning.

To be fair, for the purpose of point estimation, even sgd is not necessarily guaranteed to either theoretically converge to, nor practically well approximate the global optimum either, while in most empirical studies it still yields reasonable predictions.  This reminds me of a relevant paragraph by Gelman and Robert (2013):

In any case, no serious scientist can be interested in bogus arguments (except, perhaps, as a teaching tool or as a way to understand how intelligent and well-informed people can make evident mistakes, as discussed in chapter 3 of Gelman et al. 2008). What is perhaps more interesting is the presumed association between Bayes and bogosity. We suspect that it is Bayesians’ openness to making assumptions that makes their work a particular target, along with (some) Bayesians’ intemperate rhetoric about optimality.

Sure, I guess it might be also unfair that we are kinda rewarding Bayes for it is more likely to produce fragile and bogus computational results. That said, the discussion here does not dismiss the value of this new paper, in which part of the merit is to alert Bayesian deep learning researchers that many otherwise fine sampler may produce inaccurate or bogus posteriors in these deep models, and all these computation errors should be taken into account for prediction evaluations.

But the intemperate rhetoric about optimality is real

The discussion below is not directly related to that paper. But an even more alarming and less understood question is, how good is the predictive performance of Bayes posterior in a general model when it is computationally accurate?  

As far as I know, Bayes procedures do not necessarily automatically improve the prediction or calibration over a point estimation such as the MAP.

I collected a few paradoxical examples when I wrote our old paper on variational inference diagnostics. Without the need for a residual net, even in a linear regression I can find examples in which exact Bayesian posterior lead to worse predictive performance than ADVI (Reproducible code is available upon request). This is not a pathological edge example. The data is simulated from the correctly-specified regression model with n=100 and d=20 — and these are exactly the data one would simulate for linear regressions. The posterior is sampled using stan and is exact measured by all diagnostics. The predictive performance is evaluated using log predictive density on independent test data and the averaged over a large number of replications of both data and sampling to eliminate all other noises. But still, log predictive density from ADVI is higher than the exact posterior.

In this experiment, I also examined that ADVI has a large discrepancy compared with the exact one, revealed by a large k hat from our psis diagnostics. So basically it is a somewhat “cold-posterior” in terms of underdispersion.

At the immediate level,  the underdispersion from variational inference can serve as an implicit prior which might render the model more regularization and therefore improve the prediction. For the record, I already encode an N(0,2) prior on all regression coefficient in that example with all unit inputs. In general, however, many complicated models used in practice lack informative priors, and it could be the reason why a stronger regularization, or via a “colder” posterior could help– although it is more suggested to come with a reasonable informative prior directly rather than tune the “temperature” for the sake of both interpretation and a coherent workflow.

Secondly, a model/regularization good for point estimation,  is not necessarily good for Bayesian posteriors. We can recall a Bayesian lasso gives inferior performance than the horseshoe, though it is often what is needed for point estimations. This is also the example in which the regularization effect from prior cannot be simplified by a temperature rescaling as the horseshoe has both thicker tail and thicker zero than Laplace prior. In that deep learning context, does the network architecture and all implicit regularizations such as dropout that were motivated from, designed for, and often cases optimally-tuned towards MAP estimates necessarily good/enough for posteriors? We do not know.

Coincidentally,  Andrew and I recently wrote a paper on Holes in Bayesian Statistics:

It is a fundamental principle of Bayesian inference that statistical procedures do not apply universally; rather, they are optimal only when averaging over the prior distribution. This implies a proof-by-contradiction sort of logic … it should be possible to deduce properties of the prior based on understanding of the range of applicability of a method.

In short, it is not too surprising that the exact Bayesian posterior can give an inferior prediction than MAP or VI, if we are merely using a black-box model and treat it as it is. But,

This does not mean that we think Bayesian inference is a bad idea, but it does mean that there is a tension between Bayesian logic and Bayesian workflow which we believe can only be resolved by considering Bayesian logic as a tool, a way of revealing inevitable misfits and incoherences in our model assumptions, rather than as an end in itself.


P.S. (from Andrew): Yuling pointed me to the above post, and I just wanted to add that, yes, I do sometimes encounter problems where the posterior mode estimate makes more sense than the full posterior. See, for example, section 3.3 of Bayesian model-building by pure thought, from 1996, which is one of my favorite articles.

As Yuling says, the full Bayes posterior is the right answer if the model is correct—but the model isn’t ever correct. So it’s an interesting general question: when is the posterior dominated by a mode-based approximation? I don’t have a great answer to this one.

Another good point made by Yuling is that “the posterior” isn’t always so clearly defined, in that, in a multimodal setting, the computed posterior is not necessarily the same as the mathematical posterior from the model. Similarly, in a multimodal distribution, “the mode” isn’t so clearly defined either. We should finish our paper on stacking for multimodal posteriors.

Lost of good question shere, all of which are worth thinking about in an open-minded way, rather than as a Bayes-is-good / Bayes-is-bad battle. We’re trying to do our part here!