“Machine Learning Under a Modern Optimization Lens” Under a Bayesian Lens

I (Yuling) read this new book Machine Learning Under a Modern Optimization Lens (by Dimitris Bertsimas and Jack Dunn) after I grabbed it from Andrew’s desk. Apparently machine learning is now such a wide-ranging area that we have to access it through some sub-manifold so as to evade dimension curse, and it is the same reason why I would like to discuss this comprehensive and clearly-structured book through a Bayesian perspective.

Regularization and robustness, and what it means for priors

The first part of the book is most focused on the interpretation of regularization and robustness (Bertsimas and Copenhaver, 2017). In a linear regression with data (X,Y), we consider a small perturbation within the neighborhood Delta in mathcal{U}(q,r)= {Deltain mathcal{R}^{ntimes p}: max_{vertvert delta vertvert_{q} =1 } vertvert delta Delta vertvert_{r} }, then the l_q regularized regression is precisely equivalently to the minimax robustness:
displaystyle min_{beta}max_{Deltain mathcal{U}(q,r)} vertvert y-(X+Delta)beta vertvert_{r} = min_{beta} vertvert y-(X+Delta)beta vertvert_{r} + vertvert beta vertvert_{q}
and such equivalence can also be extended to other norms too.

The discussion of this book is mostly useful for point estimation in a linear regression. As a Bayesian, it is natural to ask, if we could also encode the robustness constraints into the prior for a general model. For example, can we establish something like (I suppress the obvious dependence on X):
displaystyle min_{p^{post}} max_{ p^*: D(p^* vertvert p^{sample})<epsilon } - int_{tilde y} log int_{theta} p(tilde y vert theta ) p^{post}(theta ) dtheta p^*(tilde yvert y ) d tilde y= - int_{tilde y} log int_{theta} p(tilde y vert theta) p(thetavert y ) dtheta p^{sample}(tilde yvert y ) d tilde y.
where is p(thetavert y ) propto p(theta)p(yvert theta) is the Bayesian posterior distribution under certain prior that is potentially induced by this equation, and y_{1,dots,n} are iid samples from p^{sample}, which are however different from the future new samples tilde y sim p^*(tilde y) up to a epsilon perturbation under some divergence D(cdot|cdot).

In other words, we might wish to construct the prior in such a way that the model could still give minimax optimal prediction even if the sampling distribution is slightly corrupted.

To be clear, the equivalence between a minimax (point-)estimation and the least-favorable prior is well studied. However here the minimax is the minimax out-of-sample risk of the prediction of the data, rather than the classic risk of the parameter estimation, where the only bridge is through the likelihood.

It also reminds me of our PSIS leave-one-out(LOO). In particular, we know where the model behaves bad (e.g. points with large k hat >epsilon). It makes sense, for example, if we encode these bad points into the prior adaptively
displaystyle log p^{prior}(theta) gets log p^{prior}(theta) + gammalog sum_{i: k_{i}>epsilon} p( y_iverttheta)
and retrain the model either explicitly or through importance sampling. It is of course nothing but adversarial training in deep leaning.

Nevertheless, it is the time to rethink what a prior is aimed for:

  1. In terms of prior predictive check, we implicitly ask for a large or even the largest (in empirical Bayes) marginal likelihood int p(yvert theta) p(theta) dtheta.
  2. But prior is also just regularization, and regularization is just robustness from the previous argument. It is not because we or any other people have enough reasons to believe the regression coefficient perfectly forms a Laplace distribution that we use the Lasso; it is rather because we want our model to be more robust under some l_1 perturbation in the data. An adversarial training weighs more on “bad” data points and therefore deliberately decrease the prior predictive power, in exchange for robustness.

Let me recall Andrew’s old blog: What is the “true prior distribution”? :

(If) the prior for a single parameter in a model that is only being used once.for example, we’re doing an experiment to measure the speed of light in a vacuum, where prior for the speed of light is the prior for the speed of light; there is no larger set of problems for which this is a single example. My short answer is: for a model that is only used once, there is no true prior.

Now from the robustness standpoint, we might have another longer answer: it does not even matter if the population theta itself lives in any population. There is an optimal prior in terms of giving the appropriate amounts of regularization such that prediction from the model is robust under small noise, which is precisely defined by the minimax problem (in case someone hates minimax, I wonder if the average risk in lieu of minimax is also valid).

To conclude, the robustness framework (in the data space) gives us a more interpretable way to explain how strong the regularization should be (in the parameter space), or equivalently how (weakly) informative the prior is supposed to be.

Trees, discrete optimization, and multimodality

A large share of the book is devoted to the optimal classification and regression trees. The authors prove that a deep enough optimal classification tree can achieve the same prediction ability as a deep neural network — when the tree makes splits exactly according to the same network — whereas tree has a much better interpretability (evidently we could prove a multilevel model at the same setting will achieve a prediction ability no worse than that deep net).

The first glance might suggest a computationally prohibitive expense of solving a high dimensional discrete optimization problem, which is ultimately what a tree requires. Fortunately, the author gives a detailed introduction to the mixed integer algorithm they use, and it is shown to be both fast and scalable — although I cannot fit a discrete tree in Stan.

Nevertheless, there are still some discrete natures that may not be desired. For instance when the data is perfectly separable by multiple classification trees, whatever the objective function, the optimizer can only report a tree among all plausible answers. In the ideal situation we would want to average over all the possibility and I suppose that can be approximated by a bootstrapped random forest — but even then we are effectively using no-pooling among all leaves. For example if an online survey has very few samples in certain groups of the population, the best classification a tree can do is to group all of them into some nearby leaves, while the leaf to which they are assigned can only be decided with large variation.

To be fair all methods have limitations and it is certainly useful to include tree-based methods in the toolbox as an alternative to more black-box deep learning models.

We love decision theories, too.

I remember in this year’s JSM, Xiao-Li Meng made a joke that even though we statistician have to struggle to define our territory in terms of AI and ML, it seems even more unfair for operations research, for it is OR that created all the optimization tools upon which modern deep learning relies, but how many outsiders would directly links AI to OR?

The author of this book gives an argument in chapter 13: Although OR pays less attention to data collection and prediction compared with ML, OR is predominately focused on the process of making optimal decisions.

Of course we (Bayesian statisticians) love decision theories, too. Using the notation in this book, given a loss function c, a space for decisions cin C, and observed data (x, y), we want to minimize the expected loss:
displaystyle z^*(x) = argmin mathrm{E}[c(z, y)|x]

From a (parametric) Bayesian perspective, such problem is easy to solve: given a parameter model p(y|x, theta), we have explicit form mathrm{E}[c(z, y)|x_0] = int c(z, y) p(y| x_0, theta) p(theta| y_{1:n}, x_{1:n}) d theta that enables a straightforward minimization on z (possibly though stochastic approximation since we can draw theta from posterior directly).

The authors essentially consider a non-parametric model on Yvert X in RKHS. That is to consider a kernel K(,) on covariates X and we can rewrite the expected risk as a weighted average of sample risk
displaystyle mathrm{E}[c(z, y)|x_0] approx K(X, x_0) K(X, X)^{-1} c(z, Y).
And not surprisingly we can also construct the kernel though trees.

Indeed we can rewrite any multilevel model by defining an equivalent kernel K(,). So the kernel representation above amounts to a multilevel model with fixed group-level variance tau (often fixed hyper-parameter in the kernel).

 And we love causal inference, too

Chapter 14 of the book is on perspective trees. The problem is motivated by observational studies with multiple treatments, and the goal is to assign an optimal treatment for a new patient. Given observed outcome y (say the Blood Sugar Level), all covariates X, (potentially nonrandom) treatments Z, what remains to be optimized is the policy for future patients with covariate x. We denote the optimal assignment policy z^*=tau(x) as a function of x.

It is a causal inference problem. If we have a parametric model, we (Bayesian statisticians) would directly write it down as
displaystyle z^*=tau(x_0) = argmin_z mathrm{E}[y|x_0, z]
Under all unconfoundness conditions we have
{E}[y|x_0, z]= int y p(y| x, z) d y, and p(y| x, z) can be learned from the sampling distribution in observational studies by weighting or matching or regression.

Returning to this book, effectively the perspective trees models p(y| x, z) by sub stratification: yvert x_0, z_0 is the empirical distribution of all observed outcomes with treatment z=z_0 in the leaf node where x_0 lives.

Making decisions in the space of all trees

The perspective trees take one step further by using the following loss function when training the regression tree:
displaystyle mathrm{E}[y|x, gamma(x) ] + sigma sum_{i=1}^n (y_i - hat y_i)^2
where hat y_i is the prediction of y_i using the same tree. (As a footnote, it is always better to use leave-one-out error, but I suppose it is hard to cross-validate a tree due to its discrete nature.)

The objective function is interpreted as the weighted average of “making an accurate prediction” (sigma =infty) and “making the optimal decision” (sigma =0). Evidently it is not the same as fitting the tree first that only minimizes prediction error, and then doing causal inference using the post-inference tree and substratification.

I have a conflicting feeling towards this approach. On one hand, it addresses the model uncertainty directly. A parametric Bayesian might always tend to treat the model as it is and ignore all other uncertainty in the forking paths. It is therefore recommended to work in a more general model space– the trees do have merits in intuitively manifesting the model complexity.

On the other hand, to me there seems to be a more principled way to deal with model uncertainty is to consider
displaystyle mathrm{E}[y|x, gamma(x) ]= mathrm{E}[mathrm{E}[y|x, gamma(x), M ]]
where M is any given tree.

Further under normal approximation, the expectation can be expanded to be
displaystyle mathrm{E}[mathrm{E}[y|x, gamma(x), M ]]= sum_{k=1}^{infty} mathrm{E}[y|x, gamma(x), M_k] exp( sum_{i=1}^n (y_{ik} - hat y_{ik})^2 ) / sum_{k=1}^{infty} exp( sum_{i=1}^n (y_{ik} - hat y_{ik})^2 ),
as long as I am allowed to abuse the notation on infinite sums and self-normalization.

The linear mixture (first expression in this section) can then be viewed as an approximation to this full-Bayes objective: it replaces the infinite sum by the largest one term, which is nearly accurate if all other trees vanish quickly.

To be clear, here different trees are only compared in the context of prediction. There is an analogy in terms of causal assumption where different causal models imply different conditional ignobility, which cannot be learned from the data in the first place.

More generally, there remains a lot to be done in the field of decision theory in the existence of model uncertainty, while everything will be even more complicated with extra settings on infinite model space, causality, and the looming concern that all models are still wrong even in this infinite model space– we almost have abandoned the posterior probability of models in model averaging, but how about decision making with a series of working models?

Overall, I enjoy reading this book. It provides various novel insights to rethink modern machine learning and a rich set of practical tools to fit models in the real world. Most constructively, it is a perfect book that inspires so many interesting open problems to work on after stacking multiple lenses.