A normalizing flow by any other name

Another week, another nice survey paper from Google. This time:

What’s a normalizing flow?

A normalizing flow is a change of variables. Just like you learned way back in calculus and linear algebra.

Normalizing flows for sampling

Suppose you have a random variable Theta with a gnarly posterior density p_{Theta}(theta) that makes it challenging to sample. It can sometimes be easier to sample a simpler variable Phi and come up with a smooth function f such that Theta = f(Phi). The implied distribution on Theta can be derived from the density of Phi and the appropriate Jacobian adjustment for change in volume,

displaystyle p_{Theta}(theta) = p_{Phi}(f^{-1}(theta)) cdot left|, textrm{det} textrm{J}_{f^{-1}}(theta) ,right|,

where textrm{J}_{f^{-1}}(theta) is the Jacobian of the inverse transform evaluated at the parameter value. This is always possible in theory—the unit hypercube with a uniform distribution is a sufficient basis for any multivariate function with the function being the inverse cumulative distribution function.

Of course, we don’t know the inverse CDFs for our posteriors or we wouldn’t need to do sampling in the first place. The hope is that we can estimate an approximate but tractable normalizing flow, which when combined with a standard Metropolis accept/reject step will be better than working in the original geometry.

Normalizing flows in Stan

Stan uses changes of variables, aka normalizing flows, in many ways.

First, Stan’s Hamiltonian Monte Carlo algorithm learns (aka estimates) a metric during warmup that is used to provide an affine transform, either just to scale (mean field metric, aka diagonal) or to scale and rotate (dense metric). If Ben Bales’s work pans out, we’ll also have low rank metric estimation soon.

Second, Stan’s constrained variables are implemented via changes of variables with efficient, differentiable Jacobians. Thank Ben Goodrich for all the hard ones: covariance matrices, correlation matrices, Cholesky factors of the these, and unit vectors. TensorFlow Probability calls these transforms “bijectors.” These constrained-variable transforms allow Stan’s algorithms to work on unconstrained spaces. In the case of variational inference, Stan fits a multivariate normal approximation to the posterior, then samples from the multivariate normal and transforms the draws back to the constrained space to get an approximate sample from the model.

Third, we widely recommend reparameterizations, such as the non-centered parameterization of hierarchical models. We used to call that specific transform the “Matt trick” until we realized it already had a name. The point of a reparameterization is to apply the appropriate normalizing flow to make the posterior closer to isotropic Gaussian. Then there’s a deterministic transform back to the variables we really care about.

What’s next?

The real trick is automating the construction of these transforms. People hold out a lot of hope for neural networks or other non-parametric function fitters there. It remains to be seen whether anything practical will come out of this that we can use for Stan. I talked to Matt and crew at Google about their work on normalizing flows for HMC (which they call “neural transport”), but they told me it was too finicky to work as a black box in Stan.

Another related idea is Riemannian Hamiltonian Monte Carlo (RHMC), which uses a second-order Hessian-based approximation to normalize the posterior geometry. It’s just very expensive on a per-iteration basis because it requires a differentiable positive-definite conditioning phase involving an eigendecomposition.