It’s “a single arena-based heap allocation” . . . whatever that is!

After getting 80 zillion comments on that last post with all that political content, I wanted to share something that’s purely technical.

It’s something Bob Carpenter wrote in a conversation regarding implementing algorithms in Stan:

One thing we are doing is having the matrix library return more expression templates rather than copying on return as it does now. This is huge in that it avoids a lot of intermediate copies. Some of these don’t look so big when running a single chain, but stand out more when running multiple chains in parallel when there’s overall more memory pressure.

Another current focus for fusing operations is the GPU so that we don’t need to move data on and off GPU between operations.

Stan only does a single arena-based heap allocation other than for local vector and Eigen::Matrix objects (which are standard RAII). Actually, it’s more of an expanding thing, since it’s not pre-sized. But each bit only gets allocated once in exponentially increasing chunk sizes, so there’ll be at most log(N) chunks. It then reuses that heap memory across iterations and only frees at the end.

We’ve found that using a standard functional map operation internally that partially evaluates reverse-mode autodiff for each block over which the function is mapped. This reduces overall memory size and keeps the partial evaluations more memory local, all of which speeds things up at the expense of clunkier Stan code.

The other big thing we’re doing now is looking at static matrix types, of the sort used by Autograd and JAX. Stan lets you assign into a matrix, which destroys any hope of memory locality. If matrices never have their entries modified after creation (for example through a comprehension or other operation), then the values and adjoints can be kept as separate double matrices. Our early experiments are showing a 5–50-fold speedup depending on the order of data copying reduced and operations provided. Addition’s great at O(N^2) copies currently for O(N^2) operations (on N x N matrices). Multiplication with an O(N^2) copy cost and O(N^3) operation cost is less optimizable when matrices get big.

I love this because I don’t understand half of what Bob’s talking about, but I know it’s important. To make decisions under uncertainty, we want to fit hierarchical models. This can increase computational cost, etc. In short: Technical progress on computing allows us to fit better models, so we can learn more.

Recall the problems with this study, which could’ve been avoided by a Bayesian analysis of uncertainty in specificity and sensitivity, and multilevel regression and poststratification for adjusting for differences between sampling and population. In that particular example, no new computational developments are required—Stan will work just fine as it is, and to the extent that Stan improvements would help, it would be in documentation and interfaces. But, moving forward, computational improvements will help us fit bigger and better models. This stuff can make a real difference in our lives, so I wanted to highlight how technical it can all get.