Fitting big multilevel regressions in Stan?

Joe Hoover writes:

I am a social psychology PhD student, and I have some questions about applying MrP to estimation problems involving very large datasets or many sub-national units.

I use MrP to obtain sub-national estimates for low-level geographic units (e.g. counties) derived from large data (e.g. 300k-1 million+). In addition to being large, my data is bad, such that it’s generated via self-selection mechanisms like voluntary internet surveys.

In my applications of MrP, I usually have at least 288 post-strata cells (not counting sub-national units), which means that for county-level procedures, I am generating predictions for close to or more than 1 million post-strata. Also, due to the prevalent county-level sparsity in my data, I’ve found that including a CAR prior offers modest improvements for units with low ns.

Put together, this all winds up being a costly computational process. I have been working primarily in R-INLA, but even with INLA’s efficiency, I find myself having to do batch predictions for the post-strata, which requires estimating the same model multiple times or manually generating predictions outside of INLA via posterior draws.

Ultimately, I would prefer to do this in Stan, but the huge number of parameters required for the entire estimation procedure makes this impractical (for me, given my timeline, modest expectations for the precision of the estimates, and access to computational resources).

Accordingly, I am wondering if you might be able to offer any advice or insight to how I can more efficiently apply MrP in this kind of situation. More specifically, do you have any programmatic approach to conducting MrP in a way that reduces computational cost and complexity?

I’ve considered doing MrP for each of some set of larger sub-national units (e.g. regions or states) and then using estimates from these MrPs as inputs to a final model. However, I have not fully thought this through and I haven’t been able to find a precedent for it, so I am a little worried that there might be a known problem with that kind of approach. I have also thought about applying approach similar to the 8 schools model, where I estimate post-strata means and SEs and use these as input to a model, rather than individual-level observations (as in your recent paper with Professor Yajuan Si). However, with 1 million post-strata cells, this is still a large problem with many parameters.

Finally, I have also considered switching to a machine learning framework, but I am kind of partial to the control that Bayesian HLM gives me, particularly given the nature of my data.

So, again, I would love to know how you would approach this problem. Or, even if you could recommend a few relevant papers, that would be immensely helpful. I am quite familiar with the MrP literature at this point, but I have not seen any applications to datasets this large.

My reply: This sort of question comes up a lot, and it’s an area of active research. Here are a few ways to go:

1. You can indeed work with cell means; see this 2013 paper with Ghitza, or the method is described more clearly in this 2017 replication paper with Lei and Ghitza. At some point, though, the number of cells becomes too large for this to work.

2. Another approach is to set the hyperparameters in the model to fixed values, just pre-determining the amount of partial pooling that the model will do. With fixed hyperaparameters, this just becomes a regression problem with some pre-set regularization, and you can compute the estimated coefficients using simple optimization.

3. If you want to go full Stan (and we’d like to do that), you could do the computations using parallel computing: not just one processor per chain, but multiple processors per chain, using our new MPI (message processing interface) implementation. I’m not quite sure how it would work for your model, but I think something should be possible.

4. Going in more speculative direction, you have a lot of data so it should be possible to use approximate methods such as EP, VB, and GMO. I think that’s related to what you’re talking about with “switching to a machine learning framework”: using computational shortcuts to approximate a posterior distribution.

Unfortunately we don’t have anything plug-and-play yet. It would be good to have some case studies of big-data MRP that people could use as templates for their analyses.