Fugitive and cloistered virtues

There’s a new revolution, a loud evolution that I saw. Born of confusion and quiet collusion of which mostly I’ve known. — Lana Del Ray

While an unexamined life seems like an excellent idea, an unexamined prior distribution probably isn’t. And, I mean, I write and talk and think and talk and write and talk and talk and talk and talk about this a lot. I am pretty much obsessed with the various ways in which we need to examine how our assumptions map to our priors (and vice versa).

But also I’m teaching a fairly low-level course on Bayesian stats (think right after a linear regression course) right now, which means I am constantly desperate for straightforward but illuminating questions that give some insight into what can and can’t go wrong with Bayesian models.

So I asked myself (and subsequently my students) to work out exactly what a prior that leads to a pre-determined posterior distribution looks like. Such a prior, which is necessarily data-dependent, is at the heart of some serious technical criticisms of Bayesian methods.

So this was my question: Consider the normal-normal model $y_imidmusim N(mu,1)$, $i=1,ldots,n$. What prior $musim N(m,s^2)$ will ensure that the posterior expectation is equal to $mu^*$?

The answer will probably not shock you. In the end, it’s any prior that satisfies $m=ns^2(mu^*-bar{y})+mu^*$. It’s pretty easy to see what this looks like asymptotically: if we fix s, then $mrightarrowmathcal{O}left(text{sign}(mu^*-mu^text{true})nright)$, which is to say the mean gets HUGE.

Alternatively, we can set the prior as $musim N(2-bar{y},n^{-1})$, which is probably cleaner.

One thing that you notice if you simulate from this is that when the number of observations is small, these priors look kinda ordinary. It’s not until you get a large sample that either the mean buggers off to infinity or the variance squishes down to zero.

And that should be a bit of a concern. Because if you’ve got a model with a lot of parameters (or–heaven forfend!– a non-parametric component like a Gaussian process or BART prior), there is likely to be very little data “per parameter”.

So it’s probably not enough to just look at the prior and say “well the mean and variance aren’t weird, so it’s probably safe”.

(BTW, we know this! This is just another version of the “concentration of measure” phenomenon that makes perfectly ok marginal priors turn into hideous monster joint priors. In that sense, priors are a lot like rat kings–rats are perfectly cute by themselves, but disgusting and terrifying when joined together.)

So what can you do to prevent people tricking you with priors? Follow my simple three step plan.

1. Be very wary of magic numbers. Ask why the prior standard deviation is 2.43! There may be good substantive reasons. But it also might just give the answer people wanted.
2. Simulate. Parameters. From. Your. Prior. Then. Use. Them. To. Simulate. Data.
3. Investigate the stability of your posterior by re-computing it on bootstrapped or subsetted data.

The third thing is the one that is actually really hard to fool. But bootstraps can be complex when the data is complicated, so don’t just do iid splits. Make your subsampling scheme respect the structure of your data (and, ideally, the structure of your model). There is, as always, a fascinating literature on this.

Part of why we’re all obsessed with developing workflows—rather than just isolated, bespoke results—is that there’s no single procedure that can guard against bad luck or malicious intent. But if we expose more of the working parts of the statistical pipeline, there become fewer and fewer places to hide.

But also sometimes we just need a fast assignment for week 2 of the course.