I just wrote up a bunch of chapters for the Stan user’s guide on prior predictive checks, posterior predictive checks, cross-validation, decision analysis, poststratification (with the obligatory multilevel regression up front), and even bootstrap (which has a surprisingly elegant formulation in Stan now that we have RNGs in trnasformed data).
Andrew then urged me to look at figure 1 of Gelman, Meng, and Stern’s 1996 paper on posterior predictive assessement (I would’ve included a diagram if the journal didn’t own the copyright). I looked at figure 1 and got really confused about why the y and y_rep were on the same side of theta. Then Andrew said it was like a probabilistic program. Which I took to mean you could write in BUGS directed graphical modeling language. Below, I write down how I’d code prior and posterior predictive checks and cross-validation in a graphical modeling language.
Instead of arrows and boesLet me present all this in a unified way the way I think it’s clearest, which does away with arrows and boxes and just uses graphical modeling notation (like BUGS or Stan supports). Then Andrew can fill all of us in on what I’m missing
A simple regression example
Nothing interesting here, just a simple regression of a univariate outcome given a univariate predict. I’ll use explicit indexing to make it clear where there are multivariate quantities. I’m also just throwing down an arbitrary prior for completeness.
a ~ normal(0, 2) b ~ normal(0, 2) s ~ lognormal(0, 1) y[1:N] ~ normal(a + b * x[1:N], s)
The variables a, b, and s are parameters, whereas y and x are the observed outcomes and predictors. y and x will be known, and we’ll run something like Stan to get posterior draws
a, b, s ~ p(a, b, s | x, y)
Posterior predictive checks
To get posterior predictive draws, we add a single line to the graphical model,
a ~ normal(0, 2) b ~ normal(0, 2) s ~ lognormal(0, 1) y[1:N] ~ normal(a + b * x[1:N], s) y_rep[1:N] ~ normal(a + b * x[1:N], s)
Here y_rep is declared as a parameter in Stan, because it’s not observed. Also note that the same x values are used for both y and y_rep.
We observe y and x as before, but now get posterior draws for y_rep in addition to the regression parameters,
a, b, s, y_rep ~ p(a, b, s, y_rep | x, y)
We just throw away the draws for the parameters and get draws from the posterior predictive distribution
y_rep ~ p(y_rep | x, y)
Monte Carlo methods are so much easier than calculus.
Prior predictive checks
This one just drops the line with the data, but continues to use the same predictor vector x for the replications. The graphical model is
a ~ normal(0, 2) b ~ normal(0, 2) s ~ lognormal(0, 1) y_rep[1:N] ~ normal(a + b * x[1:N], s)
Our posterior draws in a system like Stan now look like
a, b, s, y_rep ~ p(a, b, s, y_rep | x)
and we throw away the parameters again to get prior predictive draws,
y_rep ~ p(y_rep | x)
Held-out evaluation and cross-validation
Suppose we divide our N data items up into a training set of size M and test set of size N – M. We’ll train on the training set then predict for the test set.
a ~ normal(0, 2) b ~ normal(0, 2) s ~ lognormal(0, 1) y[1:M] ~ normal(a + b * x[1:M], s) y[M + 1:N] ~ normal(a + b * x[M + 1:N], s)
We’ll provide y[1:M] and x[1:N] as data (that is, the training set of y and all of x). Then we get draws from the posterior predictive distribution:
a, b, s, y[M + 1:N] ~ p(a, b, s, y[M + 1:N] | x[1:N], y[1:M])
and we again just drop the parameters to get posterior predictive draws for evaluation
y[M + 1:N] ~ p(y[M + 1:N] | x[1:N], y[1:M])
For cross-validation, you just provide different slices. Or random slices. I show how to do that in the forthcoming user’s guide chapters. I also show how to use the generated quantities block to make the predictive draws pure Monte Carlo draws and also cut down on computation time compared to using MCMC. But that’s just an implementation efficiency detail.
What about Gelman, Meng, and Stern’s diagram?
I’m still confused. But now my confusion is more about why there are multiple y_rep. I only use one in my approach. Then we get simulations for it which characterize the relevant predictive distribution. Also, I don’t understand why there’s only one theta in the posterior predictive diagram (1a), whereas there are multiples in the prior predictive diagram. To me, the only difference is that the edge for y doesn’t show up in the prior predictive check. It’s something you have, but not something that’s in the model. I think what Gelman, Meng, and Stern are doing here is trying to include y in the prior predictive model. My guess is that Andrew’s going to say that they know y at the point the prior predictive check is performed and all data should be on the table. Unfortunately, that requires what’s essentially a BUGS-style cut in the graphical model where you don’t let information from y bleed into theta. The multiple theta is an attempt to render it without cut. At least that’s my guess. Let’s see what Andrew says. (I could just ask via email, but it’s more fun to do my homework in front of a live audience.)