A neuroscience graduate student named James writes in with a question regarding validating Bayesian model comparison using synthetic data:

I [James] perform an experiment and collect real data. I want to determine which of 2 candidate models best accounts for the data. I perform (approximate) Bayesian model comparison (e.g., using BIC – not ideal I know, but hopefully we can suspend our disbelief about this metric) and select the best model accordingly.

I have been told that we can’t entirely trust the results of this model comparison because 1) we make approximations when performing inference (exact inference is intractable) and 2) there may be code bugs. It has sheen recommended that I should validate this model selection process by applying it on synthetic data generated from the 2 candidate models; the rationale is that if the true model is recovered in each case I can rely on the results of model comparison on the real data.

My question is: is this model recovery process something a Bayesian would do/do you think it is necessary? I am wondering if it is not appropriate because it is conditional on the data having been generated from one of the 2 candidate models, both of which are presumably wrong; the true data that we collected in the experiment was presumably generated from a different model/process. I am wondering if it is sufficient to perform model comparison on the real data (without using recovery for validation) due to the likelihood principle – model comparison will tell us under which model the data is most probable.

I would love to hear your views on the appropriateness of model recovery and whether it is something a Bayesian would do (also taking practical considerations into account such as bugs/approximations).

I replied that quick recommendation is to compare the models using leave-one-out cross validation as discussed in this article from a few years ago.

James responded with some further questions:

1. In our experiment, we assume that each participant performs Bayesian inference (in a Bayesian non-parametric switching state-space model), however we fit the (hyper)parameters of the model using maximum likelihood estimation given the behaviour of each participant. Hence, we obtain point estimates of the model parameters. Therefore, I don’t think the methods in the paper you sent are applicable as they require access to the posterior over parameters? We currently perform model comparison using metrics such as AIC/BIC, which can be computed even thought we fit parameters using maximum likelihood estimation.

2. Can I ask why you are suggesting a method that assesses out-of-sample predictive accuracy? My (perhaps naive) understanding is that we want to determine which model among a set of candidate models best explains the data we have from our current experiment, not data that we could obtain in a future experiment. Or would you argue that we always want to use our model in the future so we really care about predictive accuracy?

3. The model recovery process I mentioned has been advocated for purely practical reasons as far as I can tell (e.g., the optimiser used to fit the models could be lousy, approximations are typically made to marginal likelihoods/predictive accuracies, there could be bugs in one’s code). So even if I performed model comparison using PSIS-LOO as you suggest, I could imagine that one could still advocate doing model recovery to check that the result of model comparison based on PSIS-LOO is reliable and can be trusted. The assumption of model recovery is that you really should be able to recover the true model when it is the set of models you are comparing – if you can’t recover the true model with reasonable accuracy, then you can’t trust the results of your model comparison on real data. Do you have any thoughts on this?

My brief replies:

1a. No need to perform maximum likelihood estimation for each participant. Just do full Bayes: that should give you better inferences. And, if it doesn’t, it should reveal problems with your model, and you’ll want to know that anyway.

1b. Don’t do AIC and definitely don’t do BIC. I say this for reasons discussed in the above-linked paper and also this paper, Understanding predictive information criteria for Bayesian models.

2. You always care about out-of-sample predictive accuracy. The reason for fitting the model is that it might be applicable in the future. As described in the above-linked papers, AIC can be understood as an estimate of out-of-sample predictive accuracy. If you really only cared about within-sample prediction, you wouldn’t be using AIC at all; you’d just do least squares or maximum likelihood and never look back. The very fact that you were thinking about using AIC tells me that you care about out-of-sample predictive accuracy. And then you might as well do LOO and cut out the middleman.

3. Sure, yes, I’m a big fan of checking your fitting and computing process using fake-data simulation. So go for it!