Somethings do not seem to spread easily – the role of simulation in statistical practice and perhaps theory.

Unlike Covid19, somethings don’t seem to spread easily and the role of simulation in statistical practice (and perhaps theory) may well be one of those.

In a recent comment, Andrew provided a link to an interview about the new book Regression and Other Stories by Aki Vehtari, Andrew Gelman, and Jennifer Hill. An interview that covered many of the aspects of the book, but the comments on the role of fake data simulation caught my interest the most. 

In fact, I was surprised by the comments in that the recommended role of simulation seemed much more substantial than I would have expected from participating on this blog. For at least the last 10 years I have been promoting the use of simulation in teaching and statistical practice with seemingly little uptake from other statisticians. For instance my intro to Bayes seminar and some recent material here (downloadable HTML from google drive).

My sense was that those who eat, drink and dream in formulas see simulation as awkward and tedious. David Spiegelhalter actually said so in his thoughtfully communicated book The Art of Statistics – “[simulation is a] rather clumsy and brute-force way of carrying out statistical analysis”.  But Aki, Andrew and Jennifer seem to increasingly disagree.

For instance, at 29:30 in the interview there is about 3 minutes from Andrew that all of statistical theory is a kind of shortcut to fake data simulation and you don’t need to know any statistical theory as long as you are willing to do fake data simulation on everything. However, it is hard work to do fake data simulation well [building a credible fake world and how that is sampled from]. Soon after, Aki commented that it is only with fake data simulation that you have access to the truth in addition to data estimates. That to me is the most important aspect – you know the truth.
Also at 49:25 Jennifer disclosed that she changed her teaching recently to be based largely on fake data simulation and is finding that the students having to construct the fake world and understand how the analysis works there provides a better educational experience.
Now in a short email exchange Andrew did let me know that the role of simulation increased as they worked on the book and Jennifer let me know that there is simulation exercises in the causal inference topics.
I think the vocabulary they and others have developed (fake data, fake world, Bayesian reference set generated by sampling from the prior, etc.) will help more see why statistical theory is a kind of shortcut  to simulation. I especially like this vocabulary and recently switched from fake universe to fake world in my own work.
However, when I initially tried using simulation in webinars and seminars, many did not seem to get it at all.

My sense was in today’s vocabulary, many simply did not realize that statistical thinking and modelling always took place in fake worlds (mathematically) and is transported to our reality to make sense of data we have in hand, that we trying learn from. They thought it was directly and literally about reality. That is, they were not trying to distinguish just learning what happened in the data (description) from what to make of it to guide action in the future (inference, prediction or causal).
That is what we need fake worlds (abstractions) for – to discern this. We can only see what would repeatedly happen in possible fake worlds. What happened in a data set is just a particular dead past and arid of insight for future possibilities that in statistical inference, we are primarily interested. Christian Hennig makes related points here.
My own attempts to overcome that misconception have used metaphors: a shadow metaphor of seeing just shadows but needing to discern what cast them and an analytical chemistry metaphor of being able to spike known amounts of chemical in test tubes and repeatedly seeing what noisy measurements repeatedly occur. The discerned distribution of measurements given a known amount is then transported to assess unknown amounts in real samples. However, in many problems in statistical inference, known amounts cannot be spiked so we need fake worlds built with probability distributions to discern what would be repeatedly be observed, given known truths.
Until about 2000, these had to be discerned mathematically but in the last 10 years this can be done more and more conveniently using simulation. The relative advantage of the mathematical short cut is getting relatively less convenient and hence of increasing less value. I expect some push back here.
Most understand that statistics is hard. That, of course, was using those as Andrew put it mathematical shortcuts to fake data simulation. I think it would be a mistake to think what was hard was just getting those shortcuts. The answers are hard to fully make sense of.
I’ll close with speculating that in 10 years, statistical theory will be mostly about gaining a deep understanding of simulation as profitable abstraction of counter-factually repeatable phenomenon.