Model building is Lego, not Playmobil. (toward understanding statistical workflow)

John Seabrook writes:

Socrates . . . called writing “visible speech” . . . A more contemporary definition, developed by the linguist Linda Flower and the psychologist John Hayes, is “cognitive rhetoric”—thinking in words.

In 1981, Flower and Hayes devised a theoretical model for the brain as it is engaged in writing, which they called the cognitive-process theory. It has endured as the paradigm of literary composition for almost forty years. The previous, “stage model” theory had posited that there were three distinct stages involved in writing—planning, composing, and revising—and that a writer moved through each in order. To test that theory, the researchers asked people to speak aloud any stray thoughts that popped into their heads while they were in the composing phase, and recorded the hilariously chaotic results. They concluded that, far from being a stately progression through distinct stages, writing is a much messier situation, in which all three stages interact with one another simultaneously, loosely overseen by a mental entity that Flower and Hayes called “the monitor.” Insights derived from the work of composing continually undermine assumptions made in the planning part, requiring more research; the monitor is a kind of triage doctor in an emergency room.

This all makes sense to me. It reminds me of something I tell my students, which is that “writing is non-algorithmic,” which isn’t literally true—everything is algorithmic, if you define “algorithm” broadly enough—but which is intended to capture the idea that when writing, we go back and forth between structure and detail.

Writing is not simply three sequential steps of planning, composing, and revising, but I still think that it’s useful when writing to consider these steps, and to think of Planning/Composing/Revising as a template. You don’t have to literally start with a plan—your starting point could be composing (writing a few words, or a few sentences, or a few paragraphs) or revising (working off something written by someone else, or something written earlier by you)—but at some point near the beginning of the project, an outline can be helpful. Plan with composition in mind, and then, when it’s time to compose, compose being mindful of your plan and also of your future revision process. (To understand the past, we must first know the future.)

But what I really wanted to talk about today is statistical analysis, not writing. My colleagues and I have been thinking a lot about workflow. On the first page of BDA, we discuss these three steps:
1. Model building.
2. Model fitting.
3. Model checking.
And then you go back to step 1.

That’s all fine, it’s a starting point for workflow, but it’s not the whole story.

As we’ve discussed here and elsewhere, we don’t just fit a single model: workflow is about fitting multiple models. So there’s a lot more to workflow; it includes model building, model fitting, and model checking as dynamic processes where each model is aware of others.

Here are some ways this happens:

– We don’t just build one model, we build a sequence of models. This fits into the way that statistical modeling is a language with a generative grammar. To use toy terminology, model building is Lego, not Playmobil.

– When fitting a model, it can be helpful to use fits from other models as scaffolding. The simplest idea here is “warm start”: take the solution from a simple model as a starting point for new computation. More generally, we can use ideas such as importance sampling, probabilistic approximation, variational inference, expectation propagation, etc., to leverage solutions from simple models to help compute for more complicated models.

– Model checking is, again, relative to other models that interest us. Sometimes we talk about comparing model fit to raw data, but in many settings any “raw data” we see have already been mediated by some computation or model. So, more generally, we check models by comparing them to inferences from other, typically simpler, models.

Another key part of statistical workflow is model understanding, also called interpretable AI. Again, we can often best understand a fitted model by seeing its similarities and differences as compared to other models.

Putting this together, we can think of a sequence of models going from simple to complex—or maybe a network of models—and then the steps of model building, inference, and evaluation can be performed on this network.

This has come up before—here’s a post with some links, including one that goes back to 2011—so the challenge here is to actually do something already!

Our current plan is to work through workflow in some specific examples and some narrow classes of models and then use that as a springboard toward more general workflow ideas.

P.S. Thanks to Zad Chow for the adorable picture of workflow shown above.