“But when we apply statistical models, do we need to care about whether a model can retrieve the relationship between variables?”

Tongxi Hu writes:

Could you please answer a question about the application of statistical models. Let’s take regression models as an example.

In the real world, we use statistical models to find out relationships between different variables because we do not know the true relationship. For example, the crop yield, temperature, and precipitation. But when we apply statistical models, do we need to care about whether a model can retrieve the relationship between variables?

Examples:
Suppose the true relationship between crop yield (Y), temperature (T), and precipitation (P) is:
Y = T+ sin(T/6) + P + exp{- (P-160)/4}
Suppose we also simulated some observations of Y, T, and P. Then, we use a linear regression model to fit these simulated observations. I am sure we can fit them and fit them well using a certain statistical model. Let’s say the fitted model is:
Y = a*T+b*T^2 + c*P + d*P^2 + e).
Apparently, the fitted model can’t retrieve the real relationships between Y, T, and P. Can we really use the fitted model to do some inference?

Many researchers using statistical models to predict crop yield in future relying on statistical models fitted using historical observations. Some of their work is published on top-level journals such as Science, Nature. I am doubting their conclusions. My argument is if we are unable to make sure a model is capable of retrieving the true relationships, inference from these models can be misleading.

My reply:

If you simulate fake data, there’s a true model. But in real life there’s just about never a true model, for two reasons:

First, to go back to your example: whatever is the actual function in the population of E(y|t,p) will not be any parametric form. E(y|t,p) could be approximated by a linear model or a model with sin and exp or whatever, but there will be no true parametric form.

Second, there is no single E(y|t,p), as this expectation or regression function will vary over space, time, different types of crop, etc. Just as when estimating a treatment effect there is really no single “treatment effect” to estimate, when estimating a predictive relationship there is no single relationship to estimate.

In practice, all models are approximations, both in their functional form and in their implicit assumption of stability. (Yes, you could extend your model to allow variation in space, time, and type of crop—and that could be a good idea—but there’d still be variation according to other factors you did not account for.)

You write, “if we are unable to make sure a model is capable of retrieving the true relationships, inference from these models can be misleading.” This is a legitimate concern. Just remember that there are no “true relationships” to recover.

Also consider that models can be substantively motivated—that is, justified in part from the underlying science of the problem being modeled. There are strong motivations such as with compartmental models in toxicology, or our golf putting model, and weaker motivations such as with models predicting elections from the economy. Substantive motivation can be seen as a kind of regularization. One advantage of a substantive model is that there can be natural ways to extend it, as in the golf example.