The default prior for logistic regression coefficients in Scikit-learn

Someone pointed me to this past by W. D., reporting that, in Python’s popular Scikit-learn package, the default prior for logistic regression coefficients is normal(0,1)—or, as W. D. puts it, L2 penalization with a lambda of 1.

In the post, W. D. makes three arguments. I agree with two of them.

1. I agree with W. D. that it makes sense to scale predictors before regularization. (There are various ways to do this scaling, but I think that scaling by 2*observed sd is a reasonable default for non-binary outcomes.)

2. I agree with W. D. that default settings should be made as clear as possible at all times.

3. I disagree with the author that a default regularization prior is a bad idea. As a general point, I think it makes sense to regularize, and when it comes to this specific problem, I think that a normal(0,1) prior is a reasonable default option (assuming the predictors have been scaled). I think that rstanarm is currently using normal(0,2.5) as a default, but if I had to choose right now, I think I’d go with normal(0,1), actually.

Apparently some of the discussion of this default choice revolved around whether the routine should be considered “statistics” (where primary goal is typically parameter estimation) or “machine learning” (where the primary goal is typically prediction). As far as I’m concerned, it doesn’t matter: I’d prefer a reasonably strong default prior such as normal(0,1) both for parameter estimation and for prediction.

Again, I’ll repeat points 1 and 2 above: You do want to standardize the predictors before using this default prior, and in any case the user should be made aware of the defaults, and how to override them.

P.S. Sander Greenland and I had a discussion of then. Sander disagreed with me so I think it will be valuable to share both perspectives.

Sander wrote:

The following concerns arise in risk-factor epidemiology, my area, and related comparative causal research, not in formulation of classifiers or other pure predictive tasks as machine learners focus on…

As you may already know, in my settings I don’t think scaling by 2*SD makes any sense as a default, instead it makes the resulting estimates dependent on arbitrary aspects of the sample that have nothing to do with the causal effects under study or the effects one is attempting control with the model. Decontextualized defaults are bound to create distortions sooner or later, alpha = 0.05 being of course the poster child for that. So it seems here: Regularizing by a prior with variance 1 after rescaling by 2*SD means extending the arbitrariness to made-up prior information and can be pretty strong for a default, adding a substantial amount of pseudo-information centered on the null without any connection to an appropriate loss function. It is then capable of introducing considerable confounding (e.g., shrinking age and sex effects toward zero and thus reducing control of distortions produced by their imbalances). Weirdest of all is that rescaling everything by 2*SD and then regularizing with variance 1 means the strength of the implied confounder adjustment will depend on whether you chose to restrict the confounder range or not. Thus I advise any default prior introduce only a small absolute amount of information (e.g., two observations worth) and the program allow the user to increase that if there is real background information to support more shrinkage.

Of course high-dimensional exploratory settings may call for quite a bit of shrinkage, but then there is a huge volume of literature on that and none I’ve seen supports anything resembling assigning a prior based on 2*SD rescaling, so if you have citations showing it is superior to other approaches in comparative studies, please send them along!

I replied that I think that scaling by population sd is better than scaling by sample sd, and the way I think about scaling by sample sd is as an approximation to scaling by population sd. I also think the default I recommend, or other similar defaults, are safer than a default of no regularization, as this leads to problems with separation. It sounds like you would prefer weaker default priors. I think that weaker default priors will lead to poorer parameter estimates and poorer predictions–but estimation and prediction are not everything, and I could imagine that for some users, including epidemiology, weaker priors could be considered more acceptable.

Sander responded:

A severe question would be what is “the” population SD? The county? The state? The nation? The world? All humans who ever lived?

Maybe you are thinking of descriptive surveys with precisely pre-specified sampling frames. But no comparative cohort study or randomized clinical trial I have seen had an identified or sharply defined population to refer to beyond the particular groups they happened to get due to clinic enrollment, physician recruitment, and patient cooperation.

That aside, do we use “the” population restricted by the age restriction used in the study? Which would mean the prior SD for the per-year age effect would vary by peculiarities like age restriction even if the per-year increment in outcome was identical across years of age and populations. All that seems very weird, more along the lines of statistical numerology rather than empirical science (as if there were some magic in SD – why not the intraquartile or intraquintile or intratertile range? all of which could be equally bad, but aren’t necessarily worse).

I don’t recommend no regularization over weak regularization, but problems like separation are fixed by even the weakest priors in use. The weak priors I favor have a direct interpretation in terms of information being supplied about the parameter in whatever SI units make sense in context (e.g., mg of a medication given in mg doses).

As for “poorer parameter estimates” that is extremely dependent on the performance criteria one uses to gauge “poorer” (bias is often minimized by the Jeffreys prior which is too weak even for me – even though it is not as weak as a Cauchy prior). And “poor” is highly dependent on context. In comparative studies (which I have seen you involved in too), I’m fine with a prior that pulls estimates toward the range that debate takes place among stakeholders, so they can all be comfortable with the results. But no stronger than that, because a too-strong default prior will exert too strong a pull within that range and thus meaningfully favor some stakeholders over others, as well as start to damage confounding control as I described before. Worse, most users won’t even know when that happens; they will instead just defend their results circularly with the argument that they followed acceptable defaults. Again, 0.05 is the poster child for that kind of abuse, and at this point I can imagine parallel strong (if even more opaque) distortions from scaling of priors being driven by a 2*SD covariate scaling.

My reply regarding Sander’s first paragraph is that, yes, different goals will correspond to different models, and that can make sense. In practice with rstanarm we set priors that correspond to the scale of 2*sd of the data, and I interpret these as representing a hypothetical population for which the observed data are a sample, which is a standard way to interpret regression inferences.

Regarding Sander’s concern that users “they will instead just defend their results circularly with the argument that they followed acceptable defaults”: Sure, that’s a problem. But in any case I’d like to have better defaults, and I think extremely weak priors is not such a good default as it leads to noisy estimates (or, conversely, users not including potentially important predictors in the model, out of concern over the resulting noisy estimates). Informative priors—regularization—makes regression a more powerful tool.