“To Change the World, Behavioral Intervention Research Will Need to Get Serious About Heterogeneity”

Beth Tipton, Chris Bryan, and David Yeager write:

The increasing influence of behavioral science in policy has been a hallmark of the past decade, but so has a crisis of confidence in the replicability of behavioral science findings. In this essay, we describe a nascent paradigm shift in behavioral intervention research—a heterogeneity revolution—that we believe these two historical trends have already set in motion. The emerging paradigm recognizes that the unscientific samples that currently dominate behavioral intervention research cannot produce reliable estimates of an intervention’s real-world impact. Similarly, unqualified references to an intervention’s “true effect” are rarely warranted. Rather, the variation in effect estimates across studies that defines the current replication crisis is to be expected, even in the absence of false positives, as long as heterogeneous effects are studied without a systematic approach to sampling.

I agree! I’ve been ranting about this for a long time—hey, here’s a post from 2005, not long after we started this blog, and here’s another from 2009 . . . I guess there’s a division of labor on this one: I rant and Tipton et al. do something about it.

From one standpoint, the idea of varying treatment effects is obvious. But, when you look at what people do, this sort of variation is typically ignored. When I had my PhD training in the 1980s, we were taught all about causal inference. We learned randomization inference, we learned Bayesian inference, but it was always a model with constant treatment effect. Statistics textbooks—including my own!—always start with the model of constant treatment effect, only including interactions as an option.

And the problem’s not just with statisticians. Behavioral scientists have also been stunningly unreflective regarding the relevance of varying treatment effects to their experimental study. For example, here’s an email I received a few years ago from a prominent psychology researcher: not someone I know personally, but a prominent, very well connected professor at a leading East Coast private university that’s not Cornell. In response to a criticism I gave regarding a paper that relied entirely on data from a self-selected sample of 100 women from the Internet, and 24 undergraduates, the prominent professor wrote:

Complaining that subjects in an experiment were not randomly sampled is what freshmen do before they take their first psychology class. I really *hope* you why that is an absurd criticism – especially of authors who never claimed that their study generalized to all humans.

The paper in question did not attempt to generalize to “all humans,” just to women of childbearing age. The title and abstract to the paper simply refer to “women” with no qualifications, and there is no doubt in my mind that the authors and anyone else who found this study to be worth noting) is interested in some generalization to a larger population.

The point is that this leading psychology researcher who wrote me that email was so deep into the constant-treatment-effect mindset that he didn’t just think that particular study was OK, he also thought it was “absurd” to be concerned about the non-representativeness of a sample in a psychology experiment.

So that was a long digression. The point is that the message sent by Tipton, Bryan, and Yeager, while commonsensical and clear, is not so apparent. For whatever reason, it’s taken people awhile to come to this point?

Why? For one thing, interactions are hard to estimate. Remember 16. So, for a long time we’ve had this attitude that, since interactions are hard—sometimes essentially impossible—to identify from data, we might as well just pretend they don’t exist. It’s a kind of Pascal’s wager or bet-on-sparsity principle.

More recently, though, I’ve been thinking we need to swallow our pride and routinely model these interactions, structuring our models so that the interactions we estimate make sense. Some of this structuring can be done using informative priors, some of it can be done using careful choices of functional forms and transformations (as in my effects-of-survey-incentives paper with Lauren). But, even if we can’t accurately estimate these interactions or even reliably identify their signs, it can be a mistake to just exclude them, which is equivalent to assuming they’re zero.

Also, let’s move from the overplayed topic of analysis to the still-fertile topic of design. If certain interactions or aspects of varying treatment effects are important, let’s design studies to specifically estimate these!

To put it another way: We’re already considering treatment interactions, all the time.

Why do I say that? Consider the following two pieces of advice we always give to researchers seeking to test out a new intervention:

1. Make the intervention as effective as possible. In statistics terms, multiplying the effect size by X is equivalent to multiplying the sample size by X^2. So it makes sense to do what you can to increase that effect size.

2. Apply the intervention to people who will be most receptive of the treatment, and in settings where the treatment will be most effective.

OK, fine. So how do you do 1 and 2? You can only do these if you have some sense of how the treatment effect can vary based on manipulable conditions (that’s item 1) and based on observed settings (that’s item 2). It’s a Serenity Prayer kind of thing.

So, yeah, understanding interactions is crucial, not just for interpreting experimental results, but for designing effective experiments that can yield conclusive findings.

Big changes coming

In our recent discussion of growth mindset interventions, Diana Senechal wrote:

We not only have a mixture of mindsets but actually benefit from the mixture—that we need a sense of limitation as well as of possibility. It is fine to know that one is better at certain things than at others. This allows for focus. Yes, it’s important to know that one can improve in areas of weakness. And one’s talents also contain weaknesses, so it’s helpful, overall, to know how to improve and to believe that it can happen. But it does not have to be an all-encompassing ideology, nor does it have to replace all belief in fixity or limitation. One day, someone will write a “revelatory” book about how the great geniuses actually knew they were bad at certain things–and how this knowledge allowed them to focus. That will then turn into some “big idea” and go to extremes of its own.

I agree. Just speaking qualitatively, as a student, teacher, sibling, and parent, I’d say the following:

– When I first heard about growth mindset as an idea, 20 or 30 years ago, it was a bit of a revelation to me: one of these ideas that is obvious and that we knew all along (yes, you can progress more if you don’t think of your abilities as fixed) but where hearing the idea stated in this way could change how we think.

– It seems clear that growth mindset can help some kids, but not all or even most, as these have to be kids who (a) haven’t already internalized growth mindset, and (b) are open and receptive to the idea. This is an issue in learning and persuasion and change more generally: For anything, the only people who will change are those who have not already changed and are willing to change. Hence a key to any intervention is to target the right people.

– If growth mindset becomes a dominant ideology, then it could be that fixed-mindset interventions could be helpful to some students. Indeed, maybe this is already the case.

The interesting thing is how much these above principles would seem to apply to so many psychological and social interventions. But when we talk about causal inference, we typically focus on the average treatment effect, and we often simply regression models in which the treatment effect is constant.

This suggests, in a God-is-in-every-leaf-of-every-tree way, that we’ve been thinking about everything all wrong for all these decades, focusing on causal identification and estimating “the treatment effect” rather than on these issues of receptivity to treatment.