Why do a within-person rather than a between-person experiment?

Zach Horne writes:

A student of mine was presenting at the annual meeting of the Law and Society Association. She sent me this note after she gave her talk:

I presented some research at LSA which used a within subject design. I got attacked during the Q&A session for using a within subjects design and a few people said it doesn’t mean much unless I can replicate it with a between subjects design and it has no ecological validity.

She asked me if I had any thoughts on this and whether I had previously had problems defending a within subjects design. She also wondered what one should say when people take issue with within subjects designs.

I sent her a note with my initial thoughts but I thought it would be worth bringing up (again) on your blog because I’ve run into this criticism a lot. I don’t think she should just buckle and start running between subjects designs just to appease reviewers. We need people to understand the value of within designs, but the mantra “measurement error is important to think about” doesn’t seem to be doing the trick.

For background, here are two old posts of mine that I found on this topic from 2016 and 2017:

Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

Poisoning the well with a within-person design? What’s the risk?

Now, to quickly answer the questions above:

First off, the “ecological validity” thing is a red herring. Whoever said that were either misunderstood or didn’t know what they were talking about. Ecological validity refers to generalization from the lab to the real world, and it’s an important concern—but it has nothing to do with whether your measurements are within or between people.

Second, I think within-person designs are generally the best option when studying within-person effects. But there are settings where a between-person design is better.

In order to understand why I prefer the within-person design, it’s helpful to see the key advantage of the between-person design, which is that, by doing giving each person only one treatment, the effects of the treatment are pure. No crossover effects to worry about.

The disadvantage of the between-person design is that it does not control for variation among people, which can be huge.

In short, the between-person design is often cleaner, but at the cost of being so variable as to be essentially useless.

OK, at this point you might say, Fine, just do the between-person design with a really large N. But this approach has two problems. First, people don’t always get a really large N. One reason for that is the naive view that, if you have statistical significance, then your sample size was large enough. Second, all studies have bias (for example, in a psychology experiment there will be information leakage and demand effects), and ramping up N won’t solve that problem.

Here’s what I wrote a few years ago:

The clean simplicity of [within-person] designs has led researchers to neglect important issues of measurement . . .

Why use between-subject designs for studying within-subject phenomena? I see a bunch of reasons. In no particular order:

1. The between-subject design is easier, both for the experimenter and for any participant in the study. You just perform one measurement per person. No need to ask people a question twice, or follow them up, or ask them to keep a diary.

2. Analysis is simpler for the between-subject design. No need to worry about longitudinal data analysis or within-subject correlation or anything like that.

3. Concerns about poisoning the well. Ask the same question twice and you might be concerned that people are remembering their earlier responses. This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns. But it should not be the deciding factor. Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.

4. The confirmation fallacy. Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis. Statistically significant, so case closed, is the thinking. Then all concerns about measurements get swept aside: After all, who cares if the measurements are noisy, if you got significance? Such reasoning is wrong wrong wrong but lots of people don’t understand.

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement.

And this framing of questions of experimental design and analysis in terms of risks and benefits:

In a typical psychology experiment, the risk and benefits are indirect. No patients’ lives are in jeopardy, nor will any be saved. There could be benefits in the form of improved educational methods, or better psychotherapies, or simply a better understanding of science. On the other side, the risk is that people’s time could be wasted with spurious theories or ineffective treatments. Useless interventions could be costly in themselves and could do further harm by crowding out more effective treatments that might otherwise have been tried.

The point is that “bias” per se is not the risk. The risks and benefits come later on when someone tries to do something with the published results, such as to change national policy on child nutrition based on claims that are quite possibly spurious.

Now let’s apply these ideas to the between/within question. I’ll take one example, the notorious ovulation-and-voting study, which had a between-person design: a bunch of women were asked about their vote preference, the dates of their cycle, and some other questions, and then women in a certain phase of their cycle were compared to women in other phases. Instead, I think this should’ve been studied (if at all) using a within-person design: survey these women multiple times at different times of the month, each time asking a bunch of questions including vote intention. Under the within-person design, there’d be some concern that some respondents would be motivated to keep their answers consistent, but in what sense does that constitute a risk? What would happen is that changes would be underestimated, but when this propagates down to inferences about day-of-cycle effects, I’m pretty sure this is a small problem compared to all the variation that tangles up the between-person design. One could do a more formal version of this analysis; the point is that such comparisons can be done.

So, to get back to the question from my correspondent: what to do if someone hassles you to conduct a between-person design?

First, you can do a simulation study or design calculation and show the huge N that you would need to get a precise enough estimate of your effect of interest.

Second, you can point out that inferences from the between-person design are entirely indirect and only of averages, even though for substantive reasons you almost certainly are interested in individual effects.

Third, you can throw the “ecological validity” thing back at them and point out that, in real life, people are exposed to all sorts of different stimuli. Real life is a within-person design. In psychology experiments, we’re not talking about lifetime exposures to some treatment. In real life, people do different things all the time.