Priors on effect size in A/B testing

https://statmodeling.stat.columbia.edu/2020/07/04/priors-on-effect-size-in-a-b-testing/

I just saw this interesting applied-focused post by Kaiser Fung on non-significance in A/B testing. Kaiser was responding to a post by Ron Kohavi. I can’t find Kohavi’s note anywhere, but you can read Kaiser’s post to get the picture.

Here I want to pick out a few sentences from Kaiser’s post:

Kohavi correctly points out that for a variety of reasons, people push for “shipping flat”, i.e. adopting the Test treatment even though it did not outperform Control in the A/B test. His note carefully lays out these reasons and debunks most of them.

The first section deals with situations in which Kohavi would accept “shipping flat”. He calls these “non-inferiority” scenarios. My response to those scenarios were posted last week. I’d prefer to call several of these quantification scenarios, in which the expected effect is negative, and the purpose of A/B testing is to estimate the magnitude. . . .

The “ship flat or not” decision should be based on a cost-benefit analysis. . . .

This all rang a bell because I’ve been thinking about priors for effect sizes in A/B tests.

– If you think the new treatment will probably work, why test it? Why not just do it? It’s because by gathering data you can be more likely to make the right decision.

– But if you have partial information (characterized in the above discussion by a non-significant p-value) and you have to decide, then you should use the decision analysis.

– Often it makes sense to consider that negative-expectation scenarios. Most proposed innovations are bad ideas, right?

Also this bit from Kohavi:

The problem is exacerbated when the first iteration shows stat-sig negative, tweaks are made, and the treatment is iterated a few times until we have a non-stat sig iteration. In such cases, a replication run should be made with high power to at least confirm that the org is not p-hacking.

– This is related to the idea that A/B tests typically don’t occur in a vacuum; we have a series of innovations, experiments, and decisions.

I want to think more about all this.