Will decentralised collaboration increase the robustness of scientific findings in biomedical research? Some data and some causal questions.

Mark Tuttle points to this press release, “Decentralising science may lead to more reliable results: Analysis of data on tens of thousands of drug-gene interactions suggests that decentralised collaboration will increase the robustness of scientific findings in biomedical research,” and writes:

In my [Tuttle’s] opinion, the explanation is more likely to be sociological – group think and theory-driven observation – rather than methodological. Also, independent group tend to be rivals, not collaborators, and thus will be more inclined to be critical; I have seen this kind of thing in action . . .

I replied that I suspect it could also be that it is the more generalizable results that outside labs are more interested in replicating in the first place. So the descriptive correlation could be valid without the causal conclusion being warranted.

The research article in question is called, “Meta-Research: Centralized scientific communities are less likely to generate replicable results,” by Valentin Danchev, Andrey Rzhetsky, and James Evans, and it reports:

Here we identify a large sample of published drug-gene interaction claims curated in the Comparative Toxicogenomics Database (for example, benzo(a)pyrene decreases expression of SLC22A3) and evaluate these claims by connecting them with high-throughput experiments from the LINCS L1000 program. Our sample included 60,159 supporting findings and 4253 opposing findings about 51,292 drug-gene interaction claims in 3363 scientific articles. We show that claims reported in a single paper replicate 19.0% (95% confidence interval [CI], 16.9–21.2%) more frequently than expected, while claims reported in multiple papers replicate 45.5% (95% CI, 21.8–74.2%) more frequently than expected. We also analyze the subsample of interactions with two or more published findings (2493 claims; 6272 supporting findings; 339 opposing findings; 1282 research articles), and show that centralized scientific communities, which use similar methods and involve shared authors who contribute to many articles, propagate less replicable claims than decentralized communities, which use more diverse methods and contain more independent teams. Our findings suggest how policies that foster decentralized collaboration will increase the robustness of scientific findings in biomedical research.

Seeing this, I’d like to separate the descriptive from the causal claims.

The first descriptive statement is that claims reported in multiple papers paper replicate more often that claims reported in single papers. I’ll buy that (at least in theory; I have not gone through the article in enough detail to understand what is meant by the “expected” rate of replication, also I can’t see how the numbers add up: If some claims replicate 19% more frequently than expected, and others replicate 45% more frequently than expected (let’s pass over the extra decimal place in “45.5%” etc. in polite silence), then some claims must replicate less frequently than expected, no? But every claim is reported in a single paper or in multiple papers, so I feel like I’m missing something. But, again, that must be explained in the Danchev et al. article, and it’s not my main focus here).

The second descriptive statement is that claims produced by centralized scientific communities replicate less well than claims produced by decentralized communities. Again, I’ll assume the researchers did a good analysis here and that this descriptive statement is valid for their data and will generalize to other research in biomedicine.

Finally, the causal statement is that “policies that foster decentralized collaboration will increase the robustness of scientific findings in biomedical research.” I can see that this causal statement is consistent with the descriptive findings, but I don’t see it as implied by them. It seems to me that if you want to make this sort of causal statement, you need to do an experimental or observational study. I’m assuming experimental data on this question aren’t available, so you’ll want to do an observational study, comparing results under different practices, and adjusting for pre-treatment differences between exposed and unexposed groups. But it seems that they just did a straight comparison, and that seems subject to selection bias.

I contacted the authors of the article regarding my concern that the descriptive correlation could be valid without the causal conclusion being warranted, and they replied as follows:

James Evans:

It is most certainly the case that more generalizable findings are more likely to be repeated and published by someone (within or without author-cluster)—which we detail in our most recent version of the paper, but I do not believe that it is the case that outside labs publish on those that generalize; but that inside labs tend to agree with former findings even if they are wrong (eg., even if the finding is in the opposite direction). It seems much more likely that the same labs refuse to publish contrary findings than that outside labs magically know before they have begun their studies what the right claims are to study.

Valentin Danchev:

How are generalizable results defined seems to be a key here. In the paper, we defined generalizability across experimental settings in L1000 but if the view is that outside labs select results known to be generalizable at the time of study, then generalizability should come from the research literature (Gen1). But from this definition, clustered publications with shared authors appear to provide the most generalizable results, which are virtually always confirmed (0.989). However, these may not be the kind of generalizable results we can rely on as when we add another layer of generalizability (Gen 2) — confirmation of published results, i.e. matching effect direction, in L1000 experiments — results from those clustered publications are less likely to be confirmed. Note that if results remain in clustered publications simply because they were not selected by outside labs due to non-generalizability, without any relation to author centralization or overlap, then we should expect conflicting papers about those non-generalized results, but, again, we found this not to be the case — results in clustered papers are more confirmatory while less likely to replicate (i.e. match the direction) in L1000.

As James mentioned, the more likely a published result is confirmed in L1000 experiments (Gen2), the more likely this result is to be published multiple times, whereas non-confirmed results are more likely to be published once (and probably tested further but put in the ‘file drawer’). This does support a view that generalizable results are learned over time. But without locating centralized or overlapping groups, some results would likely turn out to be false positives. Hence, for outside labs to establish which results are actually generalizable, a few independent publications are needed in the first place; once established, many more outside labs are indeed likely to further consider those results. I have no firm knowledge about the relationship between generalizable results and outside labs, but I would not be surprised if they correlate in the long run when results are globally known to be generalizable, with the caveat that independent labs appear to initially be a condition for establishing generalizability.

Overall, I think, some of our findings – results generalizable in L1000 as well as results supported in multiple publications are both more likely to replicate – do suggest a process of self-correction in which the community learns what works and what does not work, not necessary orthogonal to the observation that outside labs would select generalizable results (if/when known), but connectivity also plays a role as it can foster or impede self-correction and learning. Of course, as one of the reviewers suggested, the findings are of associational rather than of causal nature.

Interesting. My concerns were generic involving the statistics of causal inference. The responses are focused on the specific questions they are studying, and it would be hard for me to try to evaluate these arguments without putting in some extra effort to understand these details.