The 20% Statistician: There are 13 ways to analyse a replication study, but only one of them is coherent.

Paul Simon’s says there are 50 ways to leave your lover, and a similar abundance characterizes the contemporary literature on how to analyse a replication study. The recent SCORE replication project in Nature makes this explicit by reporting “13 replication success metrics along with the number of papers to which each metric could be applied” (Tyner et al., 2026). The reason for this, as the authors note, is that “No singular success metric has been accepted as optimal and universally applicable in the literature.” With so many approaches to choose from, it should not be a surprise that the inferences based on the studies vary substantially. As the authors report: “Thirteen methods for evaluating replication success provided estimates ranging from 28.6% to 74.8% (median of 49.3%).”

Fig. 1 from the SCORE project, with 13 ways to analyze a replication study.

Although it is understandable for such a massive project to refrain from strongly opinionated choices, I think we can all agree that we do not want to compute a median for different approaches to analyse replication success, and just like we don’t want 13 ways to analyse an original study, we don’t want 13 ways to analyse a replication study. The current state of how to analyse replication studies is not as much an embarrassment of riches, but an embarrassment of a lack of a coherent epistemology. In science, we don’t compute statistics just because we can. We compute statistics because they help us to generate knowledge. In my experience, in science no “singular” anything is ever “accepted” – and especially not an approach to statistical inferences (as the ever-ongoing disagreement between Bayesians and Frequentists illustrates). Instead of waiting for agreement, we should think about which approach to the analysis of a replication study coherently follows from our epistemology.

Two metrics do the bulk of the epistemic work in the replication project. The authors are explicit that “Primary reporting emphasizes the two most reported replication metrics: statistical significance and effect size comparisons.” The authors are right to focus on these two measures, as these two inferences – whether the p-value is significant, and whether the effect sizes are similar – reflect what researchers treat as substantive claims in studies. In our recent paper ‘How to analyse a replication study’ we point out that these two inferences are directly related to basic statements that scientists make: whether the same ordinal decision can be made in the replication, and whether the magnitude of the effect meaningfully corresponds to the original estimate (Lakens et al., 2026).

While the authors understandably do not make too strong a point about their focus on 2 of the 13 metrics, their decision is not just based on which measures are commonly reported but also based on an epistemically coherent approach to the analysis of a replication study. Replication projects like SCORE are unmistakably designed within a methodological falsificationist framework, even when this philosophical background is not mentioned explicitly. In a falsificationist framework, which is implemented thought the Neyman-Pearson approach to statistical inferences, error rates are tightly controlled, and claims are made based on the results of a hypothesis test. It is clear the score authors followed a methodological falsificationist framework and the Neyman-Pearson approach to statistics as they preregistered (allowing others to evaluate how severely they controlled their error rates), designed studies with high statistical power, and set a prespecified alpha level.

From our perspective, the central goal of a replication study is not to ask a new statistical question, but to revisit the same question as in the original study, after collecting new data. As we argue, “the goal of a replication study is to examine whether the same basic statement that was made in an original study can also be made when new data are collected in a faithful empirical realization of the original test.” Within a methodological falsificationist framework, original findings enter the literature as tentative basic statements, justified by a statistical decision rule with known error rates. Replication studies function as a critical check on whether such statements were merely the result of a Type I error. As we note, “one of the main functions of a replication study is to determine the stability of a finding, which should not be present if the original finding was a Type I error.” Even though replications inevitably change many auxiliary conditions—because, as Popper already observed, experiments can never be repeated exactly—this does not undermine their epistemic role. Instead, replications test whether the original claim survives variation in factors that were implicitly relegated to the ceteris paribus clause, thereby probing whether the original inference was robust or not.

This view has direct implications for how replication studies should be analysed. If original claims were made by rejecting a null hypothesis at a prespecified alpha level, then “the replication study should evaluate whether the same claim can be made using the same inferential criterion.” And if researchers also treated the original effect size estimate as part of the claim (which they seem to do in practice) then replication studies must additionally examine whether effect sizes differ meaningfully. While the other eleven approaches to assessing replication success are statistically legitimate and can certainly be reported, they answer different questions and rest on different epistemic commitments. The coherent approach is the one that aligns epistemological aims, philosophy of science, and statistical procedure. Statistics follow from the questions we ask, the questions we ask follow from our philosophy of science, and without such coherence we cannot have a principled approach to the analysis of replication studies. Maybe it is time to admit that there are 13 statistical quantities that can be computed for a replication study, but there is really only one way to do so coherently from a specific philosophy of scientific knowledge generation.

References

Lakens, D., Leach, N. M., Haans, A., Uygun Tunç, D., & Tunç, M. N. (2026). How to analyze a replication study. PsyArXiv. https://osf.io/preprints/psyarxiv/7ydgu_v1/

Tyner, A. H., Abatayo, A. L., Daley, M., Field, S., Fox, N., Haber, N. A., Hahn, K. M., Struhl, M. K., Mawhinney, B., Miske, O., Silverstein, P., Soderberg, C. K., Stankov, T., Abbasi, A., Aberson, C. L., Aczel, B., Adamkovič, M., Albayrak, N., Allen, P. J., … Errington, T. M. (2026). Investigating the replicability of the social and behavioural sciences. Nature, 652(8108), 143–150. https://doi.org/10.1038/s41586-025-10078-y

The 20% Statistician

Sunday, May 3, 2026

There are 13 ways to analyse a replication study, but only one of them is coherent.

No comments:

Post a Comment