Paul Simon’s says there are 50 ways to leave your lover, and a similar abundance characterizes the contemporary literature on how to analyse a replication study. The recent SCORE replication project in Nature makes this explicit by reporting “13 replication success metrics along with the number of papers to which each metric could be applied” (Tyner et al., 2026). The reason for this, as the authors note, is that “No singular success metric has been accepted as optimal and universally applicable in the literature.” With so many approaches to choose from, it should not be a surprise that the inferences based on the studies vary substantially. As the authors report: “Thirteen methods for evaluating replication success provided estimates ranging from 28.6% to 74.8% (median of 49.3%).”
Although it is
understandable for such a massive project to refrain from strongly opinionated
choices, I think we can all agree that we do not want to compute a median for
different approaches to analyse replication success, and just like we don’t
want 13 ways to analyse an original study, we don’t want 13 ways to analyse a
replication study. The current state of how to analyse replication studies is not
as much an embarrassment of riches, but an embarrassment of a lack of a
coherent epistemology. In science, we don’t compute statistics just because we
can. We compute statistics because they help us to generate knowledge. In my
experience, in science no “singular” anything is ever “accepted” – and especially
not an approach to statistical inferences (as the ever-ongoing disagreement
between Bayesians and Frequentists illustrates). Instead of waiting for
agreement, we should think about which approach to the analysis of a
replication study coherently follows from our epistemology.
Two metrics do the
bulk of the epistemic work in the replication project. The authors are explicit
that “Primary reporting emphasizes the two most reported replication metrics:
statistical significance and effect size comparisons.” The authors are right to
focus on these two measures, as these two inferences – whether the p-value is significant,
and whether the effect sizes are similar – reflect what researchers treat as
substantive claims in studies. In our recent paper ‘How to analyse a
replication study’ we point out that these two inferences are directly related
to basic statements that scientists make: whether the same ordinal decision can
be made in the replication, and whether the magnitude of the effect
meaningfully corresponds to the original estimate (Lakens et al.,
2026).
While the authors understandably
do not make too strong a point about their focus on 2 of the 13 metrics, their
decision is not just based on which measures are commonly reported but also
based on an epistemically coherent approach to the analysis of a replication
study. Replication projects like SCORE are unmistakably designed within a
methodological falsificationist framework, even when this philosophical
background is not mentioned explicitly. In a falsificationist framework, which
is implemented thought the Neyman-Pearson approach to statistical inferences, error
rates are tightly controlled, and claims are made based on the results of a hypothesis
test. It is clear the score authors followed a methodological falsificationist
framework and the Neyman-Pearson approach to statistics as they preregistered
(allowing others to evaluate how severely they controlled their error rates), designed
studies with high statistical power, and set a prespecified alpha level.
From our perspective,
the central goal of a replication study is not to ask a new statistical
question, but to revisit the same question as in the original study, after
collecting new data. As we argue, “the goal of a replication study is to
examine whether the same basic statement that was made in an original study can
also be made when new data are collected in a faithful empirical realization of
the original test.” Within a methodological falsificationist framework,
original findings enter the literature as tentative basic statements, justified
by a statistical decision rule with known error rates. Replication studies
function as a critical check on whether such statements were merely the result
of a Type I error. As we note, “one of the main functions of a replication
study is to determine the stability of a finding, which should not be present
if the original finding was a Type I error.” Even though replications
inevitably change many auxiliary conditions—because, as Popper already
observed, experiments can never be repeated exactly—this does not undermine
their epistemic role. Instead, replications test whether the original claim
survives variation in factors that were implicitly relegated to the ceteris
paribus clause, thereby probing whether the original inference was robust or not.
This view has direct
implications for how replication studies should be analysed. If original claims
were made by rejecting a null hypothesis at a prespecified alpha level, then “the
replication study should evaluate whether the same claim can be made using the
same inferential criterion.” And if researchers also treated the original
effect size estimate as part of the claim (which they seem to do in practice) then
replication studies must additionally examine whether effect sizes differ
meaningfully. While the other eleven approaches to assessing replication
success are statistically legitimate and can certainly be reported, they answer
different questions and rest on different epistemic commitments. The coherent
approach is the one that aligns epistemological aims, philosophy of science,
and statistical procedure. Statistics follow from the questions we ask, the
questions we ask follow from our philosophy of science, and without such
coherence we cannot have a principled approach to the analysis of replication
studies. Maybe it is time to admit that there are 13 statistical quantities
that can be computed for a replication study, but there is really only one way to
do so coherently from a specific philosophy of scientific knowledge generation.
References
Lakens, D., Leach, N. M., Haans, A.,
Uygun Tunç, D., & Tunç, M. N. (2026). How to analyze a replication study. PsyArXiv. https://osf.io/preprints/psyarxiv/7ydgu_v1/
Tyner, A. H., Abatayo, A. L., Daley, M., Field,
S., Fox, N., Haber, N. A., Hahn, K. M., Struhl, M. K., Mawhinney, B., Miske,
O., Silverstein, P., Soderberg, C. K., Stankov, T., Abbasi, A., Aberson, C. L.,
Aczel, B., Adamkovič, M., Albayrak, N., Allen, P. J., … Errington, T. M.
(2026). Investigating the replicability of the social and behavioural sciences.
Nature, 652(8108),
143–150. https://doi.org/10.1038/s41586-025-10078-y
No comments:
Post a Comment