A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Wednesday, January 13, 2016

The Replication Value: What should be replicated?

Researchers are often reminded that replications are a cornerstone of empirical science (e.g., Koole & Lakens, 2012). However, we don’t need to regard every replication as equally valuable. Although most researchers will agree that a journal editor who rejects a manuscript reporting 20 high-powered direct replications of the Stroop-effect (Stroop, 1935) is making the right decision, they also know that some replications are worthy of being performed and published. Cumulative scientific knowledge requires a balance between original research and close replications of important findings.

The question when a close replication of an empirical finding is of sufficient value to the scientific community to justify being performed and published is an important question for any science that operates within financial and time constraints. Some years ago, I started a project on The Replication Value. The goal of the replication value was to create a quantitative and objective index to determine the value and importance of a close replication. The Replication Value can guide decisions of what to replicate directly, and can serve as a tool both for researchers to assess whether time and resources should be spent on replicating a finding, and for journal editors to help determine whether close replications should be considered for publication.

Developing a formula that can quantify the value of a replication is an interesting challenge. I realized I needed more knowledge of statistics before I could contribute, and, even though we were working with a pretty large team, I think it’s even better if even more people contribute suggestions.

Now, Courtney Soderberg and Charlie Ebersole have taken over the coordination of this project, and from now, anyone who feels like contributing to this important question can generate candidate formulas. Read more about how to contribute here. Want to demonstrate how the replication value can only be computed using Bayesian statistics? Convinced we need to rely on estimation instead? Show us what’s the best way to quantify the value of replications, and earn authorship to what will no doubt be a nice paper in the end.

My approach

I’m not going to give away my approach completely – I don’t want to limit the creativity of others – but I want to give some pointers to get people started.

I think at least two components determine the Replication Value of empirical findings: the impact of the effect, the precision of the effect size estimate. Quantifying the impact of studies is notably difficult, but I think citation counts are an easy to use proxy. Based on the idea that more data yields a better estimate of the population effect size, sample size is a dominant factor in precision (Borenstein, Hedges, Higgins, & Rothstein, 2009). The larger the sample, the lower the variance of the effect size estimate, which leads to a narrower confidence interval around the effect size estimate. We can take the precision of the effect size estimate: The confidence interval for r is calculated by first transforming r to Fisher’s z:

z=0.5 ×ln((1+r)/(1-r)) 

A very good approximation of the variance of z is:

Vz=  1/(n-3)

The confidence interval can then be calculated as normal:

95% CI=r ±1.96*√(Vz )

The values acquired through this procedure can be transformed back to r using:

r= (e^(2 × z)-1)/(e^(2 × z)+1)

where the z value is the z transformed upper or lower boundary of the 95% CI.

By expressing the width of the confidence interval of the effect size estimate of an effect as a percentage of the total possible width of the confidence interval, we have an index of the precision of the effect size estimate, which I call the ‘spielraum’, or the playing field, based on the conceptual similarity to the precision of a theoretical prediction in Meehl’s (1990) work on appraising theories.

Now the tricky thing is how these two factors interact, and determine the replication value. While I’m going back to solve that question, perhaps you want to propose a completely different approach. I mean, really, this is a question that requires Bayesian statistics, right? Are citation counts the absolutely worst way to quantify impact?

See how to contribute here: https://docs.google.com/document/d/1ufO7gwwI2rI7PnESn4wDA-pLcns46NyE7Pp-3zG3PN8/edit  I really look forward to your suggestions.


  1. In bayes stats you would calculate kullback-leibler divergence between the prior and the posterior. To cite wikipedia KL is "a measure of the information gain in moving from a prior distribution to a posterior distribution". You can compare KL across replication studies and select a study with the highest information gain. The problem is that you can arbitrarily blow the index by selecting an uninformative prior for your favorite study. This problem also affects your variance measure (why take "total possible width" of CI instead of density that corresponds to prior knowledge? What density expresses the prior knowledge best) The whole idea can work, but it requires a serious effort in expressing and justifying the prior knowledge on the part of the researcher. This only underscores that without a concept of prior knowledge, concepts such as surprise, information value or impact of scientific work are meaningless.

    1. I agree it's a challenge. And a challenge that would really be worthwhile using your expertise on. You'll gain co-authorship, but more importantly, develop a tool that could really improve cumulative science.

    2. Alright, if I find the time I will make a submission :)

    3. Cool! I look forward to it and I'm sure you will bring a very important perspective to the table!

  2. Alright, if I find the time I will make a submission :)

  3. Hi Daniel,
    I think that a precise estimate of the available evidence related to a specific phenomenon or a line of research is not difficult to achieve and your proposal and/or a Bayesian analogue is at hand.
    On the contrary, I cannot figure out how to derive a math formula related to the "worth to be replied" side of the problem without considering all sociological, applied, economical, ideological, etc., components intrinsic in the scientific research.
    I try to list some of the criteria without an hierarchical order:
    - phenomena described in the main books for undergraduate and graduate students which usually are considered as taken for granted;
    - phenomena which could change some mainstream paradigms, e.g. cognitive neuroscience will replace cognitive psychology; human mind shows quantum-like phenomena, etc.;
    - Applications which can reduce terrorist attacks, energy waste, improve physical and mental health reducing state economic resources, etc.

    What about?