Less-popularised findings from the "estimating the reproducibility" paper @Eli_Finkel #SPSP2016 pic.twitter.com/8CFJMbRhi8— Jessie Sun (@JessieSunPsych) January 28, 2016

I don’t think we should be interpreting this correlation at
all, because it might very well be completely spurious. One important reason why correlations might be spurious is the
presence of different subgroups, as
introduction to statistics textbooks explain.

When we consider the Reproducibility Project (note: I’m a
co-author of the paper) we can assume there are two subsets, one subgroup consisting
of experiments that examine true effects, and one subgroup consisting of experiments
that examine effects that are not true. This logically implies that for one
subgroup, the true effect size is 0, while for the other, the true effect size
is an unknown larger value. Different means in subgroups is a classic case
where spurious correlations can emerge.

I find the best way to learn to understand statistics is
through simulations. So let’s simulate 100 normally distributed effect sizes
from original studies that are comparable to the 100 studies included in the
Reproducibility Project, and 100 effect sizes for their replications, and
correlate these. We create two subgroups. Forty effect sizes will have true
effects (e.g., d = 0.4). The original and replication effect sizes will be
correlated (e.g., r = 0.5). Sixty of the effect sizes will have an effect size
of d = 0, and a correlation between replication and original studies of r = 0.
I’m not suggesting this reflects the truth of the studies in the
Reproducibility Project – there’s no way to know. The parameters look sort of
reasonable to me, but feel free to explore different choices for parameters by
running the code yourself.

As you see, the pattern is perfectly expected, under reasonable assumptions, when 60% of the studies is simulated to have no true effect. With a small N (100 studies gives a pretty unreliable correlation, see for yourself by running the code a few times) the spuriousness of the correlation might not be clear. So let’s simulate 100 times more studies.

As you see, the pattern is perfectly expected, under reasonable assumptions, when 60% of the studies is simulated to have no true effect. With a small N (100 studies gives a pretty unreliable correlation, see for yourself by running the code a few times) the spuriousness of the correlation might not be clear. So let’s simulate 100 times more studies.

Now, the spuriousness becomes clear. The two groups differ in their means, and if we calculate the correlation over the entire sample, the

*r*= 0.51 we get is not very meaningful (I cut off original studies at d = 0, to simulate publication bias and make the graph more similar to Figure 1 in the paper, but it doesn't matter for the current point).

So: be careful interpreting correlations when there are
different subgroups. There’s no way to know what is going on. The correlation
of 0.51 between effect sizes in original and replication studies might not
mean anything.