The Open
Science Collaboration has completed 100 replication studies of findings
published in the scientific literature, and the results
are available. The replicated studies have become much more likely to be true, but we are left with some questions about what it means that many studies did not replicate. This is a very rich dataset, and although there can be many reasons a finding does not replicate, I wanted to examine one concern. Studies in the Reproducibility Project were well powered for the effect sizes observed in the original studies. But we know effect sizes in the published literature are often overestimated. So is it possible that most of the replication studies that did not yield significant actually examined much smaller effects, and thus lacked power?
The table
below (from the article in Science) summarizes some of the results. There is a nice range of interpretations (even though I'll focus a lot on the p < 0.05 criterium in this post). The probability
of observing a statistically significant effect, if there is an effect to be
found, depends on the statistical power of a study. The ‘average replication
power’ provides estimates of the statistical power of the studies, assuming the effect size estimate in the
original study was exactly the true effect size.
As the Open
Science Collaboration (including myself) write: “On the basis of only the
average replication power of the 97 original, significant effects [M = 0.92,
median (Mdn) = 0.95], we would expect approximately 89 positive results in the replications
if all original effects were true and accurately estimated.”
With 35 significant effects out of 89, we get a 40% replication rate. But we have very
good reasons to believe that not all original effect sizes were accurately
estimated, and that the average power of replications was lower (Shravan Vasishth called this 'power inflation' earlier today). And when the
average power is lower, less findings are expected to replicate, which means
the replication success is relatively higher (i.e., instead of 35 out of 89, 35 out of some number lower than 89 replicated).
When there
is severe publication bias, effect sizes are overestimated. We can examine whether
there is publication bias in the original studies in a meta-analysis (below, I follow
one meta-analysis of the data analysis team and look at studies which reported t-tests and F-tests, 73 out of the 100). Effect sizes observed in studies
should be independent from standard errors, but when there is publication bias,
they are not. There is a funnel plot of these 73
original studies on the OSF, but I prefer contour enhanced funnel plots,
which I made by first running the (absolutely amazing - I'm serious, check out the work they put into this R script!) masterscript
for the data analysis, and then running:
A contour-enhanced funnel plot makes it more strikingly clear that almost all original studies observed a statistically significant effect. This is surprising, given that sample sizes were much smaller than in replication attempts (and the replication studies had 92% power, based on the original effect sizes). This is also clear from the distribution of the effects – small studies (with large standard errors, on the bottom of the plot) have large effect sizes (because otherwise they would not be statistically significant), while larger studies (at the top) have smaller effect sizes (but still just large enough to be statistically significant, or fall outside of the white triangle).
A trim and fill analysis is often used to examine whether there are missing studies. Now we are grouping together 73 completely different and highly heterogeneous effects, so the following numbers should be interpreted in light of huge heterogeneity, but we can perform this analysis using:
Trim-and-fill
analysis can only be used as a sensitivity analysis (it does not provide
accurate effect sizes or estimates of the actual number of missing studies),
but it clearly shows studies are missing (there are 29 white dots in the
trim-and-fill funnel plot, which represent the studies assumed to be missing),
and reports a meta-analytic effect size estimate of r = 0.28 (instead of r = 0.42)
based on these hypothetical missing studies. This does not mean r = 0.28 is the true effect size, but it’s
probably close (a meta-analysis of meta-analyses estimated the average effect
size in psychology at r = 0.21 – so that
we might be in the ballpark).
The
difference between the biased and unbiased effect size is substantial, and this
means power could very reasonable be somewhat lower that 0.92. There’s not much
the Reproducibility Project could do about publication bias (e.g., there are no
full-proof statistical technique to estimate unbiased effect size estimates).
The solution should come from us: We should publish all our effects, regardless
of their significance level. If we don’t, we are sabotaging cumulative science.
However: power only matters when there is a true effect. An unknown percentage of studies did not replicate, because they were originally a false positive, and there simply is no true effect to be found (i.e., the true effect size is 0). It is difficult to tease apart failed replications due to low power, and failed replications because the original studies were false positives, and again, this is a very hetergeneous set of studies. But a look at the p-value distribution is interesting, which we can plot with:
pdist<-MASTER$T_pval_USE..R.[!is.na(MASTER$T_pval_USE..O.) & !is.na(MASTER$T_pval_USE..R.)] hist(pdist, breaks=100) abline(h=3.4, lty = 3, col = "gray60")
The
histogram is divided into 20 bins, and the frequency of p-values in each bin are plotted. This means all significant results (p < 0.05) fall in the left-most bin.
If all non-significant studies examined no true effects, the p-values would be uniformly distributed, with 3.4 studies in each bin (64 non-significant studies (there are 99 p-values plotted, so 99-35=64) in 19 remaining bins). If we think of this p-value distribution as a mix of null
effects (uniformly distributed) and true effects (a skewed distribution highest
at low p-values), the distribution is
not a shallow curve (which would be a sign of low power, see p-value distributions as a function of power here). Instead, the distribution looks more like a sharp angle, which mirrors a p-value distribution from a set of highly powered experiments. It really looks like our power was very high (but we should remember we only have 100 datapoints). There will certainly be some replication studies that, with a much larger sample size, will reveal an effect. In general, it is extremely difficult (and requires huge sample sizes) to distinguish between a real but very small effect, and no effect. But at least the distribution of p-values takes away the concern I had when I started this blog post that the biased effect size estimates in the original studies affected the power in the replication studies.
For now, it
means 35 out of 97 replicated effects have become quite a bit more likely to be true.
We have learned something about what predicts replicability. For example, at
least for some indicators of replication success, “Surprising
effects were less reproducible” (take note, journalists and editors of Psychological
Science!). For the studies that did not replicate, we have more data, which can
inform not just our statistical inferences, but also our theoretical
inferences. The Reproducibility Project demonstrates large scale collaborative
efforts can work, so if you still believe in an effect that did not replicate, get
some people together, collect enough data, and let me know what you find.
Daniel, this is interesting. One advantage of my analysis at the Splintered Mind, which you commented on, is that some of these concerns about power don't apply, since 38% of the attempted replications yielded statistically significantly lower effect sizes. If there were any power problems (as surely there were, see, e.g. my paragraph about statistically marginal trends), those power problems would cause this 38% to be an underestimate of the failure to replicate effect size, rather than an overestimate, right?
ReplyDeleteHi Eric, to provide a quantifiable answer, I would need to make an estimate of the number of true effects and null effects. It is obvious many replications will yield lower effect sizes (whether significnt or not, I don't think it matters). There are 2 reasons: 1) original studies were false positives, so the true effect size is 0, and 2) the original studies on averages were biased (due to publication bias) and the replications show smaller effects (which are more accurate, but still variable). I think your analysis distinguishing the file-drawer problem from the invisibility problem is trying to do something that is not possible. The analysis gives no meaningful answers - but maybe I am missing something.
Delete