A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, August 27, 2015

Power of replications in the Reproducibility Project

The Open Science Collaboration has completed 100 replication studies of findings published in the scientific literature, and the results are available. The replicated studies have become much more likely to be true, but we are left with some questions about what it means that many studies did not replicate. This is a very rich dataset, and although there can be many reasons a finding does not replicate, I wanted to examine one concern. Studies in the Reproducibility Project were well powered for the effect sizes observed in the original studies. But we know effect sizes in the published literature are often overestimated. So is it possible that most of the replication studies that did not yield significant actually examined much smaller effects, and thus lacked power?

The table below (from the article in Science) summarizes some of the results. There is a nice range of interpretations (even though I'll focus a lot on the p < 0.05 criterium in this post). The probability of observing a statistically significant effect, if there is an effect to be found, depends on the statistical power of a study. The ‘average replication power’ provides estimates of the statistical power of the studies, assuming the effect size estimate in the original study was exactly the true effect size.


As the Open Science Collaboration (including myself) write: “On the basis of only the average replication power of the 97 original, significant effects [M = 0.92, median (Mdn) = 0.95], we would expect approximately 89 positive results in the replications if all original effects were true and accurately estimated.”

With 35 significant effects out of 89, we get a 40% replication rate. But we have very good reasons to believe that not all original effect sizes were accurately estimated, and that the average power of replications was lower (Shravan Vasishth called this 'power inflation' earlier today). And when the average power is lower, less findings are expected to replicate, which means the replication success is relatively higher (i.e., instead of 35 out of 89, 35 out of some number lower than 89 replicated).

When there is severe publication bias, effect sizes are overestimated. We can examine whether there is publication bias in the original studies in a meta-analysis (below, I follow one meta-analysis of the data analysis team and look at studies which reported t-tests and F-tests, 73 out of the 100). Effect sizes observed in studies should be independent from standard errors, but when there is publication bias, they are not. There is a funnel plot of these 73 original studies on the OSF, but I prefer contour enhanced funnel plots, which I made by first running the (absolutely amazing - I'm serious, check out the work they put into this R script!) masterscript for the data analysis, and then running:

funnel(res, level=c(90, 95, 99), shade=c("white", "gray", "darkgray"), refline=0, main = "Funnel plot based on original studies")

A contour-enhanced funnel plot makes it more strikingly clear that almost all original studies observed a statistically significant effect. This is surprising, given that sample sizes were much smaller than in replication attempts (and the replication studies had 92% power, based on the original effect sizes). This is also clear from the distribution of the effects – small studies (with large standard errors, on the bottom of the plot) have large effect sizes (because otherwise they would not be statistically significant), while larger studies (at the top) have smaller effect sizes (but still just large enough to be statistically significant, or fall outside of the white triangle).

A trim and fill analysis is often used to examine whether there are missing studies. Now we are grouping together 73 completely different and highly heterogeneous effects, so the following numbers should be interpreted in light of huge heterogeneity, but we can perform this analysis using:

taf <- trimfill(res)
funnel(taf, level=c(90, 95, 99), shade=c("white", "gray", "darkgray"), refline=0, main = "Trim and Fill funnel plot based on original studies")

Trim-and-fill analysis can only be used as a sensitivity analysis (it does not provide accurate effect sizes or estimates of the actual number of missing studies), but it clearly shows studies are missing (there are 29 white dots in the trim-and-fill funnel plot, which represent the studies assumed to be missing), and reports a meta-analytic effect size estimate of r = 0.28 (instead of r = 0.42) based on these hypothetical missing studies. This does not mean r = 0.28 is the true effect size, but it’s probably close (a meta-analysis of meta-analyses estimated the average effect size in psychology at r = 0.21 – so that we might be in the ballpark).

The difference between the biased and unbiased effect size is substantial, and this means power could very reasonable be somewhat lower that 0.92. There’s not much the Reproducibility Project could do about publication bias (e.g., there are no full-proof statistical technique to estimate unbiased effect size estimates). The solution should come from us: We should publish all our effects, regardless of their significance level. If we don’t, we are sabotaging cumulative science.

However: power only matters when there is a true effect. An unknown percentage of studies did not replicate, because they were originally a false positive, and there simply is no true effect to be found (i.e., the true effect size is 0). It is difficult to tease apart failed replications due to low power, and failed replications because the original studies were false positives, and again, this is a very hetergeneous set of studies. But a look at the p-value distribution is interesting, which we can plot with:

pdist<-MASTER$T_pval_USE..R.[!is.na(MASTER$T_pval_USE..O.) & !is.na(MASTER$T_pval_USE..R.)]
hist(pdist, breaks=100)
abline(h=3.4, lty = 3, col = "gray60")

The histogram is divided into 20 bins, and the frequency of p-values in each bin are plotted. This means all significant results (p < 0.05) fall in the left-most bin. If all non-significant studies examined no true effects, the p-values would be uniformly distributed, with 3.4 studies in each bin (64 non-significant studies (there are 99 p-values plotted, so 99-35=64) in 19 remaining bins). If we think of this p-value distribution as a mix of null effects (uniformly distributed) and true effects (a skewed distribution highest at low p-values), the distribution is not a shallow curve (which would be a sign of low power, see p-value distributions as a function of power here). Instead, the distribution looks more like a sharp angle, which mirrors a p-value distribution from a set of highly powered experiments. It really looks like our power was very high (but we should remember we only have 100 datapoints). There will certainly be some replication studies that, with a much larger sample size, will reveal an effect. In general, it is extremely difficult (and requires huge sample sizes) to distinguish between a real but very small effect, and no effect. But at least the distribution of p-values takes away the concern I had when I started this blog post that the biased effect size estimates in the original studies affected the power in the replication studies.  

For now, it means 35 out of 97 replicated effects have become quite a bit more likely to be true. We have learned something about what predicts replicability. For example, at least for some indicators of replication success, “Surprising effects were less reproducible” (take note, journalists and editors of Psychological Science!). For the studies that did not replicate, we have more data, which can inform not just our statistical inferences, but also our theoretical inferences. The Reproducibility Project demonstrates large scale collaborative efforts can work, so if you still believe in an effect that did not replicate, get some people together, collect enough data, and let me know what you find.


  1. Daniel, this is interesting. One advantage of my analysis at the Splintered Mind, which you commented on, is that some of these concerns about power don't apply, since 38% of the attempted replications yielded statistically significantly lower effect sizes. If there were any power problems (as surely there were, see, e.g. my paragraph about statistically marginal trends), those power problems would cause this 38% to be an underestimate of the failure to replicate effect size, rather than an overestimate, right?

    1. Hi Eric, to provide a quantifiable answer, I would need to make an estimate of the number of true effects and null effects. It is obvious many replications will yield lower effect sizes (whether significnt or not, I don't think it matters). There are 2 reasons: 1) original studies were false positives, so the true effect size is 0, and 2) the original studies on averages were biased (due to publication bias) and the replications show smaller effects (which are more accurate, but still variable). I think your analysis distinguishing the file-drawer problem from the invisibility problem is trying to do something that is not possible. The analysis gives no meaningful answers - but maybe I am missing something.