Comments on The 20% Statistician: The statistical conclusions in Gilbert et al (2016) are completely invalid

2021-06-02T04:19:17.993+02:00

This comment has been removed by a blog administrator.

2021-06-02T04:07:35.000+02:00

This comment has been removed by a blog administrator.

The challenge for evaluating replication studies i...

2016-03-08T18:40:13.008+01:00

The challenge for evaluating replication studies is that effect size, p-value, and statistical power are all relevant. No one parameter alone provides the whole story. My inclination is to focus on studies that have a statistical power of .90 or greater. For the RPP data (if I used the correct fields), 72 studies had a power of .90 or greater and 31 (43%) were significant (had p ≤ .05). Of the 24 studies with lower power, only 2 (8%) were significant. These do not quite match the published totals, probably due to 3 records with an unexplained X in the power field that I excluded.

I’ve been thinking recently that we can go farther than just rejecting the null hypothesis. Studies with a power of .90 or greater and a p-value of something like .25 or greater could be interpreted as supporting the null hypothesis and rejecting the alternative hypothesis. For the RPP data, this is 32 studies or 44% of the 72 studies with power of .90 or greater.

This strategy would be best implemented by calculating a p-value for the alternative hypothesis (p-value-H1) that would be the tail area for the observed outcome under the model for the alternative hypothesis that was used in the power analysis. This would be compared to the usual p-value-H0 for the null hypothesis.

P-value-H0 ≤ .05 and p-value-H1 ≥ .25 would be strong evidence supporting the alternative hypothesis.
P-value-H0 ≥ .25 and p-value-H1 ≤ .05 would be strong evidence supporting the null hypothesis.
P-value-H0 ≤ .05 and p-value-H1 < .25 would be tentative support for the alternative hypothesis, but an intermediate model or the null hypothesis may be true.
P-value-H0 < .25 and p-value-H1 ≤ .05 would be tentative support for the null hypothesis, but an intermediate model or the alternative hypothesis may be true.

This approach explicitly and symmetrically compares the study outcome with both the null and alternative models that were used in the power analysis. For the binomial case I have been exploring with power of .90, the criteria for .25 makes p-values in the range of .05-.015 put in the tentative category. Also, studies with low power can never strongly support the null hypothesis and thus are biased. It may be appropriate to consider that the effect size that gives a power of .90 (perhaps .95) is the useful degree of resolution for producing strong evidence with a hypothesis test.

I am not sure whether my initial optimistic impressions of this strategy will hold up. Any thoughts?

Sam (the anonymous one) I'm always short to an...

2016-03-07T19:02:12.823+01:00

Sam (the anonymous one) I'm always short to anonymous commenters here, and your post wasn't very clear. Your comments about false positives are still unclear. Can you give a numerical example of what you mean? I'm not hostile, but it's difficult to address unclear questions and my time is limited.

2016-03-07T16:57:07.067+01:00

This comment has been removed by the author.

Thanks for your reply. You seemed to take my comme...

2016-03-07T16:53:36.385+01:00

Thanks for your reply. You seemed to take my comment as a hostile challenge rather than just honest questions following your interesting post. I'm very interested in these issues and only want to understand them better/as objectively as possible. I read the RPP when it first came out and yes, probably due for a reread, but was really wondering if you had updated thoughts on that particular question in light of this new round of discussion/analysis. You don't and that's fine. I'll reread the paper. My understanding of stats is fine - I was just suggesting that some of the original findings could have been 'honest' false positives. I could have been clearer above but what I meant was that even if 70 studies were high fidelity replication attempts, some of those were attempting to replicate honest false positives.

I agree that the subjective author assessment is quite informative. I think my main concern is that the denominator should be adjusted for this percentage to be meaningful as an estimate of reproducibility (e.g., taking into account false positives, low fidelity, etc.).

(Not the same Sam... :P) As I said elsewhere alre...

2016-03-07T12:03:33.097+01:00

(Not the same Sam... :P)

As I said elsewhere already, I think the judgement whether or not there is a crisis is completely subjective. Also the use of the word 'crisis' isn't particularly helpful. And you already know my view on calling it 'reproducibility'... ;)

Either way, the RPP did a great service to the community and I can't see how anyone can in their right mind take the findings and say "Move along! Nothing to see here!" Just look at that scatter plot of effect sizes. That's pretty depressing.

I think the idea Gilbert had behind their criticism is good one. It would be useful to know how much replicability there is for typical psychology studies. I don't think it makes sense to do this across the whole field though - given how dissimilar the subfields are and how they vary in methodology (between-subject vs within-subject designs is a big difference) it would be more useful to see some subdisciplines here.

Anyway, the idea of estimating what level of replicability to expect by chance is a good one but I don't think their article really answers that question. Perhaps the correlation of effect sizes is a good measure - and how anyone can think the r=0.51 is a strong indication that all is well is beyond me.

"The use of the confidence interval interpret...

2016-03-07T11:39:59.479+01:00

"The use of the confidence interval interpretation of replicability in the OSC article was probably a mistake, too much based on the 'New Statistics' hype two years ago."
Totally agree. In his desire to claim the magic of CIs, Cummings uses language that could lead to such erroneous readings.

"The use of the confidence interval interpret...

2016-03-07T11:39:32.844+01:00

Thanks for pointing this out! If only Gilbert et a...

2016-03-07T07:30:16.343+01:00

Thanks for pointing this out! If only Gilbert et al had read the supplement, it would have saved them a huge error.

And when you say 'lower than 83.4%' you mean the probability that the original studies effect size is covered by the replication effect size. I think it's more intuitive to talk about the probability that the replication effect size is captured by the CI of the original study, and this, this percentage is higher, right?

Hi Sam, maybe you should read the RP:P article. It...

2016-03-07T06:46:58.381+01:00

Hi Sam, maybe you should read the RP:P article. It uses at least 5 replication measures. My favorite is the authors subjective assessment - 39 out of 97 studies replicated, according to this. I don't understand how this number 'doesn't provide evidence supporting a reproducibility "crisis"'. You also seem to be assuming Type 1 error happen when all effects are true - which suggests you need to brush up on your stats.

I don't need to explain the contribution of the RP:P here - we wrote this down very nicely in the Science article.

A couple of genuine questions for you. What should...

2016-03-07T06:20:55.037+01:00

A couple of genuine questions for you. What should the RP:P have used instead of CIs? And if Gilbert et al's commentary is meaningless, I'm left wondering, as many probably are, what the answer is to the very question you ended your post with. What replication rate would have been respectable? And what would have been consistent with a crisis?

I'm thinking that we could expect at the very least 5% of the high-fidelity studies to fail to be significant at p < .05. And maybe some subset of the remaining studies did not measure true effects, but that wouldn't necessarily be indicative of a crisis. Type I errors happen and we could expect at the very least 5% (assuming Type I error rates are not inflated). So if say ~70 studies were high fidelity and say 4 of them were Type I errors, then on a very good day ~63 of them would replicate. On a worse day (lets say not bad but not great), a fair number of the studies were likely underpowered and a few of the original studies reported effects that were not true (due to some p-hacking), that number would drop. This suggests to me the RP:P doesn't provide evidence supporting a reproducibility "crisis". That said, that doesn't mean I think Gilbert et al.s suggestion that the field at large is healthy is supported by the data either.

If you disagree, I would love to hear why, because I, like many people, am genuinely just trying to think through the implications of this. I think your writings on this issue have been a really important contribution (thank you) but I think you also could be clearer here about what you think the contribution of the RR:P was to answering this question. Gilbert et al's reasoning and use of statistics was faulty, but their exercise highlights the problems inherent in interpreting the results of the RP:P. Sure, the limitations were acknowledged in the RP:P, but still the claim was that the evidence supported a crisis, no?

It's not about taking sides but about dispassionately making sense of the evidence we have.

(And to be clear, I think the RR:P was an admirable effort and is also just one piece of evidence re: the question of whether or not there is a crisis.)

I completely agree. See here for a paper I co-auth...

2016-03-06T22:09:14.034+01:00

I completely agree. See here for a paper I co-authored on Rewarding Replications: http://pps.sagepub.com/content/7/6/608.abstract

Disregarding the controversy for a brief moment, t...

2016-03-06T22:03:46.026+01:00

Disregarding the controversy for a brief moment, the true problem in psychological science still is the almost complete lack of reputation associated with replication. We appear to be quite different from a discipline such as physics in that sense. Attempts to raise the appreciation of research targeted at replication are few and far between. This needs to change, e.g., by authors committing to cite at least one replication study together with the original citation, or by having database engines return replicative studies together with the original works.

The capture percentage and its properties was alre...

2016-03-06T20:06:46.282+01:00

The capture percentage and its properties was already discussed in the Supplement of the original Science article, in Appendix A4 on page 76: https://osf.io/k9rnd/.

Marcel van Assen, of the "Analysis Team" of RPP

It explains that the capture probability can vary between 0 and 1 (actually, it is .95), depending on the sample sizes of the original and replication study, even if they estimate the same true effect size. It also gives the code to estimate the expected capture probability of the RPP, which is below 0.834 because the replication sample size is often larger than the original sample size.

Thanks, JP. I hate to toot my own horn, but the lo...

2016-03-06T19:26:02.230+01:00

Thanks, JP. I hate to toot my own horn, but the low standard of evidence really seems to me the more important conclusion from the RPP. The fact that in almost 100 studies you can make the difference between the observed effect size and zero disappear by adding a heterogeneity assumption is just a reflection of that.

I think the real lesson to be learned from the RP:...

2016-03-06T15:54:12.108+01:00

I think the real lesson to be learned from the RP:P and the attack on the RP:P by GKPW is that typical studies in psychology, both original studies and (most) replications, simply don't collect sufficient amounts of data, as Etz & Vandekerckhove have shown in their reanalysis in http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149794