The 20% Statistician: Collaborative Author Involved Replication Studies

Recently a new category of studies have started to appear in the psychological literature that provide the strongest support to date for a replication crisis in psychology: Large scale collaborative replication studies where the authors of the original study are directly involved in the study. These replication studies have often provided conclusive demonstrations of the absence of any effect large enough to matter. Despite considerable attention for these extremely interesting projects, I don’t think the scientific community has fully appreciated what we have learned from these studies.

Three examples of Collaborative Author Involved Replication Studies

Vohs and colleagues (2021) performed a multi-lab replication study of the ego-depletion effect, which (deservedly) has become a poster child of non-replicable effects in psychology. The teams used different combinations of protocols, allowing an unsuccessful prediction to generalize across minor variations in how the experiment was operationalized. Across these conditions, a non-significant effect was observed of d = 0.06, 95%CI[-0.02;0.14]. Although the authors regrettably did not specify a smallest effect size of interest in their frequentist analyses, they mention “we pitted a point-null hypothesis, which states that the effect is absent, against an informed one-sided alternative hypothesis centered on a depletion effect (δ) of 0.30 with a standard deviation of 0.15” in their Bayesian analyses. Based on the confidence interval, we can reject effects of d = 0.3, and even d = 0.2, suggesting that we have extremely informative data concerning the absence of an effect most ego-depletion researchers would consider is large enough to matter.

Morey et al (2021) performed a multi-lab replication study of the Action-Sentence Compatibility effect (Glenberg & Kaschak, 2002). I cited the original paper in my PhD thesis, and it was an important finding that I built on, so I was happy to join this project. As written in the replication study, the original team, together with the original authors, “established and pre-registered ranges of effects on RT that we would deem (a) uninteresting and inconsistent with the ACE theory: less than 50 ms.” An effect between 50 ms and 100 ms was seen as inconsistent with the previous literature, but in line with predictions of the ACE effect. The replication study consisted (after exclusions) of 903 native English speakers, and 375 non-native English speakers. The original study had used 44, 70, and 72 participants across 3 studies. The conclusion in the replication study was that the median ACE interactions were close to 0 and all within the range that we pre-specified as negligible and inconsistent with the existing ACE literature. There was little heterogeneity.

Last week, Many Labs 4 was published (Klein et al., 2022). This study was designed to examine the mortality salience effect (which I think deserve the same poster child status of a non-replicable effect in psychology, but which seems to have gotten less attention so far). Data from 1550 participants was collected across 17 labs, some which performed the study with involvement of the original author, and some which did not. Several variations of the analyses were preregistered, but none revealed the predicted effect, Hedges’ g = 0.07, 95% CI = [-0.03, 0.17] (for exclusion set 1). The authors did not provide a formal sample size justification based on a smallest effect size of interest, but in a sensitivity power analysis indicate they had 95% power for effect sizes of d = 0.18 to d = 0.21. If we assume all authors found effect sizes around d = 0.2 small enough to no longer support their predictions, we can see based on the confidence intervals that we can indeed exclude effect sizes large enough to matter. The mortality salience effect, even with involvement of the original authors, seems to be too small to matter. There was little heterogeneity in effect sizes (in part because the absence of an effect).

These are just three examples (there are more, of which the multi-lab test of the facial feedback hypothesis by Coles et al., 2022, is worth highlighting), but they highlight some interesting properties of collaborative author involved replication studies. I will highlight four strengths of these studies.

Four strengths of Collaborative Author Involved Replication Studies

1) The original authors are extensively involved in the design of the study. They sign off on the final design, and agree that the study is, with the knowledge they currently have, the best test of their prediction. This means the studies tell us something about the predictive validity of state of the art knowledge in a specific field. If the predictions these researchers make are not corroborated, the knowledge we have accumulated in these research areas are is not reliable enough to make successful predictions.

2) The studies are not always direct replications, but the best possible test of the hypothesis, in the eyes of the researchers involved. Criticism on past replication studies has been that directly replicating a study performed many years ago is not always insightful, as the context has changed (even though Many Labs 5 found no support for this criticism). In this new category of collaborative author involved replication studies, the original authors are free to design the best possible test of their prediction. If these tests fail, we can not attribute the failure to replicate to the ‘protective belt’ of auxiliary hypotheses that no longer hold. Of course, it is possible that the theory can be adjusted in a constructive manner after this unsuccessful prediction. But at this moment, these original authors do not have a solid understanding of their research topic to be able to predict if an effect will be observed.

3) The other researchers involved in these projects often have extensive expertise in the content area. They are not just researchers interested in mechanistically performing a replication study on a topic they have little expertise with. Instead, many of the researchers consists of peers who have worked in a specific research area, published on the topic of the replication study, but have collectively developed some doubts about the reliability of past claims, and have decided to spend some of their time replicating a previous finding.

4) The statistical analyses in these studies yield informative conclusions. The studies typically do not conclude the prediction was unsuccessful based on p > 0.05 in a small sample. In the most informative studies, original authors have explicitly specified a smallest effect size of interest, which makes it possible to perform an equivalence test, and statistically reject the presence of any effect deemed large enough to matter. In other cases, Bayesian hypothesis tests are performed which provide support for the null, compared to the alternative, model. This makes these replications studies severe tests of the predicted effect. In cases where original authors did not specify a smallest effect size of interest, the very large sample sizes allow readers to examine effects that can be rejected based on the observed confidence interval, and in all the studies discussed here, we can reject the presence of effects large enough to be considered meaningful. There is most likely not a PhD student in the world who would be willing to examine these effects, given the size that remains possible after these collaborative author involved replication studies. We can never conclude an effect is exactly zero, but that hardly matters – the effects are clearly too small to study.

The Steel Man for Replication Crisis Deniers

Given the reward structures in science, it is extremely rewarding for individual researchers to speak out against the status quo. Currently, the status quo is that the scientific community has accepted there is a replication crisis. Some people attempt to criticize this belief. This is important. All established beliefs in science should be open to criticism.

Most papers that aim to challenge the fact that many scientific domains have a surprising difficulty successfully replicating findings once believed reliable focus on the 100 studies in the Replicability Project: Psychology that was started a decade ago, and published in 2015. This project was incredibly successful in creating awareness of concerns around replicability, but it was not incredibly informative about how big the problem was.

In the conclusion of the RP:P, the authors wrote: “After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. Humans desire certainty, and science infrequently provides it. As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation.” The RP:P was an important project, but it is no longer the project to criticize if you want to provide evidence against the presence of a replication crisis.

Since the start of the RP:P, other projects have aimed to complement our insights about replicability. Registered Replication Reports focused on single studies, replicated in much larger sample sizes, to reduce the probability of a Type 2 error. These studies often quite conclusively showed original studies did not replicate, and a surprisingly large number yielded findings not statistically different from 0, despite sample sizes much larger than psychologists would be able to collect in normal research lines. Many Labs studies focused on a smaller set of studies, replicated many times, sometimes with minor variations to examine the role of possible moderators proposed to explain failures to replicate, which were typically absent.

The collaborative author involved replications are the latest addition to this expanding literature that consistently shows great difficulties in replicating findings. I believe they currently make up the steel man for researchers motivated to cast doubt on the presence of a replication crisis. I believe the fact that these large projects with direct involvement of the original authors can not find support for predicted effects are the strongest evidence too date that we have a problem replicating findings. Of course, these studies are complemented by Registered Replication Reports and Many Labs studies, and together they make up the Steel Man to argue against if you are a Replication Crisis Denier.

References

Coles, N. A., March, D. S., Marmolejo-Ramos, F., Larsen, J., Arinze, N. C., Ndukaihe, I., Willis, M., Francesco, F., Reggev, N., Mokady, A., Forscher, P. S., Hunter, J., Gwenaël, K., Yuvruk, E., Kapucu, A., Nagy, T., Hajdu, N., Tejada, J., Freitag, R., … Marozzi, M. (2022). A Multi-Lab Test of the Facial Feedback Hypothesis by The Many Smiles Collaboration. PsyArXiv. https://doi.org/10.31234/osf.io/cvpuw

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Hilgard, J., Ahn, P. H., Brady, A. J., Chartier, C. R., Christopherson, C. D., Clay, S., Collisson, B., Crawford, J. T., Cromar, R., Gardiner, G., Gosnell, C. L., Grahe, J., Hall, C., Howard, I., … Ratliff, K. A. (2022). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement. Collabra: Psychology, 8(1), 35271. https://doi.org/10.1525/collabra.35271

Morey, R. D., Kaschak, M. P., Díez-Álamo, A. M., Glenberg, A. M., Zwaan, R. A., Lakens, D., Ibáñez, A., García, A., Gianelli, C., Jones, J. L., Madden, J., Alifano, F., Bergen, B., Bloxsom, N. G., Bub, D. N., Cai, Z. G., Chartier, C. R., Chatterjee, A., Conwell, E., … Ziv-Crispel, N. (2021). A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE). Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-021-01927-8

Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J. L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi, J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay, S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733

The 20% Statistician

Tuesday, May 3, 2022

Collaborative Author Involved Replication Studies

No comments:

Post a Comment