A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Wednesday, October 29, 2025

Why we should stop using statistical techniques that have not been adequately vetted by experts in psychology

 In a recent post on Bluesky, where Richard Morey reflects on a paper he published with Clintin Davis-Stober that points out concerns with the p-curve method (Morey & Davis-Stober, 2025), he writes:

 


Also, I think people should stop using forensic meta-analytic techniques that have not been adequately vetted by experts in statistics. The p-curve papers have very little statistical detail, and were published in psych journals. They did not get the scrutiny appropriate to their popularity.

 

Although I understand this post as an affective response, I also think this kind of though is extremely dangerous and undermines science. In this blog post I want to unpack some of the consequences of thoughts like this, and how to deal with quality control instead.

 

Adequately vetted by experts

 

I am a big fan of better vetting of scientific work by experts. I would like expert statisticians to vet the power analysis and statistical analyses in all your papers. But there are some problems. The first is in identifying expert statisticians. There are many statisticians, but some get things wrong. Of course, those are not the experts that we want to do the vetting. So how do we identify expert statisticians?

Let’s see if we can identify expert statisticians by looking at Sue Duval and Richard Tweedie. A look at their CV might convince you they are experts in statistics. But wait! They developed the ‘trim-and-fill’ method. The abstract of their classic 2000 paper is below:

A text on a white background

AI-generated content may be incorrect.

It turns out that, unlike they write in their abstract, the point estimate for the meta-analytic effect size after adjusting for missing studies is not approximately correct at all (Peters et al., 2007; Terrin et al., 2003). So clearly, Duval and Tweedie are statisticians, but not the expert statisticians that we want to vet others. They got things wrong, and more problematically, they got things wrong in the Journal of the American Statistical Association.

 

In some cases, the problems in the work by statisticians is so easy to spot, even a lowly psychologist like myself can point out the problems. When a team of biostatisticians proposed a ‘second generation p-value’, without mentioning equivalence tests anywhere in their paper, two psychologists (myself and Marie Delacre) had to point out that the statistic they had invented was very similar to an equivalence test, except that it had a number of undesirable properties (Lakens & Delacre, 2020). I guess based on this anecdotal experience, there is nothing left but to create the rule that we should stop using statistical tests that have not been adequately vetted by experts in psychology.

 

Although it greatly helps to have expertise in topics that you want to scrutinize, sometimes the most fatal criticism comes from elsewhere. Experts make mistakes – overconfidence is a thing. I recently very confidently made a statement in a (signed) peer review, that (I am still examining the topic) I might have been wrong about. I don’t want to be the expert to ‘vet’ a method and allow it to be used based on my authority. More importantly, I think no one should want a science where authorities tell us which methods are vetted, and which are not. It would undermine the very core of what science is to me – a fallible system of knowledge generation which relies on open mutual criticism.

 

Scrutiny appropriate to their popularity

 

I am a big fan of increasing our scrutiny based on how popular something is. Indeed, this is exactly what Peder Isager, myself, and our collaborators propose in our work on the Replication Value: The more popular a finding is, and the less certain, the more deserving of an independent direct replication the study is (Isager et al., 2023, 2024).

 

There are two challenges. The first is that at the moment that a method is first published we do not know how popular it will become. So there is a time where methods exists, and are used, without being criticized, as their popularity takes some time to become clear. The first paper on p-curve analysis was published in 2014 (Simonsohn et al., 2014), with an update in 2015 (Simonsohn et al., 2015). A very compelling criticism of p-curve that pointed out strong limitations was published in a preprint in 2017, and appeared in print 2 years later (Carter et al., 2019). It convincingly showed that p-curve does not work well under heterogeneity, and there often is heterogeneity. Other methods, such as z-curve analysis, were developed and showed better performance under heterogeneity (Brunner & Schimmack, 2020).

 

It seems a bit of a stretch to say the p-curve method did not get scrutiny appropriate to its popularity when there were many papers that criticized it, relatively quickly (Aert et al., 2016; Bishop & Thompson, 2016; Ulrich & Miller, 2018). What is fair to say is that statisticians failed to engage with an incredibly important topic (test for publication bias) that addressed a clear need in many scientific communities, as most of the criticism was by psychological methodologists. I fully agree that statisticians should have engaged more with this technique. I believe the reason that they didn’t is because there is a real problem in the reward structure in statistics, where statisticians get greater rewards inside their field by proposing a 12th approach to compute confidence intervals around a non-parametric effect size estimate for a test that no one uses, than to help psychologists solve a problem they really need a solution for. Indeed, for a statistician, publication bias is a very messy business, and there will never be a sufficiently rigorous test for publication bias to get credit from fellow statisticians. There are no beautiful mathematical solutions, no creative insights, there is only the messy reality of a literature that is biased by human actions that we can never adequately capture in a model. The fact that empirical researchers often don’t know where to begin to evaluate the reliability of claims in their publication-bias-ridden field is not something statisticians care about. But they should care about it.  

 

I hope statisticians will start to scrutinize things appropriate to their popularity. If a statistical technique is cited 500 times, 3 statisticians need to drop whatever they are doing, and scrutinize the hell out of this technique. We can randomly select them for ‘statisticians duty’.

 

Quality control in science

 

It might come as a surprise, but I don’t actually think we should stop using methods that are not adequately vetted by psychologists, or statisticians for that matter, because I don’t want a science where authorities tell others which methods they can use. Scrutiny is important, but we can’t know how extensively methods should be vetted, we don’t know how to identify experts, everyone – including ‘experts’ – is fallible. It is naïve to think ‘expert vetting’ will lead to clear answers about the methods we should use, and should not use. If we can’t even reach agreement about the use of p-values, no one should believe we will ever reach agreement about the use of methods to detect publication bias, which will always be messy at best.  

 

I like my science free from authority arguments. Everyone should do their best to criticize everyone else, and if they are able, themselves. Some people will be in a better position to criticize some papers than others, but it is difficult to predict where the most fatal criticism will come from. Treating statistics papers published in a statistics journal as superior to papers in a psychology journal is too messy to be a good idea, and boil down to a form of elitism that I can’t condone. Sometimes even a lowly 20% statistician can point out flaws in methods proposed by card-carrying statisticians.

 

What we can do better is implementing actual quality control. Journal peer review will not suffice, because it is only as good as the two or three peers that happen to be willing and available to review a paper. But it is a start. We should enable researchers to see how well papers are peer reviewed by journals. Without transparency, we can’t calibrate our trust (Vazire, 2017). Peer reviews should be open, for all papers, including papers proposing new statistical methods.

 

If we want our statistical methods to be of high quality, we need to specify quality standards. Morey and Davis-Stober point out the limitations of simulation-based tests of a method and convincingly argue for the value of evaluating the mathematical properties of a testing procedure. If as a field we agree that an evaluation of the mathematical properties of a test is desirable, we should track whether this evaluation has been performed, or not. We could have a long checklist of desirable quality control standards – e.g., a method has been tested on real datasets, it has been compared to similar methods, those comparisons have been performed objectively based on a well-justified set of criteria, etc.

 

One could create a database where for each method the quality standards that have been met and that have not been met are listed. If considered useful, the database could also track how often a method is used, by tracking citations, and listing papers that have implemented the method (as opposed to those merely discussing the method). When statistical methods become widely used, the database would point researchers to which methods deserve more scrutiny. The case of magnitude-based inference in sport science reveals the importance of a public call for scrutiny when a method becomes widely popular, especially when this popularity is limited to a single field.

 

The more complex methods are, the more limitations they have. This will be true for all methods that aim to deal with publication bias, because the way scientists bias the literature is difficult to quantify. Maybe as a field we will come to agree that tests for bias are never accurate enough, and we will recommend people to just look at the distribution of p-values without performing a test. Alternatively, we might believe that it is useful to have a testing procedure that too often suggests a literature contains at least some non-zero effects, because we feel we need some way to intersubjectively point out that there is bias in a literature, even if this is based on an imperfect test. Such discussions require a wide range of stakeholders, and the opinion of statisticians about the statistical properties of a test is only one source of input in this discussion. Imperfect procedures are implemented all the time, if they are the best we have, and doing nothing is also not working.

 

Statistical methods are rarely perfect from their inception, and all have limitations. Although I understand the feeling, banning all tests that have not been adequately vetted by an expert is inherently unscientific. Such a suggestion would destroy the very core of science – an institution that promotes mutual criticism, while accepting our fallibility. As Popper (1962) reminds us: “if we respect truth, we must search for it by persistently searching for our errors: by indefatigable rational criticism, and self-criticism.”

 




References

Aert, R. C. M. van, Wicherts, J. M., & Assen, M. A. L. M. van. (2016). Conducting Meta-Analyses Based on p Values Reservations and Recommendations for Applying p-Uniform and p-Curve. Perspectives on Psychological Science, 11(5), 713–729. https://doi.org/10.1177/1745691616650874

Bishop, D. V., & Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value. PeerJ, 4, e1715.

Brunner, J., & Schimmack, U. (2020). Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance. Meta-Psychology, 4. https://doi.org/10.15626/MP.2018.874

Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2019). Correcting for Bias in Psychology: A Comparison of Meta-Analytic Methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144. https://doi.org/10.1177/2515245919847196

Isager, P. M., Lakens, D., van Leeuwen, T., & van ’t Veer, A. E. (2024). Exploring a formal approach to selecting studies for replication: A feasibility study in social neuroscience. Cortex, 171, 330–346. https://doi.org/10.1016/j.cortex.2023.10.012

Isager, P. M., van Aert, R. C. M., Bahník, Š., Brandt, M. J., DeSoto, K. A., Giner-Sorolla, R., Krueger, J. I., Perugini, M., Ropovik, I., van ’t Veer, A. E., Vranka, M., & Lakens, D. (2023). Deciding what to replicate: A decision model for replication study selection under resource and knowledge constraints. Psychological Methods, 28(2), 438–451. https://doi.org/10.1037/met0000438

Lakens, D., & Delacre, M. (2020). Equivalence Testing and the Second Generation P-Value. Meta-Psychology, 4, 1–11. https://doi.org/10.15626/MP.2018.933

Morey, R. D., & Davis-Stober, C. P. (n.d.). On the poor statistical properties of the P-curve meta-analytic procedure. Journal of the American Statistical Association, 0(ja), 1–19. https://doi.org/10.1080/01621459.2025.2544397

Peters, J. L., Sutton, A. J., Jones, D. R., Abrams, K. R., & Rushton, L. (2007). Performance of the trim and fill method in the presence of publication bias and between-study heterogeneity. Statistics in Medicine, 26(25), 4544–4562. https://doi.org/10.1002/sim.2889

Popper, K. R. (1962). Conjectures and refutations: The growth of scientific knowledge. Routledge.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-Curve and Effect Size Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science, 9(6), 666–681.

Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of Experimental Psychology. General, 144(6), 1146–1152. https://doi.org/10.1037/xge0000104

Terrin, N., Schmid, C. H., Lau, J., & Olkin, I. (2003). Adjusting for publication bias in the presence of heterogeneity. Statistics in Medicine, 22(13), 2113–2126. https://doi.org/10.1002/sim.1461

Ulrich, R., & Miller, J. (2018). Some properties of p-curves, with an application to gradual publication bias. Psychological Methods, 23(3), 546–560. https://doi.org/10.1037/met0000125

Vazire, S. (2017). Quality Uncertainty Erodes Trust in Science. Collabra: Psychology, 3(1), 1. https://doi.org/10.1525/collabra.74

No comments:

Post a Comment