In a recent post on Bluesky, where Richard Morey reflects on a paper he published with Clintin Davis-Stober that points out concerns with the p-curve method (Morey & Davis-Stober, 2025), he writes:
Also,
I think people should stop using forensic meta-analytic techniques that have
not been adequately vetted by experts in statistics. The p-curve papers have
very little statistical detail, and were published in psych journals. They did
not get the scrutiny appropriate to their popularity.
Although I
understand this post as an affective response, I also think this kind of though
is extremely dangerous and undermines science. In this blog post I want to
unpack some of the consequences of thoughts like this, and how to deal with quality
control instead.
Adequately vetted by experts
I am a big
fan of better vetting of scientific work by experts. I would like expert
statisticians to vet the power analysis and statistical analyses in all your
papers. But there are some problems. The first is in identifying expert
statisticians. There are many statisticians, but some get things wrong. Of
course, those are not the experts that we want to do the vetting. So how do we
identify expert statisticians?
Let’s see
if we can identify expert statisticians by looking at Sue Duval and Richard
Tweedie. A look at their CV might convince you they are experts in statistics.
But wait! They developed the ‘trim-and-fill’ method. The abstract of their
classic 2000 paper is below:
It turns
out that, unlike they write in their abstract, the point estimate for the
meta-analytic effect size after adjusting for missing studies is not
approximately correct at all (Peters
et al., 2007; Terrin et al., 2003). So clearly, Duval and Tweedie are
statisticians, but not the expert statisticians that we want to vet others.
They got things wrong, and more problematically, they got things wrong in the
Journal of the American Statistical Association.
In some
cases, the problems in the work by statisticians is so easy to spot, even a
lowly psychologist like myself can point out the problems. When a team of
biostatisticians proposed a ‘second generation p-value’, without mentioning
equivalence tests anywhere in their paper, two psychologists (myself and Marie
Delacre) had to point out that the statistic they had invented was very similar
to an equivalence test, except that it had a number of undesirable properties (Lakens
& Delacre, 2020). I guess based on this anecdotal
experience, there is nothing left but to create the rule that we should stop
using statistical tests that have not been adequately vetted by experts in
psychology.
Although it
greatly helps to have expertise in topics that you want to scrutinize,
sometimes the most fatal criticism comes from elsewhere. Experts make mistakes –
overconfidence is a thing. I recently very confidently made a statement in a
(signed) peer review, that (I am still examining the topic) I might have been
wrong about. I don’t want to be the expert to ‘vet’ a method and allow it to be
used based on my authority. More importantly, I think no one should want a
science where authorities tell us which methods are vetted, and which are not.
It would undermine the very core of what science is to me – a fallible system
of knowledge generation which relies on open mutual criticism.
Scrutiny appropriate to their popularity
I am a big
fan of increasing our scrutiny based on how popular something is. Indeed, this
is exactly what Peder Isager, myself, and our collaborators propose in our work
on the Replication Value: The more popular a finding is, and the less certain,
the more deserving of an independent direct replication the study is (Isager
et al., 2023, 2024).
There are
two challenges. The first is that at the moment that a method is first
published we do not know how popular it will become. So there is a time where
methods exists, and are used, without being criticized, as their popularity
takes some time to become clear. The first paper on p-curve analysis was
published in 2014 (Simonsohn
et al., 2014), with an update in 2015 (Simonsohn
et al., 2015). A very compelling criticism of
p-curve that pointed out strong limitations was published in a preprint in
2017, and appeared in print 2 years later (Carter
et al., 2019). It convincingly showed that
p-curve does not work well under heterogeneity, and there often is
heterogeneity. Other methods, such as z-curve analysis, were developed and
showed better performance under heterogeneity (Brunner
& Schimmack, 2020).
It seems a
bit of a stretch to say the p-curve method did not get scrutiny appropriate to
its popularity when there were many papers that criticized it, relatively
quickly (Aert et
al., 2016; Bishop & Thompson, 2016; Ulrich & Miller, 2018). What is fair to say is that
statisticians failed to engage with an incredibly important topic (test for
publication bias) that addressed a clear need in many scientific communities,
as most of the criticism was by psychological methodologists. I fully agree
that statisticians should have engaged more with this technique. I believe the
reason that they didn’t is because there is a real problem in the reward
structure in statistics, where statisticians get greater rewards inside their
field by proposing a 12th approach to compute confidence intervals
around a non-parametric effect size estimate for a test that no one uses, than
to help psychologists solve a problem they really need a solution for. Indeed,
for a statistician, publication bias is a very messy business, and there will
never be a sufficiently rigorous test for publication bias to get credit from
fellow statisticians. There are no beautiful mathematical solutions, no
creative insights, there is only the messy reality of a literature that is
biased by human actions that we can never adequately capture in a model. The
fact that empirical researchers often don’t know where to begin to evaluate the
reliability of claims in their publication-bias-ridden field is not something
statisticians care about. But they should care about it.
I hope
statisticians will start to scrutinize things appropriate to their popularity.
If a statistical technique is cited 500 times, 3 statisticians need to drop
whatever they are doing, and scrutinize the hell out of this technique. We can
randomly select them for ‘statisticians duty’.
Quality
control in science
It might
come as a surprise, but I don’t actually think we should stop using methods
that are not adequately vetted by psychologists, or statisticians for that
matter, because I don’t want a science where authorities tell others which
methods they can use. Scrutiny is important, but we can’t know how extensively methods
should be vetted, we don’t know how to identify experts, everyone – including ‘experts’
– is fallible. It is naïve to think ‘expert vetting’ will lead to clear answers
about the methods we should use, and should not use. If we can’t even reach
agreement about the use of p-values, no one should believe we will ever reach
agreement about the use of methods to detect publication bias, which will
always be messy at best.
I like my
science free from authority arguments. Everyone should do their best to
criticize everyone else, and if they are able, themselves. Some people will be
in a better position to criticize some papers than others, but it is difficult
to predict where the most fatal criticism will come from. Treating statistics
papers published in a statistics journal as superior to papers in a psychology
journal is too messy to be a good idea, and boil down to a form of elitism that
I can’t condone. Sometimes even a lowly 20% statistician can point out flaws in
methods proposed by card-carrying statisticians.
What we can
do better is implementing actual quality control. Journal peer review will not
suffice, because it is only as good as the two or three peers that happen to be
willing and available to review a paper. But it is a start. We should enable
researchers to see how well papers are peer reviewed by journals. Without
transparency, we can’t calibrate our trust (Vazire,
2017). Peer reviews should be open, for
all papers, including papers proposing new statistical methods.
If we want
our statistical methods to be of high quality, we need to specify quality
standards. Morey and Davis-Stober point out the limitations of simulation-based
tests of a method and convincingly argue for the value of evaluating the
mathematical properties of a testing procedure. If as a field we agree that an
evaluation of the mathematical properties of a test is desirable, we should
track whether this evaluation has been performed, or not. We could have a long
checklist of desirable quality control standards – e.g., a method has been
tested on real datasets, it has been compared to similar methods, those
comparisons have been performed objectively based on a well-justified set of
criteria, etc.
One could
create a database where for each method the quality standards that have been
met and that have not been met are listed. If considered useful, the database
could also track how often a method is used, by tracking citations, and listing
papers that have implemented the method (as opposed to those merely discussing
the method). When statistical methods become widely used, the database would point
researchers to which methods deserve more scrutiny. The case of magnitude-based
inference in sport science reveals the importance of a public call for scrutiny
when a method becomes widely popular, especially when this popularity is
limited to a single field.
The more
complex methods are, the more limitations they have. This will be true for all
methods that aim to deal with publication bias, because the way scientists bias
the literature is difficult to quantify. Maybe as a field we will come to agree
that tests for bias are never accurate enough, and we will recommend people to
just look at the distribution of p-values without performing a test. Alternatively,
we might believe that it is useful to have a testing procedure that too often
suggests a literature contains at least some non-zero effects, because we feel
we need some way to intersubjectively point out that there is bias in a
literature, even if this is based on an imperfect test. Such discussions
require a wide range of stakeholders, and the opinion of statisticians about
the statistical properties of a test is only one source of input in this
discussion. Imperfect procedures are implemented all the time, if they are the
best we have, and doing nothing is also not working.
Statistical
methods are rarely perfect from their inception, and all have limitations. Although
I understand the feeling, banning all tests that have not been adequately vetted
by an expert is inherently unscientific. Such a suggestion would destroy the
very core of science – an institution that promotes mutual criticism, while
accepting our fallibility. As Popper (1962) reminds us: “if we respect truth, we must search for it by
persistently searching for our errors: by indefatigable rational criticism, and
self-criticism.”
References
Aert, R. C. M. van, Wicherts, J. M., &
Assen, M. A. L. M. van. (2016). Conducting Meta-Analyses Based on p
Values Reservations and Recommendations for Applying p-Uniform and p-Curve. Perspectives
on Psychological Science, 11(5), 713–729.
https://doi.org/10.1177/1745691616650874
Bishop, D. V., &
Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to
detect rate of p-hacking and evidential value. PeerJ, 4, e1715.
Brunner, J., &
Schimmack, U. (2020). Estimating Population Mean Power Under Conditions of
Heterogeneity and Selection for Significance. Meta-Psychology, 4.
https://doi.org/10.15626/MP.2018.874
Carter, E. C., Schönbrodt, F. D., Gervais,
W. M., & Hilgard, J. (2019). Correcting for Bias in Psychology: A Comparison
of Meta-Analytic Methods. Advances in Methods and Practices in Psychological
Science, 2(2), 115–144. https://doi.org/10.1177/2515245919847196
Isager, P. M., Lakens, D., van Leeuwen, T.,
& van ’t Veer, A. E. (2024). Exploring a formal approach to selecting studies
for replication: A feasibility study in social neuroscience. Cortex, 171, 330–346.
https://doi.org/10.1016/j.cortex.2023.10.012
Isager, P. M., van Aert, R. C. M., Bahník,
Š., Brandt, M. J., DeSoto, K. A., Giner-Sorolla, R., Krueger, J. I., Perugini,
M., Ropovik, I., van ’t Veer, A. E., Vranka, M., & Lakens, D. (2023). Deciding
what to replicate: A decision model for replication study selection under
resource and knowledge constraints. Psychological Methods, 28(2),
438–451. https://doi.org/10.1037/met0000438
Lakens, D., &
Delacre, M. (2020). Equivalence Testing and the Second Generation P-Value. Meta-Psychology,
4, 1–11. https://doi.org/10.15626/MP.2018.933
Morey, R. D., &
Davis-Stober, C. P. (n.d.). On the poor statistical properties of the P-curve
meta-analytic procedure. Journal of the American Statistical Association,
0(ja), 1–19. https://doi.org/10.1080/01621459.2025.2544397
Peters, J. L.,
Sutton, A. J., Jones, D. R., Abrams, K. R., & Rushton, L. (2007).
Performance of the trim and fill method in the presence of publication bias and
between-study heterogeneity. Statistics in Medicine, 26(25),
4544–4562. https://doi.org/10.1002/sim.2889
Popper, K. R.
(1962). Conjectures and refutations: The growth of scientific knowledge.
Routledge.
Simonsohn, U.,
Nelson, L. D., & Simmons, J. P. (2014). P-Curve and Effect Size Correcting
for Publication Bias Using Only Significant Results. Perspectives on
Psychological Science, 9(6), 666–681.
Simonsohn, U.,
Simmons, J. P., & Nelson, L. D. (2015). Better P-curves: Making P-curve
analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to
Ulrich and Miller (2015). Journal of Experimental Psychology. General, 144(6),
1146–1152. https://doi.org/10.1037/xge0000104
Terrin, N., Schmid,
C. H., Lau, J., & Olkin, I. (2003). Adjusting for publication bias in the
presence of heterogeneity. Statistics in Medicine, 22(13),
2113–2126. https://doi.org/10.1002/sim.1461
Ulrich, R., &
Miller, J. (2018). Some properties of p-curves, with an application to gradual
publication bias. Psychological Methods, 23(3), 546–560.
https://doi.org/10.1037/met0000125
Vazire, S. (2017).
Quality Uncertainty Erodes Trust in Science. Collabra: Psychology, 3(1), 1.
https://doi.org/10.1525/collabra.74
No comments:
Post a Comment