The 20% Statistician

Wednesday, October 29, 2025

Why we should stop using statistical techniques that have not been adequately vetted by experts in psychology

In a recent post on Bluesky, where Richard Morey reflects on a paper he published with Clintin Davis-Stober that points out concerns with the p-curve method (Morey & Davis-Stober, 2025), he writes:

Also, I think people should stop using forensic meta-analytic techniques that have not been adequately vetted by experts in statistics. The p-curve papers have very little statistical detail, and were published in psych journals. They did not get the scrutiny appropriate to their popularity.

Although I understand this post as an affective response, I also think this kind of thought is extremely dangerous and undermines science. In this blog post I want to unpack some of the consequences of thoughts like this, and how to deal with quality control instead.

Adequately vetted by experts

I am a big fan of better vetting of scientific work by experts. I would like expert statisticians to vet the power analysis and statistical analyses in all your papers. But there are some problems. The first is in identifying expert statisticians. There are many statisticians, but some get things wrong. Of course, those are not the experts that we want to do the vetting. So how do we identify expert statisticians?

Let’s see if we can identify expert statisticians by looking at Sue Duval and Richard Tweedie. A look at their CV might convince you they are experts in statistics. But wait! They developed the ‘trim-and-fill’ method. The abstract of their classic 2000 paper is below:

A text on a white background

AI-generated content may be incorrect.

It turns out that, unlike they write in their abstract, the point estimate for the meta-analytic effect size after adjusting for missing studies is not approximately correct at all (Peters et al., 2007; Terrin et al., 2003). So clearly, Duval and Tweedie are statisticians, but not the expert statisticians that we want to vet others. They got things wrong, and more problematically, they got things wrong in the Journal of the American Statistical Association.

In some cases, the problems in the work by statisticians is so easy to spot, even a lowly psychologist like myself can point out the problems. When a team of biostatisticians proposed a ‘second generation p-value’, without mentioning equivalence tests anywhere in their paper, two psychologists (myself and Marie Delacre) had to point out that the statistic they had invented was very similar to an equivalence test, except that it had a number of undesirable properties (Lakens & Delacre, 2020). I guess based on this anecdotal experience, there is nothing left but to create the rule that we should stop using statistical tests that have not been adequately vetted by experts in psychology.

Although it greatly helps to have expertise in topics that you want to scrutinize, sometimes the most fatal criticism comes from elsewhere. Experts make mistakes – overconfidence is a thing. I recently very confidently made a statement in a (signed) peer review, that (I am still examining the topic) I might have been wrong about. I don’t want to be the expert to ‘vet’ a method and allow it to be used based on my authority. More importantly, I think no one should want a science where authorities tell us which methods are vetted, and which are not. It would undermine the very core of what science is to me – a fallible system of knowledge generation which relies on open mutual criticism.

Scrutiny appropriate to their popularity

I am a big fan of increasing our scrutiny based on how popular something is. Indeed, this is exactly what Peder Isager, myself, and our collaborators propose in our work on the Replication Value: The more popular a finding is, and the less certain, the more deserving of an independent direct replication the study is (Isager et al., 2023, 2024).

There are two challenges. The first is that at the moment that a method is first published we do not know how popular it will become. So there is a time where methods exists, and are used, without being criticized, as their popularity takes some time to become clear. The first paper on p-curve analysis was published in 2014 (Simonsohn et al., 2014), with an update in 2015 (Simonsohn et al., 2015). A very compelling criticism of p-curve that pointed out strong limitations was published in a preprint in 2017, and appeared in print 2 years later (Carter et al., 2019). It convincingly showed that p-curve does not work well under heterogeneity, and there often is heterogeneity. Other methods, such as z-curve analysis, were developed and showed better performance under heterogeneity (Brunner & Schimmack, 2020).

It seems a bit of a stretch to say the p-curve method did not get scrutiny appropriate to its popularity when there were many papers that criticized it, relatively quickly (Aert et al., 2016; Bishop & Thompson, 2016; Ulrich & Miller, 2018). What is fair to say is that statisticians failed to engage with an incredibly important topic (test for publication bias) that addressed a clear need in many scientific communities, as most of the criticism was by psychological methodologists. I fully agree that statisticians should have engaged more with this technique. I believe the reason that they didn’t is because there is a real problem in the reward structure in statistics, where statisticians get greater rewards inside their field by proposing a 12^th approach to compute confidence intervals around a non-parametric effect size estimate for a test that no one uses, than to help psychologists solve a problem they really need a solution for. Indeed, for a statistician, publication bias is a very messy business, and there will never be a sufficiently rigorous test for publication bias to get credit from fellow statisticians. There are no beautiful mathematical solutions, no creative insights, there is only the messy reality of a literature that is biased by human actions that we can never adequately capture in a model. The fact that empirical researchers often don’t know where to begin to evaluate the reliability of claims in their publication-bias-ridden field is not something statisticians care about. But they should care about it.

I hope statisticians will start to scrutinize things appropriate to their popularity. If a statistical technique is cited 500 times, 3 statisticians need to drop whatever they are doing, and scrutinize the hell out of this technique. We can randomly select them for ‘statisticians duty’.

Quality control in science

It might come as a surprise, but I don’t actually think we should stop using methods that are not adequately vetted by psychologists, or statisticians for that matter, because I don’t want a science where authorities tell others which methods they can use. Scrutiny is important, but we can’t know how extensively methods should be vetted, we don’t know how to identify experts, everyone – including ‘experts’ – is fallible. It is naïve to think ‘expert vetting’ will lead to clear answers about the methods we should use, and should not use. If we can’t even reach agreement about the use of p-values, no one should believe we will ever reach agreement about the use of methods to detect publication bias, which will always be messy at best.

I like my science free from authority arguments. Everyone should do their best to criticize everyone else, and if they are able, themselves. Some people will be in a better position to criticize some papers than others, but it is difficult to predict where the most fatal criticism will come from. Treating statistics papers published in a statistics journal as superior to papers in a psychology journal is too messy to be a good idea, and boil down to a form of elitism that I can’t condone. Sometimes even a lowly 20% statistician can point out flaws in methods proposed by card-carrying statisticians.

What we can do better is implementing actual quality control. Journal peer review will not suffice, because it is only as good as the two or three peers that happen to be willing and available to review a paper. But it is a start. We should enable researchers to see how well papers are peer reviewed by journals. Without transparency, we can’t calibrate our trust (Vazire, 2017). Peer reviews should be open, for all papers, including papers proposing new statistical methods.

If we want our statistical methods to be of high quality, we need to specify quality standards. Morey and Davis-Stober point out the limitations of simulation-based tests of a method and convincingly argue for the value of evaluating the mathematical properties of a testing procedure. If as a field we agree that an evaluation of the mathematical properties of a test is desirable, we should track whether this evaluation has been performed, or not. We could have a long checklist of desirable quality control standards – e.g., a method has been tested on real datasets, it has been compared to similar methods, those comparisons have been performed objectively based on a well-justified set of criteria, etc.

One could create a database where for each method the quality standards that have been met and that have not been met are listed. If considered useful, the database could also track how often a method is used, by tracking citations, and listing papers that have implemented the method (as opposed to those merely discussing the method). When statistical methods become widely used, the database would point researchers to which methods deserve more scrutiny. The case of magnitude-based inference in sport science reveals the importance of a public call for scrutiny when a method becomes widely popular, especially when this popularity is limited to a single field.

The more complex methods are, the more limitations they have. This will be true for all methods that aim to deal with publication bias, because the way scientists bias the literature is difficult to quantify. Maybe as a field we will come to agree that tests for bias are never accurate enough, and we will recommend people to just look at the distribution of p-values without performing a test. Alternatively, we might believe that it is useful to have a testing procedure that too often suggests a literature contains at least some non-zero effects, because we feel we need some way to intersubjectively point out that there is bias in a literature, even if this is based on an imperfect test. Such discussions require a wide range of stakeholders, and the opinion of statisticians about the statistical properties of a test is only one source of input in this discussion. Imperfect procedures are implemented all the time, if they are the best we have, and doing nothing is also not working.

Statistical methods are rarely perfect from their inception, and all have limitations. Although I understand the feeling, banning all tests that have not been adequately vetted by an expert is inherently unscientific. Such a suggestion would destroy the very core of science – an institution that promotes mutual criticism, while accepting our fallibility. As Popper (1962) reminds us: “if we respect truth, we must search for it by persistently searching for our errors: by indefatigable rational criticism, and self-criticism.”

References

Aert, R. C. M. van, Wicherts, J. M., & Assen, M. A. L. M. van. (2016). Conducting Meta-Analyses Based on p Values Reservations and Recommendations for Applying p-Uniform and p-Curve. Perspectives on Psychological Science, 11(5), 713–729. https://doi.org/10.1177/1745691616650874

Bishop, D. V., & Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value. PeerJ, 4, e1715.

Brunner, J., & Schimmack, U. (2020). Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance. Meta-Psychology, 4. https://doi.org/10.15626/MP.2018.874

Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2019). Correcting for Bias in Psychology: A Comparison of Meta-Analytic Methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144. https://doi.org/10.1177/2515245919847196

Isager, P. M., Lakens, D., van Leeuwen, T., & van ’t Veer, A. E. (2024). Exploring a formal approach to selecting studies for replication: A feasibility study in social neuroscience. Cortex, 171, 330–346. https://doi.org/10.1016/j.cortex.2023.10.012

Isager, P. M., van Aert, R. C. M., Bahník, Š., Brandt, M. J., DeSoto, K. A., Giner-Sorolla, R., Krueger, J. I., Perugini, M., Ropovik, I., van ’t Veer, A. E., Vranka, M., & Lakens, D. (2023). Deciding what to replicate: A decision model for replication study selection under resource and knowledge constraints. Psychological Methods, 28(2), 438–451. https://doi.org/10.1037/met0000438

Lakens, D., & Delacre, M. (2020). Equivalence Testing and the Second Generation P-Value. Meta-Psychology, 4, 1–11. https://doi.org/10.15626/MP.2018.933

Morey, R. D., & Davis-Stober, C. P. (n.d.). On the poor statistical properties of the P-curve meta-analytic procedure. Journal of the American Statistical Association, 0(ja), 1–19. https://doi.org/10.1080/01621459.2025.2544397

Peters, J. L., Sutton, A. J., Jones, D. R., Abrams, K. R., & Rushton, L. (2007). Performance of the trim and fill method in the presence of publication bias and between-study heterogeneity. Statistics in Medicine, 26(25), 4544–4562. https://doi.org/10.1002/sim.2889

Popper, K. R. (1962). Conjectures and refutations: The growth of scientific knowledge. Routledge.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-Curve and Effect Size Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science, 9(6), 666–681.

Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of Experimental Psychology. General, 144(6), 1146–1152. https://doi.org/10.1037/xge0000104

Terrin, N., Schmid, C. H., Lau, J., & Olkin, I. (2003). Adjusting for publication bias in the presence of heterogeneity. Statistics in Medicine, 22(13), 2113–2126. https://doi.org/10.1002/sim.1461

Ulrich, R., & Miller, J. (2018). Some properties of p-curves, with an application to gradual publication bias. Psychological Methods, 23(3), 546–560. https://doi.org/10.1037/met0000125

Vazire, S. (2017). Quality Uncertainty Erodes Trust in Science. Collabra: Psychology, 3(1), 1. https://doi.org/10.1525/collabra.74

Sunday, September 28, 2025

Type S and M errors as a “rhetorical tool”

Update 30/09/2025: I have added a reply by Andrew Gelman below my original blog post.

We recently posted a preprint criticizing the idea of Type S and M errors (https://osf.io/2phzb_v1). From our abstract: “While these concepts have been proposed to be useful both when designing a study (prospective) and when evaluating results (retroactive), we argue that these statistics do not facilitate the proper design of studies, nor the meaningful interpretation of results.”

In a recent blog post that is mainly on p-curve analysis, Gelman writes briefly about Type S and M errors, stating that he does not see them as tools that should be used regularly, but that they mainly function as a ‘rhetorical tool’:

I offer a three well-known examples of statistical ideas arising in the field of science criticism, three methods whose main value is rhetorical:

[…]

2. The concepts of Type M and Type S errors, which I developed with Francis Tuerlinckx in 2000 and John Carlin in 2014. This has been an influential idea–ok, not as influential as Ioannidis’s paper!–and I like it a lot, but it doesn’t correspond to a method that I will typically use in practice. To me, the value of the concepts of Type M and Type S errors is they help us understand certain existing statistical procedures, such as selection on statistical significance, that have serious problems. There’s mathematical content here for sure, but I fundamentally think of these error calculations as having rhetorical value for the design of studies and interpretation of reported results.

The main sentence of interest here is that Gelman says this is not a method he would use in practice. I was surprised, because in their article Gelman and Carlin (2014) recommend the calculation of Type S and M errors more forcefully: “We suggest that design calculations be performed after as well as before data collection and analysis.” Throughout their article, they compare design calculations where Type S and M errors are calculated to power analyses, which are widely seen as a requirement before data collection of any hypothesis testing study. For example, in the abstract they write “power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations”.

They also say design calculations are useful when interpreting results, and that they add something to p-values and effect sizes, which again seems to suggest they can complement ordinary data analysis: “Our retrospective analysis provided useful insight, beyond what was revealed by the estimate, confidence interval, and p value that came from the original data summary.” (Gelman & Carlin, 2014, p. 646). In general, they seem to suggest design analyses are done before or after data analysis: “First, it is indeed preferable to do a design analysis ahead of time, but a researcher can analyze data in many different ways—indeed, an important part of data analysis is the discovery of unanticipated patterns (Tukey, 1977) so that it is unreasonable to suppose that all potential analyses could have been determined ahead of time. The second reason for performing postdata design calculations is that they can be a useful way to interpret the results from a data analysis, as we next demonstrate in two examples.” (Gelman & Carlin, 2014, p. 643).

One the other hand, in a single sentence in the discussion, they also write: “Our goal in developing this software is not so much to provide a tool for routine use but rather to demonstrate that such calculations are possible and to allow researchers to play around and get a sense of the sizes of Type S errors and Type M errors in realistic data settings.”

Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention. If the idea is to replace Type 1 and 2 errors, and hence, replace power analysis and the interpretation of data, design analysis should be part of every hypothesis testing study. Sentences such as “the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field” seemed to suggest to me that design analyses could be a requirement for all studies. But maybe I was wrong.

Or maybe I wasn’t.

In this blog post, Gelman writes: “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” So, here there seems to be the idea that others routinely use Type S and M errors. And in a very early version of the paper with Carlin, available here, the opening sentence also suggests routine use: “The present article proposes an ideal that every statistical analysis be followed up with a power calculation to better understand the inference from the data. As the quotations above illustrate, however, our suggestion contradicts the advice of many respected statisticians. Our resolution of this apparent disagreement is that we perform retrospective power analysis in a different way and for a different purpose than is typically recommended in the literature.”

Of course, one good thing about science is that people change their beliefs about things. Maybe Gelman one time thought Type S and M errors should be part of ‘every statistical analysis’ but now sees the tool mainly as a ‘rhetorical device’. And that is perfectly fine. It is also good to know, because I regular see people who suggest that Type S and M error should routinely be used in practice. I guess I can now point them to a blog post where Gelman himself disagrees with that suggestion.

As we explain in our preprint, the idea of Type S errors is conceptually incoherent, and any probabilities calculated will be identical to the Type 1 error in directional tests, or the false discovery rate, as all that Type S errors do is remove the possibility of an effect being 0 from the distribution, but this probability is itself 0. We also explain how other tools are better to educate researchers about effect size inflation in studies selected for significance (for which Gelman would recommend Type M errors), and we actually recommend p-uniform for this, or just teaching people about critical effect sizes.

Personally, I don’t like rhetorical tools. Although in our preprint we agree that teaching the idea of Type S and M errors can be useful in education, there are also conceptually coherent and practically useful statistical ideas that we can teach instead to achieve the same understanding. Rhetorical tools might be useful to convince people who do not think logically about a topic, but I prefer to have a slightly higher bar for the scientists that I aim to educate about good research practices, and I think they are able to understand the problem of low statistical power and selection bias without rhetorical tools.

--Reply by Andrew Gelman--

Hi, Daniel. Thanks for your comments. It's always good to see that people are reading our articles and blog posts. I think you are a little bit confused about what we wrote, but ultimately that's our fault for not being clear, so I appreciate the opportunity to clarify.

So you don't need to consider this comment as a "rebuttal" to your post. For convenience I'll go through several of your statements one by one, but my goal is to clarify.

First, I guess I should've avoided the word "rhetorical." In my post, I characterized Ioannidis's 2005 claim, type M and S errors, and multiverse analysis as "rhetorical tools" that have been been useful in the field of science criticism but which I would not use in my own analyses. I could've added to this many other statistical methods including p-values and Bayes factors.

When I describe a statistical method as "rhetorical" in this context, I'm not saying it's mathematically invalid or that it's conceptually incoherent (to use your term), nor am I saying these methods should not be used! All these tools can be useful; they just rely on very strong assumptions. P-values and Bayes factors are measures of evidence relative to a null hypothesis (not just an assumption that a particular parameter equals zero, but an entire set of assumptions about the data-generating process) that is irrelevant in the science and decision problems I've seen--but these methods are clearly defined and theoretically justified, and many practitioners get a lot out of them. I very rarely would use p-values or Bayes factors in my work because I'm very rarely interested in this sort of discrepancy from a null hypothesis.

A related point comes up in my paper with Hill and Yajima, "Why we (usually) don't have to worry about multiple comparisons" (https://sites.stat.columbia.edu/gelman/research/published/multiple2f.pdf). Multiple comparisons corrections can be important, indeed I've criticized some published work for misinterpreting evidence by not accounting for multiple comparisons or multiple potential comparisons--but it doesn't come up so much in the context of multilevel modeling.

Ioannidis (2005) is a provocative paper that I think has a lot of value--but you have to be really careful to try to directly apply such an analysis to real data. He's making some really strong assumptions! The logic of his paper is clear, though. O'Rourke and I discuss the challenges of moving from that sort of model to larger conclusions in our 2013 paper (https://sites.stat.columbia.edu/gelman/research/published/GelmanORourkeBiostatistics.pdf).

The multiverse is a cool idea, and researchers have found it to be useful. The sociologists Cristobal Young and Erin Cumberworth recently published a book on it (https://www.cambridge.org/core/books/multiverse-analysis/D53C3AB449F6747B4A319174E5C95FA1). I don't think I'd apply the method in my own applied research, though, because the whole idea of the multiverse is to consider all the possible analyses you might have done on a dataset, and if I get to that point I'm more inclined to fit a multilevel model that subsumes all these analyses. I have found multiverse analysis to be useful in understanding research published by others, and maybe it would be useful for my own work too, given that my final published analyses never really include all the possibilities of what I might have done. The point is that this is yet another useful method that can have conceptual value even if I might not apply it to my own work. Again, the term "rhetorical" might be misleading, as these are real methods that, like all statistical methods, are appropriate in some settings and not in others.

So please don't let your personal dislike of the term "rhetorical tools" to dissuade you from taking seriously the tools that I happen to have characterized as "rhetorical," as these include p-values, multiple comparisons corrections, Bayesian analysis with point priors, and all sorts of other methods that are rigorously defined and can be useful in many applied settings, including some of yours!

OK, now on to Type M and Type S errors. You seem to imply that at some time I thought that these "should be part of ‘every statistical analysis,'" but I can assure you that I have never believed or written such a thing. You put the phrase "every statistical analysis," but this is your phrase, not mine.

One very obvious way to see that I never thought Type M and Type S errors "should be part of ‘every statistical analysis'" is that, since the appearance of that article in 2014, I've published dozens of applied papers, and in only very few of these did I look at Type M and Type S errors.

What is that? Why is it that my colleagues and I came up with this idea that has been influential, and which I indeed think can be very useful and which I do think should often be used by practitioners, but I only use it myself?

The reason is that the focus of our work on Type M and Type S errors has been to understand selection on statistical significance (as in that notorious estimate that early childhood intervention increases adult earnings by 42% on average, but with that being the result of an inferential procedure that, under any reasonable assumptions, greatly overestimates the magnitude of any real effect; that is, Type M error). In my applied work it's very rare that I condition on statistical significance, and so this sort of use of Type M and S errors is not so relevant. So it's perfectly coherent for me to say that Type M and S error analysis is valuable in a wide range of settings that that I think these tools should be applied very widely, without believing that they should be part of "every statistical analysis" or that I should necessarily use them for my own analyses.

That said, more recently I've been thinking that Type M and S errors are a useful approach to understanding statistical estimates more generally, not just for estimates that are conditioned on statistical significance. I'm working with Erik van Zwet and Witold Więcek on applying these ideas to Bayesian inferences as well. So I'm actually finding these methods to be more, not less, valuable for statistical understanding, and not just for "people who do not think logically about a topic" (in your phrasing). Our papers on these topics are published in real journals and of course they're intended for people who <em>do</em> think logically about the topic! And, just to be clear, I believe that you're thinking logically in your post too; I just think you've been misled by my terminology (again, I accept the blame for that), and also you work on different sorts of problems than I do, so it makes sense that a method that I find useful might not be so helpful to you. There are many ways to Rome, which is another point I was making in that blog post.

Finally, a few things in your post that I did not address above:

1. You quote from my blog post, where I wrote, “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” That's exactly my point above! You had it right there.

2. You wrote, "Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention." So, just to clarify, yes in our paper we recommended the regular use of Type M and S errors, and we still recommend that!

3. You write that our "sentences such as 'the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field' seemed to suggest to me that design analyses could be a requirement for all studies." That's right--I actually do think that design analysis should be done for all studies!

OK, nothing is done all the time. I guess that some studies are so cheap that there's no need for a design analysis--or maybe we could say that in such studies the design analysis is implicit. For example, if I'm doing A/B testing in a company, and they've done lots of A/B tests before, and I think the new effect will be comparable to previous things being studied, then maybe I just go with the same design as in previous experiments, without performing a formal design analysis. But one could argue that this corresponds to some implicit calculation.

In any case, yeah, in general I think that a design analysis should come before any study. Indeed, that is what I tell students and colleagues: never collect data before doing a simulation study first. Often we do fake-data simulation after the data come in, to validate our model-fitting strategies, but for a while I've been thinking it's best to do it before.

This is not controversial advice in statistics, to recommend a design analysis before gathering data! Indeed, in medical research it's basically a requirement. In our paper, Carlin and I argue--and I still believe--that a design analysis using Type M and S errors is more valuable than the traditional Type 1 and 2 errors. But in any case I consider "design analysis" to be the general term, with "power analysis" being a special case (design analysis looking at the probability of attaining statistical significance). I don't think traditional power analysis is useless--one way you can see this is that we demonstrate power calculations in chapter 16 of Regression and Other Stories, a book that came out several years after my paper with Carlin--; I just think it can be misleading, especially if it is done without consideration of Type M and S errors.

Thanks again for your comments. It's good to have an opportunity to clarify my thinking, and these are important issues in statistics.

P.S. If you see something on our blog that you disagree with, feel free to comment there directly, as that way you can also reach readers of the original post.

References:

Lakens, D., Cristian, Xavier-Quintais, G., Rasti, S., Toffalini, E., & Altoè, G. (2025). Rethinking Type S and M Errors. OSF. https://doi.org/10.31234/osf.io/2phzb_v1

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642

Tuesday, July 22, 2025

Easily download files from the Open Science Framework with Papercheck

Researchers increasingly use the Open Science Framework (OSF) to share files, such as data and code underlying scientific publications, or presentations and materials for scientific workshops. The OSF is an amazing service that has contributed immensely to a changed research culture where psychologists share data, code, and materials. We are very grateful it exists.

But it is not always the most user-friendly. Specifically, downloading files from the OSF is a bigger hassle than we (Lisa DeBruine and Daniel Lakens, the developers of Papercheck) would like it to be. Downloading individual files is so complex, Malte Elson recently posted this meme on Bluesky.

Not only is the download button for files difficult to find, but downloading all files related to a project can be surprisingly effortful. It is possible to download all files in a zip folder that will be called ‘osfstorage-archive.zip’ when downloaded. But as the OSF supports a nested folder structure, you might miss a folder, and you will quickly end up with ‘osfstorage-archive (1).zip’, ‘osfstorage-archive (2).zip’, etc. Unzipping these archives creates a lot of files without the organized folder structure, in folders with meaningless names, making it difficult to understand where files are.

The osf_file_download function in Papercheck

We have added a new function to our R package ‘Papercheck’ that will download all files and folders in an OSF repository. It saves all files by recreating the folder structure from the OSF in your download folder. Just install Papercheck, load the library, and use the osf_file_download function to grab all files on the OSF:

devtools::install_github("scienceverse/papercheck")
library(papercheck)
osf_file_download("6nt4v")

All files will be downloaded to your working directory.

Are you feeling FOMO for missing out on the King Open Research Summer School that is going on these days, where Sajedeh Rasti talked about Preregistration, and Cristian Mesquida will give a workshop on using Papercheck? Well, at least it is very easy to download all the files they have shared on the OSF and look at the presentations:

osf_file_download("b7es8")

In the output, we see that by default large files (more than 10mb) are omitted. A screenshot of a computer

AI-generated content may be incorrect.

If you want to download all files, regardless of the size, then set the parameter to ignore the maximum file size:

osf_file_download("b7es8", max_file_size = NULL)

Sometimes you might want to download all files but ignore the file structure on the OSF, to just have all the files in one folder. Setting the parameter ignore_folder_structure = TRUE will give you all the files on the OSF in a single folder. By default, files will be downloaded into your working directory, but you can also specify where you want the files to be saved.

osf_file_download("6nt4v", ignore_folder_structure = TRUE, download_to = "C:\\test_download")

We hope this function will make it easier for reviewers to access all supplementary files stored on the OSF during peer review, and for researchers who want to re-use data, code, or materials shared on the OSF by downloading all the files they need easily. Make sure to install the latest version of Papercheck (0.0.0.9050) to get access to this new function. Papercheck is still in active development, so report any bugs on GitHub.

Friday, July 4, 2025

Are meta-scientists ignoring philosophy of science?

Are meta-scientists ignoring philosophy of science (PoS)? Are they re-inventing the wheel? A recent panel at the Metascience conference engaged with this question, and the first sentence of the abstract states “Critics argue that metascience merely reinvents the wheel of other academic fields.” It’s a topic I have been thinking about for a while, so I will share my thoughts on this question. In this blog post I will only speak for myself, and not for any other metascientists. I studied philosophy for a year, read quite a lot about philosophy of science, regularly review for philosophy journals, have co-organized a conference that brings together philosophers and metascientists (and am co-organizing the next meeting) and I currently have 3 ongoing collaborations with philosophers of science. I would say it seems a bit far-fetched to claim I would be ignoring philosophy of science, and just reinvent what that field has already done. But I am ignoring a lot of it. That is not something PoS should take personally. I am also ignoring a lot of metascientific work that is done. That seems perfectly normal to me – there is only so much work I need to engage with to do my work better.

I read a lot of the work philosophers of science have written that is relevant for metascientists, and a lot they are currently writing. Too often, I find work in philosophy of science on the replication crisis and related topics to be of quite low quality, and mostly it turns out to be rather irrelevant for my work. It is very common that philosophers seem to have thought about topics very little, and the limited time they thought about a topic has been spent without any engagement with actual scientists. This is especially true about the philosophical work on the replication crisis. Having lived through it, and having thought about it every day for the last 15 years, most of the work by philosophers is quite superficial. If you spent only 3 years fulltime on your paper (and I know people who spent no more than a year full-time on an entire book!), it is just not going to be insightful enough for me to learn anything I didn’t know. Instead, I will notice a lot of mistakes, faulty assumptions, and incorrect conclusions.

I recently read a book from a philosopher of science I was looking forward to, hoping I would learn new things. Instead, I just thought ‘but psychologists themselves have done so much work on this that you are not discussing!’ for 200 pages. I found that quite frustrating. Maybe we should also talk about philosophers ignoring the work by psychologists.

As I was being frustrated, a thought popped up. There are very few philosophers of science, compared to the number of psychologists. Let’s say I read the literature on metascientific topics in psychology and philosophy, and only the top X% of papers make my work better. What is the probability that a psychologist has performed work that is relevant for me as a metascientist, compared to work by a philosopher? Because there are so many more psychologists than philosophers, all else equal, there will be many more important papers by psychologists than by philosophers that I should read.

Of course, all else is not equal. Psychologists have thought about all their crises for more than 60 years, ever since the first crisis in the 70’s (Lakens, 2025a, 2025b). Psychologists have a much better understanding of how psychological research is done than philosophers. Philosophers are on average smarter than psychologists (but again, there are many more psychologists), and have better training in conceptual analysis. Psychologists are more motivated to work on challenges in their field than philosophers. There are many other differences. So, we need to weigh all these factors in our model that will predict how many papers by philosophers of science I find useful to read, compared to the number of papers by psychologists I find useful to read. I don’t have those weights, but I have the outcome of the model for my own research: Most often, papers by psychologists on metascience are better and more relevant for my work. Remember: I still ignore a lot of the papers on metascience by psychologists, and I find a lot of those papers low quality as well! But psychologists write a lot more on the topic, and I also think the best papers on metascience when I combine both fields are more often written by psychologists.

I think this alternative explanation for why we engage very little with philosophers of science is worth taking into account. I personally consider it a strong contender to explain the behavior of metascientists to an explanation that posits that we intentionally do not engage with the literature in philosophy of science.

There are additional reasons for why I end up reading less work by philosophers of science. One is that I often do not agree with certain assumptions they have. The ideas that guide my work have been out of fashion in philosophy of science for half a century. Most of the ideas that are in fashion turn me off. I just do not enjoy reading papers that say ‘Let’s assume scientists are rational Bayesian updaters’ or ‘There is not one way to do science’. My working model of science is something very different, best summarized by this cartoon:

Afbeelding

When I say I ignore most of the work by philosophers of science on metascience, I still engage with quite a lot of philosophy of science. If you browse through the reference list of my online textbook – which is about statistics! – I am confident that there are more philosophers cited than the number of metascientists cited in a book by a philosopher of science. If you browse through the reading notes of the podcast ‘Nullius in Verba’ that I record with Smriti Mehta, I am confident that there are more papers by philosophers than there are papers by metascientists in podcasts by philosophers of science.

I just wanted to share these thoughts to provide some more diversity to the ideas that were shared in the panel at the Metascience conference. When “this panel asks what is new about metascience, and why it may have captured greater attention than previous research and reform initiatives” maybe one reason is that on average this literature is better and more relevant for researchers interested in how science works, and how it could work. I know that is not going to be a very popular viewpoint for philosophers to read, but it is my viewpoint, and I think no one can criticize me for not engaging with philosophy of science enough. I have at least a moderately informed opinion on this matter.

P.S. The Metascience symposium also discussed why work in the field of Science and Technology Studies is not receiving much love by metascientists. I also have thoughts about this topic, but those thoughts are a bit too provocative to share on a blog.

References

Lakens, D. (2025a). Concerns About Replicability Across Two Crises in Social Psychology. International Review of Social Psychology, 38(1). https://doi.org/10.5334/irsp.1036

Lakens, D. (2025b). Concerns About Theorizing, Relevance, Generalizability, and Methodology Across Two Crises in Social Psychology. International Review of Social Psychology, 38(1). https://doi.org/10.5334/irsp.1038