The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, September 28, 2025

Type S and M errors as a “rhetorical tool”

Update 30/09/2025: I have added a reply by Andrew Gelman below my original blog post. 

We recently posted a preprint criticizing the idea of Type S and M errors (https://osf.io/2phzb_v1). From our abstract: “While these concepts have been proposed to be useful both when designing a study (prospective) and when evaluating results (retroactive), we argue that these statistics do not facilitate the proper design of studies, nor the meaningful interpretation of results.”

In a recent blog post that is mainly on p-curve analysis, Gelman writes briefly about Type S and M errors, stating that he does not see them as tools that should be used regularly, but that they mainly function as a ‘rhetorical tool’:

I offer a three well-known examples of statistical ideas arising in the field of science criticism, three methods whose main value is rhetorical:

[…]

2. The concepts of Type M and Type S errors, which I developed with Francis Tuerlinckx in 2000 and John Carlin in 2014. This has been an influential idea–ok, not as influential as Ioannidis’s paper!–and I like it a lot, but it doesn’t correspond to a method that I will typically use in practice. To me, the value of the concepts of Type M and Type S errors is they help us understand certain existing statistical procedures, such as selection on statistical significance, that have serious problems. There’s mathematical content here for sure, but I fundamentally think of these error calculations as having rhetorical value for the design of studies and interpretation of reported results.

The main sentence of interest here is that Gelman says this is not a method he would use in practice. I was surprised, because in their article Gelman and Carlin (2014) recommend the calculation of Type S and M errors more forcefully: “We suggest that design calculations be performed after as well as before data collection and analysis.” Throughout their article, they compare design calculations where Type S and M errors are calculated to power analyses, which are widely seen as a requirement before data collection of any hypothesis testing study. For example, in the abstract they write “power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations”.

They also say design calculations are useful when interpreting results, and that they add something to p-values and effect sizes, which again seems to suggest they can complement ordinary data analysis: “Our retrospective analysis provided useful insight, beyond what was revealed by the estimate, confidence interval, and p value that came from the original data summary.” (Gelman & Carlin, 2014, p. 646). In general, they seem to suggest design analyses are done before or after data analysis: “First, it is indeed preferable to do a design analysis ahead of time, but a researcher can analyze data in many different ways—indeed, an important part of data analysis is the discovery of unanticipated patterns (Tukey, 1977) so that it is unreasonable to suppose that all potential analyses could have been determined ahead of time. The second reason for performing postdata design calculations is that they can be a useful way to interpret the results from a data analysis, as we next demonstrate in two examples.” (Gelman & Carlin, 2014, p. 643).

One the other hand, in a single sentence in the discussion, they also write: “Our goal in developing this software is not so much to provide a tool for routine use but rather to demonstrate that such calculations are possible and to allow researchers to play around and get a sense of the sizes of Type S errors and Type M errors in realistic data settings.”

Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention. If the idea is to replace Type 1 and 2 errors, and hence, replace power analysis and the interpretation of data, design analysis should be part of every hypothesis testing study. Sentences such as “the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field” seemed to suggest to me that design analyses could be a requirement for all studies. But maybe I was wrong.

 

Or maybe I wasn’t.

 

In this blog post, Gelman writes: “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” So, here there seems to be the idea that others routinely use Type S and M errors. And in a very early version of the paper with Carlin, available here, the opening sentence also suggests routine use: “The present article proposes an ideal that every statistical analysis be followed up with a power calculation to better understand the inference from the data. As the quotations above illustrate, however, our suggestion contradicts the advice of many respected statisticians. Our resolution of this apparent disagreement is that we perform retrospective power analysis in a different way and for a different purpose than is typically recommended in the literature.”

Of course, one good thing about science is that people change their beliefs about things. Maybe Gelman one time thought Type S and M errors should be part of ‘every statistical analysis’ but now sees the tool mainly as a ‘rhetorical device’. And that is perfectly fine. It is also good to know, because I regular see people who suggest that Type S and M error should routinely be used in practice. I guess I can now point them to a blog post where Gelman himself disagrees with that suggestion.

As we explain in our preprint, the idea of Type S errors is conceptually incoherent, and any probabilities calculated will be identical to the Type 1 error in directional tests, or the false discovery rate, as all that Type S errors do is remove the possibility of an effect being 0 from the distribution, but this probability is itself 0. We also explain how other tools are better to educate researchers about effect size inflation in studies selected for significance (for which Gelman would recommend Type M errors), and we actually recommend p-uniform for this, or just teaching people about critical effect sizes.

Personally, I don’t like rhetorical tools. Although in our preprint we agree that teaching the idea of Type S and M errors can be useful in education, there are also conceptually coherent and practically useful statistical ideas that we can teach instead to achieve the same understanding. Rhetorical tools might be useful to convince people who do not think logically about a topic, but I prefer to have a slightly higher bar for the scientists that I aim to educate about good research practices, and I think they are able to understand the problem of low statistical power and selection bias without rhetorical tools.


--Reply by Andrew Gelman--


Hi, Daniel.  Thanks for your comments.  It's always good to see that people are reading our articles and blog posts.  I think you are a little bit confused about what we wrote, but ultimately that's our fault for not being clear, so I appreciate the opportunity to clarify.

So you don't need to consider this comment as a "rebuttal" to your post.  For convenience I'll go through several of your statements one by one, but my goal is to clarify.

First, I guess I should've avoided the word "rhetorical."  In my post, I characterized Ioannidis's 2005 claim, type M and S errors, and multiverse analysis as "rhetorical tools" that have been been useful in the field of science criticism but which I would not use in my own analyses.  I could've added to this many other statistical methods including p-values and Bayes factors.

When I describe a statistical method as "rhetorical" in this context, I'm not saying it's mathematically invalid or that it's conceptually incoherent (to use your term), nor am I saying these methods should not be used!  All these tools can be useful; they just rely on very strong assumptions.  P-values and Bayes factors are measures of evidence relative to a null hypothesis (not just an assumption that a particular parameter equals zero, but an entire set of assumptions about the data-generating process) that is irrelevant in the science and decision problems I've seen--but these methods are clearly defined and theoretically justified, and many practitioners get a lot out of them.  I very rarely would use p-values or Bayes factors in my work because I'm very rarely interested in this sort of discrepancy from a null hypothesis.

A related point comes up in my paper with Hill and Yajima, "Why we (usually) don't have to worry about multiple comparisons" (https://sites.stat.columbia.edu/gelman/research/published/multiple2f.pdf).  Multiple comparisons corrections can be important, indeed I've criticized some published work for misinterpreting evidence by not accounting for multiple comparisons or multiple potential comparisons--but it doesn't come up so much in the context of multilevel modeling.

Ioannidis (2005) is a provocative paper that I think has a lot of value--but you have to be really careful to try to directly apply such an analysis to real data.   He's making some really strong assumptions!  The logic of his paper is clear, though.  O'Rourke and I discuss the challenges of moving from that sort of model to larger conclusions in our 2013 paper (https://sites.stat.columbia.edu/gelman/research/published/GelmanORourkeBiostatistics.pdf).

The multiverse is a cool idea, and researchers have found it to be useful.  The sociologists Cristobal Young and Erin Cumberworth recently published a book on it (https://www.cambridge.org/core/books/multiverse-analysis/D53C3AB449F6747B4A319174E5C95FA1).  I don't think I'd apply the method in my own applied research, though, because the whole idea of the multiverse is to consider all the possible analyses you might have done on a dataset, and if I get to that point I'm more inclined to fit a multilevel model that subsumes all these analyses.  I have found multiverse analysis to be useful in understanding research published by others, and maybe it would be useful for my own work too, given that my final published analyses never really include all the possibilities of what I might have done.  The point is that this is yet another useful method that can have conceptual value even if I might not apply it to my own work.  Again, the term "rhetorical" might be misleading, as these are real methods that, like all statistical methods, are appropriate in some settings and not in others.

So please don't let your personal dislike of the term "rhetorical tools" to dissuade you from taking seriously the tools that I happen to have characterized as "rhetorical," as these include p-values, multiple comparisons corrections, Bayesian analysis with point priors, and all sorts of other methods that are rigorously defined and can be useful in many applied settings, including some of yours!

OK, now on to Type M and Type S errors.  You seem to imply that at some time I thought that these "should be part of ‘every statistical analysis,'" but I can assure you that I have never believed or written such a thing.  You put the phrase "every statistical analysis," but this is your phrase, not mine.

One very obvious way to see that I never thought Type M and Type S errors "should be part of ‘every statistical analysis'" is that, since the appearance of that article in 2014, I've published dozens of applied papers, and in only very few of these did I look at Type M and Type S errors.

What is that?  Why is it that my colleagues and I came up with this idea that has been influential, and which I indeed think can be very useful and which I do think should often be used by practitioners, but I only use it myself?

The reason is that the focus of our work on Type M and Type S errors has been to understand selection on statistical significance (as in that notorious estimate that early childhood intervention increases adult earnings by 42% on average, but with that being the result of an inferential procedure that, under any reasonable assumptions, greatly overestimates the magnitude of any real effect; that is, Type M error).  In my applied work it's very rare that I condition on statistical significance, and so this sort of use of Type M and S errors is not so relevant.  So it's perfectly coherent for me to say that Type M and S error analysis is valuable in a wide range of settings that that I think these tools should be applied very widely, without believing that they should be part of "every statistical analysis" or that I should necessarily use them for my own analyses.

That said, more recently I've been thinking that Type M and S errors are a useful approach to understanding statistical estimates more generally, not just for estimates that are conditioned on statistical significance.  I'm working with Erik van Zwet and Witold Więcek on applying these ideas to Bayesian inferences as well.  So I'm actually finding these methods to be more, not less, valuable for statistical understanding, and not just for "people who do not think logically about a topic" (in your phrasing).  Our papers on these topics are published in real journals and of course they're intended for people who <em>do</em> think logically about the topic!  And, just to be clear, I believe that you're thinking logically in your post too; I just think you've been misled by my terminology (again, I accept the blame for that), and also you work on different sorts of problems than I do, so it makes sense that a method that I find useful might not be so helpful to you.  There are many ways to Rome, which is another point I was making in that blog post.

Finally, a few things in your post that I did not address above:

1.  You quote from my blog post, where I wrote, “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.”  That's exactly my point above!  You had it right there.

2.  You wrote, "Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention."  So, just to clarify, yes in our paper we recommended the regular use of Type M and S errors, and we still recommend that!

3.  You write that our "sentences such as 'the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field' seemed to suggest to me that design analyses could be a requirement for all studies."  That's right--I actually do think that design analysis should be done for all studies!

OK, nothing is done all the time.  I guess that some studies are so cheap that there's no need for a design analysis--or maybe we could say that in such studies the design analysis is implicit.  For example, if I'm doing A/B testing in a company, and they've done lots of A/B tests before, and I think the new effect will be comparable to previous things being studied, then maybe I just go with the same design as in previous experiments, without performing a formal design analysis.  But one could argue that this corresponds to some implicit calculation.

In any case, yeah, in general I think that a design analysis should come before any study.  Indeed, that is what I tell students and colleagues:  never collect data before doing a simulation study first.  Often we do fake-data simulation after the data come in, to validate our model-fitting strategies, but for a while I've been thinking it's best to do it before.

This is not controversial advice in statistics, to recommend a design analysis before gathering data!  Indeed, in medical research it's basically a requirement.  In our paper, Carlin and I argue--and I still believe--that a design analysis using Type M and S errors is more valuable than the traditional Type 1 and 2 errors.  But in any case I consider "design analysis" to be the general term, with "power analysis" being a special case (design analysis looking at the probability of attaining statistical significance).  I don't think traditional power analysis is useless--one way you can see this is that we demonstrate power calculations in chapter 16 of Regression and Other Stories, a book that came out several years after my paper with Carlin--; I just think it can be misleading, especially if it is done without consideration of Type M and S errors.

Thanks again for your comments.  It's good to have an opportunity to clarify my thinking, and these are important issues in statistics.

P.S.  If you see something on our blog that you disagree with, feel free to comment there directly, as that way you can also reach readers of the original post.

--

References: 

Lakens, D., Cristian, Xavier-Quintais, G., Rasti, S., Toffalini, E., & Altoè, G. (2025). Rethinking Type S and M Errors. OSF. https://doi.org/10.31234/osf.io/2phzb_v1

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642

Tuesday, July 22, 2025

Easily download files from the Open Science Framework with Papercheck

Researchers increasingly use the Open Science Framework (OSF) to share files, such as data and code underlying scientific publications, or presentations and materials for scientific workshops. The OSF is an amazing service that has contributed immensely to a changed research culture where psychologists share data, code, and materials. We are very grateful it exists.

 

But it is not always the most user-friendly. Specifically, downloading files from the OSF is a bigger hassle than we (Lisa DeBruine and Daniel Lakens, the developers of Papercheck) would like it to be. Downloading individual files is so complex, Malte Elson recently posted this meme on Bluesky.

 

 

 

Not only is the download button for files difficult to find, but downloading all files related to a project can be surprisingly effortful. It is possible to download all files in a zip folder that will be called ‘osfstorage-archive.zip’ when downloaded. But as the OSF supports a nested folder structure, you might miss a folder, and you will quickly end up with ‘osfstorage-archive (1).zip’, ‘osfstorage-archive (2).zip’, etc. Unzipping these archives creates a lot of files without the organized folder structure, in folders with meaningless names, making it difficult to understand where files are.

 

The osf_file_download function in Papercheck

We have added a new function to our R package ‘Papercheck’ that will download all files and folders in an OSF repository. It saves all files by recreating the folder structure from the OSF in your download folder. Just install Papercheck, load the library, and use the osf_file_download function to grab all files on the OSF:


devtools::install_github("scienceverse/papercheck")
library(papercheck)
osf_file_download("6nt4v")

 

All files will be downloaded to your working directory.

 

Are you feeling FOMO for missing out on the King Open Research Summer School that is going on these days, where Sajedeh Rasti talked about Preregistration, and Cristian Mesquida will give a workshop on using Papercheck? Well, at least it is very easy to download all the files they have shared on the OSF and look at the presentations:

 

osf_file_download("b7es8")

 

In the output, we see that by default large files (more than 10mb) are omitted. A screenshot of a computer

AI-generated content may be incorrect.

 

If you want to download all files, regardless of the size, then set the parameter to ignore the maximum file size:

osf_file_download("b7es8", max_file_size = NULL)

 

Sometimes you might want to download all files but ignore the file structure on the OSF, to just have all the files in one folder. Setting the parameter ignore_folder_structure = TRUE will give you all the files on the OSF in a single folder. By default, files will be downloaded into your working directory, but you can also specify where you want the files to be saved.

 

osf_file_download("6nt4v", ignore_folder_structure = TRUE, download_to = "C:\\test_download")

 

We hope this function will make it easier for reviewers to access all supplementary files stored on the OSF during peer review, and for researchers who want to re-use data, code, or materials shared on the OSF by downloading all the files they need easily. Make sure to install the latest version of Papercheck (0.0.0.9050) to get access to this new function. Papercheck is still in active development, so report any bugs on GitHub.

 

 

Friday, July 4, 2025

Are meta-scientists ignoring philosophy of science?

Are meta-scientists ignoring philosophy of science (PoS)? Are they re-inventing the wheel? A recent panel at the Metascience conference engaged with this question, and the first sentence of the abstract states “Critics argue that metascience merely reinvents the wheel of other academic fields.” It’s a topic I have been thinking about for a while, so I will share my thoughts on this question. In this blog post I will only speak for myself, and not for any other metascientists. I studied philosophy for a year, read quite a lot about philosophy of science, regularly review for philosophy journals, have co-organized a conference that brings together philosophers and metascientists (and am co-organizing the next meeting) and I currently have 3 ongoing collaborations with philosophers of science. I would say it seems a bit far-fetched to claim I would be ignoring philosophy of science, and just reinvent what that field has already done. But I am ignoring a lot of it. That is not something PoS should take personally. I am also ignoring a lot of metascientific work that is done. That seems perfectly normal to me – there is only so much work I need to engage with to do my work better.

 

I read a lot of the work philosophers of science have written that is relevant for metascientists, and a lot they are currently writing. Too often, I find work in philosophy of science on the replication crisis and related topics to be of quite low quality, and mostly it turns out to be rather irrelevant for my work. It is very common that philosophers seem to have thought about topics very little, and the limited time they thought about a topic has been spent without any engagement with actual scientists. This is especially true about the philosophical work on the replication crisis. Having lived through it, and having thought about it every day for the last 15 years, most of the work by philosophers is quite superficial. If you spent only 3 years fulltime on your paper (and I know people who spent no more than a year full-time on an entire book!), it is just not going to be insightful enough for me to learn anything I didn’t know. Instead, I will notice a lot of mistakes, faulty assumptions, and incorrect conclusions.

 

I recently read a book from a philosopher of science I was looking forward to, hoping I would learn new things. Instead, I just thought ‘but psychologists themselves have done so much work on this that you are not discussing!’ for 200 pages. I found that quite frustrating. Maybe we should also talk about philosophers ignoring the work by psychologists.

 

As I was being frustrated, a thought popped up. There are very few philosophers of science, compared to the number of psychologists. Let’s say I read the literature on metascientific topics in psychology and philosophy, and only the top X% of papers make my work better. What is the probability that a psychologist has performed work that is relevant for me as a metascientist, compared to work by a philosopher? Because there are so many more psychologists than philosophers, all else equal, there will be many more important papers by psychologists than by philosophers that I should read.

 

Of course, all else is not equal. Psychologists have thought about all their crises for more than 60 years, ever since the first crisis in the 70’s (Lakens, 2025a, 2025b). Psychologists have a much better understanding of how psychological research is done than philosophers. Philosophers are on average smarter than psychologists (but again, there are many more psychologists), and have better training in conceptual analysis. Psychologists are more motivated to work on challenges in their field than philosophers. There are many other differences. So, we need to weigh all these factors in our model that will predict how many papers by philosophers of science I find useful to read, compared to the number of papers by psychologists I find useful to read. I don’t have those weights, but I have the outcome of the model for my own research: Most often, papers by psychologists on metascience are better and more relevant for my work. Remember: I still ignore a lot of the papers on metascience by psychologists, and I find a lot of those papers low quality as well! But psychologists write a lot more on the topic, and I also think the best papers on metascience when I combine both fields are more often written by psychologists.

 

I think this alternative explanation for why we engage very little with philosophers of science is worth taking into account. I personally consider it a strong contender to explain the behavior of metascientists to an explanation that posits that we intentionally do not engage with the literature in philosophy of science.

 

There are additional reasons for why I end up reading less work by philosophers of science. One is that I often do not agree with certain assumptions they have. The ideas that guide my work have been out of fashion in philosophy of science for half a century. Most of the ideas that are in fashion turn me off. I just do not enjoy reading papers that say ‘Let’s assume scientists are rational Bayesian updaters’ or ‘There is not one way to do science’. My working model of science is something very different, best summarized by this cartoon:

Afbeelding

 

When I say I ignore most of the work by philosophers of science on metascience, I still engage with quite a lot of philosophy of science. If you browse through the reference list of my online textbook – which is about statistics! – I am confident that there are more philosophers cited than the number of metascientists cited in a book by a philosopher of science. If you browse through the reading notes of the podcast ‘Nullius in Verba’ that I record with Smriti Mehta, I am confident that there are more papers by philosophers than there are papers by metascientists in podcasts by philosophers of science.

 

I just wanted to share these thoughts to provide some more diversity to the ideas that were shared in the panel at the Metascience conference. When “this panel asks what is new about metascience, and why it may have captured greater attention than previous research and reform initiatives” maybe one reason is that on average this literature is better and more relevant for researchers interested in how science works, and how it could work. I know that is not going to be a very popular viewpoint for philosophers to read, but it is my viewpoint, and I think no one can criticize me for not engaging with philosophy of science enough. I have at least a moderately informed opinion on this matter. 




P.S. The Metascience symposium also discussed why work in the field of Science and Technology Studies is not receiving much love by metascientists. I also have thoughts about this topic, but those thoughts are a bit too provocative to share on a blog. 


References

Lakens, D. (2025a). Concerns About Replicability Across Two Crises in Social Psychology. International Review of Social Psychology, 38(1). https://doi.org/10.5334/irsp.1036

Lakens, D. (2025b). Concerns About Theorizing, Relevance, Generalizability, and Methodology Across Two Crises in Social Psychology. International Review of Social Psychology, 38(1). https://doi.org/10.5334/irsp.1038