The 20% Statistician: Type S and M errors as a “rhetorical tool”

Update 30/09/2025: I have added a reply by Andrew Gelman below my original blog post.

We recently posted a preprint criticizing the idea of Type S and M errors (https://osf.io/2phzb_v1). From our abstract: “While these concepts have been proposed to be useful both when designing a study (prospective) and when evaluating results (retroactive), we argue that these statistics do not facilitate the proper design of studies, nor the meaningful interpretation of results.”

In a recent blog post that is mainly on p-curve analysis, Gelman writes briefly about Type S and M errors, stating that he does not see them as tools that should be used regularly, but that they mainly function as a ‘rhetorical tool’:

I offer a three well-known examples of statistical ideas arising in the field of science criticism, three methods whose main value is rhetorical:

[…]

2. The concepts of Type M and Type S errors, which I developed with Francis Tuerlinckx in 2000 and John Carlin in 2014. This has been an influential idea–ok, not as influential as Ioannidis’s paper!–and I like it a lot, but it doesn’t correspond to a method that I will typically use in practice. To me, the value of the concepts of Type M and Type S errors is they help us understand certain existing statistical procedures, such as selection on statistical significance, that have serious problems. There’s mathematical content here for sure, but I fundamentally think of these error calculations as having rhetorical value for the design of studies and interpretation of reported results.

The main sentence of interest here is that Gelman says this is not a method he would use in practice. I was surprised, because in their article Gelman and Carlin (2014) recommend the calculation of Type S and M errors more forcefully: “We suggest that design calculations be performed after as well as before data collection and analysis.” Throughout their article, they compare design calculations where Type S and M errors are calculated to power analyses, which are widely seen as a requirement before data collection of any hypothesis testing study. For example, in the abstract they write “power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations”.

They also say design calculations are useful when interpreting results, and that they add something to p-values and effect sizes, which again seems to suggest they can complement ordinary data analysis: “Our retrospective analysis provided useful insight, beyond what was revealed by the estimate, confidence interval, and p value that came from the original data summary.” (Gelman & Carlin, 2014, p. 646). In general, they seem to suggest design analyses are done before or after data analysis: “First, it is indeed preferable to do a design analysis ahead of time, but a researcher can analyze data in many different ways—indeed, an important part of data analysis is the discovery of unanticipated patterns (Tukey, 1977) so that it is unreasonable to suppose that all potential analyses could have been determined ahead of time. The second reason for performing postdata design calculations is that they can be a useful way to interpret the results from a data analysis, as we next demonstrate in two examples.” (Gelman & Carlin, 2014, p. 643).

One the other hand, in a single sentence in the discussion, they also write: “Our goal in developing this software is not so much to provide a tool for routine use but rather to demonstrate that such calculations are possible and to allow researchers to play around and get a sense of the sizes of Type S errors and Type M errors in realistic data settings.”

Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention. If the idea is to replace Type 1 and 2 errors, and hence, replace power analysis and the interpretation of data, design analysis should be part of every hypothesis testing study. Sentences such as “the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field” seemed to suggest to me that design analyses could be a requirement for all studies. But maybe I was wrong.

Or maybe I wasn’t.

In this blog post, Gelman writes: “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” So, here there seems to be the idea that others routinely use Type S and M errors. And in a very early version of the paper with Carlin, available here, the opening sentence also suggests routine use: “The present article proposes an ideal that every statistical analysis be followed up with a power calculation to better understand the inference from the data. As the quotations above illustrate, however, our suggestion contradicts the advice of many respected statisticians. Our resolution of this apparent disagreement is that we perform retrospective power analysis in a different way and for a different purpose than is typically recommended in the literature.”

Of course, one good thing about science is that people change their beliefs about things. Maybe Gelman one time thought Type S and M errors should be part of ‘every statistical analysis’ but now sees the tool mainly as a ‘rhetorical device’. And that is perfectly fine. It is also good to know, because I regular see people who suggest that Type S and M error should routinely be used in practice. I guess I can now point them to a blog post where Gelman himself disagrees with that suggestion.

As we explain in our preprint, the idea of Type S errors is conceptually incoherent, and any probabilities calculated will be identical to the Type 1 error in directional tests, or the false discovery rate, as all that Type S errors do is remove the possibility of an effect being 0 from the distribution, but this probability is itself 0. We also explain how other tools are better to educate researchers about effect size inflation in studies selected for significance (for which Gelman would recommend Type M errors), and we actually recommend p-uniform for this, or just teaching people about critical effect sizes.

Personally, I don’t like rhetorical tools. Although in our preprint we agree that teaching the idea of Type S and M errors can be useful in education, there are also conceptually coherent and practically useful statistical ideas that we can teach instead to achieve the same understanding. Rhetorical tools might be useful to convince people who do not think logically about a topic, but I prefer to have a slightly higher bar for the scientists that I aim to educate about good research practices, and I think they are able to understand the problem of low statistical power and selection bias without rhetorical tools.

--Reply by Andrew Gelman--

Hi, Daniel. Thanks for your comments. It's always good to see that people are reading our articles and blog posts. I think you are a little bit confused about what we wrote, but ultimately that's our fault for not being clear, so I appreciate the opportunity to clarify.

So you don't need to consider this comment as a "rebuttal" to your post. For convenience I'll go through several of your statements one by one, but my goal is to clarify.

First, I guess I should've avoided the word "rhetorical." In my post, I characterized Ioannidis's 2005 claim, type M and S errors, and multiverse analysis as "rhetorical tools" that have been been useful in the field of science criticism but which I would not use in my own analyses. I could've added to this many other statistical methods including p-values and Bayes factors.

When I describe a statistical method as "rhetorical" in this context, I'm not saying it's mathematically invalid or that it's conceptually incoherent (to use your term), nor am I saying these methods should not be used! All these tools can be useful; they just rely on very strong assumptions. P-values and Bayes factors are measures of evidence relative to a null hypothesis (not just an assumption that a particular parameter equals zero, but an entire set of assumptions about the data-generating process) that is irrelevant in the science and decision problems I've seen--but these methods are clearly defined and theoretically justified, and many practitioners get a lot out of them. I very rarely would use p-values or Bayes factors in my work because I'm very rarely interested in this sort of discrepancy from a null hypothesis.

A related point comes up in my paper with Hill and Yajima, "Why we (usually) don't have to worry about multiple comparisons" (https://sites.stat.columbia.edu/gelman/research/published/multiple2f.pdf). Multiple comparisons corrections can be important, indeed I've criticized some published work for misinterpreting evidence by not accounting for multiple comparisons or multiple potential comparisons--but it doesn't come up so much in the context of multilevel modeling.

Ioannidis (2005) is a provocative paper that I think has a lot of value--but you have to be really careful to try to directly apply such an analysis to real data. He's making some really strong assumptions! The logic of his paper is clear, though. O'Rourke and I discuss the challenges of moving from that sort of model to larger conclusions in our 2013 paper (https://sites.stat.columbia.edu/gelman/research/published/GelmanORourkeBiostatistics.pdf).

The multiverse is a cool idea, and researchers have found it to be useful. The sociologists Cristobal Young and Erin Cumberworth recently published a book on it (https://www.cambridge.org/core/books/multiverse-analysis/D53C3AB449F6747B4A319174E5C95FA1). I don't think I'd apply the method in my own applied research, though, because the whole idea of the multiverse is to consider all the possible analyses you might have done on a dataset, and if I get to that point I'm more inclined to fit a multilevel model that subsumes all these analyses. I have found multiverse analysis to be useful in understanding research published by others, and maybe it would be useful for my own work too, given that my final published analyses never really include all the possibilities of what I might have done. The point is that this is yet another useful method that can have conceptual value even if I might not apply it to my own work. Again, the term "rhetorical" might be misleading, as these are real methods that, like all statistical methods, are appropriate in some settings and not in others.

So please don't let your personal dislike of the term "rhetorical tools" to dissuade you from taking seriously the tools that I happen to have characterized as "rhetorical," as these include p-values, multiple comparisons corrections, Bayesian analysis with point priors, and all sorts of other methods that are rigorously defined and can be useful in many applied settings, including some of yours!

OK, now on to Type M and Type S errors. You seem to imply that at some time I thought that these "should be part of ‘every statistical analysis,'" but I can assure you that I have never believed or written such a thing. You put the phrase "every statistical analysis," but this is your phrase, not mine.

One very obvious way to see that I never thought Type M and Type S errors "should be part of ‘every statistical analysis'" is that, since the appearance of that article in 2014, I've published dozens of applied papers, and in only very few of these did I look at Type M and Type S errors.

What is that? Why is it that my colleagues and I came up with this idea that has been influential, and which I indeed think can be very useful and which I do think should often be used by practitioners, but I only use it myself?

The reason is that the focus of our work on Type M and Type S errors has been to understand selection on statistical significance (as in that notorious estimate that early childhood intervention increases adult earnings by 42% on average, but with that being the result of an inferential procedure that, under any reasonable assumptions, greatly overestimates the magnitude of any real effect; that is, Type M error). In my applied work it's very rare that I condition on statistical significance, and so this sort of use of Type M and S errors is not so relevant. So it's perfectly coherent for me to say that Type M and S error analysis is valuable in a wide range of settings that that I think these tools should be applied very widely, without believing that they should be part of "every statistical analysis" or that I should necessarily use them for my own analyses.

That said, more recently I've been thinking that Type M and S errors are a useful approach to understanding statistical estimates more generally, not just for estimates that are conditioned on statistical significance. I'm working with Erik van Zwet and Witold Więcek on applying these ideas to Bayesian inferences as well. So I'm actually finding these methods to be more, not less, valuable for statistical understanding, and not just for "people who do not think logically about a topic" (in your phrasing). Our papers on these topics are published in real journals and of course they're intended for people who <em>do</em> think logically about the topic! And, just to be clear, I believe that you're thinking logically in your post too; I just think you've been misled by my terminology (again, I accept the blame for that), and also you work on different sorts of problems than I do, so it makes sense that a method that I find useful might not be so helpful to you. There are many ways to Rome, which is another point I was making in that blog post.

Finally, a few things in your post that I did not address above:

1. You quote from my blog post, where I wrote, “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” That's exactly my point above! You had it right there.

2. You wrote, "Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention." So, just to clarify, yes in our paper we recommended the regular use of Type M and S errors, and we still recommend that!

3. You write that our "sentences such as 'the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field' seemed to suggest to me that design analyses could be a requirement for all studies." That's right--I actually do think that design analysis should be done for all studies!

OK, nothing is done all the time. I guess that some studies are so cheap that there's no need for a design analysis--or maybe we could say that in such studies the design analysis is implicit. For example, if I'm doing A/B testing in a company, and they've done lots of A/B tests before, and I think the new effect will be comparable to previous things being studied, then maybe I just go with the same design as in previous experiments, without performing a formal design analysis. But one could argue that this corresponds to some implicit calculation.

In any case, yeah, in general I think that a design analysis should come before any study. Indeed, that is what I tell students and colleagues: never collect data before doing a simulation study first. Often we do fake-data simulation after the data come in, to validate our model-fitting strategies, but for a while I've been thinking it's best to do it before.

This is not controversial advice in statistics, to recommend a design analysis before gathering data! Indeed, in medical research it's basically a requirement. In our paper, Carlin and I argue--and I still believe--that a design analysis using Type M and S errors is more valuable than the traditional Type 1 and 2 errors. But in any case I consider "design analysis" to be the general term, with "power analysis" being a special case (design analysis looking at the probability of attaining statistical significance). I don't think traditional power analysis is useless--one way you can see this is that we demonstrate power calculations in chapter 16 of Regression and Other Stories, a book that came out several years after my paper with Carlin--; I just think it can be misleading, especially if it is done without consideration of Type M and S errors.

Thanks again for your comments. It's good to have an opportunity to clarify my thinking, and these are important issues in statistics.

P.S. If you see something on our blog that you disagree with, feel free to comment there directly, as that way you can also reach readers of the original post.

References:

Lakens, D., Cristian, Xavier-Quintais, G., Rasti, S., Toffalini, E., & Altoè, G. (2025). Rethinking Type S and M Errors. OSF. https://doi.org/10.31234/osf.io/2phzb_v1

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642

The 20% Statistician

Sunday, September 28, 2025

Type S and M errors as a “rhetorical tool”

No comments:

Post a Comment