Update 30/09/2025: I have added a reply by Andrew Gelman below my original blog post.
We recently
posted a preprint criticizing the idea of Type S and M errors (https://osf.io/2phzb_v1). From our abstract:
“While these concepts have been proposed to be useful both when designing a
study (prospective) and when evaluating results (retroactive), we argue that
these statistics do not facilitate the proper design of studies, nor the
meaningful interpretation of results.”
In a recent
blog post that is mainly on p-curve analysis, Gelman writes briefly about Type
S and M errors, stating that he does not see them as tools that should be used
regularly, but that they mainly function as a ‘rhetorical tool’:
I offer
a three well-known examples of statistical ideas arising in the field of
science criticism, three methods whose main value is rhetorical:
[…]
2. The
concepts of Type M and Type S errors, which I developed with Francis Tuerlinckx
in 2000 and John Carlin in 2014. This has been an influential idea–ok, not as
influential as Ioannidis’s paper!–and I like it a lot, but it doesn’t
correspond to a method that I will typically use in practice. To me, the value
of the concepts of Type M and Type S errors is they help us understand certain
existing statistical procedures, such as selection on statistical significance,
that have serious problems. There’s mathematical content here for sure, but I
fundamentally think of these error calculations as having rhetorical value for
the design of studies and interpretation of reported results.
The main
sentence of interest here is that Gelman says this is not a method he would use
in practice. I was surprised, because in their article Gelman and Carlin (2014)
recommend the calculation of Type S and M errors more forcefully: “We suggest
that design calculations be performed after as well as before data collection
and analysis.” Throughout their article, they compare design calculations where
Type S and M errors are calculated to power analyses, which are widely seen as
a requirement before data collection of any hypothesis testing study. For
example, in the abstract they write “power analysis is flawed in that a narrow
emphasis on statistical significance is placed as the primary focus of study
design. In noisy, small-sample settings, statistically significant results can
often be misleading. To help researchers address this problem in the context of
their own studies, we recommend design calculations”.
They also say
design calculations are useful when interpreting results, and that they add
something to p-values and effect sizes, which again seems to suggest they can complement
ordinary data analysis: “Our retrospective analysis provided useful insight, beyond
what was revealed by the estimate, confidence interval, and p value that came
from the original data summary.” (Gelman & Carlin, 2014, p. 646). In
general, they seem to suggest design analyses are done before or after data analysis:
“First, it is indeed preferable to do a design analysis ahead of time, but a
researcher can analyze data in many different ways—indeed, an important part of
data analysis is the discovery of unanticipated patterns (Tukey, 1977) so that it
is unreasonable to suppose that all potential analyses could have been
determined ahead of time. The second reason for performing postdata design
calculations is that they can be a useful way to interpret the results from a
data analysis, as we next demonstrate in two examples.” (Gelman & Carlin,
2014, p. 643).
One the other
hand, in a single sentence in the discussion, they also write: “Our goal in
developing this software is not so much to provide a tool for routine use but
rather to demonstrate that such calculations are possible and to allow
researchers to play around and get a sense of the sizes of Type S errors and Type
M errors in realistic data settings.”
Maybe I
have always misinterpreted Gelman and Carlin, 2014, in that I took it as a
paper that recommended the regular use of Type S and M errors, and I should
have understood that the sentence in the discussion made it clear that this was
never their intention. If the idea is to replace Type 1 and 2 errors, and
hence, replace power analysis and the interpretation of data, design analysis
should be part of every hypothesis testing study. Sentences such as “the
requirement of design analysis can stimulate engagement with the existing literature
in the subject-matter field” seemed to suggest to me that design analyses could
be a requirement for all studies. But maybe I was wrong.
Or maybe I
wasn’t.
In this blog
post, Gelman writes: “Now, one odd thing about my paper with Carlin is that
it gives some tools that I recommend others use when designing and evaluating
their research, but I would not typically use these tools directly myself!
Because I am not wanting to summarize inference by statistical significance.” So,
here there seems to be the idea that others routinely use Type S and M errors. And
in a very early version of the paper with Carlin, available here,
the opening sentence also suggests routine use: “The present article proposes
an ideal that every statistical analysis be followed up with a power
calculation to better understand the inference from the data. As the quotations
above illustrate, however, our suggestion contradicts the advice of many
respected statisticians. Our resolution of this apparent disagreement is that
we perform retrospective power analysis in a different way and for a different
purpose than is typically recommended in the literature.”
Of course,
one good thing about science is that people change their beliefs about things.
Maybe Gelman one time thought Type S and M errors should be part of ‘every
statistical analysis’ but now sees the tool mainly as a ‘rhetorical device’. And
that is perfectly fine. It is also good to know, because I regular see people
who suggest that Type S and M error should routinely be used in practice. I
guess I can now point them to a blog post where Gelman himself disagrees with
that suggestion.
As we explain
in our preprint, the idea of Type S errors is conceptually incoherent, and any
probabilities calculated will be identical to the Type 1 error in directional
tests, or the false discovery rate, as all that Type S errors do is remove the
possibility of an effect being 0 from the distribution, but this probability is
itself 0. We also explain how other tools are better to educate researchers
about effect size inflation in studies selected for significance (for which
Gelman would recommend Type M errors), and we actually recommend p-uniform for
this, or just teaching people about critical effect sizes.
Personally,
I don’t like rhetorical tools. Although in our preprint we agree that teaching
the idea of Type S and M errors can be useful in education, there are also conceptually
coherent and practically useful statistical ideas that we can teach instead to
achieve the same understanding. Rhetorical tools might be useful to convince
people who do not think logically about a topic, but I prefer to have a
slightly higher bar for the scientists that I aim to educate about good
research practices, and I think they are able to understand the problem of low
statistical power and selection bias without rhetorical tools.
--Reply by Andrew Gelman--
Hi, Daniel. Thanks for your comments. It's always good to see that people are reading our articles and blog posts. I think you are a little bit confused about what we wrote, but ultimately that's our fault for not being clear, so I appreciate the opportunity to clarify.
So you don't need to consider this comment as a "rebuttal" to your post. For convenience I'll go through several of your statements one by one, but my goal is to clarify.
First, I guess I should've avoided the word "rhetorical." In my post, I characterized Ioannidis's 2005 claim, type M and S errors, and multiverse analysis as "rhetorical tools" that have been been useful in the field of science criticism but which I would not use in my own analyses. I could've added to this many other statistical methods including p-values and Bayes factors.
When I describe a statistical method as "rhetorical" in this context, I'm not saying it's mathematically invalid or that it's conceptually incoherent (to use your term), nor am I saying these methods should not be used! All these tools can be useful; they just rely on very strong assumptions. P-values and Bayes factors are measures of evidence relative to a null hypothesis (not just an assumption that a particular parameter equals zero, but an entire set of assumptions about the data-generating process) that is irrelevant in the science and decision problems I've seen--but these methods are clearly defined and theoretically justified, and many practitioners get a lot out of them. I very rarely would use p-values or Bayes factors in my work because I'm very rarely interested in this sort of discrepancy from a null hypothesis.
A related point comes up in my paper with Hill and Yajima, "Why we (usually) don't have to worry about multiple comparisons" (https://sites.stat.columbia.edu/gelman/research/published/multiple2f.pdf). Multiple comparisons corrections can be important, indeed I've criticized some published work for misinterpreting evidence by not accounting for multiple comparisons or multiple potential comparisons--but it doesn't come up so much in the context of multilevel modeling.
Ioannidis (2005) is a provocative paper that I think has a lot of value--but you have to be really careful to try to directly apply such an analysis to real data. He's making some really strong assumptions! The logic of his paper is clear, though. O'Rourke and I discuss the challenges of moving from that sort of model to larger conclusions in our 2013 paper (https://sites.stat.columbia.edu/gelman/research/published/GelmanORourkeBiostatistics.pdf).
The multiverse is a cool idea, and researchers have found it to be useful. The sociologists Cristobal Young and Erin Cumberworth recently published a book on it (https://www.cambridge.org/core/books/multiverse-analysis/D53C3AB449F6747B4A319174E5C95FA1). I don't think I'd apply the method in my own applied research, though, because the whole idea of the multiverse is to consider all the possible analyses you might have done on a dataset, and if I get to that point I'm more inclined to fit a multilevel model that subsumes all these analyses. I have found multiverse analysis to be useful in understanding research published by others, and maybe it would be useful for my own work too, given that my final published analyses never really include all the possibilities of what I might have done. The point is that this is yet another useful method that can have conceptual value even if I might not apply it to my own work. Again, the term "rhetorical" might be misleading, as these are real methods that, like all statistical methods, are appropriate in some settings and not in others.
So please don't let your personal dislike of the term "rhetorical tools" to dissuade you from taking seriously the tools that I happen to have characterized as "rhetorical," as these include p-values, multiple comparisons corrections, Bayesian analysis with point priors, and all sorts of other methods that are rigorously defined and can be useful in many applied settings, including some of yours!
OK, now on to Type M and Type S errors. You seem to imply that at some time I thought that these "should be part of ‘every statistical analysis,'" but I can assure you that I have never believed or written such a thing. You put the phrase "every statistical analysis," but this is your phrase, not mine.
One very obvious way to see that I never thought Type M and Type S errors "should be part of ‘every statistical analysis'" is that, since the appearance of that article in 2014, I've published dozens of applied papers, and in only very few of these did I look at Type M and Type S errors.
What is that? Why is it that my colleagues and I came up with this idea that has been influential, and which I indeed think can be very useful and which I do think should often be used by practitioners, but I only use it myself?
The reason is that the focus of our work on Type M and Type S errors has been to understand selection on statistical significance (as in that notorious estimate that early childhood intervention increases adult earnings by 42% on average, but with that being the result of an inferential procedure that, under any reasonable assumptions, greatly overestimates the magnitude of any real effect; that is, Type M error). In my applied work it's very rare that I condition on statistical significance, and so this sort of use of Type M and S errors is not so relevant. So it's perfectly coherent for me to say that Type M and S error analysis is valuable in a wide range of settings that that I think these tools should be applied very widely, without believing that they should be part of "every statistical analysis" or that I should necessarily use them for my own analyses.
That said, more recently I've been thinking that Type M and S errors are a useful approach to understanding statistical estimates more generally, not just for estimates that are conditioned on statistical significance. I'm working with Erik van Zwet and Witold Więcek on applying these ideas to Bayesian inferences as well. So I'm actually finding these methods to be more, not less, valuable for statistical understanding, and not just for "people who do not think logically about a topic" (in your phrasing). Our papers on these topics are published in real journals and of course they're intended for people who <em>do</em> think logically about the topic! And, just to be clear, I believe that you're thinking logically in your post too; I just think you've been misled by my terminology (again, I accept the blame for that), and also you work on different sorts of problems than I do, so it makes sense that a method that I find useful might not be so helpful to you. There are many ways to Rome, which is another point I was making in that blog post.
Finally, a few things in your post that I did not address above:
1. You quote from my blog post, where I wrote, “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” That's exactly my point above! You had it right there.
2. You wrote, "Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention." So, just to clarify, yes in our paper we recommended the regular use of Type M and S errors, and we still recommend that!
3. You write that our "sentences such as 'the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field' seemed to suggest to me that design analyses could be a requirement for all studies." That's right--I actually do think that design analysis should be done for all studies!
OK, nothing is done all the time. I guess that some studies are so cheap that there's no need for a design analysis--or maybe we could say that in such studies the design analysis is implicit. For example, if I'm doing A/B testing in a company, and they've done lots of A/B tests before, and I think the new effect will be comparable to previous things being studied, then maybe I just go with the same design as in previous experiments, without performing a formal design analysis. But one could argue that this corresponds to some implicit calculation.
In any case, yeah, in general I think that a design analysis should come before any study. Indeed, that is what I tell students and colleagues: never collect data before doing a simulation study first. Often we do fake-data simulation after the data come in, to validate our model-fitting strategies, but for a while I've been thinking it's best to do it before.
This is not controversial advice in statistics, to recommend a design analysis before gathering data! Indeed, in medical research it's basically a requirement. In our paper, Carlin and I argue--and I still believe--that a design analysis using Type M and S errors is more valuable than the traditional Type 1 and 2 errors. But in any case I consider "design analysis" to be the general term, with "power analysis" being a special case (design analysis looking at the probability of attaining statistical significance). I don't think traditional power analysis is useless--one way you can see this is that we demonstrate power calculations in chapter 16 of Regression and Other Stories, a book that came out several years after my paper with Carlin--; I just think it can be misleading, especially if it is done without consideration of Type M and S errors.
Thanks again for your comments. It's good to have an opportunity to clarify my thinking, and these are important issues in statistics.
P.S. If you see something on our blog that you disagree with, feel free to comment there directly, as that way you can also reach readers of the original post.
--
References:
Lakens, D., Cristian, Xavier-Quintais, G., Rasti, S., Toffalini, E., & Altoè, G. (2025). Rethinking Type S and M Errors. OSF.
https://doi.org/10.31234/osf.io/2phzb_v1 Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651.
https://doi.org/10.1177/1745691614551642