A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, September 28, 2025

Type S and M errors as a “rhetorical tool”

We recently posted a preprint criticizing the idea of Type S and M errors (https://osf.io/2phzb_v1). From our abstract: “While these concepts have been proposed to be useful both when designing a study (prospective) and when evaluating results (retroactive), we argue that these statistics do not facilitate the proper design of studies, nor the meaningful interpretation of results.”

In a recent blog post that is mainly on p-curve analysis, Gelman writes briefly about Type S and M errors, stating that he does not see them as tools that should be used regularly, but that they mainly function as a ‘rhetorical tool’:

I offer a three well-known examples of statistical ideas arising in the field of science criticism, three methods whose main value is rhetorical:

[…]

2. The concepts of Type M and Type S errors, which I developed with Francis Tuerlinckx in 2000 and John Carlin in 2014. This has been an influential idea–ok, not as influential as Ioannidis’s paper!–and I like it a lot, but it doesn’t correspond to a method that I will typically use in practice. To me, the value of the concepts of Type M and Type S errors is they help us understand certain existing statistical procedures, such as selection on statistical significance, that have serious problems. There’s mathematical content here for sure, but I fundamentally think of these error calculations as having rhetorical value for the design of studies and interpretation of reported results.

The main sentence of interest here is that Gelman says this is not a method he would use in practice. I was surprised, because in their article Gelman and Carlin (2014) recommend the calculation of Type S and M errors more forcefully: “We suggest that design calculations be performed after as well as before data collection and analysis.” Throughout their article, they compare design calculations where Type S and M errors are calculated to power analyses, which are widely seen as a requirement before data collection of any hypothesis testing study. For example, in the abstract they write “power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations”.

They also say design calculations are useful when interpreting results, and that they add something to p-values and effect sizes, which again seems to suggest they can complement ordinary data analysis: “Our retrospective analysis provided useful insight, beyond what was revealed by the estimate, confidence interval, and p value that came from the original data summary.” (Gelman & Carlin, 2014, p. 646). In general, they seem to suggest design analyses are done before or after data analysis: “First, it is indeed preferable to do a design analysis ahead of time, but a researcher can analyze data in many different ways—indeed, an important part of data analysis is the discovery of unanticipated patterns (Tukey, 1977) so that it is unreasonable to suppose that all potential analyses could have been determined ahead of time. The second reason for performing postdata design calculations is that they can be a useful way to interpret the results from a data analysis, as we next demonstrate in two examples.” (Gelman & Carlin, 2014, p. 643).

One the other hand, in a single sentence in the discussion, they also write: “Our goal in developing this software is not so much to provide a tool for routine use but rather to demonstrate that such calculations are possible and to allow researchers to play around and get a sense of the sizes of Type S errors and Type M errors in realistic data settings.”

Maybe I have always misinterpreted Gelman and Carlin, 2014, in that I took it as a paper that recommended the regular use of Type S and M errors, and I should have understood that the sentence in the discussion made it clear that this was never their intention. If the idea is to replace Type 1 and 2 errors, and hence, replace power analysis and the interpretation of data, design analysis should be part of every hypothesis testing study. Sentences such as “the requirement of design analysis can stimulate engagement with the existing literature in the subject-matter field” seemed to suggest to me that design analyses could be a requirement for all studies. But maybe I was wrong.

 

Or maybe I wasn’t.

 

In this blog post, Gelman writes: “Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.” So, here there seems to be the idea that others routinely use Type S and M errors. And in a very early version of the paper with Carlin, available here, the opening sentence also suggests routine use: “The present article proposes an ideal that every statistical analysis be followed up with a power calculation to better understand the inference from the data. As the quotations above illustrate, however, our suggestion contradicts the advice of many respected statisticians. Our resolution of this apparent disagreement is that we perform retrospective power analysis in a different way and for a different purpose than is typically recommended in the literature.”

Of course, one good thing about science is that people change their beliefs about things. Maybe Gelman one time thought Type S and M errors should be part of ‘every statistical analysis’ but now sees the tool mainly as a ‘rhetorical device’. And that is perfectly fine. It is also good to know, because I regular see people who suggest that Type S and M error should routinely be used in practice. I guess I can now point them to a blog post where Gelman himself disagrees with that suggestion.

As we explain in our preprint, the idea of Type S errors is conceptually incoherent, and any probabilities calculated will be identical to the Type 1 error in directional tests, or the false discovery rate, as all that Type S errors do is remove the possibility of an effect being 0 from the distribution, but this probability is itself 0. We also explain how other tools are better to educate researchers about effect size inflation in studies selected for significance (for which Gelman would recommend Type M errors), and we actually recommend p-uniform for this, or just teaching people about critical effect sizes.

Personally, I don’t like rhetorical tools. Although in our preprint we agree that teaching the idea of Type S and M errors can be useful in education, there are also conceptually coherent and practically useful statistical ideas that we can teach instead to achieve the same understanding. Rhetorical tools might be useful to convince people who do not think logically about a topic, but I prefer to have a slightly higher bar for the scientists that I aim to educate about good research practices, and I think they are able to understand the problem of low statistical power and selection bias without rhetorical tools.


References: 

Lakens, D., Cristian, Xavier-Quintais, G., Rasti, S., Toffalini, E., & Altoè, G. (2025). Rethinking Type S and M Errors. OSF. https://doi.org/10.31234/osf.io/2phzb_v1

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642

No comments:

Post a Comment