The 20% Statistician: Irwin Bross justifying the 0.05 alpha level as a means of communication between scientists

This blog post is based on the chapter “Critical Levels, Statistical Language, and Scientific Inference” by Irwin D. J. Bross (1971) in the proceedings of the symposium on the foundations of statistical inference held in 1970. Because the conference proceedings might be difficult to access, I am citing extensively from the original source. Irwin D. J. Bross [1921-2004] was a biostatistician at Roswell Park Cancer Institute in Buffalo up to 1983.

Irwin D. J. Bross

Criticizing the use of thresholds such as an alpha level of 0.05 to make dichotomous inferences is of all times. Bross writes in 1981: “Of late the attacks on critical levels (and the statistical methods based on these levels) have become more frequent and more vehement.” I feel the same way, but it seems unlikely the vehemence of criticisms has been increasing for half a century. A more likely explanation is perhaps that some people, like Bross and myself, become increasingly annoyed by such criticisms.

Bross reflects on how very few justifications of the use of alpha levels exists in the literature, because “Elementary statistics texts are not equipped to go into the matter; advanced texts are too preoccupied with the latest and fanciest statistical techniques to have space for anything so elementary. Thus the justifications for critical levels that are commonly offered are flimsy, superficial, and badly outdated.” He notes how the use of critical values emerged in a time where statisticians had more practical experience, but that “Unfortunately, many of the theorists nowadays have lost touch with statistical practice and as a consequence, their work is mathematically sophisticated but scientifically very naive.”

Bross sets out to consider which justification can be given for the use of critical alpha levels. He would like such a justification to convince those who use statistical methods in their research, and statisticians who are familiar with statistical practice. He argues that the purpose of a medical researcher “is to reach scientific conclusions concerning the relative efficacy (or safety) of the drugs under test. From the investigator's standpoint, he would like to make statements which are reliable and informative. From a somewhat broader standpoint, we can consider the larger communication network that exists in a given research area - the network which would connect a clinical pharmacologist with his colleagues and with the practicing physicians who might make use of his findings. Any realistic picture of scientific inference must take some account of the communication networks that exist in the sciences.”

It is rare to see biostatisticians explicitly embrace the pragmatic and social aspect of scientific inference. He highlights three main points about communication networks. “First, messages generate messages.” Colleagues might replicate a study, or build on it, or apply the knowledge. “A second key point is: discordant messages produce noise in the network”. “A third key point is: statistical methods are useful in controlling the noise in the network”. The critical level set by researchers controls the noise in the network. Too much noise in a network impedes scientific progress, because communication breaks down. He writes “Thus the specification of the critical levels […] has proved in practice to be an effective method for controlling the noise in communication networks.” Bross also notes that the critical alpha level in itself is not enough to reduce noise – it is just one component of a well-designed experiment that reduces noise in the network. Setting a sufficiently low alpha level is therefore one aspect that contributes to a system where people in the network can place some reliance on claims that are made because noise levels are not too high.

“This simple example serves to bring out several features of the usual critical level techniques which are often overlooked although they are important in practice. Clearly, if each investigator holds the proportion of false positive reports in his studies at under 5%, then the proportion of false positive reports from all of the studies carried out by the participating members of the network will be held at less than 5%. This property does not sound very impressive – it sounds like the sort of property one would expect any sensible statistical method to have. But it might be noted that most of the methods advocated by the theoreticians who object to critical levels lack this and other important properties which facilitate control of the noise level in the network.”

This point, common among all error statisticians, has repeatedly raised its head in response to suggestions to abandon statistical significance, or to stop interpreting p-values dichotomously. Of course, one can choose not to care about the rate at which researchers make erroneous claims, but it is important to realize the consequences of not caring about error rates. Of course, one can work towards a science where scientists no longer make claims, but generate knowledge through some other mechanism. But recent proposals to treat p-values as continuous measures of evidence (Muff et al., 2021; but see Lakens, 2022a) or to use estimation instead of hypothesis tests (Elkins et al., 2021; but see Lakens, 2022b) do not outline what such an alternative mode of knowledge generation would look like, or how researchers will be prevented from making claims about the presence or absence of effects.

Bross proposes an intriguing view on statistical inference where “statistics is used as a means of communication between the members of the network.” He views statistics not as a way to learn whether idealized probability distributions accurately reflect the empirical reality in infinitely repeated samples, but as a way to communicate assertions that are limited by the facts. He argues that specific ways of communicating only become widely used if they are effective. Here, I believe he fails to acknowledge that ineffective communication systems can also evolve, and it is possible that scientists en masse use techniques, not because they are efficient ways of communicating facts, but because they will lead to scientific publications. The idea that statistical inferences are a ‘mindless ritual’ has been proposed (Gigerenzer, 2018), and there is no doubt that many scientists simply imitate the practices they see. Furthermore, the replication crisis has shown that huge error rates in subfields of the scientific literature can exists for decades. The problems associated with these large error rates (e.g., failures to replicate findings, inaccurate effect size estimate) can sometimes only very slowly lead to a change in practice. So, arguing a practice survives because it works is risky. Whether current research practices are effective - or whether other practices would be more effective – requires empirical evidence. Randomized controlled trials seem a bridge to far to compare statistical approaches, but natural experiments by journals that abandon p-values support Bross’s argument to some extent. When the journal of Basic and Applied Psychology abandoned p-values, the consequence was that researchers claimed effects were present at a higher error rate than if claims had been limited by typical alpha level thresholds (Fricker et al., 2019).

Whether efficient or not, statistical language surrounding critical thresholds is in widespread use. Bross discusses how an alpha level of 5% or 1% is a convention. Like many linguistic conventions, the threshold of 5% is somewhat arbitrary, and reflects the influence of statisticians like Karl Pearson and Ronald Fisher. However, the threshold is not completely arbitrary. Bross asks us to imagine what would have happened had the alpha level of 0.001 been proposed, or an alpha level of 0.20. In both cases, he believes the convention would not have spread – in the first case because in many fields there are not sufficient resources to make claims at such a low error rate, and in the second case because few researchers would have found that alpha level a satisfactory quantification of ‘rare’ events. He, I think correctly, observes this means that the alpha level of 0.05 is not completely arbitrary, but that it reflects a quantification of ‘rare’ that researchers believe has sufficient practical value to be used in communication. Bross argues that the convention of a 5% alpha level spread because it sufficiently matched with what most scientists considered as an appropriate probability level to define ‘rare’ events, for example as when Fisher (1956) writes “Either an exceptionally rare chance has occurred, or the theory of random distribution is not true.”

Of course, one might follow Bross and ask “But is there any reason to single out one particular value, 5%, instead of some other value such as 3.98% or 7.13%?”. He writes: “Again, inconformity with the linguistic patterns in setting conventions it is natural to use a round number like 5%. Such round numbers serve to avoid any suggestion that the critical value has been gerrymandered or otherwise picked to prove a point in a particular study. From an abstract standpoint, it might seem more logical to allow a range of critical values rather than to choose one number but to do so would be to go contrary to the linguistic habits of fact-limited languages. Such languages tend to minimize the freedom of choice of the speaker in order to insure that a statement results from factual evidence and not from a little language-game played by the speaker.”

In a recent paper we made a similar point (Uygun Tunç et al., 2021): “The conventional use of an alpha level of 0.05 can also be explained by the requirement in methodological falsificationism that statistical decision procedures are specified before the data is analyzed (see Popper, 2002, sections 19-20; Lakatos, 1978, p. 23-28). If researchers are allowed to set the alpha level after looking at the data there is a possibility that confirmation bias (or more intentional falsification-deflecting strategies) influences the choice of an alpha level. An additional reason for a conventional alpha level of 0.05 is that before the rise of the internet it was difficult to transparently communicate the pre-specified alpha level for any individual test to peers. The use of a default alpha level therefore effectively functioned as a pre-specification. For a convention to work as a universal pre-specification, it must be accepted by nearly everyone, and be extremely resistant to change. If more than a single conventional alpha level exists, this introduces the risk that confirmation bias influences the choice of an alpha level.” Our thoughts seem to be very much aligned with those of Bross.

Bross continues and writes “Anyone familiar with certain areas of the scientific literature will be well aware of the need for curtailing language-games. Thus if there were no 5% level firmly established, then some persons would stretch the level to 6% or 7% to prove their point. Soon others would be stretching to 10% and 15% and the jargon would become meaningless. Whereas nowadays a phrase such as statistically significant difference provides some assurance that the results are not merely a manifestation of sampling variation, the phrase would mean very little if everyone played language-games. To be sure, there are always a few folks who fiddle with significance levels-who will switch from two-tailed to one-tailed tests or from one significance test to another in an effort to get positive results. However such gamesmanship is severely frowned upon and is rarely practiced by persons who are native speakers of fact-limited scientific languages - it is the mark of an amateur.”

We struggled with the idea that changing the alpha level (and especially increasing the alpha level) might confuse readers when ‘statistically significant’ no longer means ‘rejected with a 5% alpha level) in our recent paper on justifying alpha levels (Maier & Lakens, 2022). We wrote “Finally, the use of a high alpha level might be missed if readers skim an article. We believe this can be avoided by having each scientific claim accompanied by the alpha level under which it was made. Scientists should be required to report their alpha levels prominently, usually in the abstract of an article alongside a summary of the main claim.” It might in general be an improvement if people write ‘we reject an effect size of X at an alpha level of 5%’, but this is especially true of researchers choose to deviate from the conventional 5% alpha level.

I like how Bross has an extremely pragmatic but still principled view on statistical inferences. He writes: “This means that we have to abandon the traditional prescriptive attitude and adopt the descriptive approach which is characteristic of the empirical sciences. If we do so, then we get a very different picture of what statistical and scientific inference is all about. It is very difficult, I believe, to get such a picture unless you have had some first hand experience as an independent investigator in a scientific study. You then learn that drawing conclusions from statistical data can be a traumatic experience. A statistical consultant who takes a detached view of things just does not feel this pain.” This point is made by other applied statisticians, and it is a really important one. There are consequences of statistical inferences that you only experience when you spend several years trying to answer a substantive research question. Without that experience, it is difficult to give practical recommendations about what researchers should want when using statistics.

He continues “What you want - and want desperately - is all the protection you can get against the "slings and arrows of outrageous fortune". You want to say something informative and useful about the origins and nature of a disease or health hazard. But you do not want your statement to come back and haunt you for the rest of your life.” Of course, Bross should have said ‘One thing you might want’ (because now his statement is just another example of The Statistician’s Fallacy (Lakens, 2021)). But with this small amendment, I think there are quite some scientists who want this from their statistical inferences. He writes “When you announce a new finding, you put your scientific reputation on the line. You colleagues probably cannot remember all your achievements, but they will never forget any of your mistakes! Second thoughts like these produce an acute sense of insecurity.” Not all scientists might feel like this, I hope we are willing to forget some mistakes people make, and I fear the consequences of making too many incorrect claims on the reputation of a researcher is not as severe as Bross suggests[i]. But I think many fellow researchers will experience some fear their findings do not hold up (at least until they have been replicated several times), and that hearing researchers failed to replicate a finding yields some negative affect.

I think Bross hits the nail on the head when it comes to thinking about justifications of the use of alpha levels as thresholds to make claims. The justification for this practice is social and pragmatic in nature, not statistical (cf. Uygun Tunç et al., 2021). If we want to evaluate if current practices are useful or not, we have to abandon a prescriptive approach, and rely on a descriptive approach (Bross, p. 511). Anyone proposing an alternative to the use of alpha levels should not make prescriptive arguments, but provide descriptive data (or at least predictions) that highlight how their preferred approach to statistical inferences will improve communication between scientists.

References

Bross, I. D. (1971). Critical levels, statistical language and scientific inference. In Foundations of statistical inference (pp. 500–513). Holt, Rinehart and Winston.

Elkins, M. R., Pinto, R. Z., Verhagen, A., Grygorowicz, M., Söderlund, A., Guemann, M., Gómez-Conesa, A., Blanton, S., Brismée, J.-M., Ardern, C., Agarwal, S., Jette, A., Karstens, S., Harms, M., Verheyden, G., & Sheikh, U. (2021). Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors. Journal of Physiotherapy. https://doi.org/10.1016/j.jphys.2021.12.001

Fisher, R. A. (1956). Statistical methods and scientific inference (Vol. viii). Hafner Publishing Co.

Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban. The American Statistician, 73(sup1), 374–384. https://doi.org/10.1080/00031305.2018.1537892

Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 1(2), 198–218. https://doi.org/10.1177/2515245918771329

Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012

Lakens, D. (2022a). Why P values are not measures of evidence. Trends in Ecology & Evolution. https://doi.org/10.1016/j.tree.2021.12.006

Lakens, D. (2022b). Correspondence: Reward, but do not yet require, interval hypothesis tests. Journal of Physiotherapy, 68(3), 213–214. https://doi.org/10.1016/j.jphys.2022.06.004

Maier, M., & Lakens, D. (2022). Justify Your Alpha: A Primer on Two Practical Approaches. Advances in Methods and Practices in Psychological Science, 5(2), 25152459221080396. https://doi.org/10.1177/25152459221080396

Muff, S., Nilsen, E. B., O’Hara, R. B., & Nater, C. R. (2021). Rewriting results sections in the language of evidence. Trends in Ecology & Evolution. https://doi.org/10.1016/j.tree.2021.10.009

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by

[i] I recently saw the bio of a social psychologist who has produces a depressingly large number of incorrect claims in the literature. His bio made no mention of this fact, but proudly boasted about the thousands of times he was cited, even though most citations were for work that did not survive the replication crisis. How much reputations should suffer is an intriguing question that I think too many scientists will never feel comfortable addressing.

The 20% Statistician

Monday, July 18, 2022

Irwin Bross justifying the 0.05 alpha level as a means of communication between scientists

No comments:

Post a Comment