The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, April 13, 2023

Preventing common misconceptions about Bayes Factors

As more people have started to use Bayes Factors, we should not be surprised that misconceptions about Bayes Factors have become common. A recent study shows that the percentage of scientific articles that draw incorrect inferences based on observed Bayes Factors is distressingly high (Wong et al., 2022), with 92% of articles demonstrating at least one misconception of Bayes Factors. Here I will review some of the most common misconceptions, and how to prevent them.

Misunderstanding 1: Confusing Bayes Factors with Posterior Odds.

One common criticism by Bayesians of null hypothesis significance testing (NHST) is that NHST quantifies the probability of the data (or more extreme data), given that the null hypothesis is true, but that scientists should be interested in the probability that the hypothesis is true, given the data. Cohen (1994) wrote:

What’s wrong with NHST? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is “Given these data, what is the probability that Ho is true?”

One might therefore believe that Bayes factors tell us something about the probability that a hypothesis true, but this is incorrect. A Bayes factor quantifies how much we should update our belief in one hypothesis. If this hypothesis was extremely unlikely (e.g., the probability that people have telepathy) this hypothesis might still be very unlikely, even after computing a large Bayes factor in a single study demonstrating telepathy. If we believed the hypothesis that people have telepathy was unlikely to be true (e.g., we thought it was 99.9% certain telepathy was not true) evidence for telepathy might only increase our belief in telepathy to the extent that we now believe it is 98% unlikely. The Bayes factor only corresponds to our posterior belief if we were perfectly uncertain about the hypothesis being true or not. If both hypotheses were equally likely, and a Bayes factor indicates we should update our belief in such a way that the alternative hypothesis is three times more likely than the null hypothesis, only then would we end up believing the alternative hypothesis is exactly three times more likely than the null hypothesis. One should therefore not conclude that, for example, given a BF of 10, the alternative hypothesis is more likely to be true than the null hypothesis. The correct claim is that people should update their belief in the alternative hypothesis by a factor of 10.

Misunderstanding 2: Failing to interpret Bayes Factors as relative evidence.

One benefit of Bayes factors that is often mentioned by Bayesians is that, unlike NHST, Bayes factors can provide support for the null hypothesis, and thereby falsify predictions. It is true that NHST can only reject the null hypothesis, although it is important to add that in frequentist statistics equivalence tests can be used to reject the alternative hypothesis, and therefore there is no need to switch to Bayes factors to meaningfully interpret the results of non-significant null hypothesis tests.

Bayes factors quantify support for one hypothesis relative to another hypothesis. As with likelihood ratios, it is possible that one hypothesis is supported more than another hypothesis, while both hypotheses are actually false. It is incorrect to interpret Bayes factors in an absolute manner, for example by stating that a Bayes factor of 0.09 provides support for the null hypothesis. The correct interpretation is that the Bayes factor provides relative support for H0 compared to H1. With a different alternative model, the Bayes factor would change. As with a signiifcant equivalence tests, even a Bayes factor strongly supporting H0 does not mean there is no effect at all - there could be a true, but small, effect.

For example, after Daryl Bem (2011) published 9 studies demonstrating support for pre-cognition (conscious cognitive awareness of a future event that could not otherwise be known) a team of Bayesian statisticians re-analyzed the studies, and concluded “Out of the 10 critical tests, only one yields “substantial” evidence for H1, whereas three yield “substantial” evidence in favor of H0. The results of the remaining six tests provide evidence that is only “anecdotal”” (2011). In a reply, Bem and Utts (2011) reply by arguing that the set of studies provide convincing evidence for the alternative hypothesis, if the Bayes factors are computed as relative evidence between the null hypothesis and a more realistically specified alternative hypothesis, where the effects of pre-cognition are expected to be small. This back and forth illustrates how Bayes factors are relative evidence, and a change in the alternative model specification changes whether the null or the alternative hypothesis receives relatively more support given the data.

Misunderstanding 3: Not specifying the null and/or alternative model.

Given that Bayes factors are relative evidence for or against one model compared to another model, it might be surprising that many researchers fail to specify the alternative model to begin with when reporting their analysis. And yet, in a systematic review of how psychologist use Bayes factors, van de Schoot et al. (2017) found that “31.1% of the articles did not even discuss the priors implemented”. Where in a null hypothesis significance test researchers do not need to specify the model that the test is based on, as the test is by definition a test against an effect of 0, and the alternative model consists of any non-zero effect size (in a two-sided test), this is not true when computing Bayes factors. The null model when computing Bayes factors is often (but not necessarily) a point null as in NHST, but the alternative model only one of many possible alternative hypotheses that a researcher could test against. It has become common to use ‘default’ priors, but as with any heuristic, defaults will most often give an answer to a nonsensical question, and quickly become a form of mindless statistics. When introducing Bayes factors as an alternative to frequentist t-tests, Rouder et al. (2009) write:

This commitment to specify judicious and reasoned alternatives places a burden on the analyst. We have provided default settings appropriate to generic situations. Nonetheless, these recommendations are just that and should not be used blindly. Moreover, analysts can and should consider their goals and expectations when specifying priors. Simply put, principled inference is a thoughtful process that cannot be performed by rigid adherence to defaults.

The priors used when computing a Bayes factor should therefore be both specified and justified.

Misunderstanding 4: Claims based on Bayes Factors do not require error control.

In a paper with the provocative title “Optional stopping: No problem for Bayesians” Rouder (2014) argues that “Researchers using Bayesian methods may employ optional stopping in their own research and may provide Bayesian analysis of secondary data regardless of the employed stopping rule.” If one would merely read the title and abstract, a reader might come to the conclusion that Bayes factors a wonderful solution to the error inflation due to optional stopping in the frequentist framework, but this is not correct (de Heide & Grünwald, 2017).

There is a big caveat about the type of statistical inferences that is unaffected by optional stopping. Optional stopping is no problem for Bayesians if they refrain from making a dichotomous claim about the presence or absence of an effect, or when they refrain from drawing conclusions about a prediction being supported or falsified. Rouder notes how “Even with optional stopping, a researcher can interpret the posterior odds as updated beliefs about hypotheses in light of data.” In other words, even after optional stopping, a Bayes factor tells researchers who much they should update their belief in a hypothesis. Importantly, when researchers make dichotomous claims based on Bayes factors (e.g., “The effect did not differ significantly between the condition, BF10 = 0.17”) then this claim can be correct, or an error, and error rates become a relevant consideration, unlike when researchers simply present the Bayes factor for readers to update their personal beliefs.

Bayesians disagree among each other about whether Bayes factors should be the basis of dichotomous claims, or not. Those who promote the use of Bayes factors to make claims often refer to thresholds proposed by Jeffreys (1939), where a BF > 3 is “substantial evidence”, and a BF > 10 is considered “strong evidence”. Some journals, such as Nature Human Behavior, have the following requirement for researchers who submit a Registered Report: “For inference by Bayes factors, authors must be able to guarantee data collection until the Bayes factor is at least 10 times in favour of the experimental hypothesis over the null hypothesis (or vice versa).” When researchers decide to collect data until a specific threshold is crossed to make a claim about a test, their claim can be correct, or wrong, just as when p-values are the statistical quantity a claim is based on. As both the Bayes factor and the p-value can be computed based on the sample size and the t-value (Francis, 2016; Rouder et al., 2009), there is nothing special about using Bayes factors as the basis of an ordinal claim. The exact long run error rates can not be directly controlled when computing Bayes factors, and the Type 1 and Type 2 error rate depends on the choice of the prior and the choice for the cut-off used to decide to make a claim. Simulations studies show that for commonly used priors and a BF > 3 cut-off to make claims the Type 1 error rate is somewhat smaller, but the Type 2 error rate is considerably larger (Kelter, 2021).

To conclude this section, whenever researchers make claims, they can make erroneous claims, and error control should be a worthy goal. Error control is not a consideration when researchers do not make ordinal claims (e.g., X is larger than Y, there is a non-zero correlation between X and Y, etc). If Bayes factors are used to quantify how much researchers should update personal beliefs in a hypothesis, there is no need to consider error control, but researchers should also refrain from making any ordinal claims based on Bayes factors in the results section or the discussion section. Giving up error control also means giving up claims about the presence or absence of effects.

Misunderstanding 5: Interpret Bayes Factors as effect sizes.

Bayes factors are not statements about the size of an effect. It is therefore not appropriate to conclude that the effect size is small or large purely based on the Bayes factor. Depending on the priors used when specifying the alternative and null model, the same Bayes factor can be observed for very different effect size estimates. The reverse is also true. The same effect size can correspond to Bayes factors supporting the null or the alternative hypothesis, depending on how the null model and the alternative model are specified. Researchers should therefore always report and interpret effect size measure. Statements about the size of effects should only be based on these effect size measures, and not on Bayes factors.

Any tool for statistical inferences will be mis-used, and the greater the adoption, the more people will use a tool without proper training. Simplistic sales pitches for Bayes factors (e.g., Bayes factors tell you the probability that your hypothesis is true, Bayes factors do not require error control, you can use ‘default’ Bayes factors and do not have to think about your priors) contribute to this misuse. When reviewing papers that report Bayes factors, check if the authors use Bayes factors to draw correct inferences.

 

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524

Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719. https://doi.org/10.1037/a0024777

Cohen, J. (1994). The earth is round (p .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

de Heide, R., & Grünwald, P. D. (2017). Why optional stopping is a problem for Bayesians. arXiv:1708.08278 [Math, Stat]. https://arxiv.org/abs/1708.08278

Francis, G. (2016). Equivalent statistics and data interpretation. Behavior Research Methods, 1–15. https://doi.org/10.3758/s13428-016-0812-3

Jeffreys, H. (1939). Theory of probability (1st ed). Oxford University Press.

Kelter, R. (2021). Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tests under preliminary assessment of normality. Computational Statistics, 36(2), 1263–1288. https://doi.org/10.1007/s00180-020-01034-7

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301–308.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225

van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A systematic review of Bayesian articles in psychology: The last 25 years. Psychological Methods, 22(2), 217–239. https://doi.org/10.1037/met0000100

Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426–432. https://doi.org/10.1037/a0022790

Wong, T. K., Kiers, H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1), 36357. https://doi.org/10.1525/collabra.36357

Monday, February 20, 2023

New Podcast: Nullius in Verba

Together with my co-host Smriti Mehta we've started a new podcast: Nullius in Verba. It's a podcast about science - what it is, and what it could be. The introduction episode is up now, and new episodes will be released every other week starting this friday!

You can subscribe using clicking the links below:

Apple Podcast:

https://podcasts.apple.com/us/podcast/nullius-in-verba/id1672861665

Spotify:

https://open.spotify.com/show/0tw0yVZxuf2X4iN6kFay5A 

or by adding this RSS feed to your podcast player:

https://feed.podbean.com/nulliusinverba/feed.xml

 


 

We will release episodes on friday every other week (starting this friday). Topics in the first episodes are inspired by aphorisms in Francis Bacon's 'Novum Organum' (the galleon in our logo comes from the title page of his 1620 book, where it is passing between the mythical Pillars of Hercules that stand either side of the Strait of Gibraltar, which have been smashed through by Iberian sailors, opening a new world for exploration and marking the exit from the well-charted waters of the Mediterranean into the Atlantic Ocean. Bacon hoped that empirical investigation will similarly smash the old scientific ideas and lead to greater understanding of nature and the world.

As we explain in the introduction, the title of the podcast comes from the motto of the Royal Society. 

Our logo is set in typeface Kepler by Robert Slimbach. Our theme song is Newton’s Cradle by Grandbrothers. You see we are going all in on subtle science references.

We hope you'll enjoy listening along as we discuss themes like confirmation bias, skepticism, eminence, the 'itch to publish' and in the first episode, the motivations to do science.

Monday, July 18, 2022

Irwin Bross justifying the 0.05 alpha level as a means of communication between scientists

This blog post is based on the chapter “Critical Levels, Statistical Language, and Scientific Inference” by Irwin D. J. Bross (1971) in the proceedings of the symposium on the foundations of statistical inference held in 1970. Because the conference proceedings might be difficult to access, I am citing extensively from the original source. Irwin D. J. Bross [1921-2004] was a biostatistician at Roswell Park Cancer Institute in Buffalo up to 1983.

 

Irwin D. J. Bross

 

Criticizing the use of thresholds such as an alpha level of 0.05 to make dichotomous inferences is of all times. Bross writes in 1981: “Of late the attacks on critical levels (and the statistical methods based on these levels) have become more frequent and more vehement.” I feel the same way, but it seems unlikely the vehemence of criticisms has been increasing for half a century. A more likely explanation is perhaps that some people, like Bross and myself, become increasingly annoyed by such criticisms.

 

Bross reflects on how very few justifications of the use of alpha levels exists in the literature, because “Elementary statistics texts are not equipped to go into the matter; advanced texts are too preoccupied with the latest and fanciest statistical techniques to have space for anything so elementary. Thus the justifications for critical levels that are commonly offered are flimsy, superficial, and badly outdated.” He notes how the use of critical values emerged in a time where statisticians had more practical experience, but that “Unfortunately, many of the theorists nowadays have lost touch with statistical practice and as a consequence, their work is mathematically sophisticated but scientifically very naive.

 

Bross sets out to consider which justification can be given for the use of critical alpha levels. He would like such a justification to convince those who use statistical methods in their research, and statisticians who are familiar with statistical practice. He argues that the purpose of a medical researcher “is to reach scientific conclusions concerning the relative efficacy (or safety) of the drugs under test. From the investigator's standpoint, he would like to make statements which are reliable and informative. From a somewhat broader standpoint, we can consider the larger communication network that exists in a given research area - the network which would connect a clinical pharmacologist with his colleagues and with the practicing physicians who might make use of his findings. Any realistic picture of scientific inference must take some account of the communication networks that exist in the sciences.”

 

It is rare to see biostatisticians explicitly embrace the pragmatic and social aspect of scientific inference. He highlights three main points about communication networks. “First, messages generate messages.” Colleagues might replicate a study, or build on it, or apply the knowledge. “A second key point is: discordant messages produce noise in the network”. “A third key point is: statistical methods are useful in controlling the noise in the network”. The critical level set by researchers controls the noise in the network. Too much noise in a network impedes scientific progress, because communication breaks down. He writes “Thus the specification of the critical levels […] has proved in practice to be an effective method for controlling the noise in communication networks.” Bross also notes that the critical alpha level in itself is not enough to reduce noise – it is just one component of a well-designed experiment that reduces noise in the network. Setting a sufficiently low alpha level is therefore one aspect that contributes to a system where people in the network can place some reliance on claims that are made because noise levels are not too high.

 

“This simple example serves to bring out several features of the usual critical level techniques which are often overlooked although they are important in practice. Clearly, if each investigator holds the proportion of false positive reports in his studies at under 5%, then the proportion of false positive reports from all of the studies carried out by the participating members of the network will be held at less than 5%. This property does not sound very impressive – it sounds like the sort of property one would expect any sensible statistical method to have. But it might be noted that most of the methods advocated by the theoreticians who object to critical levels lack this and other important properties which facilitate control of the noise level in the network.”

 

This point, common among all error statisticians, has repeatedly raised its head in response to suggestions to abandon statistical significance, or to stop interpreting p-values dichotomously. Of course, one can choose not to care about the rate at which researchers make erroneous claims, but it is important to realize the consequences of not caring about error rates. Of course, one can work towards a science where scientists no longer make claims, but generate knowledge through some other mechanism. But recent proposals to treat p-values as continuous measures of evidence (Muff et al., 2021; but see Lakens, 2022a) or to use estimation instead of hypothesis tests (Elkins et al., 2021; but see Lakens, 2022b) do not outline what such an alternative mode of knowledge generation would look like, or how researchers will be prevented from making claims about the presence or absence of effects.

 

Bross proposes an intriguing view on statistical inference where “statistics is used as a means of communication between the members of the network.” He views statistics not as a way to learn whether idealized probability distributions accurately reflect the empirical reality in infinitely repeated samples, but as a way to communicate assertions that are limited by the facts. He argues that specific ways of communicating only become widely used if they are effective. Here, I believe he fails to acknowledge that ineffective communication systems can also evolve, and it is possible that scientists en masse use techniques, not because they are efficient ways of communicating facts, but because they will lead to scientific publications. The idea that statistical inferences are a ‘mindless ritual’ has been proposed (Gigerenzer, 2018), and there is no doubt that many scientists simply imitate the practices they see. Furthermore, the replication crisis has shown that huge error rates in subfields of the scientific literature can exists for decades. The problems associated with these large error rates (e.g., failures to replicate findings, inaccurate effect size estimate) can sometimes only very slowly lead to a change in practice. So, arguing a practice survives because it works is risky. Whether current research practices are effective - or whether other practices would be more effective – requires empirical evidence. Randomized controlled trials seem a bridge to far to compare statistical approaches, but natural experiments by journals that abandon p-values support Bross’s argument to some extent. When the journal of Basic and Applied Psychology abandoned p-values, the consequence was that researchers claimed effects were present at a higher error rate than if claims had been limited by typical alpha level thresholds (Fricker et al., 2019).

 

Whether efficient or not, statistical language surrounding critical thresholds is in widespread use. Bross discusses how an alpha level of 5% or 1% is a convention. Like many linguistic conventions, the threshold of 5% is somewhat arbitrary, and reflects the influence of statisticians like Karl Pearson and Ronald Fisher. However, the threshold is not completely arbitrary. Bross asks us to imagine what would have happened had the alpha level of 0.001 been proposed, or an alpha level of 0.20. In both cases, he believes the convention would not have spread – in the first case because in many fields there are not sufficient resources to make claims at such a low error rate, and in the second case because few researchers would have found that alpha level a satisfactory quantification of ‘rare’ events. He, I think correctly, observes this means that the alpha level of 0.05 is not completely arbitrary, but that it reflects a quantification of ‘rare’ that researchers believe has sufficient practical value to be used in communication. Bross argues that the convention of a 5% alpha level spread because it sufficiently matched with what most scientists considered as an appropriate probability level to define ‘rare’ events, for example as when Fisher (1956) writes “Either an exceptionally rare chance has occurred, or the theory of random distribution is not true.”

 

Of course, one might follow Bross and ask “But is there any reason to single out one particular value, 5%, instead of some other value such as 3.98% or 7.13%?”. He writes: “Again, inconformity with the linguistic patterns in setting conventions it is natural to use a round number like 5%. Such round numbers serve to avoid any suggestion that the critical value has been gerrymandered or otherwise picked to prove a point in a particular study. From an abstract standpoint, it might seem more logical to allow a range of critical values rather than to choose one number but to do so would be to go contrary to the linguistic habits of fact-limited languages. Such languages tend to minimize the freedom of choice of the speaker in order to insure that a statement results from factual evidence and not from a little language-game played by the speaker.

 

In a recent paper we made a similar point (Uygun Tunç et al., 2021): “The conventional use of an alpha level of 0.05 can also be explained by the requirement in methodological falsificationism that statistical decision procedures are specified before the data is analyzed (see Popper, 2002, sections 19-20; Lakatos, 1978, p. 23-28). If researchers are allowed to set the alpha level after looking at the data there is a possibility that confirmation bias (or more intentional falsification-deflecting strategies) influences the choice of an alpha level. An additional reason for a conventional alpha level of 0.05 is that before the rise of the internet it was difficult to transparently communicate the pre-specified alpha level for any individual test to peers. The use of a default alpha level therefore effectively functioned as a pre-specification. For a convention to work as a universal pre-specification, it must be accepted by nearly everyone, and be extremely resistant to change. If more than a single conventional alpha level exists, this introduces the risk that confirmation bias influences the choice of an alpha level.” Our thoughts seem to be very much aligned with those of Bross.

 

Bross continues and writes “Anyone familiar with certain areas of the scientific literature will be well aware of the need for curtailing language-games. Thus if there were no 5% level firmly established, then some persons would stretch the level to 6% or 7% to prove their point. Soon others would be stretching to 10% and 15% and the jargon would become meaningless. Whereas nowadays a phrase such as statistically significant difference provides some assurance that the results are not merely a manifestation of sampling variation, the phrase would mean very little if everyone played language-games. To be sure, there are always a few folks who fiddle with significance levels-who will switch from two-tailed to one-tailed tests or from one significance test to another in an effort to get positive results. However such gamesmanship is severely frowned upon and is rarely practiced by persons who are native speakers of fact-limited scientific languages - it is the mark of an amateur.

 

We struggled with the idea that changing the alpha level (and especially increasing the alpha level) might confuse readers when ‘statistically significant’ no longer means ‘rejected with a 5% alpha level) in our recent paper on justifying alpha levels (Maier & Lakens, 2022). We wrote “Finally, the use of a high alpha level might be missed if readers skim an article. We believe this can be avoided by having each scientific claim accompanied by the alpha level under which it was made. Scientists should be required to report their alpha levels prominently, usually in the abstract of an article alongside a summary of the main claim.” It might in general be an improvement if people write ‘we reject an effect size of X at an alpha level of 5%’, but this is especially true of researchers choose to deviate from the conventional 5% alpha level.

 

I like how Bross has an extremely pragmatic but still principled view on statistical inferences. He writes: “This means that we have to abandon the traditional prescriptive attitude and adopt the descriptive approach which is characteristic of the empirical sciences. If we do so, then we get a very different picture of what statistical and scientific inference is all about. It is very difficult, I believe, to get such a picture unless you have had some first hand experience as an independent investigator in a scientific study. You then learn that drawing conclusions from statistical data can be a traumatic experience. A statistical consultant who takes a detached view of things just does not feel this pain.” This point is made by other applied statisticians, and it is a really important one. There are consequences of statistical inferences that you only experience when you spend several years trying to answer a substantive research question. Without that experience, it is difficult to give practical recommendations about what researchers should want when using statistics.

 

He continues “What you want - and want desperately - is all the protection you can get against the "slings and arrows of outrageous fortune". You want to say something informative and useful about the origins and nature of a disease or health hazard. But you do not want your statement to come back and haunt you for the rest of your life.” Of course, Bross should have said ‘One thing you might want’ (because now his statement is just another example of The Statistician’s Fallacy (Lakens, 2021)). But with this small amendment, I think there are quite some scientists who want this from their statistical inferences. He writes “When you announce a new finding, you put your scientific reputation on the line. You colleagues probably cannot remember all your achievements, but they will never forget any of your mistakes! Second thoughts like these produce an acute sense of insecurity.” Not all scientists might feel like this, I hope we are willing to forget some mistakes people make, and I fear the consequences of making too many incorrect claims on the reputation of a researcher is not as severe as Bross suggests[i]. But I think many fellow researchers will experience some fear their findings do not hold up (at least until they have been replicated several times), and that hearing researchers failed to replicate a finding yields some negative affect.

 

I think Bross hits the nail on the head when it comes to thinking about justifications of the use of alpha levels as thresholds to make claims. The justification for this practice is social and pragmatic in nature, not statistical (cf. Uygun Tunç et al., 2021). If we want to evaluate if current practices are useful or not, we have to abandon a prescriptive approach, and rely on a descriptive approach (Bross, p. 511). Anyone proposing an alternative to the use of alpha levels should not make prescriptive arguments, but provide descriptive data (or at least predictions) that highlight how their preferred approach to statistical inferences will improve communication between scientists.

 

 

References

 

Bross, I. D. (1971). Critical levels, statistical language and scientific inference. In Foundations of statistical inference (pp. 500–513). Holt, Rinehart and Winston.

Elkins, M. R., Pinto, R. Z., Verhagen, A., Grygorowicz, M., Söderlund, A., Guemann, M., Gómez-Conesa, A., Blanton, S., Brismée, J.-M., Ardern, C., Agarwal, S., Jette, A., Karstens, S., Harms, M., Verheyden, G., & Sheikh, U. (2021). Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors. Journal of Physiotherapy. https://doi.org/10.1016/j.jphys.2021.12.001

Fisher, R. A. (1956). Statistical methods and scientific inference (Vol. viii). Hafner Publishing Co.

Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban. The American Statistician, 73(sup1), 374–384. https://doi.org/10.1080/00031305.2018.1537892

Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 1(2), 198–218. https://doi.org/10.1177/2515245918771329

Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012

Lakens, D. (2022a). Why P values are not measures of evidence. Trends in Ecology & Evolution. https://doi.org/10.1016/j.tree.2021.12.006

Lakens, D. (2022b). Correspondence: Reward, but do not yet require, interval hypothesis tests. Journal of Physiotherapy, 68(3), 213–214. https://doi.org/10.1016/j.jphys.2022.06.004

Maier, M., & Lakens, D. (2022). Justify Your Alpha: A Primer on Two Practical Approaches. Advances in Methods and Practices in Psychological Science, 5(2), 25152459221080396. https://doi.org/10.1177/25152459221080396

Muff, S., Nilsen, E. B., O’Hara, R. B., & Nater, C. R. (2021). Rewriting results sections in the language of evidence. Trends in Ecology & Evolution. https://doi.org/10.1016/j.tree.2021.10.009

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by

 



[i] I recently saw the bio of a social psychologist who has produces a depressingly large number of incorrect claims in the literature. His bio made no mention of this fact, but proudly boasted about the thousands of times he was cited, even though most citations were for work that did not survive the replication crisis. How much reputations should suffer is an intriguing question that I think too many scientists will never feel comfortable addressing.

Monday, May 9, 2022

Tukey on Decisions and Conclusions

In 1955 Tukey gave a dinner talk about the difference between decisions and conclusions at a meeting of the Section of Physical and Engineering Science of the American Statistical Association. The talk was published in 1960. The distinction relates directly to different goals researchers might have when they collect data. This blog is largely a summary of his paper.

 


Tukey was concerned about the ‘tendency of decision theory to attempt to conquest all of statistics’. In hindsight, he needn’t have worried. In the social sciences, most statistics textbooks do not even discuss decision theory. His goal was to distinguish decisions from conclusions, to carve out a space for ‘conclusion theory’ to complement decision theory. He distinguishes decisions from conclusions.

 

In practice, making a decision means to ‘decide to act for the present as if’. Possible actions are defined, possible states of nature identified, and we make an inference about each state of nature. Decisions can be made even when we remain extremely uncertain about any ‘truth’. Indeed, in extreme cases we can even make decisions without access to any data. We might even decide to act as if two mutually exclusive states of nature are true! For example, we might buy a train ticket for a holiday three months from now, but also take out life insurance in case we die tomorrow.   

 

Conclusions differ from decisions. First, conclusions are established without taking consequences into consideration. Second, conclusions are used to build up a ‘fairly well-established body of knowledge’. As Tukey writes: “A conclusion is a statement which is to be accepted as applicable to the conditions of an experiment or observation unless and until unusually strong evidence to the contrary arises.” A conclusion is not a decision on how to act in the present. Conclusions are to be accepted, and thereby incorporated into what Frick (1996) calls a ‘corpus of findings’. According to Tukey, conclusions are used to narrow down the number of working hypotheses still considered consistent with observations. Conclusions should be reached, not based on their consequences, but because of their lasting (but not everlasting, as conclusions can now and then be overturned by new evidence) contribution to scientific knowledge.

 

Tests of hypotheses

 

According to Tukey, a test of hypotheses can have two functions. The first function is as a decision procedure, and the second function is to reach a conclusion. In a decision procedure the goal is to choose a course of action given an acceptable risk. This risk can be high. For example, a researcher might decide not to pursue a research idea after a first study, designed to have 80% power for a smallest effect size of interest, yields a non-significant result. The error rate is at most 20%, but the researcher might have enough good research ideas to not care.

 

The second function is to reach a conclusion. This is done, according to Tukey, by controlling the Type 1 and Type 2 error rate at ‘suitably low levels’ (Note: Tukey’s discussion of concluding an effect is absent is hindered somewhat by the fact that equivalence tests were not yet widely established in 1955 – Hodges & Lehman’s paper appeared in 1954). Low error rates, such as the conventions to use a 5% of 1% alpha level, are needed to draw conclusions that can enter the corpus of findings (even though some of these conclusions will turn out to be wrong, in the long run).

 

Why would we need conclusions?

 

One might reasonably wonder if we need conclusions in science. Tukey also ponders this question in Appendix 2. He writes “Science, in the broadest sense, is both one of the most successful of human affairs, and one of the most decentralized. In principle, each of us puts his evidence (his observations, experimental or not, and their discussion) before all the others, and in due course an adequate consensus of opinion develops.” He argues not for an epistemological reason, nor for a statistical reason, but for a sociological reason. Tukey writes: There are four types of difficulty, then, ranging from communication through assessment to mathematical treatment, each of which by itself will be sufficient, for a long time, to prevent the replacement, in science, of the system of conclusions by a system based more closely on today’s decision theory.” He notes how scientists can no longer get together in a single room (as was somewhat possible in the early decades of the Royal Society of London) to reach consensus about decisions. Therefore, they need to communicate conclusions, as “In order to replace conclusions as the basic means of communication, it would be necessary to rearrange and replan the entire fabric of science.” 

 

I hadn’t read Tukey’s paper when we wrote our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests”. In this preprint, we also discuss a sociological reason for the presence of dichotomous claims in science. We also ask: “Would it be possible to organize science in a way that relies less on tests of competing theories to arrive at intersubjectively established facts about phenomena?” and similarly conclude: “Such alternative approaches seem feasible if stakeholders agree on the research questions that need to be investigated, and methods to be utilized, and coordinate their research efforts”.  We should add a citation to Tukey's 1960 paper.

 

Is the goal of an study a conclusion, a decision, or both?

 

Tukey writes he “looks forward to the day when the history and status of tests of hypotheses will have been disentangled.” I think that in 2022 that day has not yet come. At the same time, Tukey admits in Appendix 1 that the two are sometimes intertwined.

 

A situation Tukey does not discuss, but that I think is especially difficult to disentangle, is a cumulative line of research. Although I would prefer to only build on an established corpus of findings, this is simply not possible. Not all conclusions in the current literature are reached with low error rates. This is true both for claims about the absence of an effect (which are rarely based on an equivalence test against a smallest effect size of interest with a low error rate), as for claims about the presence of an effect, not just because of p-hacking, but also because I might want to build on an exploratory finding from a previous study. In such cases, I would like to be able to conclude the effects I build on are established findings, but more often than not, I have to decide these effects are worth building on. The same holds for choices about the design of a set of studies in a research line. I might decide to include a factor in a subsequent study, or drop it. These decisions are based on conclusions with low error rates if I had the resources to collect large samples and perform replication studies, but other times they involve decisions about how to act in my next study with quite considerable risk.

 

We allow researchers to publish feasibility studies, pilot studies, and exploratory studies. We don’t require every study to be a Registered Report of Phase 3 trial. Not all information in the literature that we build on has been established with the rigor Tukey associates with conclusions. And the replication crisis has taught us that more conclusions from the past are later rejected than we might have thought based on the alpha levels reported in the original articles. And in some research areas, where data is scarce, we might need to accept that, if we want to learn anything, the conclusions will always more tentative (and the error rates accepted in individual studies will be higher) than in research areas where data is abundant.

 

Even if decisions and conclusions can not be completely disentangled, reflecting on their relative differences is very useful, as I think it can help us to clarify the goal we have when we collect data. 

 

For a 2013 blog post by Justin Esarey, who found the distinction a bit less useful than I found it, see https://polmeth.org/blog/scientific-conclusions-versus-scientific-decisions-or-we%E2%80%99re-having-tukey-thanksgiving

 

References

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Tukey, J. W. (1960). Conclusions vs decisions. Technometrics, 2(4), 423–433.

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by

 

 

 

 

 

Tuesday, May 3, 2022

Collaborative Author Involved Replication Studies

Recently a new category of studies have started to appear in the psychological literature that provide the strongest support to date for a replication crisis in psychology: Large scale collaborative replication studies where the authors of the original study are directly involved in the study. These replication studies have often provided conclusive demonstrations of the absence of any effect large enough to matter. Despite considerable attention for these extremely interesting projects, I don’t think the scientific community has fully appreciated what we have learned from these studies.

 

Three examples of Collaborative Author Involved Replication Studies

 

Vohs and colleagues (2021) performed a multi-lab replication study of the ego-depletion effect, which (deservedly) has become a poster child of non-replicable effects in psychology. The teams used different combinations of protocols, allowing an unsuccessful prediction to generalize across minor variations in how the experiment was operationalized. Across these conditions, a non-significant effect was observed of d = 0.06, 95%CI[-0.02;0.14]. Although the authors regrettably did not specify a smallest effect size of interest in their frequentist analyses, they mention “we pitted a point-null hypothesis, which states that the effect is absent, against an informed one-sided alternative hypothesis centered on a depletion effect (δ) of 0.30 with a standard deviation of 0.15” in their Bayesian analyses. Based on the confidence interval, we can reject effects of d = 0.3, and even d = 0.2, suggesting that we have extremely informative data concerning the absence of an effect most ego-depletion researchers would consider is large enough to matter.

 

Morey et al (2021) performed a multi-lab replication study of the Action-Sentence Compatibility effect (Glenberg & Kaschak, 2002). I cited the original paper in my PhD thesis, and it was an important finding that I built on, so I was happy to join this project. As written in the replication study, the original team, together with the original authors, “established and pre-registered ranges of effects on RT that we would deem (a) uninteresting and inconsistent with the ACE theory: less than 50 ms.” An effect between 50 ms and 100 ms was seen as inconsistent with the previous literature, but in line with predictions of the ACE effect. The replication study consisted (after exclusions) of 903 native English speakers, and 375 non-native English speakers. The original study had used 44, 70, and 72 participants across 3 studies. The conclusion in the replication study was that  the median ACE interactions were close to 0 and all within the range that we pre-specified as negligible and inconsistent with the existing ACE literature. There was little heterogeneity.

 

Last week, Many Labs 4 was published (Klein et al., 2022). This study was designed to examine the mortality salience effect (which I think deserve the same poster child status of a non-replicable effect in psychology, but which seems to have gotten less attention so far). Data from 1550 participants was collected across 17 labs, some which performed the study with involvement of the original author, and some which did not. Several variations of the analyses were preregistered, but none revealed the predicted effect, Hedges’ g = 0.07, 95% CI = [-0.03, 0.17] (for exclusion set 1). The authors did not provide a formal sample size justification based on a smallest effect size of interest, but in a sensitivity power analysis indicate they had 95% power for effect sizes of d = 0.18 to d = 0.21. If we assume all authors found effect sizes around d = 0.2 small enough to no longer support their predictions, we can see based on the confidence intervals that we can indeed exclude effect sizes large enough to matter. The mortality salience effect, even with involvement of the original authors, seems to be too small to matter. There was little heterogeneity in effect sizes (in part because the absence of an effect).

 

These are just three examples (there are more, of which the multi-lab test of the facial feedback hypothesis by Coles et al., 2022, is worth highlighting), but they highlight some interesting properties of collaborative author involved replication studies. I will highlight four strengths of these studies.

 

Four strengths of Collaborative Author Involved Replication Studies

 

1) The original authors are extensively involved in the design of the study. They sign off on the final design, and agree that the study is, with the knowledge they currently have, the best test of their prediction. This means the studies tell us something about the predictive validity of state of the art knowledge in a specific field. If the predictions these researchers make are not corroborated, the knowledge we have accumulated in these research areas are is not reliable enough to make successful predictions.

2) The studies are not always direct replications, but the best possible test of the hypothesis, in the eyes of the researchers involved. Criticism on past replication studies has been that directly replicating a study performed many years ago is not always insightful, as the context has changed (even though Many Labs 5 found no support for this criticism). In this new category of collaborative author involved replication studies, the original authors are free to design the best possible test of their prediction. If these tests fail, we can not attribute the failure to replicate to the ‘protective belt’ of auxiliary hypotheses that no longer hold. Of course, it is possible that the theory can be adjusted in a constructive manner after this unsuccessful prediction. But at this moment, these original authors do not have a solid understanding of their research topic to be able to predict if an effect will be observed.

3) The other researchers involved in these projects often have extensive expertise in the content area. They are not just researchers interested in mechanistically performing a replication study on a topic they have little expertise with. Instead, many of the researchers consists of peers who have worked in a specific research area, published on the topic of the replication study, but have collectively developed some doubts about the reliability of past claims, and have decided to spend some of their time replicating a previous finding.

4) The statistical analyses in these studies yield informative conclusions. The studies typically do not conclude the prediction was unsuccessful based on p > 0.05 in a small sample. In the most informative studies, original authors have explicitly specified a smallest effect size of interest, which makes it possible to perform an equivalence test, and statistically reject the presence of any effect deemed large enough to matter. In other cases, Bayesian hypothesis tests are performed which provide support for the null, compared to the alternative, model. This makes these replications studies severe tests of the predicted effect. In cases where original authors did not specify a smallest effect size of interest, the very large sample sizes allow readers to examine effects that can be rejected based on the observed confidence interval, and in all the studies discussed here, we can reject the presence of effects large enough to be considered meaningful. There is most likely not a PhD student in the world who would be willing to examine these effects, given the size that remains possible after these collaborative author involved replication studies. We can never conclude an effect is exactly zero, but that hardly matters – the effects are clearly too small to study.

 

The Steel Man for Replication Crisis Deniers

 

Given the reward structures in science, it is extremely rewarding for individual researchers to speak out against the status quo. Currently, the status quo is that the scientific community has accepted there is a replication crisis. Some people attempt to criticize this belief. This is important. All established beliefs in science should be open to criticism.

Most papers that aim to challenge the fact that many scientific domains have a surprising difficulty successfully replicating findings once believed reliable focus on the 100 studies in the Replicability Project: Psychology that was started a decade ago, and published in 2015. This project was incredibly successful in creating awareness of concerns around replicability, but it was not incredibly informative about how big the problem was.

In the conclusion of the RP:P, the authors wrote: “After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. Humans desire certainty, and science infrequently provides it. As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation.” The RP:P was an important project, but it is no longer the project to criticize if you want to provide evidence against the presence of a replication crisis.

Since the start of the RP:P, other projects have aimed to complement our insights about replicability. Registered Replication Reports focused on single studies, replicated in much larger sample sizes, to reduce the probability of a Type 2 error. These studies often quite conclusively showed original studies did not replicate, and a surprisingly large number yielded findings not statistically different from 0, despite sample sizes much larger than psychologists would be able to collect in normal research lines. Many Labs studies focused on a smaller set of studies, replicated many times, sometimes with minor variations to examine the role of possible moderators proposed to explain failures to replicate, which were typically absent.

The collaborative author involved replications are the latest addition to this expanding literature that consistently shows great difficulties in replicating findings. I believe they currently make up the steel man for researchers motivated to cast doubt on the presence of a replication crisis. I believe the fact that these large projects with direct involvement of the original authors can not find support for predicted effects are the strongest evidence too date that we have a problem replicating findings. Of course, these studies are complemented by Registered Replication Reports and Many Labs studies, and together they make up the Steel Man to argue against if you are a Replication Crisis Denier.

 

References

Coles, N. A., March, D. S., Marmolejo-Ramos, F., Larsen, J., Arinze, N. C., Ndukaihe, I., Willis, M., Francesco, F., Reggev, N., Mokady, A., Forscher, P. S., Hunter, J., Gwenaël, K., Yuvruk, E., Kapucu, A., Nagy, T., Hajdu, N., Tejada, J., Freitag, R., … Marozzi, M. (2022). A Multi-Lab Test of the Facial Feedback Hypothesis by The Many Smiles Collaboration. PsyArXiv. https://doi.org/10.31234/osf.io/cvpuw

 

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Hilgard, J., Ahn, P. H., Brady, A. J., Chartier, C. R., Christopherson, C. D., Clay, S., Collisson, B., Crawford, J. T., Cromar, R., Gardiner, G., Gosnell, C. L., Grahe, J., Hall, C., Howard, I., … Ratliff, K. A. (2022). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement. Collabra: Psychology, 8(1), 35271. https://doi.org/10.1525/collabra.35271

 

Morey, R. D., Kaschak, M. P., Díez-Álamo, A. M., Glenberg, A. M., Zwaan, R. A., Lakens, D., Ibáñez, A., García, A., Gianelli, C., Jones, J. L., Madden, J., Alifano, F., Bergen, B., Bloxsom, N. G., Bub, D. N., Cai, Z. G., Chartier, C. R., Chatterjee, A., Conwell, E., … Ziv-Crispel, N. (2021). A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE). Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-021-01927-8

 

Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J. L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi, J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay, S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733

 

Saturday, November 20, 2021

Why p-values should be interpreted as p-values and not as measures of evidence

Update: Florian Hartig has also published a blog post criticizing the paper by Muff et al (2021). 

In a recent paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation “to regard P-values as what they are, namely, continuous measures of statistical evidence". This is a surprising recommendation, given that p-values are not valid measures of evidence (Royall, 1997). The authors follow Bland (2015) who suggests that “It is preferable to think of the significance test probability as an index of the strength of evidence against the null hypothesis” and proposed verbal labels for p-values in specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman & Royall, 1988). If the null-hypothesis is true, p-values are uniformly distributed. This means it is just as likely to observe a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just because the probability to observe this p-value is very small. After all, if the null hypothesis is true, the probability of observing p = 0.999 is exactly as small.

The reason that small p-values can be used to guide us in the direction of true effects is not because they are rarely observed when the null-hypothesis is true, but because they are relatively less likely to be observed when the null hypothesis is true, than when the alternative hypothesis is true. For this reason, statisticians have argued that the concept of evidence is necessarily ‘relative’. We can quantify evidence in favor of one hypothesis over another hypothesis, based on the likelihood of observing data when the null hypothesis is true, compared to this probability when an alternative hypothesis is true. As Royall (1997, p. 8) explains: “The law of likelihood applies to pairs of hypotheses, telling when a given set of observations is evidence for one versus the other: hypothesis A is better supported than B if A implies a greater probability for the observations than B does. This law represents a concept of evidence that is essentially relative, one that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569) write, “The p-value is not adequate for inference because the measurement of evidence requires at least three components: the observations, and two competing explanations for how they were produced.

In practice, the problem of interpreting p-values as evidence in absence of a clearly defined alternative hypothesis is that they at best serve as proxies for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true, p-values are unrelated to evidence. In practice, when researchers examine a mix of hypotheses where the alternative hypothesis is sometimes true, p-values will be correlated with measures of evidence. However, this correlation can be quite weak (Krueger, 2001), and in general this correlation is too weak for p-values to function as a valid measure of evidence, where p-values in a specific range can directly be associated with ‘strong’ or ‘weak’ evidence.

 

Why single p-values cannot be interpreted as the strength of evidence

 

The evidential value of a single p-value depends on the statistical power of the test (i.e., on the sample size in combination with the effect size of the alternative hypothesis). The statistical power expresses the probability of observing a p-value smaller than the alpha level if the alternative hypothesis is true. When the null hypothesis is true, statistical power is formally undefined, but in practice in a two-sided test α% of the observed p-values will fall below the alpha level, as p-values are uniformly distributed under the null-hypothesis. The horizontal grey line in Figure 1 illustrates the expected p-value distribution for a two-sided independent t-test if the null-hypothesis is true (or when the observed effect size Cohen’s d is 0). As every p-value is equally likely, they can not quantify the strength of evidence against the null hypothesis.

 

Figure 1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve) and 99% (dotted black curve). 



If the alternative hypothesis is true the strength of evidence that corresponds to a p-value depends on the statistical power of the test. If power is 50%, we should expect that 50% of the observed p-values fall below the alpha level. The remaining p-values fall above the alpha level. The black curve in Figure 1 illustrates the p-value distribution for a test with a statistical power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when there is a true effect that is examined in a statistical test with 50% power than when the null hypothesis is true (as illustrated by the black curve being above the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence for an alternative hypothesis examined with 50% power, compared to the null hypothesis.

If an effect is examined in a test with 99% power (the dotted line in Figure 1) we would draw a different conclusion. With such high power p-values larger than the alpha level of 5% are rare (they occur only 1% of the time) and a p-value of 0.168 is much more likely to be observed when the null-hypothesis is true than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is evidence against an alternative hypothesis examined with 99% power, compared to the null hypothesis.

Figure 1 illustrates that with 99% power even a ‘statistically significant’ p-value of 0.04 is evidence for of the null-hypothesis. The reason for this is that the probability of observing a p-value of 0.04 is more likely when the null hypothesis is true than when a hypothesis is tested with 99% power (i.e., the grey horizontal line at p = 0.04 is above the dotted black curve). This fact, which is often counterintuitive when first encountered, is known as the Lindley paradox, or the Jeffreys-Lindley paradox (for a discussion, see Spanos, 2013).

Figure 1 illustrates that different p-values can correspond to the same relative evidence in favor of a specific alternative hypothesis, and that the same p-value can correspond to different levels of relative evidence. This is obviously undesirable if we want to use p-values as a measure of the strength of evidence. Therefore, it is incorrect to verbally label any p-value as providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis, as depending on the alternative hypothesis a researcher is interested in, the level of evidence will differ (and the p-value could even correspond to evidence in favor of the null hypothesis).

 

All p-values smaller than 1 correspond to evidence for some non-zero effect

 

If the alternative hypothesis is not specified, any p-value smaller than 1 should be treated as at least some evidence (however small) for some alternative hypotheses. It is therefore not correct to follow the recommendations of the authors in their Table 2 to interpret p-values above 0.1 (e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis based on p < 0.05 (see the forest plot in Figure 2). It thus seems ill-advised to follow their recommendation to describe a single study with p = 0.168 as ‘no evidence’ for a relationship.

 

Figure 2: Forest plot for a meta-analysis of three identical studies yielding p = 0.168.


However, replacing the label of ‘no evidence’ with the label ‘at least some evidence for some hypotheses’ leads to practical problems when communicating the results of statistical tests. It seems generally undesirable to allow researchers to interpret any p-value smaller than 1 as ‘at least some evidence’ against the null hypothesis. This is the price one pays for not specifying an alternative hypothesis, and try to interpret p-values from a null hypothesis significance test in an evidential manner. If we do not specify the alternative hypothesis, it becomes impossible to conclude there is evidence for the null hypothesis, and we cannot statistically falsify any hypothesis (Lakens, Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).

 

Interpreting p-values as p-values

 

Instead of interpreting p-values as measures of the strength of evidence, we could consider a radical alternative: interpret p-values as p-values. This would, perhaps surprisingly, solve the main problems that Muff and colleagues aim to address, namely ‘black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff’. The idea to interpret p-values as measures of evidence is most strongly tried to a Fisherian interpretation of p-values. An alternative statistical frequentist philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to guide decisions about the null and alternative hypothesis by, in the long run, controlling the Type I and Type II error rate. Researchers specify an alpha level and design a study with a sufficiently high statistical power, and reject (or fail to reject) the null hypothesis.

Neyman and Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First, Neyman and Pearson (1933b) leave open whether the states of the world are divided in two (‘accept’ and ‘reject’) or three regions, and write that a “region of doubt may be obtained by a further subdivision of the region of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics is to test range predictions instead of limiting oneself to a null hypothesis significance test (Lakens, 2021). This implements the idea of Neyman and Pearson to introduce a region of doubt, and distinguishes inconclusive results (where neither the null hypothesis nor the alternative hypothesis can be rejected, and more data needs to be collected to draw a conclusion) from conclusive results (where either the null hypothesis or the alternative hypothesis can be rejected.

In a Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of rejections must be only tentative.” So when we reject the null model, we do so tentatively, aware of the fact we might have done so in error, and without necessarily believing the null model is false. For Neyman (1957, p. 13) inferential behavior is an: “act of will to behave in the future (perhaps until new experiments are performed) in a particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.

Furthermore, it is important to remember that hypothesis tests reject a statistical hypothesis, but not a theoretical hypothesis. As Neyman (1960, p. 290) writes: “the frequency of correct conclusions regarding the statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis in a specific experiment does not necessarily inform us about the truth of the theory. Decisions about the truthfulness of a theory requires a careful evaluation of the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).

Neyman (1976) provides some reporting examples that reflect his philosophy on statistical inferences: “after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that "high" scores on "potential and on "education" are indicative of better chances of success in the drive to home ownership”. An example of a shorter statement that Neyman provides reads: “As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population.

A complete verbal description of the result of a Neyman-Pearson hypothesis test acknowledges two sources of uncertainty. First, the assumptions of the statistical test must be met (i.e., data is normally distributed), or any deviations should be small enough to not have any substantial effect on the frequentist error rates. Second, conclusions are made “Without hoping to know. whether each separate hypothesis is true or false(Neyman & Pearson, 1933a). Any single conclusion can be wrong, and assuming the test assumption are met, we make claims under a known maximum error rate (which is never zero). Future replication studies are needed to provide further insights about whether the current conclusion was erroneous or not.

After observing a p-value smaller than the alpha level, one can therefore conclude: “Until new data emerges that proves us wrong, we decide to act as if there is an effect, while acknowledging that the methodological procedure we base this decision on has, a maximum error rate of alpha% (assuming the statistical assumptions are met), which we find acceptably low.” One can follow such a statement about the observed data with a theoretical inference, such as “assuming our auxiliary hypotheses hold, the result of this statistical test corroborates our theoretical hypothesis”. If a conclusive test result in an equivalence test is observed that allows a researcher to reject the presence of any effect large enough to be meaningful, the conclusion would be that the test result does not corroborate the theoretical hypothesis.

The problem that the common application of null hypothesis significance testing in science is based on an arbitrary threshold of 0.05 is true (Lakens, Adolfi, et al., 2018). There are surprisingly few attempts to provide researchers with practical approaches to determine an alpha level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021; Miller & Ulrich, 2019; Mudge et al., 2012). It seems difficult to resolve in practice, both because at least some scientist adopt a philosophy of science where the goal of hypothesis tests is to establish a corpus of scientific claims (Frick, 1996), and any continuous measure will be broken up in a threshold below which a researcher are not expected to make a claim about a finding (e.g., a BF < 3, see Kass & Raftery, 1995, or a likelihood ratio lower than k = 8, see Royall, 2000). Although it is true that an alpha level of 0.05 is arbitrary, there are some pragmatic arguments in its favor (e.g., it is established, and it might be low enough to yield claims that are taken seriously, but not high enough to prevent other researchers from attempting to refute the claim, see Uygun Tunç et al., 2021).

 

If there really no agreement on best practices in sight?

 

One major impetus for the flawed proposal to interpret p-values as evidence by Muff and colleagues is that “no agreement on a way forward is in sight”. The statement that there is little agreement among statisticians is an oversimplification. I will go out on a limb and state some things I assume most statisticians agree on. First, there are multiple statistical tools one can use, and each tool has their own strengths and weaknesses. Second, there are different statistical philosophies, each with their own coherent logic, and researchers are free to analyze data from the perspective of one or multiple of these philosophies. Third, one should not misuse statistical tools, or apply them to attempt to answer questions the tool was not designed to answer.

It is true that there is variation in the preferences individuals have about which statistical tools should be used, and the philosophies of statistical researchers should adopt. This should not be surprising. Individual researchers differ in which research questions they find interesting within a specific content domain, and similarly, they differ in which statistical questions they find interesting when analyzing data. Individual researchers differ in which approaches to science they adopt (e.g., a qualitative or a quantitative approach), and similarly, they differ in which approach to statistical inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences. It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is. Researchers can believe it is important for reliable knowledge generation to control error rates when making scientific claims, while at the same time believing that it is important to quantify relative evidence using likelihoods or Bayes factors (for example by presented a Bayes factor alongside every p-value for a statistical test, Lakens et al., 2020).

Whatever approach to statistical inferences researchers choose to use, the approach should answer a meaningful statistical question (Hand, 1994), the approach to statistical inferences should be logically coherent, and the approach should be applied correctly. Despite the common statement in the literature that p-values can be interpreted as measures of evidence, the criticism against the coherence of this approach should make us pause. Given that coherent alternatives exist, such as likelihoods (Royall, 1997) or Bayes factors (Kass & Raftery, 1995), researchers should not follow the recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p = 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.

 

References

Bland, M. (2015). An introduction to medical statistics (Fourth edition). Oxford University Press.

Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecology Letters, 7(8), 669–675. https://doi.org/10.1111/j.1461-0248.2004.00625.x

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568–1574.

Hand, D. J. (1994). Deconstructing Statistical Questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3), 317–356. https://doi.org/10.2307/2983526

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

Kim, J. H., & Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic Approach. Abacus, 57(1), 27–71. https://doi.org/10.1111/abac.12172

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. https://doi.org/10.1037//0003-066X.56.1.16

Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2020). Improving Inferences About Null Effects With Bayes Factors and Equivalence Tests. The Journals of Gerontology: Series B, 75(1), 45–57. https://doi.org/10.1093/geronb/gby065

Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

Maier, M., & Lakens, D. (2021). Justify Your Alpha: A Primer on Two Practical Approaches. PsyArXiv. https://doi.org/10.31234/osf.io/ts4r6

Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7–22. https://doi.org/10.2307/1401671

Neyman, J. (1960). First course in probability and statistics. Holt, Rinehart and Winston.

Neyman, J. (1976). Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in Statistics - Theory and Methods, 5(8), 737–751. https://doi.org/10.1080/03610927608827392

Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009

Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29(04), 492–510. https://doi.org/10.1017/S030500410001152X

Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall/CRC.

Royall, R. (2000). On the probability of observing misleading statistical evidence. Journal of the American Statistical Association, 95(451), 760–768.

Spanos, A. (2013). Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science, 80(1), 73–93.

Uygun Tunç, D., & Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework. In Meta-Psychology. https://doi.org/10.31234/osf.io/pdm7y

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by