*Update: Florian Hartig has also published a blog post criticizing the paper by Muff et al (2021). *

In a recent paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation “to regard P-values as what they are, namely, continuous measures of statistical evidence". This is a surprising recommendation, given that p-values are not valid measures of evidence (Royall, 1997). The authors follow Bland (2015) who suggests that “It is preferable to think of the significance test probability as an index of the strength of evidence against the null hypothesis” and proposed verbal labels for p-values in specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman & Royall, 1988). If the null-hypothesis is true, p-values are uniformly distributed. This means it is just as likely to observe a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just because the probability to observe this p-value is very small. After all, if the null hypothesis is true, the probability of observing p = 0.999 is exactly as small.

The reason that
small p-values can be used to guide us in the direction of true effects is not
because they are rarely observed when the null-hypothesis is true, but because
they are relatively less likely to be observed when the null hypothesis is
true, than when the alternative hypothesis is true. For this reason,
statisticians have argued that the concept of evidence is necessarily ‘relative’.
We can quantify evidence in favor of one hypothesis over another hypothesis,
based on the likelihood of observing data when the null hypothesis is true,
compared to this probability when an alternative hypothesis is true. As Royall
(1997, p. 8) explains: “The law of
likelihood applies to pairs of hypotheses, telling when a given
set of observations is evidence for one versus the other: hypothesis
*A *is better supported than *B *if *A *implies a greater
probability for the observations than *B *does. This
law represents a concept of evidence that is essentially relative, one
that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569)
write, “The p-value is not adequate
for inference because the measurement of evidence requires at least three
components: the observations, and two competing explanations
for how they were produced.”

In
practice, the problem of interpreting p-values as evidence in absence of a
clearly defined alternative hypothesis is that they at best serve as proxies
for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true,
p-values are unrelated to evidence. In practice, when researchers examine a mix
of hypotheses where the alternative hypothesis is sometimes true, p-values will
be *correlated* with measures of evidence. However, this correlation can
be quite weak (Krueger,
2001), and in general this correlation is
too weak for p-values to function as a valid measure of evidence, where
p-values in a specific range can directly be associated with ‘strong’ or ‘weak’
evidence.

**Why
single p-values cannot be interpreted as the strength of evidence**

The
evidential value of a single p-value depends on the statistical power of the
test (i.e., on the sample size in combination with the effect size of the
alternative hypothesis). The statistical power expresses the probability of
observing a p-value smaller than the alpha level if the alternative hypothesis
is true. When the null hypothesis is true, statistical power is formally undefined,
but in practice in a two-sided test α% of the observed p-values will fall below the alpha level, as p-values
are uniformly distributed under the null-hypothesis. The horizontal grey line
in Figure 1 illustrates the expected p-value distribution for a two-sided
independent *t*-test if the null-hypothesis is true (or when the observed
effect size Cohen’s d is 0). As every p-value is equally likely, they can not
quantify the strength of evidence against the null hypothesis.

*Figure
1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve)
and 99% (dotted black curve). *

*If the alternative hypothesis is true the strength of evidence that corresponds to a p-value depends on the statistical power of the test. If power is 50%, we should expect that 50% of the observed p-values fall below the alpha level. The remaining p-values fall above the alpha level. The black curve in Figure 1 illustrates the p-value distribution for a test with a statistical power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when there is a true effect that is examined in a statistical test with 50% power than when the null hypothesis is true (as illustrated by the black curve being above the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence*

*for*an alternative hypothesis examined with 50% power, compared to the null hypothesis.

If an
effect is examined in a test with 99% power (the dotted line in Figure 1) we
would draw a different conclusion. With such high power p-values larger than
the alpha level of 5% are rare (they occur only 1% of the time) and a p-value
of 0.168 is much more likely to be observed when the null-hypothesis is true
than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is
evidence *against* an alternative hypothesis examined with 99% power,
compared to the null hypothesis.

Figure 1
illustrates that with 99% power even a ‘statistically significant’ p-value of
0.04 is evidence *for* of the null-hypothesis. The reason for this is that
the probability of observing a p-value of 0.04 is more likely when the null
hypothesis is true than when a hypothesis is tested with 99% power (i.e., the
grey horizontal line at p = 0.04 is above the dotted black curve). This fact,
which is often counterintuitive when first encountered, is known as the Lindley
paradox, or the Jeffreys-Lindley paradox (for a
discussion, see Spanos, 2013).

Figure 1 illustrates that different p-values can correspond to the same relative evidence in favor of a specific alternative hypothesis, and that the same p-value can correspond to different levels of relative evidence. This is obviously undesirable if we want to use p-values as a measure of the strength of evidence. Therefore, it is incorrect to verbally label any p-value as providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis, as depending on the alternative hypothesis a researcher is interested in, the level of evidence will differ (and the p-value could even correspond to evidence in favor of the null hypothesis).

**All
p-values smaller than 1 correspond to evidence for some non-zero effect**

If the
alternative hypothesis is not specified, any p-value smaller than 1 should be
treated as at least *some* evidence (however small) for *some*
alternative hypotheses. It is therefore not correct to follow the
recommendations of the authors in their Table 2 to interpret p-values above 0.1
(e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against
the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a
p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis
based on p < 0.05 (see the forest plot in Figure 2). It thus seems
ill-advised to follow their recommendation to describe a single study with p =
0.168 as ‘no evidence’ for a relationship.

*Figure
2: Forest plot for a meta-analysis of three identical studies yielding p =
0.168.*

However,
replacing the label of ‘no evidence’ with the label ‘at least some evidence for
some hypotheses’ leads to practical problems when communicating the results of
statistical tests. It seems generally undesirable to allow researchers to
interpret any p-value smaller than 1 as ‘at least some evidence’ against the null
hypothesis. This is the price one pays for not specifying an alternative
hypothesis, and try to interpret p-values from a null hypothesis significance
test in an evidential manner. If we do not specify the alternative hypothesis,
it becomes impossible to conclude there is evidence *for* the null
hypothesis, and we cannot statistically falsify any hypothesis (Lakens,
Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).

** **

**Interpreting
p-values as p-values**

Instead of interpreting p-values as measures of the strength of evidence, we could consider a radical alternative: interpret p-values as p-values. This would, perhaps surprisingly, solve the main problems that Muff and colleagues aim to address, namely ‘black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff’. The idea to interpret p-values as measures of evidence is most strongly tried to a Fisherian interpretation of p-values. An alternative statistical frequentist philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to guide decisions about the null and alternative hypothesis by, in the long run, controlling the Type I and Type II error rate. Researchers specify an alpha level and design a study with a sufficiently high statistical power, and reject (or fail to reject) the null hypothesis.

Neyman and Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First, Neyman and Pearson (1933b) leave open whether the states of the world are divided in two (‘accept’ and ‘reject’) or three regions, and write that a “region of doubt may be obtained by a further subdivision of the region of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics is to test range predictions instead of limiting oneself to a null hypothesis significance test (Lakens, 2021). This implements the idea of Neyman and Pearson to introduce a region of doubt, and distinguishes inconclusive results (where neither the null hypothesis nor the alternative hypothesis can be rejected, and more data needs to be collected to draw a conclusion) from conclusive results (where either the null hypothesis or the alternative hypothesis can be rejected.

In a Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of rejections must be only tentative.” So when we reject the null model, we do so tentatively, aware of the fact we might have done so in error, and without necessarily believing the null model is false. For Neyman (1957, p. 13) inferential behavior is an: “act of will to behave in the future (perhaps until new experiments are performed) in a particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.

Furthermore, it is important to remember that hypothesis tests reject a statistical hypothesis, but not a theoretical hypothesis. As Neyman (1960, p. 290) writes: “the frequency of correct conclusions regarding the statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis in a specific experiment does not necessarily inform us about the truth of the theory. Decisions about the truthfulness of a theory requires a careful evaluation of the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).

Neyman (1976) provides some reporting examples that reflect his philosophy on statistical inferences: “after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that "high" scores on "potential” and on "education" are indicative of better chances of success in the drive to home ownership”. An example of a shorter statement that Neyman provides reads: “As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population.”

A complete verbal description of the result of a Neyman-Pearson hypothesis test acknowledges two sources of uncertainty. First, the assumptions of the statistical test must be met (i.e., data is normally distributed), or any deviations should be small enough to not have any substantial effect on the frequentist error rates. Second, conclusions are made “Without hoping to know. whether each separate hypothesis is true or false” (Neyman & Pearson, 1933a). Any single conclusion can be wrong, and assuming the test assumption are met, we make claims under a known maximum error rate (which is never zero). Future replication studies are needed to provide further insights about whether the current conclusion was erroneous or not.

After observing a p-value smaller than the alpha level, one can therefore conclude: “Until new data emerges that proves us wrong, we decide to act as if there is an effect, while acknowledging that the methodological procedure we base this decision on has, a maximum error rate of alpha% (assuming the statistical assumptions are met), which we find acceptably low.” One can follow such a statement about the observed data with a theoretical inference, such as “assuming our auxiliary hypotheses hold, the result of this statistical test corroborates our theoretical hypothesis”. If a conclusive test result in an equivalence test is observed that allows a researcher to reject the presence of any effect large enough to be meaningful, the conclusion would be that the test result does not corroborate the theoretical hypothesis.

The problem that the common application of null hypothesis significance testing in science is based on an arbitrary threshold of 0.05 is true (Lakens, Adolfi, et al., 2018). There are surprisingly few attempts to provide researchers with practical approaches to determine an alpha level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021; Miller & Ulrich, 2019; Mudge et al., 2012). It seems difficult to resolve in practice, both because at least some scientist adopt a philosophy of science where the goal of hypothesis tests is to establish a corpus of scientific claims (Frick, 1996), and any continuous measure will be broken up in a threshold below which a researcher are not expected to make a claim about a finding (e.g., a BF < 3, see Kass & Raftery, 1995, or a likelihood ratio lower than k = 8, see Royall, 2000). Although it is true that an alpha level of 0.05 is arbitrary, there are some pragmatic arguments in its favor (e.g., it is established, and it might be low enough to yield claims that are taken seriously, but not high enough to prevent other researchers from attempting to refute the claim, see Uygun Tunç et al., 2021).

**If there
really no agreement on best practices in sight?**

One major impetus for the flawed proposal to interpret p-values as evidence by Muff and colleagues is that “no agreement on a way forward is in sight”. The statement that there is little agreement among statisticians is an oversimplification. I will go out on a limb and state some things I assume most statisticians agree on. First, there are multiple statistical tools one can use, and each tool has their own strengths and weaknesses. Second, there are different statistical philosophies, each with their own coherent logic, and researchers are free to analyze data from the perspective of one or multiple of these philosophies. Third, one should not misuse statistical tools, or apply them to attempt to answer questions the tool was not designed to answer.

It is true that there is variation in the preferences individuals have about which statistical tools should be used, and the philosophies of statistical researchers should adopt. This should not be surprising. Individual researchers differ in which research questions they find interesting within a specific content domain, and similarly, they differ in which statistical questions they find interesting when analyzing data. Individual researchers differ in which approaches to science they adopt (e.g., a qualitative or a quantitative approach), and similarly, they differ in which approach to statistical inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences. It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is. Researchers can believe it is important for reliable knowledge generation to control error rates when making scientific claims, while at the same time believing that it is important to quantify relative evidence using likelihoods or Bayes factors (for example by presented a Bayes factor alongside every p-value for a statistical test, Lakens et al., 2020).

Whatever approach to statistical inferences researchers choose to use, the approach should answer a meaningful statistical question (Hand, 1994), the approach to statistical inferences should be logically coherent, and the approach should be applied correctly. Despite the common statement in the literature that p-values can be interpreted as measures of evidence, the criticism against the coherence of this approach should make us pause. Given that coherent alternatives exist, such as likelihoods (Royall, 1997) or Bayes factors (Kass & Raftery, 1995), researchers should not follow the recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p = 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.

**References**

Bland, M.
(2015). *An introduction to medical statistics* (Fourth edition). Oxford
University Press.

Field, S. A., Tyre,
A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing
the cost of environmental management decisions by optimizing statistical
thresholds. *Ecology Letters*, *7*(8), 669–675.
https://doi.org/10.1111/j.1461-0248.2004.00625.x

Frick, R. W. (1996).
The appropriate use of null hypothesis testing. *Psychological Methods*, *1*(4),
379–390. https://doi.org/10.1037/1082-989X.1.4.379

Goodman, S. N.,
& Royall, R. (1988). Evidence and scientific research. *American Journal
of Public Health*, *78*(12), 1568–1574.

Hand, D. J. (1994).
Deconstructing Statistical Questions. *Journal of the Royal Statistical
Society. Series A (Statistics in Society)*, *157*(3), 317–356.
https://doi.org/10.2307/2983526

Kass, R. E., &
Raftery, A. E. (1995). Bayes factors. *Journal of the American Statistical
Association*, *90*(430), 773–795.
https://doi.org/10.1080/01621459.1995.10476572

Kim, J. H., &
Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic
Approach. *Abacus*, *57*(1), 27–71.
https://doi.org/10.1111/abac.12172

Krueger, J. (2001).
Null hypothesis significance testing: On the survival of a flawed method. *American
Psychologist*, *56*(1), 16–26.
https://doi.org/10.1037//0003-066X.56.1.16

Lakens, D. (2021).
The practical alternative to the p value is the correctly used p value. *Perspectives
on Psychological Science*, *16*(3), 639–648.
https://doi.org/10.1177/1745691620958012

Lakens, D., Adolfi,
F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T.,
Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A.
R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins,
G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. *Nature Human
Behaviour*, *2*, 168–171. https://doi.org/10.1038/s41562-018-0311-x

Lakens, D., McLatchie, N., Isager, P. M.,
Scheel, A. M., & Dienes, Z. (2020). Improving Inferences
About Null Effects With Bayes Factors and Equivalence Tests. *The Journals of
Gerontology: Series B*, *75*(1), 45–57.
https://doi.org/10.1093/geronb/gby065

Lakens, D., Scheel, A. M., & Isager, P.
M. (2018). Equivalence testing for psychological research: A
tutorial. *Advances in Methods and Practices in Psychological Science*, *1*(2),
259–269. https://doi.org/10.1177/2515245918770963

Maier, M., &
Lakens, D. (2021). *Justify Your Alpha: A Primer on Two Practical Approaches*.
PsyArXiv.
https://doi.org/10.31234/osf.io/ts4r6

Miller, J., & Ulrich, R. (2019). The quest
for an optimal alpha. *PLOS ONE*, *14*(1), e0208631.
https://doi.org/10.1371/journal.pone.0208631

Mudge, J. F., Baker,
L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes
Errors in Null Hypothesis Significance Tests. *PLOS ONE*, *7*(2),
e32734. https://doi.org/10.1371/journal.pone.0032734

Neyman, J. (1957).
“Inductive Behavior” as a Basic Concept of Philosophy of Science. *Revue de
l’Institut International de Statistique / Review of the International
Statistical Institute*, *25*(1/3), 7–22.
https://doi.org/10.2307/1401671

Neyman, J. (1960). *First
course in probability and statistics*. Holt, Rinehart and Winston.

Neyman, J. (1976).
Tests of statistical hypotheses and their use in studies of natural phenomena. *Communications
in Statistics - Theory and Methods*, *5*(8), 737–751.
https://doi.org/10.1080/03610927608827392

Neyman, J., &
Pearson, E. S. (1933a). On the problem of the most efficient tests of
statistical hypotheses. *Philosophical Transactions of the Royal Society of
London A: Mathematical, Physical and Engineering Sciences*, *231*(694–706),
289–337. https://doi.org/10.1098/rsta.1933.0009

Neyman, J., &
Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to
probabilities a priori. *Mathematical Proceedings of the Cambridge
Philosophical Society*, *29*(04), 492–510.
https://doi.org/10.1017/S030500410001152X

Royall, R. (1997). *Statistical
Evidence: A Likelihood Paradigm*. Chapman and Hall/CRC.

Royall, R. (2000).
On the probability of observing misleading statistical evidence. *Journal of
the American Statistical Association*, *95*(451), 760–768.

Spanos, A. (2013).
Who should be afraid of the Jeffreys-Lindley paradox? *Philosophy of Science*,
*80*(1), 73–93.

Uygun Tunç, D.,
& Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses
in Social and Behavioral Sciences: Systematic Replications Framework. In *Meta-Psychology*.
https://doi.org/10.31234/osf.io/pdm7y

Uygun Tunç, D., Tunç, M. N., & Lakens,
D. (2021). *The Epistemic and Pragmatic Function of Dichotomous
Claims Based on Statistical Hypothesis Tests*. PsyArXiv.
https://doi.org/10.31234/osf.io/af9by