Update: Florian Hartig has also published a blog post criticizing the paper by Muff et al (2021).
In a recent
paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation
“to regard P-values as what they
are, namely, continuous measures of statistical evidence". This is a surprising
recommendation, given that p-values are not valid measures of evidence (Royall,
1997). The authors follow Bland (2015) who suggests that “It
is preferable to think of the
significance test probability as an index of the strength of evidence against
the null hypothesis” and proposed verbal labels for p-values in
specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values
between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the
idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman
& Royall, 1988). If the null-hypothesis is true,
p-values are uniformly distributed. This means it is just as likely to observe
a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the
interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just
because the probability to observe this p-value is very small. After all, if
the null hypothesis is true, the probability of observing p = 0.999 is exactly
as small.
The reason that
small p-values can be used to guide us in the direction of true effects is not
because they are rarely observed when the null-hypothesis is true, but because
they are relatively less likely to be observed when the null hypothesis is
true, than when the alternative hypothesis is true. For this reason,
statisticians have argued that the concept of evidence is necessarily ‘relative’.
We can quantify evidence in favor of one hypothesis over another hypothesis,
based on the likelihood of observing data when the null hypothesis is true,
compared to this probability when an alternative hypothesis is true. As Royall
(1997, p. 8) explains: “The law of
likelihood applies to pairs of hypotheses, telling when a given
set of observations is evidence for one versus the other: hypothesis
A is better supported than B if A implies a greater
probability for the observations than B does. This
law represents a concept of evidence that is essentially relative, one
that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569)
write, “The p-value is not adequate
for inference because the measurement of evidence requires at least three
components: the observations, and two competing explanations
for how they were produced.”
In
practice, the problem of interpreting p-values as evidence in absence of a
clearly defined alternative hypothesis is that they at best serve as proxies
for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true,
p-values are unrelated to evidence. In practice, when researchers examine a mix
of hypotheses where the alternative hypothesis is sometimes true, p-values will
be correlated with measures of evidence. However, this correlation can
be quite weak (Krueger,
2001), and in general this correlation is
too weak for p-values to function as a valid measure of evidence, where
p-values in a specific range can directly be associated with ‘strong’ or ‘weak’
evidence.
Why
single p-values cannot be interpreted as the strength of evidence
The
evidential value of a single p-value depends on the statistical power of the
test (i.e., on the sample size in combination with the effect size of the
alternative hypothesis). The statistical power expresses the probability of
observing a p-value smaller than the alpha level if the alternative hypothesis
is true. When the null hypothesis is true, statistical power is formally undefined,
but in practice in a two-sided test α% of the observed p-values will fall below the alpha level, as p-values
are uniformly distributed under the null-hypothesis. The horizontal grey line
in Figure 1 illustrates the expected p-value distribution for a two-sided
independent t-test if the null-hypothesis is true (or when the observed
effect size Cohen’s d is 0). As every p-value is equally likely, they can not
quantify the strength of evidence against the null hypothesis.
Figure
1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve)
and 99% (dotted black curve).
If the alternative hypothesis is true the strength of evidence that
corresponds to a p-value depends on the statistical power of the test. If power
is 50%, we should expect that 50% of the observed p-values fall below the alpha
level. The remaining p-values fall above the alpha level. The black curve in
Figure 1 illustrates the p-value distribution for a test with a statistical
power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when
there is a true effect that is examined in a statistical test with 50% power than
when the null hypothesis is true (as illustrated by the black curve being above
the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence for
an alternative hypothesis examined with 50% power, compared to the null
hypothesis.
If an
effect is examined in a test with 99% power (the dotted line in Figure 1) we
would draw a different conclusion. With such high power p-values larger than
the alpha level of 5% are rare (they occur only 1% of the time) and a p-value
of 0.168 is much more likely to be observed when the null-hypothesis is true
than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is
evidence against an alternative hypothesis examined with 99% power,
compared to the null hypothesis.
Figure 1
illustrates that with 99% power even a ‘statistically significant’ p-value of
0.04 is evidence for of the null-hypothesis. The reason for this is that
the probability of observing a p-value of 0.04 is more likely when the null
hypothesis is true than when a hypothesis is tested with 99% power (i.e., the
grey horizontal line at p = 0.04 is above the dotted black curve). This fact,
which is often counterintuitive when first encountered, is known as the Lindley
paradox, or the Jeffreys-Lindley paradox (for a
discussion, see Spanos, 2013).
Figure 1
illustrates that different p-values can correspond to the same relative
evidence in favor of a specific alternative hypothesis, and that the same
p-value can correspond to different levels of relative evidence. This is
obviously undesirable if we want to use p-values as a measure of the strength
of evidence. Therefore, it is incorrect to verbally label any p-value as
providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis,
as depending on the alternative hypothesis a researcher is interested in, the
level of evidence will differ (and the p-value could even correspond to
evidence in favor of the null hypothesis).
All
p-values smaller than 1 correspond to evidence for some non-zero effect
If the
alternative hypothesis is not specified, any p-value smaller than 1 should be
treated as at least some evidence (however small) for some
alternative hypotheses. It is therefore not correct to follow the
recommendations of the authors in their Table 2 to interpret p-values above 0.1
(e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against
the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a
p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis
based on p < 0.05 (see the forest plot in Figure 2). It thus seems
ill-advised to follow their recommendation to describe a single study with p =
0.168 as ‘no evidence’ for a relationship.
Figure
2: Forest plot for a meta-analysis of three identical studies yielding p =
0.168.
However,
replacing the label of ‘no evidence’ with the label ‘at least some evidence for
some hypotheses’ leads to practical problems when communicating the results of
statistical tests. It seems generally undesirable to allow researchers to
interpret any p-value smaller than 1 as ‘at least some evidence’ against the null
hypothesis. This is the price one pays for not specifying an alternative
hypothesis, and try to interpret p-values from a null hypothesis significance
test in an evidential manner. If we do not specify the alternative hypothesis,
it becomes impossible to conclude there is evidence for the null
hypothesis, and we cannot statistically falsify any hypothesis (Lakens,
Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).
Interpreting
p-values as p-values
Instead of
interpreting p-values as measures of the strength of evidence, we could
consider a radical alternative: interpret p-values as p-values. This would,
perhaps surprisingly, solve the main problems that Muff and colleagues aim to
address, namely ‘black-or-white
null-hypothesis significance testing with an arbitrary P-value
cutoff’. The idea to
interpret p-values as measures of evidence is most strongly tried to a
Fisherian interpretation of p-values. An alternative statistical frequentist
philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to
guide decisions about the null and alternative hypothesis by, in the long run,
controlling the Type I and Type II error rate. Researchers specify an alpha
level and design a study with a sufficiently high statistical power, and reject
(or fail to reject) the null hypothesis.
Neyman and
Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First,
Neyman and Pearson (1933b) leave open whether the states of
the world are divided in two (‘accept’ and ‘reject’) or three regions, and write
that a “region of doubt may be obtained by a further subdivision of the region
of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics
is to test range predictions instead of limiting oneself to a null hypothesis
significance test (Lakens,
2021). This implements the idea of Neyman
and Pearson to introduce a region of doubt, and distinguishes inconclusive results
(where neither the null hypothesis nor the alternative hypothesis can be
rejected, and more data needs to be collected to draw a conclusion) from
conclusive results (where either the null hypothesis or the alternative
hypothesis can be rejected.
In a
Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis
comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of
rejections must be only tentative.” So when we reject the null model, we do so
tentatively, aware of the fact we might have done so in error, and without
necessarily believing the null model is false. For Neyman (1957, p.
13) inferential behavior is an: “act of
will to behave in the future (perhaps until new experiments are performed) in a
particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.
Furthermore, it is important to remember that hypothesis
tests reject a statistical hypothesis, but not a theoretical hypothesis. As
Neyman (1960, p.
290) writes: “the frequency of correct conclusions regarding the
statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the
frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis
in a specific experiment does not necessarily inform us about the truth of the theory.
Decisions about the truthfulness of a theory requires a careful evaluation of
the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).
Neyman (1976) provides some reporting examples that
reflect his philosophy on statistical inferences: “after considering the probability
of error (that is, after considering how frequently we
would be in error if in conditions of our data we rejected the hypotheses
tested), we decided to act on the assumption that "high" scores on
"potential” and on "education" are indicative of better
chances of success in the drive to home ownership”. An example of a shorter statement that
Neyman provides reads: “As a
result of the tests we applied, we decided to act on the assumption (or
concluded) that the two groups are not random samples from the same
population.”
A complete
verbal description of the result of a Neyman-Pearson hypothesis test acknowledges
two sources of uncertainty. First, the assumptions of the statistical test must
be met (i.e., data is normally distributed), or any deviations should be small
enough to not have any substantial effect on the frequentist error rates. Second,
conclusions are made “Without
hoping to know. whether each separate hypothesis is
true or false” (Neyman
& Pearson, 1933a). Any single conclusion can be
wrong, and assuming the test assumption are met, we make claims under a known maximum
error rate (which is never zero). Future replication studies are needed to provide
further insights about whether the current conclusion was erroneous or not.
After observing
a p-value smaller than the alpha level, one can therefore conclude: “Until new
data emerges that proves us wrong, we decide to act as if there is an effect,
while acknowledging that the methodological procedure we base this decision on
has, a maximum error rate of alpha% (assuming the statistical assumptions are
met), which we find acceptably low.” One can follow such a statement about the
observed data with a theoretical inference, such as “assuming our auxiliary
hypotheses hold, the result of this statistical test corroborates our
theoretical hypothesis”. If a conclusive test result in an equivalence test is
observed that allows a researcher to reject the presence of any effect large
enough to be meaningful, the conclusion would be that the test result does not
corroborate the theoretical hypothesis.
The problem
that the common application of null hypothesis significance testing in science
is based on an arbitrary threshold of 0.05 is true (Lakens,
Adolfi, et al., 2018). There are surprisingly few
attempts to provide researchers with practical approaches to determine an alpha
level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021;
Miller & Ulrich, 2019; Mudge et al., 2012). It seems
difficult to resolve in practice, both because at least some scientist adopt a
philosophy of science where the goal of hypothesis tests is to establish a
corpus of scientific claims (Frick,
1996), and any continuous measure will be
broken up in a threshold below which a researcher are not expected to make a
claim about a finding (e.g., a BF < 3, see Kass
& Raftery, 1995, or a likelihood ratio lower than k
= 8, see Royall, 2000). Although it is true that an alpha
level of 0.05 is arbitrary, there are some pragmatic arguments in its favor
(e.g., it is established, and it might be low enough to yield claims that are
taken seriously, but not high enough to prevent other researchers from attempting
to refute the claim, see Uygun Tunç et al., 2021).
If there
really no agreement on best practices in sight?
One major
impetus for the flawed proposal to interpret p-values as evidence by Muff and
colleagues is that “no agreement
on a way forward is in sight”. The statement that there is little agreement among statisticians is
an oversimplification. I will go out on a limb and state some things I assume
most statisticians agree on. First, there are multiple statistical tools one
can use, and each tool has their own strengths and weaknesses. Second, there
are different statistical philosophies, each with their own coherent logic, and
researchers are free to analyze data from the perspective of one or multiple of
these philosophies. Third, one should not misuse statistical tools, or apply
them to attempt to answer questions the tool was not designed to answer.
It is true
that there is variation in the preferences individuals have about which
statistical tools should be used, and the philosophies of statistical
researchers should adopt. This should not be surprising. Individual researchers
differ in which research questions they find interesting within a specific
content domain, and similarly, they differ in which statistical questions they
find interesting when analyzing data. Individual researchers differ in which
approaches to science they adopt (e.g., a qualitative or a quantitative
approach), and similarly, they differ in which approach to statistical
inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there
is no reason to limit oneself to a single tool or philosophy, and if anything,
the recommendation is to use multiple approaches to statistical inferences. It
is not always interesting to ask what the p-value is when analyzing data, and
it is often interesting to ask what the effect size is. Researchers can believe
it is important for reliable knowledge generation to control error rates when
making scientific claims, while at the same time believing that it is important
to quantify relative evidence using likelihoods or Bayes factors (for example
by presented a Bayes factor alongside every p-value for a statistical test, Lakens et
al., 2020).
Whatever approach
to statistical inferences researchers choose to use, the approach should answer
a meaningful statistical question (Hand,
1994), the approach to statistical
inferences should be logically coherent, and the approach should be applied
correctly. Despite the common statement in the literature that p-values can be
interpreted as measures of evidence, the criticism against the coherence of
this approach should make us pause. Given that coherent alternatives exist,
such as likelihoods (Royall,
1997) or Bayes factors (Kass
& Raftery, 1995), researchers should not follow the
recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p
= 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.
References
Bland, M.
(2015). An introduction to medical statistics (Fourth edition). Oxford
University Press.
Field, S. A., Tyre,
A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing
the cost of environmental management decisions by optimizing statistical
thresholds. Ecology Letters, 7(8), 669–675.
https://doi.org/10.1111/j.1461-0248.2004.00625.x
Frick, R. W. (1996).
The appropriate use of null hypothesis testing. Psychological Methods, 1(4),
379–390. https://doi.org/10.1037/1082-989X.1.4.379
Goodman, S. N.,
& Royall, R. (1988). Evidence and scientific research. American Journal
of Public Health, 78(12), 1568–1574.
Hand, D. J. (1994).
Deconstructing Statistical Questions. Journal of the Royal Statistical
Society. Series A (Statistics in Society), 157(3), 317–356.
https://doi.org/10.2307/2983526
Kass, R. E., &
Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical
Association, 90(430), 773–795.
https://doi.org/10.1080/01621459.1995.10476572
Kim, J. H., &
Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic
Approach. Abacus, 57(1), 27–71.
https://doi.org/10.1111/abac.12172
Krueger, J. (2001).
Null hypothesis significance testing: On the survival of a flawed method. American
Psychologist, 56(1), 16–26.
https://doi.org/10.1037//0003-066X.56.1.16
Lakens, D. (2021).
The practical alternative to the p value is the correctly used p value. Perspectives
on Psychological Science, 16(3), 639–648.
https://doi.org/10.1177/1745691620958012
Lakens, D., Adolfi,
F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T.,
Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A.
R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins,
G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. Nature Human
Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Lakens, D., McLatchie, N., Isager, P. M.,
Scheel, A. M., & Dienes, Z. (2020). Improving Inferences
About Null Effects With Bayes Factors and Equivalence Tests. The Journals of
Gerontology: Series B, 75(1), 45–57.
https://doi.org/10.1093/geronb/gby065
Lakens, D., Scheel, A. M., & Isager, P.
M. (2018). Equivalence testing for psychological research: A
tutorial. Advances in Methods and Practices in Psychological Science, 1(2),
259–269. https://doi.org/10.1177/2515245918770963
Maier, M., &
Lakens, D. (2021). Justify Your Alpha: A Primer on Two Practical Approaches.
PsyArXiv.
https://doi.org/10.31234/osf.io/ts4r6
Miller, J., & Ulrich, R. (2019). The quest
for an optimal alpha. PLOS ONE, 14(1), e0208631.
https://doi.org/10.1371/journal.pone.0208631
Mudge, J. F., Baker,
L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes
Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2),
e32734. https://doi.org/10.1371/journal.pone.0032734
Neyman, J. (1957).
“Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de
l’Institut International de Statistique / Review of the International
Statistical Institute, 25(1/3), 7–22.
https://doi.org/10.2307/1401671
Neyman, J. (1960). First
course in probability and statistics. Holt, Rinehart and Winston.
Neyman, J. (1976).
Tests of statistical hypotheses and their use in studies of natural phenomena. Communications
in Statistics - Theory and Methods, 5(8), 737–751.
https://doi.org/10.1080/03610927608827392
Neyman, J., &
Pearson, E. S. (1933a). On the problem of the most efficient tests of
statistical hypotheses. Philosophical Transactions of the Royal Society of
London A: Mathematical, Physical and Engineering Sciences, 231(694–706),
289–337. https://doi.org/10.1098/rsta.1933.0009
Neyman, J., &
Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to
probabilities a priori. Mathematical Proceedings of the Cambridge
Philosophical Society, 29(04), 492–510.
https://doi.org/10.1017/S030500410001152X
Royall, R. (1997). Statistical
Evidence: A Likelihood Paradigm. Chapman and Hall/CRC.
Royall, R. (2000).
On the probability of observing misleading statistical evidence. Journal of
the American Statistical Association, 95(451), 760–768.
Spanos, A. (2013).
Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science,
80(1), 73–93.
Uygun Tunç, D.,
& Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses
in Social and Behavioral Sciences: Systematic Replications Framework. In Meta-Psychology.
https://doi.org/10.31234/osf.io/pdm7y
Uygun Tunç, D., Tunç, M. N., & Lakens,
D. (2021). The Epistemic and Pragmatic Function of Dichotomous
Claims Based on Statistical Hypothesis Tests. PsyArXiv.
https://doi.org/10.31234/osf.io/af9by