A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, November 20, 2021

Why p-values should be interpreted as p-values and not as measures of evidence

In a recent paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation “to regard P-values as what they are, namely, continuous measures of statistical evidence". This is a surprising recommendation, given that p-values are not valid measures of evidence (Royall, 1997). The authors follow Bland (2015) who suggests that “It is preferable to think of the significance test probability as an index of the strength of evidence against the null hypothesis” and proposed verbal labels for p-values in specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman & Royall, 1988). If the null-hypothesis is true, p-values are uniformly distributed. This means it is just as likely to observe a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just because the probability to observe this p-value is very small. After all, if the null hypothesis is true, the probability of observing p = 0.999 is exactly as small.

The reason that small p-values can be used to guide us in the direction of true effects is not because they are rarely observed when the null-hypothesis is true, but because they are relatively less likely to be observed when the null hypothesis is true, than when the alternative hypothesis is true. For this reason, statisticians have argued that the concept of evidence is necessarily ‘relative’. We can quantify evidence in favor of one hypothesis over another hypothesis, based on the likelihood of observing data when the null hypothesis is true, compared to this probability when an alternative hypothesis is true. As Royall (1997, p. 8) explains: “The law of likelihood applies to pairs of hypotheses, telling when a given set of observations is evidence for one versus the other: hypothesis A is better supported than B if A implies a greater probability for the observations than B does. This law represents a concept of evidence that is essentially relative, one that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569) write, “The p-value is not adequate for inference because the measurement of evidence requires at least three components: the observations, and two competing explanations for how they were produced.

In practice, the problem of interpreting p-values as evidence in absence of a clearly defined alternative hypothesis is that they at best serve as proxies for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true, p-values are unrelated to evidence. In practice, when researchers examine a mix of hypotheses where the alternative hypothesis is sometimes true, p-values will be correlated with measures of evidence. However, this correlation can be quite weak (Krueger, 2001), and in general this correlation is too weak for p-values to function as a valid measure of evidence, where p-values in a specific range can directly be associated with ‘strong’ or ‘weak’ evidence.

 

Why single p-values cannot be interpreted as the strength of evidence

 

The evidential value of a single p-value depends on the statistical power of the test (i.e., on the sample size in combination with the effect size of the alternative hypothesis). The statistical power expresses the probability of observing a p-value smaller than the alpha level if the alternative hypothesis is true. When the null hypothesis is true, statistical power is formally undefined, but in practice in a two-sided test α% of the observed p-values will fall below the alpha level, as p-values are uniformly distributed under the null-hypothesis. The horizontal grey line in Figure 1 illustrates the expected p-value distribution for a two-sided independent t-test if the null-hypothesis is true (or when the observed effect size Cohen’s d is 0). As every p-value is equally likely, they can not quantify the strength of evidence against the null hypothesis.

 

Figure 1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve) and 99% (dotted black curve). 



If the alternative hypothesis is true the strength of evidence that corresponds to a p-value depends on the statistical power of the test. If power is 50%, we should expect that 50% of the observed p-values fall below the alpha level. The remaining p-values fall above the alpha level. The black curve in Figure 1 illustrates the p-value distribution for a test with a statistical power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when there is a true effect that is examined in a statistical test with 50% power than when the null hypothesis is true (as illustrated by the black curve being above the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence for an alternative hypothesis examined with 50% power, compared to the null hypothesis.

If an effect is examined in a test with 99% power (the dotted line in Figure 1) we would draw a different conclusion. With such high power p-values larger than the alpha level of 5% are rare (they occur only 1% of the time) and a p-value of 0.168 is much more likely to be observed when the null-hypothesis is true than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is evidence against an alternative hypothesis examined with 99% power, compared to the null hypothesis.

Figure 1 illustrates that with 99% power even a ‘statistically significant’ p-value of 0.04 is evidence for of the null-hypothesis. The reason for this is that the probability of observing a p-value of 0.04 is more likely when the null hypothesis is true than when a hypothesis is tested with 99% power (i.e., the grey horizontal line at p = 0.04 is above the dotted black curve). This fact, which is often counterintuitive when first encountered, is known as the Lindley paradox, or the Jeffreys-Lindley paradox (for a discussion, see Spanos, 2013).

Figure 1 illustrates that different p-values can correspond to the same relative evidence in favor of a specific alternative hypothesis, and that the same p-value can correspond to different levels of relative evidence. This is obviously undesirable if we want to use p-values as a measure of the strength of evidence. Therefore, it is incorrect to verbally label any p-value as providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis, as depending on the alternative hypothesis a researcher is interested in, the level of evidence will differ (and the p-value could even correspond to evidence in favor of the null hypothesis).

 

All p-values smaller than 1 correspond to evidence for some non-zero effect

 

If the alternative hypothesis is not specified, any p-value smaller than 1 should be treated as at least some evidence (however small) for some alternative hypotheses. It is therefore not correct to follow the recommendations of the authors in their Table 2 to interpret p-values above 0.1 (e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis based on p < 0.05 (see the forest plot in Figure 2). It thus seems ill-advised to follow their recommendation to describe a single study with p = 0.168 as ‘no evidence’ for a relationship.

 

Figure 2: Forest plot for a meta-analysis of three identical studies yielding p = 0.168.


However, replacing the label of ‘no evidence’ with the label ‘at least some evidence for some hypotheses’ leads to practical problems when communicating the results of statistical tests. It seems generally undesirable to allow researchers to interpret any p-value smaller than 1 as ‘at least some evidence’ against the null hypothesis. This is the price one pays for not specifying an alternative hypothesis, and try to interpret p-values from a null hypothesis significance test in an evidential manner. If we do not specify the alternative hypothesis, it becomes impossible to conclude there is evidence for the null hypothesis, and we cannot statistically falsify any hypothesis (Lakens, Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).

 

Interpreting p-values as p-values

 

Instead of interpreting p-values as measures of the strength of evidence, we could consider a radical alternative: interpret p-values as p-values. This would, perhaps surprisingly, solve the main problems that Muff and colleagues aim to address, namely ‘black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff’. The idea to interpret p-values as measures of evidence is most strongly tried to a Fisherian interpretation of p-values. An alternative statistical frequentist philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to guide decisions about the null and alternative hypothesis by, in the long run, controlling the Type I and Type II error rate. Researchers specify an alpha level and design a study with a sufficiently high statistical power, and reject (or fail to reject) the null hypothesis.

Neyman and Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First, Neyman and Pearson (1933b) leave open whether the states of the world are divided in two (‘accept’ and ‘reject’) or three regions, and write that a “region of doubt may be obtained by a further subdivision of the region of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics is to test range predictions instead of limiting oneself to a null hypothesis significance test (Lakens, 2021). This implements the idea of Neyman and Pearson to introduce a region of doubt, and distinguishes inconclusive results (where neither the null hypothesis nor the alternative hypothesis can be rejected, and more data needs to be collected to draw a conclusion) from conclusive results (where either the null hypothesis or the alternative hypothesis can be rejected.

In a Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of rejections must be only tentative.” So when we reject the null model, we do so tentatively, aware of the fact we might have done so in error, and without necessarily believing the null model is false. For Neyman (1957, p. 13) inferential behavior is an: “act of will to behave in the future (perhaps until new experiments are performed) in a particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.

Furthermore, it is important to remember that hypothesis tests reject a statistical hypothesis, but not a theoretical hypothesis. As Neyman (1960, p. 290) writes: “the frequency of correct conclusions regarding the statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis in a specific experiment does not necessarily inform us about the truth of the theory. Decisions about the truthfulness of a theory requires a careful evaluation of the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).

Neyman (1976) provides some reporting examples that reflect his philosophy on statistical inferences: “after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that "high" scores on "potential and on "education" are indicative of better chances of success in the drive to home ownership”. An example of a shorter statement that Neyman provides reads: “As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population.

A complete verbal description of the result of a Neyman-Pearson hypothesis test acknowledges two sources of uncertainty. First, the assumptions of the statistical test must be met (i.e., data is normally distributed), or any deviations should be small enough to not have any substantial effect on the frequentist error rates. Second, conclusions are made “Without hoping to know. whether each separate hypothesis is true or false(Neyman & Pearson, 1933a). Any single conclusion can be wrong, and assuming the test assumption are met, we make claims under a known maximum error rate (which is never zero). Future replication studies are needed to provide further insights about whether the current conclusion was erroneous or not.

After observing a p-value smaller than the alpha level, one can therefore conclude: “Until new data emerges that proves us wrong, we decide to act as if there is an effect, while acknowledging that the methodological procedure we base this decision on has, a maximum error rate of alpha% (assuming the statistical assumptions are met), which we find acceptably low.” One can follow such a statement about the observed data with a theoretical inference, such as “assuming our auxiliary hypotheses hold, the result of this statistical test corroborates our theoretical hypothesis”. If a conclusive test result in an equivalence test is observed that allows a researcher to reject the presence of any effect large enough to be meaningful, the conclusion would be that the test result does not corroborate the theoretical hypothesis.

The problem that the common application of null hypothesis significance testing in science is based on an arbitrary threshold of 0.05 is true (Lakens, Adolfi, et al., 2018). There are surprisingly few attempts to provide researchers with practical approaches to determine an alpha level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021; Miller & Ulrich, 2019; Mudge et al., 2012). It seems difficult to resolve in practice, both because at least some scientist adopt a philosophy of science where the goal of hypothesis tests is to establish a corpus of scientific claims (Frick, 1996), and any continuous measure will be broken up in a threshold below which a researcher are not expected to make a claim about a finding (e.g., a BF < 3, see Kass & Raftery, 1995, or a likelihood ratio lower than k = 8, see Royall, 2000). Although it is true that an alpha level of 0.05 is arbitrary, there are some pragmatic arguments in its favor (e.g., it is established, and it might be low enough to yield claims that are taken seriously, but not high enough to prevent other researchers from attempting to refute the claim, see Uygun Tunç et al., 2021).

 

If there really no agreement on best practices in sight?

 

One major impetus for the flawed proposal to interpret p-values as evidence by Muff and colleagues is that “no agreement on a way forward is in sight”. The statement that there is little agreement among statisticians is an oversimplification. I will go out on a limb and state some things I assume most statisticians agree on. First, there are multiple statistical tools one can use, and each tool has their own strengths and weaknesses. Second, there are different statistical philosophies, each with their own coherent logic, and researchers are free to analyze data from the perspective of one or multiple of these philosophies. Third, one should not misuse statistical tools, or apply them to attempt to answer questions the tool was not designed to answer.

It is true that there is variation in the preferences individuals have about which statistical tools should be used, and the philosophies of statistical researchers should adopt. This should not be surprising. Individual researchers differ in which research questions they find interesting within a specific content domain, and similarly, they differ in which statistical questions they find interesting when analyzing data. Individual researchers differ in which approaches to science they adopt (e.g., a qualitative or a quantitative approach), and similarly, they differ in which approach to statistical inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences. It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is. Researchers can believe it is important for reliable knowledge generation to control error rates when making scientific claims, while at the same time believing that it is important to quantify relative evidence using likelihoods or Bayes factors (for example by presented a Bayes factor alongside every p-value for a statistical test, Lakens et al., 2020).

Whatever approach to statistical inferences researchers choose to use, the approach should answer a meaningful statistical question (Hand, 1994), the approach to statistical inferences should be logically coherent, and the approach should be applied correctly. Despite the common statement in the literature that p-values can be interpreted as measures of evidence, the criticism against the coherence of this approach should make us pause. Given that coherent alternatives exist, such as likelihoods (Royall, 1997) or Bayes factors (Kass & Raftery, 1995), researchers should not follow the recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p = 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.

 

References

Bland, M. (2015). An introduction to medical statistics (Fourth edition). Oxford University Press.

Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecology Letters, 7(8), 669–675. https://doi.org/10.1111/j.1461-0248.2004.00625.x

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568–1574.

Hand, D. J. (1994). Deconstructing Statistical Questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3), 317–356. https://doi.org/10.2307/2983526

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

Kim, J. H., & Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic Approach. Abacus, 57(1), 27–71. https://doi.org/10.1111/abac.12172

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. https://doi.org/10.1037//0003-066X.56.1.16

Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2020). Improving Inferences About Null Effects With Bayes Factors and Equivalence Tests. The Journals of Gerontology: Series B, 75(1), 45–57. https://doi.org/10.1093/geronb/gby065

Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

Maier, M., & Lakens, D. (2021). Justify Your Alpha: A Primer on Two Practical Approaches. PsyArXiv. https://doi.org/10.31234/osf.io/ts4r6

Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7–22. https://doi.org/10.2307/1401671

Neyman, J. (1960). First course in probability and statistics. Holt, Rinehart and Winston.

Neyman, J. (1976). Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in Statistics - Theory and Methods, 5(8), 737–751. https://doi.org/10.1080/03610927608827392

Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009

Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29(04), 492–510. https://doi.org/10.1017/S030500410001152X

Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall/CRC.

Royall, R. (2000). On the probability of observing misleading statistical evidence. Journal of the American Statistical Association, 95(451), 760–768.

Spanos, A. (2013). Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science, 80(1), 73–93.

Uygun Tunç, D., & Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework. In Meta-Psychology. https://doi.org/10.31234/osf.io/pdm7y

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by

7 comments:

  1. "p-values should be interpreted as p-values" is no different to the former UK Prime Minister's comment "Brexit means Brexit". Since at the time there was no consensus as to the meaning of Brexit, the Brexit meme was meaningless. The same may be true for this p-value meme, if such it is to become, since the "value" in p-value is itself disputed.

    ReplyDelete
  2. Who would dispute the definition of a p value? And who would dispute that it it's in fact a value that is called p? The discussion is about what inferences to draw from a p value, and whether such inferences are consistent with it's definition. But there are no different ways to calculate a p value.

    ReplyDelete
    Replies
    1. The original definition of the "value of P" in Pearson 1900 and which became known as the P-value by the 1920s is an observed tail area of a divergence statistic, while in the Neyman-Pearsonian definition assumed above P is a random variable defined from a formal decision rule with known conditional error rates. The two concepts can come into conflict over proper extension beyond simple hypotheses in basic models, e.g. see Robins et al. JASA 2000.

      Delete
  3. Part 1: I thought this post provided mostly good coverage under the Neyman-Pearson-Lehmann/decision-theory (NPL) concept of P-values as random variables whose single-trial realization is the smallest alpha-level at which the tested hypothesis H could be rejected (given all background assumptions hold). In this NPL vision, P-values are inessential add-ons that can be skipped if one wants to just check in what decision region the test statistic fell.

    But I object to the coverage above and in its cites for not recognizing how the Pearson-Fisher P-value concept (which is the original form of their "value of P") differs in a crucial fashion from the NPL version. Fisher strongly objected to the NP formalization of statistical testing, and I think his main reasons can be made precise when one considers alternative formalizations of how he described P-values. There is no agreed-upon formal definition of "evidence" or how to measure it, but in Fisher's conceptual framework P-values can indeed "measure evidence" in the sense of providing coherent summaries of the information against H contained in measure of divergence of data from models.

    Pearson and Fisher defintion started from divergence measures in single trials, such as chi-squared or Z-statistics; P is then the observed divergence quantile (tail area) in a reference distribution under H. No alpha or decision need be in the offing, so those become the add-ons. For some review material see
    Greenland S. 2019 http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
    Rafi & Greenland. https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9
    Greenland & Rafi. https://arxiv.org/abs/2008.12991
    Cole SR, Edwards J, Greenland S. (2021). https://academic.oup.com/aje/advance-article-abstract/doi/10.1093/aje/kwaa136/5869593
    Related views are in e.g.
    Perezgonzalez JD. P-values as percentiles. Commentary on: “Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations”. Front Psych 2015;6. https://doi.org/10.3389/fpsyg.2015.00341.
    Vos P, Holbert D. Frequentist inference without repeated sampling. ArXiv190608360 StatOT. 2019; https://arxiv.org/abs/1906.08360.

    ReplyDelete
  4. Part 2: The original observed-quantile conceptualization of P-values can conflict with the NPL/decision conceptualization in e.g. Lehmann 1986 used for example in Schervish 1996. The latter paper showed how NPL P-values can be incoherent measures of support, with which I wholly agree. As I think both K. Pearson and Fisher saw, the value of P can only indicate compatibility of data with models, and many conflicting models may be highly compatible with data. But P-values can be transformed into measures of refutation, conflict, or countersupport, such as the binary S-value, Shannon or surprisal transform -log2(p) as reviewed in the Greenland et al. cites above.

    Schervish 1996 failed to recognize the Fisherian alternative derivation/definition of P-values and so wrote (as others do) as if the NPL formalization was the only one available or worth considering - a shortcoming quite at odds with sound advice like "there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences." And while I hope everyone agrees that "It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is", I think it important to recognize that most of the time our "best" (by the usual statistical criteria) point estimates of effect sizes can be represented as maxima of 2-sided P-value functions or crossing points of upper and lower P-value functions, and our "best" interval estimates can be read off the same P-functions.

    I must add that I am surprised that so many otherwise perceptive writers keep repeating the absurd statement that "P-values overstate evidence", which I view as a classic example of the mind-projection fallacy. The P-value is just a number that sits there; any overstatement of its meaning in any context has to be on the part of the viewer. I suspect the overstatement claim arises because some are still subconsciously sensing P-values as some sort of posterior probability (even if consciously they would deny that vehemently). This problem indicates that attention should also be given to the ways in which P-values can supply interesting bounds on posterior probabilities, as shown in Casella & R. Berger 1987ab and reviewed in Greenland & Poole 2013ab (all are cited in Greenland 2019 above), and how P-values can be rescaled as binary S-values -log2(p) to better perceive their information content (again as reviewed in the Greenland et al. citations above).

    ReplyDelete
  5. There is one thing I keep asking and never get an answer to—which is kind of weird since it’s so obviously relevant and is a point that comes from one of the founders of significance testing. You say: “After observing a p-value smaller than the alpha level, one can therefore conclude…” How is that compatible with what Fisher said about significance tests: “A scientific fact should be regarded as experimentally established only if a properly designed experiment *rarely fails* to give this level of significance”?

    Do we all agree that Fisher can only have meant that after observing (obtaining, actually) a single p-value *we do not conclude anything*? But that we only conclude things after obtaining *many* p-values? (As many as we deem necessary to be able to speak of “rarely fails”.)

    ReplyDelete
    Replies
    1. Fisher is not really the best source on how to interpret test result. It is a lot simpler (and better) from a Neyman-Pearson approach. You conclude something *with a known maximum error rate* - so, you draw a conclusion but at the same time accept that in the long run, you could be wrong at most e.g., 5% of the time. Conclusions are, as I write in the blog, always tentative.

      Delete