Comments on The 20% Statistician: Why p-values should be interpreted as p-values and not as measures of evidence

If a p-value is 0.01, it is unlikely to observe wh...

2024-05-29T17:23:05.904+02:00

If a p-value is 0.01, it is unlikely to observe what we did, given the null hypothesis is true. The above statement is based on the value of the p-val, not on the pdf of the p-val. I don't think the uniform distribution of p-val contradicts the fact that p-val itself indicates the strength of evidence against H0

Part 2: Interpreting the p value of a statistical ...

2021-12-04T06:58:40.175+01:00

Part 2: Interpreting the p value of a statistical analysis is dependent on it mode of presentation. A verbal presentation can be derived on statistical analysis but is a different presentation mode. Somehow this topic has not been discussed. A proposal based on alternative representations with examples from pre-clinical and clinical research can be found in https://www.dropbox.com/s/zfmuc81ho2yschm/Kenett%20Rubinstein%20Scientometrics%202021.pdf?dl=0

Part 1: Interpretation is the focus of this blog. ...

2021-12-04T06:22:55.760+01:00

Part 1: Interpretation is the focus of this blog. As often found, the comments to the blog are also interesting. Stating the goal of the analysis is an obvious step to clarify the interpretation of the analysis results. We expanded on Hand's deconstruction paper and propose a framework of information quality. It has four components and 8 dimensions. "Goal" being the first component. http://infoq.galitshmueli.com/home

How about we first discuss what Fisher actually sa...

2021-11-29T20:25:41.430+01:00

How about we first discuss what Fisher actually said before dismissing it without engaging with it? In any case, I would have expected an actual argument for why “Fisher is not really the best source on how to interpret test results”…

Fisher is not really the best source on how to int...

2021-11-25T13:27:40.962+01:00

Fisher is not really the best source on how to interpret test result. It is a lot simpler (and better) from a Neyman-Pearson approach. You conclude something *with a known maximum error rate* - so, you draw a conclusion but at the same time accept that in the long run, you could be wrong at most e.g., 5% of the time. Conclusions are, as I write in the blog, always tentative.

There is one thing I keep asking and never get an ...

2021-11-24T14:23:55.601+01:00

There is one thing I keep asking and never get an answer to—which is kind of weird since it’s so obviously relevant and is a point that comes from one of the founders of significance testing. You say: “After observing a p-value smaller than the alpha level, one can therefore conclude…” How is that compatible with what Fisher said about significance tests: “A scientific fact should be regarded as experimentally established only if a properly designed experiment *rarely fails* to give this level of significance”?

Do we all agree that Fisher can only have meant that after observing (obtaining, actually) a single p-value *we do not conclude anything*? But that we only conclude things after obtaining *many* p-values? (As many as we deem necessary to be able to speak of “rarely fails”.)

Part 2: The original observed-quantile conceptuali...

2021-11-22T11:31:00.315+01:00

Part 2: The original observed-quantile conceptualization of P-values can conflict with the NPL/decision conceptualization in e.g. Lehmann 1986 used for example in Schervish 1996. The latter paper showed how NPL P-values can be incoherent measures of support, with which I wholly agree. As I think both K. Pearson and Fisher saw, the value of P can only indicate compatibility of data with models, and many conflicting models may be highly compatible with data. But P-values can be transformed into measures of refutation, conflict, or countersupport, such as the binary S-value, Shannon or surprisal transform -log2(p) as reviewed in the Greenland et al. cites above.

Schervish 1996 failed to recognize the Fisherian alternative derivation/definition of P-values and so wrote (as others do) as if the NPL formalization was the only one available or worth considering - a shortcoming quite at odds with sound advice like "there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences." And while I hope everyone agrees that "It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is", I think it important to recognize that most of the time our "best" (by the usual statistical criteria) point estimates of effect sizes can be represented as maxima of 2-sided P-value functions or crossing points of upper and lower P-value functions, and our "best" interval estimates can be read off the same P-functions.

I must add that I am surprised that so many otherwise perceptive writers keep repeating the absurd statement that "P-values overstate evidence", which I view as a classic example of the mind-projection fallacy. The P-value is just a number that sits there; any overstatement of its meaning in any context has to be on the part of the viewer. I suspect the overstatement claim arises because some are still subconsciously sensing P-values as some sort of posterior probability (even if consciously they would deny that vehemently). This problem indicates that attention should also be given to the ways in which P-values can supply interesting bounds on posterior probabilities, as shown in Casella & R. Berger 1987ab and reviewed in Greenland & Poole 2013ab (all are cited in Greenland 2019 above), and how P-values can be rescaled as binary S-values -log2(p) to better perceive their information content (again as reviewed in the Greenland et al. citations above).

Part 1: I thought this post provided mostly good c...

2021-11-22T11:29:52.219+01:00

Part 1: I thought this post provided mostly good coverage under the Neyman-Pearson-Lehmann/decision-theory (NPL) concept of P-values as random variables whose single-trial realization is the smallest alpha-level at which the tested hypothesis H could be rejected (given all background assumptions hold). In this NPL vision, P-values are inessential add-ons that can be skipped if one wants to just check in what decision region the test statistic fell.

But I object to the coverage above and in its cites for not recognizing how the Pearson-Fisher P-value concept (which is the original form of their "value of P") differs in a crucial fashion from the NPL version. Fisher strongly objected to the NP formalization of statistical testing, and I think his main reasons can be made precise when one considers alternative formalizations of how he described P-values. There is no agreed-upon formal definition of "evidence" or how to measure it, but in Fisher's conceptual framework P-values can indeed "measure evidence" in the sense of providing coherent summaries of the information against H contained in measure of divergence of data from models.

Pearson and Fisher defintion started from divergence measures in single trials, such as chi-squared or Z-statistics; P is then the observed divergence quantile (tail area) in a reference distribution under H. No alpha or decision need be in the offing, so those become the add-ons. For some review material see
Greenland S. 2019 http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
Rafi & Greenland. https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9
Greenland & Rafi. https://arxiv.org/abs/2008.12991
Cole SR, Edwards J, Greenland S. (2021). https://academic.oup.com/aje/advance-article-abstract/doi/10.1093/aje/kwaa136/5869593
Related views are in e.g.
Perezgonzalez JD. P-values as percentiles. Commentary on: “Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations”. Front Psych 2015;6. https://doi.org/10.3389/fpsyg.2015.00341.
Vos P, Holbert D. Frequentist inference without repeated sampling. ArXiv190608360 StatOT. 2019; https://arxiv.org/abs/1906.08360.

The original definition of the "value of P&qu...

2021-11-22T10:54:54.333+01:00

The original definition of the "value of P" in Pearson 1900 and which became known as the P-value by the 1920s is an observed tail area of a divergence statistic, while in the Neyman-Pearsonian definition assumed above P is a random variable defined from a formal decision rule with known conditional error rates. The two concepts can come into conflict over proper extension beyond simple hypotheses in basic models, e.g. see Robins et al. JASA 2000.

Who would dispute the definition of a p value? And...

2021-11-21T23:45:04.624+01:00

Who would dispute the definition of a p value? And who would dispute that it it's in fact a value that is called p? The discussion is about what inferences to draw from a p value, and whether such inferences are consistent with it's definition. But there are no different ways to calculate a p value.

"p-values should be interpreted as p-values&q...

2021-11-20T14:38:40.974+01:00

"p-values should be interpreted as p-values" is no different to the former UK Prime Minister's comment "Brexit means Brexit". Since at the time there was no consensus as to the meaning of Brexit, the Brexit meme was meaningless. The same may be true for this p-value meme, if such it is to become, since the "value" in p-value is itself disputed.