The 20% Statistician

Could you recommend any materials (reading, online...

2025-01-25T06:06:25.989+01:00

Could you recommend any materials (reading, online courses, etc) on metascience for folks who'd love to learn more about the topic, but cannot attend your course?

If a p-value is 0.01, it is unlikely to observe wh...

2024-05-29T17:23:05.904+02:00

If a p-value is 0.01, it is unlikely to observe what we did, given the null hypothesis is true. The above statement is based on the value of the p-val, not on the pdf of the p-val. I don't think the uniform distribution of p-val contradicts the fact that p-val itself indicates the strength of evidence against H0

It is true of course that effect sizes are inflate...

2024-05-27T11:58:08.416+02:00

It is true of course that effect sizes are inflated if only a positive selection of significant results is considered. However, keep in mind that many scientific studies do not intend to estimate the effect size in the first place but to establish the existence of effects. To make this difference clear, just consider experimental studies. The very purpose of an experiment is to MAXIMIZE the size of the effect by controlling noise and confounding variables. Also, many important effects only exist in the laboratory, and even there in only a few experiments (this is why there is no such thing as the "true magnitude of the emotional Stroop effect with Taylor Swift faces"). Generally, a context of hypothesis testing (i.e., theory-testing) is different from a context of effect size estimation -- the latter would require large representative samples, while the former employs experimental control to bring out effects that would otherwise be tiny. There are actually very few studies in experimental psychology that are conducted with effect size estimation in mind.

I just wanted to comment and thank you for your si...

2024-04-13T08:25:36.290+02:00

I just wanted to comment and thank you for your site. It was recommended to me by my supervisor, Steve Lindsay, and your blog posts (and publications) have elevated my understanding of statistics well beyond what I thought I would ever know. Shout-out to your twitter for directing me to the best realization I've ever had, that p-values are uniformly distributed under the null-hypothesis.

The text was happy to point out a perverse effect ...

2024-02-12T13:50:55.095+01:00

The text was happy to point out a perverse effect of this problem: researchers will look for significant results. They will do this through processes that may include selective exclusion of outliers, treatment of variables, search for models that respond better in terms of results, and even gross manipulation of the data.

Its SatterTHwaite https://en.wikipedia.org/wiki/W...

2023-11-04T12:53:00.816+01:00

Its SatterTHwaite
https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation

Thanks Daniel. What do I do when the sample size i...

2023-09-10T21:37:56.366+02:00

Thanks Daniel. What do I do when the sample size is 20,000 (as when using data from AP Votecast)?

I fell for this. Apparently I assumed that someone...

2023-08-23T11:19:40.739+02:00

I fell for this. Apparently I assumed that someone with a PhD would not deliberately spread misinformation, which is pretty funny because (a) I didn't actually check that the author is entitled to put "Dr" in front of her name, and (b) I spend an inordinate amount of time looking at cases where people with PhDs are lying.

Dear anonymous, all non-inferiority (and most clin...

2023-08-04T14:25:30.415+02:00

Dear anonymous, all non-inferiority (and most clinical superiority) studies are single-sided just by definition of non-inferiority. Only one side of the corresponding confidence interval is of interest. What would you like to test here bi-directionally???

Thanks for writing this - after reading a recent t...

2023-04-13T17:51:13.710+02:00

Thanks for writing this - after reading a recent tweet of yours, I thought the arguments behind that brief statement must be somewhere in your posts and materials, ooor maybe even there's a new post in the making, and here it comes I guess :)

Two - hopefully not silly - questions came into mind.

1: "The correct claim is that people should update their belief in the alternative hypothesis by a factor of 10." - Doesn't this updating factor require that the personal prior belief of the reader was the same as the prior specified in the analysis? So when I don't agree with (or don't understand) the priors, then I'm probably gonna have a hard time drawing a conclusion for myself... (...which could also apply to a corresponding frequentist analysis as well)

2: "Giving up error control also means giving up claims about the presence or absence of effects." - This is a strong claim, and I think this was basically what the tweet I was referring to was about. You concentrated on Bayes Factors, but I wondered how this would apply to a Bayesian inferring, plotting and characterizing the full Bayesian posterior distribution and make claims and interpretations ba(ye)sed on that. For instance, seeing that moost of the posterior mass encompasses a range of effects of meaningful strength, then they would base their interpretation and further research or other decision based on that. Would you consider some formal "Bayesian error control" possible and necessary in this situation? I'm curious how you look at this from your point of view.

(I unfortunately don't have deep experience with Bayesian methods yet, but I'd think that a) based on the posterior distribution and the specific hypothesis, readers might be able to consider the weight of the evidence for themselves, b) maybe in critical situations, the downstream consequences of dichotomous decisions could even be explicitly modeled in a Bayesian framework.)

Hi Daniel. Thanks very much for the blog. How woul...

2023-01-31T12:45:22.945+01:00

Hi Daniel. Thanks very much for the blog. How would you suggest doing a sensitivity for a multi-level regression? I am under the impression that G*Power does not include this.

Thanks for this. I'm a bit confused. If epsilo...

2022-11-26T18:08:56.605+01:00

Thanks for this. I'm a bit confused. If epsilon-squared is, according to Okada, preferred over omega, shouldn't the title be 'Why you should use epsilon-squared instead of eta-squared' ?

I guess this is an older post but I hope you can g...

2022-08-26T13:11:33.804+02:00

I guess this is an older post but I hope you can get back to me. I found this (worryingly) convincing. Worryingly, because I use Bayes Factors so often in my work (reporting them alongside the frequentist indices). I do not like the subjective Bayesian methodology for the reasons you have argued and I always try and use the hopefully reasonable "default" priors Erik and co recommend (despite push back from elsewhere regarding them that is above my pay grade).

But there are some practical (rather than philosophical) reasons why I have to go Bayes sometimes. The models I tend to use most are mixed models (repeated measurements from each participant) and sometimes even on minimal structure (e.g. just random intercepts for subject) they can be singular fit and the only option I know from there is to use Bayesian mixed-models (which is fine for linear models but I have have binary outcomes I am bit stuck as there are no "defaults" yet.)

What would you do in these situations? Also did Erik write a response somewhere, even if it was just a twitter thread? I can't find one but it would be cool to see more discussion between my two favorite stats guys.

Lastly, you mention "Leonard Held brought up these Bayesian/Frequentist compromise methods." Do you have a link to anything on this point as I just do both analyses and report the indices but this seems much better.

Thanks!

P.S. I recently completed my PhD and your posts have always been a great help!

Part 2: Interpreting the p value of a statistical ...

2021-12-04T06:58:40.175+01:00

Part 2: Interpreting the p value of a statistical analysis is dependent on it mode of presentation. A verbal presentation can be derived on statistical analysis but is a different presentation mode. Somehow this topic has not been discussed. A proposal based on alternative representations with examples from pre-clinical and clinical research can be found in https://www.dropbox.com/s/zfmuc81ho2yschm/Kenett%20Rubinstein%20Scientometrics%202021.pdf?dl=0

Part 1: Interpretation is the focus of this blog. ...

2021-12-04T06:22:55.760+01:00

Part 1: Interpretation is the focus of this blog. As often found, the comments to the blog are also interesting. Stating the goal of the analysis is an obvious step to clarify the interpretation of the analysis results. We expanded on Hand's deconstruction paper and propose a framework of information quality. It has four components and 8 dimensions. "Goal" being the first component. http://infoq.galitshmueli.com/home

How about we first discuss what Fisher actually sa...

2021-11-29T20:25:41.430+01:00

How about we first discuss what Fisher actually said before dismissing it without engaging with it? In any case, I would have expected an actual argument for why “Fisher is not really the best source on how to interpret test results”…

Fisher is not really the best source on how to int...

2021-11-25T13:27:40.962+01:00

Fisher is not really the best source on how to interpret test result. It is a lot simpler (and better) from a Neyman-Pearson approach. You conclude something *with a known maximum error rate* - so, you draw a conclusion but at the same time accept that in the long run, you could be wrong at most e.g., 5% of the time. Conclusions are, as I write in the blog, always tentative.

There is one thing I keep asking and never get an ...

2021-11-24T14:23:55.601+01:00

There is one thing I keep asking and never get an answer to—which is kind of weird since it’s so obviously relevant and is a point that comes from one of the founders of significance testing. You say: “After observing a p-value smaller than the alpha level, one can therefore conclude…” How is that compatible with what Fisher said about significance tests: “A scientific fact should be regarded as experimentally established only if a properly designed experiment *rarely fails* to give this level of significance”?

Do we all agree that Fisher can only have meant that after observing (obtaining, actually) a single p-value *we do not conclude anything*? But that we only conclude things after obtaining *many* p-values? (As many as we deem necessary to be able to speak of “rarely fails”.)

Part 2: The original observed-quantile conceptuali...

2021-11-22T11:31:00.315+01:00

Part 2: The original observed-quantile conceptualization of P-values can conflict with the NPL/decision conceptualization in e.g. Lehmann 1986 used for example in Schervish 1996. The latter paper showed how NPL P-values can be incoherent measures of support, with which I wholly agree. As I think both K. Pearson and Fisher saw, the value of P can only indicate compatibility of data with models, and many conflicting models may be highly compatible with data. But P-values can be transformed into measures of refutation, conflict, or countersupport, such as the binary S-value, Shannon or surprisal transform -log2(p) as reviewed in the Greenland et al. cites above.

Schervish 1996 failed to recognize the Fisherian alternative derivation/definition of P-values and so wrote (as others do) as if the NPL formalization was the only one available or worth considering - a shortcoming quite at odds with sound advice like "there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences." And while I hope everyone agrees that "It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is", I think it important to recognize that most of the time our "best" (by the usual statistical criteria) point estimates of effect sizes can be represented as maxima of 2-sided P-value functions or crossing points of upper and lower P-value functions, and our "best" interval estimates can be read off the same P-functions.

I must add that I am surprised that so many otherwise perceptive writers keep repeating the absurd statement that "P-values overstate evidence", which I view as a classic example of the mind-projection fallacy. The P-value is just a number that sits there; any overstatement of its meaning in any context has to be on the part of the viewer. I suspect the overstatement claim arises because some are still subconsciously sensing P-values as some sort of posterior probability (even if consciously they would deny that vehemently). This problem indicates that attention should also be given to the ways in which P-values can supply interesting bounds on posterior probabilities, as shown in Casella & R. Berger 1987ab and reviewed in Greenland & Poole 2013ab (all are cited in Greenland 2019 above), and how P-values can be rescaled as binary S-values -log2(p) to better perceive their information content (again as reviewed in the Greenland et al. citations above).

Part 1: I thought this post provided mostly good c...

2021-11-22T11:29:52.219+01:00

Part 1: I thought this post provided mostly good coverage under the Neyman-Pearson-Lehmann/decision-theory (NPL) concept of P-values as random variables whose single-trial realization is the smallest alpha-level at which the tested hypothesis H could be rejected (given all background assumptions hold). In this NPL vision, P-values are inessential add-ons that can be skipped if one wants to just check in what decision region the test statistic fell.

But I object to the coverage above and in its cites for not recognizing how the Pearson-Fisher P-value concept (which is the original form of their "value of P") differs in a crucial fashion from the NPL version. Fisher strongly objected to the NP formalization of statistical testing, and I think his main reasons can be made precise when one considers alternative formalizations of how he described P-values. There is no agreed-upon formal definition of "evidence" or how to measure it, but in Fisher's conceptual framework P-values can indeed "measure evidence" in the sense of providing coherent summaries of the information against H contained in measure of divergence of data from models.

Pearson and Fisher defintion started from divergence measures in single trials, such as chi-squared or Z-statistics; P is then the observed divergence quantile (tail area) in a reference distribution under H. No alpha or decision need be in the offing, so those become the add-ons. For some review material see
Greenland S. 2019 http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
Rafi & Greenland. https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01105-9
Greenland & Rafi. https://arxiv.org/abs/2008.12991
Cole SR, Edwards J, Greenland S. (2021). https://academic.oup.com/aje/advance-article-abstract/doi/10.1093/aje/kwaa136/5869593
Related views are in e.g.
Perezgonzalez JD. P-values as percentiles. Commentary on: “Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations”. Front Psych 2015;6. https://doi.org/10.3389/fpsyg.2015.00341.
Vos P, Holbert D. Frequentist inference without repeated sampling. ArXiv190608360 StatOT. 2019; https://arxiv.org/abs/1906.08360.

The original definition of the "value of P&qu...

2021-11-22T10:54:54.333+01:00

The original definition of the "value of P" in Pearson 1900 and which became known as the P-value by the 1920s is an observed tail area of a divergence statistic, while in the Neyman-Pearsonian definition assumed above P is a random variable defined from a formal decision rule with known conditional error rates. The two concepts can come into conflict over proper extension beyond simple hypotheses in basic models, e.g. see Robins et al. JASA 2000.

Who would dispute the definition of a p value? And...

2021-11-21T23:45:04.624+01:00

Who would dispute the definition of a p value? And who would dispute that it it's in fact a value that is called p? The discussion is about what inferences to draw from a p value, and whether such inferences are consistent with it's definition. But there are no different ways to calculate a p value.

"p-values should be interpreted as p-values&q...

2021-11-20T14:38:40.974+01:00

"p-values should be interpreted as p-values" is no different to the former UK Prime Minister's comment "Brexit means Brexit". Since at the time there was no consensus as to the meaning of Brexit, the Brexit meme was meaningless. The same may be true for this p-value meme, if such it is to become, since the "value" in p-value is itself disputed.

Your first part about sequential analysis is not r...

2021-11-01T08:44:06.109+01:00

Your first part about sequential analysis is not really a solid analysis - the problems you mentioned are all easily solved, see https://psyarxiv.com/x4azm/.

About the second part: Preregistration is more than just being 'open' about a process. It is about allowing others to evaluate the severity of a test. This means you need to provide very specific information in the preregistration - a problem is many are now too vague to evaluate the severity of a test.

This is all fine except for the advice on testing ...

2021-10-31T16:51:42.298+01:00

This is all fine except for the advice on testing - running - testing cycle. While it's true that you can plan for this and avoid inflated Type I error rates, it still has problems. The first is that, in the long run, it is not more efficient. Sometimes you'll need to run more participants and sometimes fewer; but it doesn't make things more efficient in the long run. The second is that it generates a literature where all of the small studies have exaggerated effect sizes and the large ones underestimated effect sizes. Consider the situation, you're going to run until you find an effect. If your initial sample was an underestimates of that effect, even in the wrong direction, you'll need to run many participants in order to eventually find an effect, and the under estimate bias will never be eliminated. If you start with an over estimate in your initial sample you'll be done collecting data quickly and, again, not have eliminated the over estimation bias in your sample.

Every other bit of overregularization mentioned here is spot on though. I especially often run into the preregistration issue. I never explain it to my students as a way to avoid Type I errors. I only describe it as a way to be open about your process. With that mindset, of doing open science, they don't worry about being able to solve every analysis problem prior to do it.