The 20% Statistician: The probability of p-values as a function of the statistical power of a test

Friday, May 30, 2014

The probability of p-values as a function of the statistical power of a test

I used to be really happy with any p-value smaller than .05, and very disappointed when p-values turned out to be higher than .05. Looking back, I realize I was suffering from a bi-polar p-value disorder. Nowadays, I interpret p-values more evenly. Instead of a polar division between p-values above and below the .05 significance level, I use a gradual interpretation of p-values. As a consequence, I'm no longer very convinced something is going on by p-values between .02 and .05. Let me explain.

In my previous blogpost, I explained how p-values can be calibrated to provide best-case posterior probabilities that the H0 was true. High p-values leave quite something to be desired, with a p = .05 yielding a best-case scenario with a 71% probability that H1 is true (assuming H0 and H1 are a-priori equally likely). Here, I want to move beyond best case scenario’s. Instead of only looking at p-values, we are going to look at the likelihood that a p-value represents a true effect, given the power of a statistical test.

This blog post is still based on the paper by Sellke, Bayarri, & Berger, 2001. The power of a statistical test that yields a specific p-value is determined by the size of the effect, the significance level, and the size of the sample. The more observations, and the larger the effect size, the higher the statistical power. The higher the statistical power, the higher the likelihood of observing a small (e.g., p = .01) compared to a high (e.g., p = .04) p-value, assuming there is a true effect in the population. We can see this in the figure below. The top and bottom halves of the figure display the same information, but the scale showing the percentage of expected p-values differs (from 0-100 in the top, from 0-10 in the bottom, where the percentages for p-values between .00 and .01 are cut off at .1). As the top pane illustrates, the probability of observing a p-value between 0.00 and 0.01 is more than twice as large if a test has 80% power, compared to when the test has only 50% power. In an extremely high powered experiment (e.g., 99% power) the p-value will be smaller than .01 in approximately 96% of the tests, and between 0.01 and 0.05 in only 3.5% of the tests.

In general, the higher the statistical power of a test, the less likely it is to observe relatively high p-values (e.g., p > .02). As can be seen in the lower pane in the figure, in extremely high powered statistical tests (i.e., 99% power), the probability of observing a p-value between .02 and .03 is less than 1%. If there is no real effect in the population, and the power of the statistical test is 0% (i.e., there is no chance to observe a real effect), p-values are uniformly distributed. This means that every p-value is equally likely to be observed, and thus that 1% of the p-values will fall within the .02 and .03 interval. As a consequence, when a test with extremely high statistical power returns a p = .024, this outcome is more likely when the null hypothesis is true, than when the alternative hypothesis is true (the bars for a p-value between .02 and .03 is higher when power = 0%, than when power = 99%). In other words, a statistical difference at the p < .05 level is surprising, assuming the null-hypothesis is true, but should still be interpreted as support for the null-hypothesis (we also explain this in Lakens & Evers (2014).

The fact that with increasing sample size, a result can at the same time be a statistical difference with p < .05, while also being stronger support for the null-hypothesis than for the alternative hypothesis, is known as Lindley’s paradox. This isn’t a true paradox – things just get more interesting to people if you call them a paradox. There are just two different questions that are asked. First, the probability of the data, assuming the null-hypothesis is true, or Pr(D|H0), is very low. Second, the probability of the alternative hypothesis, is lower than the probability of the null-hypothesis, given the data, or Pr(H1|D)<Pr(H0|D). Although it is often interpreted by advocates of Bayesian statistics as a demonstration of the ‘illogical status of significance testing’ (Rouder, Morey, Verhagen, Province, & Wagemakers, in press), it also an illustration of the consequences of using improper priors in Bayesian statistics (Robert, 2013).

An extension of these ideas is now more widely known in psychology as p-curve analysis (Simonsohn, Nelson, & Simmons, 2014, see www.p-curve.com). However, you can apply this logic (with care!) when subjectively evaluating single studies as well. In a well-powered study (with power = 80%) the odds of a statistical difference yielding a p-value smaller than .01 compared to a statistical difference between .01 and .05 is approximately 3 to 1. In general, the lower the p-value, the more the result supports the alternative hypothesis (but don't interpret p-values directly as support for H0 or H1, and always consider the prior probability of H0). Nevertheless, 'sensible p-values are related to weights of evidence' (Good, 1992), and the lower the p-value the better. A p-value for a true effect can be higher than .03, but it's relatively unlikely to happen a lot across multiple studies, especially when sample sizes are large. In small sample sizes, there is a lot of variation in the data, and a relatively high percentage of higher p-values is possible (see the figure for 50% power). Remember that if studies only have 50% power, there should also be 50% non-significant findings.

The statistical reality explained above also means that in high-powered studies (e.g., with a power of .99, for example when you collect 400 participants (divided over 2 conditions in an independent t-test) and the effect size is d=.43), setting the significance level to .05 is not very logical. After all, p-values > .02 are not even more likely under the alternative hypothesis, than under the null-hypothesis. Unlike my previous blog, where subjective priors were involved, this blog post is focused on a the objective probability of observing p-values under the null hypothesis and the alternative hypothesis, as a function of power. It means that we need to stop using fixed significance levels of α = .05 for all our statistical tests, especially now that we are starting to collect larger samples. As Good (1992) remarks:

The real objection to p-values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.’

How we can decide which significance level we should use, depending on our sample size, is a topic for a future blog post. With which I mean to say I haven't completely figured out how it should be done. If you have, I'd appreciate a comment below.

22 comments:

TonyMay 30, 2014 at 3:21 PM
Here's a naive thought. Given what we have learned from the reproducibility project that original findings with p values betw .02 and .05 replicated extremely rarely, would it be crazy to move away from .05 to a value such as .01?

I've begun to do this myself in my own work (e.g., I recently observed an anticipated effect with a p value of .04; my next step will be to try to replicate this effect, and make some procedural changes to hopefully make it larger, before thinking about trying to publish it).

Of course, this naive proposal assumes no p-hacking (otherwise, there would be a rash of findings just below p = .01 that would not replicate).
ReplyDelete
Replies
johannMay 30, 2014 at 4:07 PM
in addition to reporting simply a p value, also report a p* value which indicates the likelihood of observing a test statistic of the magnitude found under the alternative hypothesis (to be specified in advance). if the study is underpowered, power will be low and hence the likelihood of observing a t value (for example) with p=.049 is not very much larger than the likelihood of observing that same t/p value pair under H1. You may have found a significant effect, but with much uncertainty as to which distribution that value comes from). p/p* will be closer to zero in a study with higher power however. if the study is adequately powered for the specified effect, then p* will exceed p to a higher degree, and p/p* will become smaller.

if, however, the pre-specified effect is very large, then a test statistic with p = .049 - as you point out - might still be relatively more likely under the null than under the alternative hypothesis with specified huge effect size. then p/p* will become be > 1 and the result - albeit associated with a small p value and huge power - still speaks in favor of the null rather than the alternative, but with much uncertainty. hence, the p value should be much smaller (and with that, the likelihood of the test statistic occurring under the alternative hypothesis becomes larger again), so that p/p* again decreases.

ideally p/p* should be zero (no chance of a significant finding being a false positive, but every chance to find the true effect).

i believe that this is, however, the same idea as that behind using the bayes factor instead of simple p values that are based on the assumption that the null is true. it is also represented in classical neyman pearson testing, where you are not to only look at the p value, but also specify a target effect size beforehand and after results are in, inspect the found effect size for consistency with the target effect size you planned with (not necessarily whether it is larger or smaller, but whether it is in the vicinity of the planned effect). only if the empirical effect is approximately the size of the effect size planned with, the p value is a strong indication of the alternative being true. if the found effect is smaller than the one planned or much larger, then alpha and beta errors are out of control and either not much can be said anymore about what the study tells you (e.g., p value is very small, but p* is much smaller than p, but on the left side of the noncentral distribution - with a much larger planned effect) or you are running the risk that the found statistic - even though its likelihood / p value under the null is quite small -, the likelihood under the alternative is also small (this time on the right side of the noncentral distribution). then the beta error is also large and power quite low, so that finding something significant is arguably a lucky coincidence much more than a stable phenomenon you would bet a lot of money to find again with a similar small sample.
ReplyDelete
Replies
MatusMay 30, 2014 at 7:23 PM
This idea is a dead-end. The power calculations depend on the effect size which is unknown. One approach is to use the post-hoc estimate of the effect size. In general this estimator over-estimates the magnitude of the effect size. So we need some kind of correction. In bayesian stats this is handled by a prior which locates most probability mass around zero. (Another approach is to use hierarchical prior.) Another option is to derive the effect size (distribution) from the literature. Or as an third option you derive the effect size distribution based on some domain/topic-specific theoretic considerations. In any case all of this has been attempted and is routinely done in bayesian literature, when researchers try to justify their priors. The question then is, shouldn't we use the bayesian approach right on, instead of attempting to patch p-values so that they emulate bayesian inference?
ReplyDelete
Replies
UnknownJune 1, 2015 at 2:35 PM
Hi,
Thanx fort this great post. Could you provide any R script to reproduce those figures?
ReplyDelete
Replies
UnknownNovember 5, 2016 at 6:20 AM
The blog contain valuable information thanks for sharing it
Selenium Training in Chennai
ReplyDelete
Replies
sunil kumarSeptember 17, 2021 at 8:14 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment