This blog
post is presented in collaboration with a new interactive visualization of the distribution of p-values created by Kristoffer Magnusson (@RPsychologist) based
on code by JP de Ruiter (@JPdeRuiter).
Question 1: Would you be inclined to interpret a p-value between 0.16- 0.17 as support
for the presence of an effect, assuming the power of the study was 50%? Write
down your answer – we will come back to this question later.
Question 2: If you have 95% power, would you be inclined
to interpret a p-value between 0.04
and 0.05 as support for the presence of an effect? Write down your answer – we
will come back to this question later.
If you gave
a different answer on question 1 than on question 2, you are over relying on p-values, and you’ll want to read this
blog post. If you have been collecting larger sample sizes, and continue to rely on p < 0.05 to guide your statistical
inferences, you’ll also want to read on.
When we
have collected data, we often try to infer whether the observed effect is
random noise (the null hypothesis is true) or a signal (the alternative
hypothesis is true). It is useful to consider how much more likely it is that
we observed a specific p-value if the
alternative hypothesis is true, than when the null-hypothesis is true. We can
do this by thinking about (or simulating, or using the great visualization by
Kristoffer Magnusson) how often we could expect to observe a specific p-value when the alternative hypothesis
is true, compared to when the null-hypothesis is true.
The latter
is easy. When the null-hypothesis is true, every p-value is equally likely. So we can expect 1% of p-values to fall between 0.04 and 0.05.
When the
alternative hypothesis is true, we have a probability of finding a significant
effect, which is the statistical power of the test. As power increases, the p-value distribution changes (play
around with the visualization to see how it changes). Very high p-values (e.g., p = 0.99) become less likely, and low p-values (e.g., p = 0.01)
become more likely. This increase is really slow. If you have 20% power, a p-value between 0.26 and 0.27 is still
just as likely under the alternative hypothesis (1%) as under the null
hypothesis (only higher p-values are
less likely under the null hypothesis than under the alternative hypothesis). When
power is 50%, a p-value between 0.17-0.18
is just as likely when the alternative hypothesis is true as when the null hypothesis
is true (both are again 1% likely to occur).
If the
power of the test is 50%, a p-value
between 0.16-0.17 is 1.1% likely. That means it is slightly more likely under
the alternative hypothesis, than under the null hypothesis, but not very much
(see Question 1).
If the
power of the test is 50% (not uncommon in psychology experiments), p-values between 0.04 and 0.05 can be
expected around 3.8% of the time, while under the null hypothesis, these p-values can only be expected around 1%
of the time. This means that it is 3.8 times more likely to observe this p-value assuming the alternative
hypothesis is true, than assuming the null hypothesis is true (dear Bayesian
friends: I am also assuming the null hypothesis and the alternative hypothesis
are equally likely a-priori). That’s not a lot, but it is something. Take a
moment to think about which ratio you would need before you would consider
something ‘support’ for the alternative hypothesis (there is no single correct
answer).
As power increases
even more, most of the p-values from
statistical tests will be below 0.01, and there will be relatively few p-values between 0.01 and 0.05. For
example, if you have 99% power with an alpha of 0.05, you might at the same
time have 95% power for an alpha of 0.01. This means that 99% of the p-values can be expected to be lower
than 0.05, but 95% of the p-values will also be below 0.01. That
leaves only 4% of the p-values
between 0.01 and 0.05. I think you’ll see where I am going.
If you have
95% power (e.g., you have 484 participants, 242 in each of two conditions, and you expect
a small effect size of Cohen’s d = 0.3 in a between participants design), and
you observe a p-value between 0.04
and 0.05, the probability of observing this p-value
is 1.1% when the alternative hypothesis is true. It is still 1% when the null
hypothesis is true (see Question 2).
This
example shows how a p-value between
0.16-0.17 can give us exactly the same signal to noise ratio as a p-value between 0.04-0.05. It shows that
when interpreting p-values, it is
important to take the power of the study into consideration.
If your
answer on Question 1 and Question 2 is not the same, you are relying too much
on p-values when drawing statistical
inferences. In both scenarios, the probability of observing this p-value when the alternative hypothesis
is true is 1.1%, and the probability of observing this p-value when the null-hypothesis is true is 1%. This is not enough
to distinguish between the signal and the noise.
When power
is higher than 96%, p-values between
0.04 and 0.05 become more likely
under the null-hypothesis than under the alternative hypothesis. In such
circumstances, it would make sense to say: “p
= 0.041, which does not provide support for our hypothesis”. If you collect
larger sample sizes, always calculate Bayes Factors for converging support (and
be careful if Bayes Factors do not provide support for your hypothesis). Try
out JASP for software that looks just like
SPSS, but also provides Bayes Factors, and is completely free.
Even though
it is often difficult to know how much power you have (it depends on the true
effect size, which is unknown), for any study with 580 participants in each
condition, power is 96% for a pretty small true effect size of Cohen’s d = 0.2.
In very large samples p-values above
0.04 should not be interpreted as support for the alternative hypothesis. Now
that we are seeing more and more large-scale collaborations, and people are making
use of big data, it’s important to keep this fact in mind. Obviously you won’t
determine whether a paper should be published or not based on the p-value (all interesting papers should
be published), but these papers should draw conclusions that follow from the
data.
One way to
think about this blog post, is that in large sample sizes, we might as well use
a stricter Type 1 error rate (e.g., 0.01 instead of 0.05). After all, there is
nothing magical about the use of 0.05 as a cut-off, and we should determine the desired Type 1 error rate and Type 2 error
rate when we design a study. Cohen suggests a ratio of Type 2 error rates to
Type 1 error rates of 4:1, which is reflected in the well known ‘minimum’
recommendation to aim for 80% power (which would mean a 20% Type 2 error rate) when you have a 5% Type 1 error rate . If my Type 2 error rate is only 5%
(because I can be pretty sure I have 95% power) than it makes sense to reduce
my Type 1 error rate to 1.25 (or just 1) to approximately maintain the ratio
between Type 1 errors and Type 2 errors Cohen recommended.
Relying too
much on p-values when you draw
statistical inferences is a big problem – everyone agrees on this, whether or
not they think p-values can be
useful. I hope that with this blog post, I’ve contributed a little bit to help
you think about p-values in a more
accurate manner. Below is some R code you can run to see the probability of
observing a p-values between two
limits (which I made before Kristoffer Magnusson created his awesome visualization of the original Mathematica code by JP de Ruiter).
Go ahead and play around with the online visualization to see how much more
likely p-values between 0.049 and
0.050 are when you have 50% power.
P.S.: But
Daniel, aren’t p-values ‘fickle’ and
don’t they dance around so that they are really useless to draw any statistical inferences?
Well, when 95% of them end up below p=0.05,
I think we should talk of the march of the p-values
instead of the dance of the p-values.
They will nicely line up and behave like good little soldiers. All they need is
a good commander that designs studies in a manner where p-values can be used to distinguish between signal and noise. If
you don’t create a good and healthy environment for our poor little p-values, we can’t blame the children
for their parent’s mistake.
P.P.S.: You
might think: “Daniel, be proud and say it loud: You are a Bayesian!” Indeed, I am
asking you to consider what the true effect size is, so that you can draw an
inference about your data, assuming the alternative hypothesis is true. As Beyoncé
would say: “'Cause if you liked it, then you should have put a ring on it.” As
a Bayesian would say: 'Cause if you liked it, you should have put a prior
probability distribution on it’. Yes, I sometimes get a weird tingly feeling
when I incorporate prior information in my statistical inferences. But I am not
yet sure I always want to know the probability that a hypothesis is true, given
the data, or that I even can come close to knowing the probability a hypothesis
is true. But I am very sure I want to control my Type 1 error rate in the long
term. This blog is only 10 months old, so let's see what happens.
P.P.P.S. But Daniel, why would anyone "try to infer whether the observed effect is random noise"? We are not shuffling cards or throwing dice, we are testing human beings whose behavior (and hence your measurements) will always be a product of highly structured process. Your null hyp is always false. Either go back to studying card games or move on to parameter estimation.
ReplyDeleteHi, luckily I have already answered that question in this post: The Null Is Always False (Except When It Is True) http://daniellakens.blogspot.nl/2014/06/the-null-is-always-false-except-when-it.html. But yes, after you have discovered something is a signal and not noise, move on to estimation (if that answers a question you are interested in).
DeleteNah, you haven't answered anything. You have just shown that some tests in the many labs study fail to reject null. We don't know whether the null is false. It may be that just your sample size is not large enough to detect that particular effect.
Delete"But yes, after you have discovered something is a signal and not noise, move on to estimation." Why after? Why do we need this two step procedure? Seems like a completely unnecessary research slow-down to me. The ES estimate gives you the ability to make any comparison including a comparison with your null.
The third sentence should read "We don't know whether the null is TRUE".
Deletewhich third sentence exactly? We can never know if the null is false of the alternative is false. But we can use methods that make it unlikely we make too many errors, in the long term.
DeleteMy third sentence in my comment at 7:26 AM.
DeleteYour claim that "We can never know if the null is false", is either inconsistent with your previous post "The Null Is Always False (Except When It Is True)" (because what's the point of arguing whether null is false if we can't know) or irrelevant to our discussion (since I'm using the same language as you did in your previous post, regardless of whether this language super-accurate).
there's a typo in one paragraph, albeit d = 0.03 is a small effect, 484 participants would not be enough to detect it
ReplyDeleteand feel free to delete the smartypants comment after fixing it..
DeleteThanks, changed it to the 0.3 it should have been! I'll leave your comment, credit where credit is due, thanks for taking the effort to point this out!
DeleteThanks for the food for thought. I'll need to think about this some more before actually commenting, but in the meantime, I wanted to give you a pointer about R code, if you don't mind.
ReplyDeleteThe for-loop has the advantage that it's ubiquitous in basic programming courses and semantically fairly transparent. But it slows down your simulation (takes about 16 seconds on my machine). Below is an alternative that takes less than 2 seconds.
First, it defines the variables of interest.
Then, the code for a single simulation (one draw) is defined.
This code is then run 10000 times using the replicate function (see ?replicate or ?mapply for more complicated stuff). The 10000 p-values are stored in a vector.
The rest is the same as in your code (I prefer sum(ps > lowp & ps <= highp) to length(p[p > lowp & p <= highp) because it's a neat R trick :)).
It's no big deal for this simulation, but in case you wanted to run more arduous simulations, it cuts down on waiting time.
HTH - Jan
###
nSims <- 10000
lowp <- 0.04
highp <- 0.05
# Simulate one draw
oneRun.fnc <- function(N = 32) {
x <- rnorm(n = N, mean = 100, sd = 20)
y <- rnorm(n = N, mean = 110, sd = 20)
z <- t.test(x,y)
return(z$p.value)
}
# Simulate multiple draws
pvalues <- replicate(nSims,
oneRun.fnc(N = 32))
#Calculate power in simulation
cat ("The power is",(sum(pvalues < 0.05)/nSims*100),"%")
p2 <- sum((ps > lowp & ps <= highp))
cat ("The probability of finding a p-value between ",lowp," and ", highp," is ",
(p2/nSims*100),"%,\n which makes it ",
((p2/nSims*100)/(((highp-lowp)*100))),"
times more probable under the alternative hypothesis than the null-hypothesis \b
(numbers below 1 mean the observed p-value is more likely under the null-hypothesis than under the alternative hypothesis)\n")
#now plot histograms of p-values (the most left bar contains all p-values between 0.00 and 0.05)
hist(pvalues, main="Histogram of p-values", xlab=("Observed p-value"), breaks = 20)
Thanks so much! I've seen the mapply function before, but never really understood how it should be used. I use variations of these simulations, and they indeed take a long time sometimes, so this will definately be useful!
DeleteI've got an example here, with mapply embedded in another function that is then run 10000 times: http://janhove.github.io/analysis/2014/08/20/adjusted-pvalues-breakpoint-regression/
DeleteThis is an easier example, though:
###
mapply(sum, c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
[1] 12 15 18
###
You apply the function over the first element of each argument (1+4+7), then over the second (2+5+8) etc.
Hey Daniel,
ReplyDeleteRegarding your pps:
The calculation of the posterior probability that a hypotheses is true is not always the goal of bayesian inference. Hypothesis testing by Bayes factors, for example, does not tell you what the posterior probability of your hypothesis is, Pr(H|D). It also does not tell you the posterior odds if you are comparing two hypotheses, Pr(H1|D)/Pr(H2|D) (unless you start with 1:1 prior odds). Instead, hyp testing with Bayes factors tells you how you should update whatever prior odds you hold into posterior odds, now that you've seen the data. The *evaluation* of the evidential value of the data can stand alone from the prior or posterior probability/odds of a hypothesis.
Bayes Factors tell you the relative predictive success of the two hypotheses under consideration, and this is formulated as Pr(D|H1)/Pr(D|H2). The prior distributions on the parameters for your hypotheses can even be spike priors, like d=0 and d=.3 in this post, and in that case you just simplify your Bayes factor to the Plain Jane Likelihood ratio.
I think you would really enjoy Likelihoods based on this post and other conversations we've had. They control for probabilities of misleading evidence and they only use spike priors :) Here are two links if you are interested:
http://www.stat.fi/isi99/proceedings/arkisto/varasto/roya0578.pdf
http://www.sortie-nd.org/lme/Bayesian%20methods%20in%20ecology/Royall_2004_Likelihood_Paradigm.pdf
PLUS- Using likelihoods doesn't mean you are a bayesian (Royall certainly was not), so you could stay on the NHST ship if you really wanted to.
Following Alexander's point about likelihood, it is fairly easy to show that the strongest evidence p = .05 can supply is around .128 chance that H1 is true (assuming both H1 and H0 are equally likely a priori). This is based on the likelihood ratio for a just significant test with H1 being that the parameter is at the maximum of the likelihood (the hypothesis with the strongest evidence relative to the null).
ReplyDeleteThis actually came up on twitter just today. https://twitter.com/AlexanderEtz/status/579421141694951424
DeleteThree experiments:
n1 = n2 = 30, p = .031, d = .57
n1 = n2 = 50, p = .029, d = .44
n1 = n2 = 100, p = .024, d = .32
Even if we set the scale on a cauchy prior to be exactly the observed ES, which of course is unreasonable, BFs are still roughly 2. Even for the experiment with the smallest n (30 per group) and largest ES (d = .57).
Goes to show that p doesn't just fail when n is overly large!
PS- Thom, are you on twitter?
Sorry - not on twitter. Maybe one day!
ReplyDeleteSo, one moral of this story is that, given the power we typically see in social psych studies (50% on a very good day), a p-value between .04 and .05 actually is providing evidence that's more supportive of the alternative than the null (given equal priors, blah blah blah).
ReplyDeleteI'll start worrying more about the pesky .04-.05 range as soon as I see more than 1 social psych paper in 10,000 with > 95% power. That'll be a great problem to have one day.
Expanding this to the .03-.05 range (which some have flagged as problematic), the alternative is more than 4x better than the null. And we don't see Ho = Ha until power exceeds 96%.
Close. The moral is that you should not think of p-values between 0.04-0.05 as evidence. You should think about those p-values as 'extremely weak evidence, assuming power is not extremely high'. Your estimate of 10000 for when you have power > .95 is off - I clearly mention you already achieve this with 242 participants in two conditions, which is rare, but occurs already.
DeleteSeems to me that your moral isn't right either. Under realistic conditions (looking at our woefully underpowered literature), this suggests that a p-value of .04-.05 is better support for the alternative than the null by a factor of at least 3-4. Bayesians would be willing to talk about one hypothesis being 3x more likely; why should we call it "extremely weak evidence?"
DeleteSure, a tiny minority of our literature might have power of 95%. But for the bulk of the literature where power is closer to 25% than 95%, p-values near .04 do suggest evidence, at least in a relative sense. Yet I've heard grad students and others naively apply the logic presented here to say that p=.04 actually supports the null. The logic in this post clearly shows that to be untrue.
My 1 in 10000 comment was tongue-in cheek. But given that 242 per is what it takes to get 95% power for a median effect size in social psych, and we rarely see 242 per, I think it's pretty safe to say that very very few social psych studies need to worry about what happens with 95% power. That's like worrying if 100% solar powered hovercars driven by robots will put taxi drivers out of business. Maybe some day we'll get there, but I'm not holding my breath ;)
Now, there are myriad problems with underpowered studies. But devaluing p=.04 doesn't seem to be one of the most pressing.
Again, close. Use the visualization. You want power close to 25%? Use d = 0.4, with n = 20. You see you drop below 3. Bayesians call anything below 3 'anecdotal evidence', which I perhaps overstated as extremely weak (but perhaps not). If you continue to interpret p between 0.04-0.05 as evidence (3X), you've missed the point. If grad students say p = 0.04 supports the null, they are not specifying their priors either, and missing the point. Remember very high power and very low power leads to p 0.04-0.05 being difficult to interpret as 'evidence'.
DeleteSure, we set it at 25% and the ratio dips just below 3. I'm not missing the point here, because in many real-world-relevant conditions, .04-.05 can be interpreted as evidence, even if we decide to adopt the highly arbitrary ratio of 3 as a cutoff. This directly contradicts your claim that "you should not think of p-values between .04-.05 as evidence."
DeletePerhaps you meant "you should not NECESSARILY think of p-values between .04 and .05 as evidence because the situation is complicated." Which nobody who knows stats would disagree with. But your statement implied the stronger step that we should ignore the .04-.05 range as evidence, full stop.
Fact is, the visualizations show that if power is between about 27% and 76%, Ha is at least 3 times as likely as Ho for p-values between .04-.05.
That's a rather large range of power for which the statement "you should not think of p-values between 0.04-0.05 as evidence" does not seem to be true. It spans about half of the logically possible values of power, and I'm guessing most of the real-world plausible values.
Definately - I think both points are important. It often is evidence, but sometimes it's not. No need to be overly critical to p-values (as you see, I'm one of the few people consistently saying they are useful and taking the time to explain to people when they are).
DeleteMakes sense. So perhaps a nuanced moral could be something like this:
Delete"Across a wide range of possible and highly plausible values for power, .04-.05 probably does in fact constitute evidence. If power is below 25% or over 75%, start getting skeptical of those pesky p = .042s."
Of course, given the conditions that would lead to power below 25%, there would probably be loads of reasons to be skeptical. I'm guessing you're looking at a tiny N, which is its own problem on numerous fronts. This equates to a d of .4, and an N under 20. My first concern with an N under 20 per isn't necessarily how I should interpret p = .042.
The potentially more interesting extension is that p = .042 wouldn't be evidence in a huge study. That case will be more rare, I'm guessing. But worth keeping in mind.
Good chat.
This comment has been removed by a blog administrator.
ReplyDelete