Comments on The 20% Statistician: How a p-value between 0.04-0.05 equals a p-value between 0.16-017

2021-03-09T09:17:50.314+01:00

This comment has been removed by a blog administrator.

Makes sense. So perhaps a nuanced moral could be s...

2015-03-27T21:08:01.557+01:00

Makes sense. So perhaps a nuanced moral could be something like this:

"Across a wide range of possible and highly plausible values for power, .04-.05 probably does in fact constitute evidence. If power is below 25% or over 75%, start getting skeptical of those pesky p = .042s."

Of course, given the conditions that would lead to power below 25%, there would probably be loads of reasons to be skeptical. I'm guessing you're looking at a tiny N, which is its own problem on numerous fronts. This equates to a d of .4, and an N under 20. My first concern with an N under 20 per isn't necessarily how I should interpret p = .042.

The potentially more interesting extension is that p = .042 wouldn't be evidence in a huge study. That case will be more rare, I'm guessing. But worth keeping in mind.

Good chat.

Definately - I think both points are important. It...

2015-03-27T20:39:03.455+01:00

Definately - I think both points are important. It often is evidence, but sometimes it's not. No need to be overly critical to p-values (as you see, I'm one of the few people consistently saying they are useful and taking the time to explain to people when they are).

Sure, we set it at 25% and the ratio dips just bel...

2015-03-27T20:31:44.844+01:00

Sure, we set it at 25% and the ratio dips just below 3. I'm not missing the point here, because in many real-world-relevant conditions, .04-.05 can be interpreted as evidence, even if we decide to adopt the highly arbitrary ratio of 3 as a cutoff. This directly contradicts your claim that "you should not think of p-values between .04-.05 as evidence."

Perhaps you meant "you should not NECESSARILY think of p-values between .04 and .05 as evidence because the situation is complicated." Which nobody who knows stats would disagree with. But your statement implied the stronger step that we should ignore the .04-.05 range as evidence, full stop.

Fact is, the visualizations show that if power is between about 27% and 76%, Ha is at least 3 times as likely as Ho for p-values between .04-.05.

That's a rather large range of power for which the statement "you should not think of p-values between 0.04-0.05 as evidence" does not seem to be true. It spans about half of the logically possible values of power, and I'm guessing most of the real-world plausible values.

Again, close. Use the visualization. You want powe...

2015-03-27T14:56:16.084+01:00

Again, close. Use the visualization. You want power close to 25%? Use d = 0.4, with n = 20. You see you drop below 3. Bayesians call anything below 3 'anecdotal evidence', which I perhaps overstated as extremely weak (but perhaps not). If you continue to interpret p between 0.04-0.05 as evidence (3X), you've missed the point. If grad students say p = 0.04 supports the null, they are not specifying their priors either, and missing the point. Remember very high power and very low power leads to p 0.04-0.05 being difficult to interpret as 'evidence'.

Seems to me that your moral isn't right either...

2015-03-27T14:40:36.722+01:00

Seems to me that your moral isn't right either. Under realistic conditions (looking at our woefully underpowered literature), this suggests that a p-value of .04-.05 is better support for the alternative than the null by a factor of at least 3-4. Bayesians would be willing to talk about one hypothesis being 3x more likely; why should we call it "extremely weak evidence?"

Sure, a tiny minority of our literature might have power of 95%. But for the bulk of the literature where power is closer to 25% than 95%, p-values near .04 do suggest evidence, at least in a relative sense. Yet I've heard grad students and others naively apply the logic presented here to say that p=.04 actually supports the null. The logic in this post clearly shows that to be untrue.

My 1 in 10000 comment was tongue-in cheek. But given that 242 per is what it takes to get 95% power for a median effect size in social psych, and we rarely see 242 per, I think it's pretty safe to say that very very few social psych studies need to worry about what happens with 95% power. That's like worrying if 100% solar powered hovercars driven by robots will put taxi drivers out of business. Maybe some day we'll get there, but I'm not holding my breath ;)

Now, there are myriad problems with underpowered studies. But devaluing p=.04 doesn't seem to be one of the most pressing.

Close. The moral is that you should not think of p...

2015-03-27T06:56:01.392+01:00

Close. The moral is that you should not think of p-values between 0.04-0.05 as evidence. You should think about those p-values as 'extremely weak evidence, assuming power is not extremely high'. Your estimate of 10000 for when you have power > .95 is off - I clearly mention you already achieve this with 242 participants in two conditions, which is rare, but occurs already.

So, one moral of this story is that, given the pow...

2015-03-27T03:53:02.394+01:00

So, one moral of this story is that, given the power we typically see in social psych studies (50% on a very good day), a p-value between .04 and .05 actually is providing evidence that's more supportive of the alternative than the null (given equal priors, blah blah blah).

I'll start worrying more about the pesky .04-.05 range as soon as I see more than 1 social psych paper in 10,000 with > 95% power. That'll be a great problem to have one day.

Expanding this to the .03-.05 range (which some have flagged as problematic), the alternative is more than 4x better than the null. And we don't see Ho = Ha until power exceeds 96%.

Sorry - not on twitter. Maybe one day!

2015-03-22T20:40:35.184+01:00

Sorry - not on twitter. Maybe one day!

This actually came up on twitter just today. https...

2015-03-22T01:11:38.700+01:00

This actually came up on twitter just today. https://twitter.com/AlexanderEtz/status/579421141694951424

Three experiments:
n1 = n2 = 30, p = .031, d = .57
n1 = n2 = 50, p = .029, d = .44
n1 = n2 = 100, p = .024, d = .32

Even if we set the scale on a cauchy prior to be exactly the observed ES, which of course is unreasonable, BFs are still roughly 2. Even for the experiment with the smallest n (30 per group) and largest ES (d = .57).

Goes to show that p doesn't just fail when n is overly large!

PS- Thom, are you on twitter?

Following Alexander's point about likelihood, ...

2015-03-21T17:06:16.313+01:00

Following Alexander's point about likelihood, it is fairly easy to show that the strongest evidence p = .05 can supply is around .128 chance that H1 is true (assuming both H1 and H0 are equally likely a priori). This is based on the likelihood ratio for a just significant test with H1 being that the parameter is at the maximum of the likelihood (the hypothesis with the strongest evidence relative to the null).

Hey Daniel, Regarding your pps: The calculation ...

2015-03-20T22:52:39.148+01:00

Hey Daniel,

Regarding your pps:

The calculation of the posterior probability that a hypotheses is true is not always the goal of bayesian inference. Hypothesis testing by Bayes factors, for example, does not tell you what the posterior probability of your hypothesis is, Pr(H|D). It also does not tell you the posterior odds if you are comparing two hypotheses, Pr(H1|D)/Pr(H2|D) (unless you start with 1:1 prior odds). Instead, hyp testing with Bayes factors tells you how you should update whatever prior odds you hold into posterior odds, now that you've seen the data. The *evaluation* of the evidential value of the data can stand alone from the prior or posterior probability/odds of a hypothesis.

Bayes Factors tell you the relative predictive success of the two hypotheses under consideration, and this is formulated as Pr(D|H1)/Pr(D|H2). The prior distributions on the parameters for your hypotheses can even be spike priors, like d=0 and d=.3 in this post, and in that case you just simplify your Bayes factor to the Plain Jane Likelihood ratio.

I think you would really enjoy Likelihoods based on this post and other conversations we've had. They control for probabilities of misleading evidence and they only use spike priors :) Here are two links if you are interested:

http://www.stat.fi/isi99/proceedings/arkisto/varasto/roya0578.pdf

http://www.sortie-nd.org/lme/Bayesian%20methods%20in%20ecology/Royall_2004_Likelihood_Paradigm.pdf

PLUS- Using likelihoods doesn't mean you are a bayesian (Royall certainly was not), so you could stay on the NHST ship if you really wanted to.

I've got an example here, with mapply embedded...

2015-03-20T18:50:49.819+01:00

I've got an example here, with mapply embedded in another function that is then run 10000 times: http://janhove.github.io/analysis/2014/08/20/adjusted-pvalues-breakpoint-regression/

This is an easier example, though:

###
mapply(sum, c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
[1] 12 15 18
###

You apply the function over the first element of each argument (1+4+7), then over the second (2+5+8) etc.

Thanks so much! I've seen the mapply function ...

2015-03-20T17:38:47.340+01:00

Thanks so much! I've seen the mapply function before, but never really understood how it should be used. I use variations of these simulations, and they indeed take a long time sometimes, so this will definately be useful!

Thanks for the food for thought. I'll need to ...

2015-03-20T17:12:37.518+01:00

Thanks for the food for thought. I'll need to think about this some more before actually commenting, but in the meantime, I wanted to give you a pointer about R code, if you don't mind.

The for-loop has the advantage that it's ubiquitous in basic programming courses and semantically fairly transparent. But it slows down your simulation (takes about 16 seconds on my machine). Below is an alternative that takes less than 2 seconds.
First, it defines the variables of interest.
Then, the code for a single simulation (one draw) is defined.
This code is then run 10000 times using the replicate function (see ?replicate or ?mapply for more complicated stuff). The 10000 p-values are stored in a vector.
The rest is the same as in your code (I prefer sum(ps > lowp & ps <= highp) to length(p[p > lowp & p <= highp) because it's a neat R trick :)).

It's no big deal for this simulation, but in case you wanted to run more arduous simulations, it cuts down on waiting time.

HTH - Jan

###
nSims <- 10000
lowp <- 0.04
highp <- 0.05

# Simulate one draw
oneRun.fnc <- function(N = 32) {
x <- rnorm(n = N, mean = 100, sd = 20)
y <- rnorm(n = N, mean = 110, sd = 20)
z <- t.test(x,y)
return(z$p.value)
}

# Simulate multiple draws
pvalues <- replicate(nSims,
oneRun.fnc(N = 32))

#Calculate power in simulation
cat ("The power is",(sum(pvalues < 0.05)/nSims*100),"%")

p2 <- sum((ps > lowp & ps <= highp))

cat ("The probability of finding a p-value between ",lowp," and ", highp," is ",
(p2/nSims*100),"%,\n which makes it ",
((p2/nSims*100)/(((highp-lowp)*100))),"
times more probable under the alternative hypothesis than the null-hypothesis \b
(numbers below 1 mean the observed p-value is more likely under the null-hypothesis than under the alternative hypothesis)\n")

#now plot histograms of p-values (the most left bar contains all p-values between 0.00 and 0.05)
hist(pvalues, main="Histogram of p-values", xlab=("Observed p-value"), breaks = 20)

My third sentence in my comment at 7:26 AM. Your...

2015-03-20T16:11:31.331+01:00

My third sentence in my comment at 7:26 AM.

Your claim that "We can never know if the null is false", is either inconsistent with your previous post "The Null Is Always False (Except When It Is True)" (because what's the point of arguing whether null is false if we can't know) or irrelevant to our discussion (since I'm using the same language as you did in your previous post, regardless of whether this language super-accurate).

which third sentence exactly? We can never know if...

2015-03-20T15:36:57.443+01:00

which third sentence exactly? We can never know if the null is false of the alternative is false. But we can use methods that make it unlikely we make too many errors, in the long term.

The third sentence should read "We don't ...

2015-03-20T15:28:22.985+01:00

The third sentence should read "We don't know whether the null is TRUE".

Nah, you haven't answered anything. You have j...

2015-03-20T15:26:20.906+01:00

Nah, you haven't answered anything. You have just shown that some tests in the many labs study fail to reject null. We don't know whether the null is false. It may be that just your sample size is not large enough to detect that particular effect.

"But yes, after you have discovered something is a signal and not noise, move on to estimation." Why after? Why do we need this two step procedure? Seems like a completely unnecessary research slow-down to me. The ES estimate gives you the ability to make any comparison including a comparison with your null.

Thanks, changed it to the 0.3 it should have been!...

2015-03-20T15:23:40.347+01:00

Thanks, changed it to the 0.3 it should have been! I'll leave your comment, credit where credit is due, thanks for taking the effort to point this out!

and feel free to delete the smartypants comment af...

2015-03-20T15:21:09.761+01:00

and feel free to delete the smartypants comment after fixing it..

there's a typo in one paragraph, albeit d = 0....

2015-03-20T15:20:25.024+01:00

there's a typo in one paragraph, albeit d = 0.03 is a small effect, 484 participants would not be enough to detect it

Hi, luckily I have already answered that question ...

2015-03-20T14:55:00.262+01:00

Hi, luckily I have already answered that question in this post: The Null Is Always False (Except When It Is True) http://daniellakens.blogspot.nl/2014/06/the-null-is-always-false-except-when-it.html. But yes, after you have discovered something is a signal and not noise, move on to estimation (if that answers a question you are interested in).

P.P.P.S. But Daniel, why would anyone "try to...

2015-03-20T14:52:24.100+01:00

P.P.P.S. But Daniel, why would anyone "try to infer whether the observed effect is random noise"? We are not shuffling cards or throwing dice, we are testing human beings whose behavior (and hence your measurements) will always be a product of highly structured process. Your null hyp is always false. Either go back to studying card games or move on to parameter estimation.