A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Tuesday, May 26, 2015

After how many p-values between 0.025-0.05 should one start getting concerned about robustness?

Like insects sensitive to the infrared spectrum, researchers have evolved a special sensitivity to p-values between 0.025 and 0.05. Sure, you’ve got your p < .001’s and n.s.’s, but your life as a scientist might depend on your special sensitivity to 0.025<p<0.05.

 Our increased sensitivity to these p-values might make us forget what a small part of the p-value spectrum we are talking about here – just 2.5%. At the same time, we are slowly realizing that too many high p-values are rather unlikely. This lead Michael Inzlicht to wonder:

Now if someone who is so serious about improving the way he works as Michael Inzlicht wants to know something, I’m more than happy to help.

I wanted to give the best possible probability. Using this handy interactive visualization it is easy to move some sliders around and see which power has the highest percentage of p-values between 0.025 and 0.05 (give it a try = it’s around 56% power, when approximately 11% of p-values will fall within this small section). If we increase or decrease the power, p-values are either spread out more uniformly, or most of them will be very small. Assuming we are examining a true effect, the probability of finding two p-values in a row within 0.025 and 0.05 is simply 11% times 11%, or 0.11*0.11=0.012. At the very best, published papers that simply report what they find will contain two p-values between 0.025 and 0.05 1.2% of the time.

NOTE 1: Richard Morey noted on Twitter this calculation ignores how researchers will typically not run two studies in a row, regardless of the outcome of the first study. They will typically run Study 2 only if Study 1 was statistically significant. If so, we need to calculate the conditional probability that Study 2 found a significant effect between 0.025-0.05, conditional on the probability that Study 1 found a significant effect between 0.025-0.05 (with 56% power). Thus: p(0.025<p<0.05|p<0.05, assuming 56% power). This probability is 21%, which makes the probability across two studies 0,21*0,11=0.023, or 2.3%. 

We can also simulate independent t-tests with 56% power, and count how often we find two p-values between 0.025 and 0.05 in a row. The R script below gives us the same answer to our question.

Note 2: Frederik Aust and Rogier Kievit remarked on Twitter that if you multiply enough probabilities, the number will always be small in the end. I agree. We can compare observing 2 p-values between 0.025-0.05 when we have 56% power with observing these two p-values when the null-hypothesis is true. If we again take the conditional probability, this is 0.5*0.025 - 0.0125. The conditional probability with 56% power was 0.023. This means the observed pattern across two studies is 1.84 times more likely under the alternative hypothesis than under the null-hypothesis. Even though we have 2 significant studies, the evidence for the alternative hypothesis is purely anecdotal. 

If you want to focus on the probability of p-values between 0.01 and 0.05, the interactive visualization shows the optimal power is around 62%, when approximately 24% of the p-values will fall between 0.01 and 0.05. Finding two studies within these p-values is not improbable (it happens around 0.24*0.24 = 6% of the time), but a third study within this interval occurs only 1.3% of the time. Again, it can happen, but not very often. 

If you want to use the frequency of p-values as an indication of the robustness of results, don't decide when to use the rule or which boundaries you will use after looking at the data. But if you always use the rule to doubt the robustness of two p-values between 0.025 and 0.05 in two study papers, and three p-values between 0.01 and 0.05 in three study papers, you won't make too many errors in the long run. An obvious exception is when authors pre-registered all their studies. 

What should you do as an editor when you encounter a set of studies with p-values that are relatively unlikely to occur? First, you can ask the authors to discuss the situation. For example, when you explicitly mention the set of studies becomes more probable when a non-significant finding is added, the authors might be happy to oblige. Second, one of my favorite solutions is to decide upon an in principle acceptance (assuming the article is fit for publication), but ask the authors to add one replication. The authors are guaranteed of publication irrespective of the outcome of the replication, but we have a better knowledge of what is likely to be true.


  1. My interpretation of Mickey's question, slightly paraphrased, is: "Given that I have observed two p-values between .025 and .05, what is the probability of them coming from an unbiased report?"

    On the other hand, the calculations in this blog post (such as the 6%) are asking, "Given an unbiased report, what is the probability of observing two p-values between .025 and .05?"

    I'd just like to point out that these are not the same thing. It's a reversal of the conditional probabilities.

  2. Sanjay. To get the probability that the data are biased given the observation of a pair of p-values between .05 and .025, we have to make some assumptions about the probability of this event to occur when bias is present. Does 50% seem reasonable to you? In this case, the probability that bias is present when the red flag is raised would be 50 out of 51, or 98%, a little bit less than 99 out of 100 (99%).

    Maybe you want to be more conservative. with 25% probability of bias producing the event, there are still 25 out of 26 events where bias produced the critical event (96% correct positive rate).

    Bayesians often trick us by using a medical analogy where the event we are looking for is very low (brain cancer).

    Both p in .05-.025 One p not in .05-.025
    Bias 50 50
    NoBias 1 99

  3. Hi Daniel,

    Mickey's question and your answer suggest another way to examine bias.

    It is similar to TIVA.


    Both tests are based on the insight that test-statistics (whether they are presented as z-scores, p-values, post-hoc power, or other transformations) should vary considerably. Obtaining p-values that are too close to each other suggests that bias is present.

    The difference between TIVA and the critical region approach (Neyman-Pearson) is that TIVA does not require an a priori specification of the critical region. If Mickey would always use .05 to .025, the approach is fine. However, if the critical region is not fixed, the bias test itself is biased.

    The problem with .05 to .025 is that it is very narrow. This reduces the type-I error rate very much (even with k=2, p = .01), but the type-II error rate is high because p-hacking doesn't always produce p-values just below .05 as you posted on another blog.

    Thus, the trick is to find a good balance between type-I and type-II error. For two studies, I suggest a range from 50% to 80% power, which corresponds to z-scores of 1.96 to 2.8, and p-values from .05 to .005.

    The type-I error rate for this test with k = 2 is about 10%, which is considered acceptable to just raise awareness of bias. This test has more power for k = 2 than TIVA. This makes it appealing to use it for pairs of studies.