Our increased sensitivity to these p-values might make us forget what a small part of the p-value spectrum we are talking about here – just 2.5%. At the same time, we are slowly realizing that too many high p-values are rather unlikely. This lead Michael Inzlicht to wonder:
Editor question: After how many p-values <.05 but >.025 should one start getting concerned about robustness? What's fair?
— Michael Inzlicht (@minzlicht) May 26, 2015
Now if someone who is so serious about improving the way he works as Michael Inzlicht wants to know something, I’m more than happy to help.
I wanted to give the best possible probability. Using this handy interactive visualization it is easy to move some sliders around and see which power has the highest percentage of p-values between 0.025 and 0.05 (give it a try = it’s around 56% power, when approximately 11% of p-values will fall within this small section). If we increase or decrease the power, p-values are either spread out more uniformly, or most of them will be very small. Assuming we are examining a true effect, the probability of finding two p-values in a row within 0.025 and 0.05 is simply 11% times 11%, or 0.11*0.11=0.012. At the very best, published papers that simply report what they find will contain two p-values between 0.025 and 0.05 1.2% of the time.
NOTE 1: Richard Morey noted on Twitter this calculation ignores how researchers will typically not run two studies in a row, regardless of the outcome of the first study. They will typically run Study 2 only if Study 1 was statistically significant. If so, we need to calculate the conditional probability that Study 2 found a significant effect between 0.025-0.05, conditional on the probability that Study 1 found a significant effect between 0.025-0.05 (with 56% power). Thus: p(0.025<p<0.05|p<0.05, assuming 56% power). This probability is 21%, which makes the probability across two studies 0,21*0,11=0.023, or 2.3%.
We can also simulate independent t-tests with 56% power, and count how often we find two p-values between 0.025 and 0.05 in a row. The R script below gives us the same answer to our question.
My interpretation of Mickey's question, slightly paraphrased, is: "Given that I have observed two p-values between .025 and .05, what is the probability of them coming from an unbiased report?"
ReplyDeleteOn the other hand, the calculations in this blog post (such as the 6%) are asking, "Given an unbiased report, what is the probability of observing two p-values between .025 and .05?"
I'd just like to point out that these are not the same thing. It's a reversal of the conditional probabilities.
Sanjay. To get the probability that the data are biased given the observation of a pair of p-values between .05 and .025, we have to make some assumptions about the probability of this event to occur when bias is present. Does 50% seem reasonable to you? In this case, the probability that bias is present when the red flag is raised would be 50 out of 51, or 98%, a little bit less than 99 out of 100 (99%).
ReplyDeleteMaybe you want to be more conservative. with 25% probability of bias producing the event, there are still 25 out of 26 events where bias produced the critical event (96% correct positive rate).
Bayesians often trick us by using a medical analogy where the event we are looking for is very low (brain cancer).
Both p in .05-.025 One p not in .05-.025
Bias 50 50
NoBias 1 99
Hi Daniel,
ReplyDeleteMickey's question and your answer suggest another way to examine bias.
It is similar to TIVA.
https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/
Both tests are based on the insight that test-statistics (whether they are presented as z-scores, p-values, post-hoc power, or other transformations) should vary considerably. Obtaining p-values that are too close to each other suggests that bias is present.
The difference between TIVA and the critical region approach (Neyman-Pearson) is that TIVA does not require an a priori specification of the critical region. If Mickey would always use .05 to .025, the approach is fine. However, if the critical region is not fixed, the bias test itself is biased.
The problem with .05 to .025 is that it is very narrow. This reduces the type-I error rate very much (even with k=2, p = .01), but the type-II error rate is high because p-hacking doesn't always produce p-values just below .05 as you posted on another blog.
Thus, the trick is to find a good balance between type-I and type-II error. For two studies, I suggest a range from 50% to 80% power, which corresponds to z-scores of 1.96 to 2.8, and p-values from .05 to .005.
The type-I error rate for this test with k = 2 is about 10%, which is considered acceptable to just raise awareness of bias. This test has more power for k = 2 than TIVA. This makes it appealing to use it for pairs of studies.