# The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

## Tuesday, May 26, 2015

### After how many p-values between 0.025-0.05 should one start getting concerned about robustness?

Like insects sensitive to the infrared spectrum, researchers have evolved a special sensitivity to p-values between 0.025 and 0.05. Sure, you’ve got your p < .001’s and n.s.’s, but your life as a scientist might depend on your special sensitivity to 0.025<p<0.05.

Our increased sensitivity to these p-values might make us forget what a small part of the p-value spectrum we are talking about here – just 2.5%. At the same time, we are slowly realizing that too many high p-values are rather unlikely. This lead Michael Inzlicht to wonder:

#### NOTE 1: Richard Morey noted on Twitter this calculation ignores how researchers will typically not run two studies in a row, regardless of the outcome of the first study. They will typically run Study 2 only if Study 1 was statistically significant. If so, we need to calculate the conditional probability that Study 2 found a significant effect between 0.025-0.05, conditional on the probability that Study 1 found a significant effect between 0.025-0.05 (with 56% power). Thus: p(0.025<p<0.05|p<0.05, assuming 56% power). This probability is 21%, which makes the probability across two studies 0,21*0,11=0.023, or 2.3%.

We can also simulate independent t-tests with 56% power, and count how often we find two p-values between 0.025 and 0.05 in a row. The R script below gives us the same answer to our question.

#### What should you do as an editor when you encounter a set of studies with p-values that are relatively unlikely to occur? First, you can ask the authors to discuss the situation. For example, when you explicitly mention the set of studies becomes more probable when a non-significant finding is added, the authors might be happy to oblige. Second, one of my favorite solutions is to decide upon an in principle acceptance (assuming the article is fit for publication), but ask the authors to add one replication. The authors are guaranteed of publication irrespective of the outcome of the replication, but we have a better knowledge of what is likely to be true.

1. My interpretation of Mickey's question, slightly paraphrased, is: "Given that I have observed two p-values between .025 and .05, what is the probability of them coming from an unbiased report?"

On the other hand, the calculations in this blog post (such as the 6%) are asking, "Given an unbiased report, what is the probability of observing two p-values between .025 and .05?"

I'd just like to point out that these are not the same thing. It's a reversal of the conditional probabilities.

2. Sanjay. To get the probability that the data are biased given the observation of a pair of p-values between .05 and .025, we have to make some assumptions about the probability of this event to occur when bias is present. Does 50% seem reasonable to you? In this case, the probability that bias is present when the red flag is raised would be 50 out of 51, or 98%, a little bit less than 99 out of 100 (99%).

Maybe you want to be more conservative. with 25% probability of bias producing the event, there are still 25 out of 26 events where bias produced the critical event (96% correct positive rate).

Bayesians often trick us by using a medical analogy where the event we are looking for is very low (brain cancer).

Both p in .05-.025 One p not in .05-.025
Bias 50 50
NoBias 1 99

3. Hi Daniel,

It is similar to TIVA.

https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Both tests are based on the insight that test-statistics (whether they are presented as z-scores, p-values, post-hoc power, or other transformations) should vary considerably. Obtaining p-values that are too close to each other suggests that bias is present.

The difference between TIVA and the critical region approach (Neyman-Pearson) is that TIVA does not require an a priori specification of the critical region. If Mickey would always use .05 to .025, the approach is fine. However, if the critical region is not fixed, the bias test itself is biased.

The problem with .05 to .025 is that it is very narrow. This reduces the type-I error rate very much (even with k=2, p = .01), but the type-II error rate is high because p-hacking doesn't always produce p-values just below .05 as you posted on another blog.

Thus, the trick is to find a good balance between type-I and type-II error. For two studies, I suggest a range from 50% to 80% power, which corresponds to z-scores of 1.96 to 2.8, and p-values from .05 to .005.

The type-I error rate for this test with k = 2 is about 10%, which is considered acceptable to just raise awareness of bias. This test has more power for k = 2 than TIVA. This makes it appealing to use it for pairs of studies.