Monday, March 6, 2017

How p-values solve 50% of the problems with p-values


Greenland and colleagues (Greenland et al., 2016) published a list with 25 common misinterpretations of statistical concepts such as power, confidence intervals, and, in points 1-10, p-values. Here I’ll explain how 50% of these problems are resolved by using equivalence tests in addition to null-hypothesis significance tests.

First, let’s look through the 5 points we will resolve:

4. A nonsignificant test result (P > 0.05) means that the test hypothesis [NOTE DL: This is typically the null-hypothesis] is true or should be accepted.
5. A large P value is evidence in favor of the test hypothesis.
6. A null-hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated.
7. Statistical significance indicates a scientifically or substantively important relation has been detected.
8. Lack of statistical significance indicates that the effect size is small.

With an equivalence test, we specify which effect is scientifically or substantially important. We call this the ‘smallest effect size of interest’. If we find an effect that is surprisingly small, assuming any effect we would care about exists, we can conclude practical equivalence (and we’d not be wrong more than 5% of the time).

Let’s take a look at the figure below (adapted from Lakens, 2017). A mean difference of Cohen’s d 0.5 (either positive or negative) is specified as a smallest effect size of interest. Data is collected, and one of four possible outcomes is observed. 



Equivalence tests fix point 4 above, because a non-significant result no longer automatically means we can accept the null hypothesis. We can accept the null if we find the pattern indicated by A: The p-value from NHST is > 0.05, and the p-value for the equivalence test is 0.05. However, if the p-value for the equivalence test is also > 0.05, the outcome matches pattern D, and we can not reject either hypothesis, and thus remain undecided. Equivalence tests similarly fix point 5: A large p-value is not evidence in favor of the test hypothesis. Ignoring that p-values don’t have an evidential interpretation to begin with, we can only conclude equivalence under pattern A, not under pattern D. Point 6 is just more of the same (empirical scientists inflate error rates, statisticians inflate lists criticizing p-values): We can only conclude there is absence of evidence under pattern A, but not under pattern D.

Point 7 is solved because by using equivalence tests, we can also observe pattern C: An effect is statistically significant, but also smaller than anything we care about, or equivalent to null. We can only conclude the effect is significant, and that the possibility that the effect is large enough to matter can not be rejected, under pattern B. Similarly, point 8 is solved because we can only conclude an effect is non-significant and small under pattern A, but not under pattern D. When there is no significant difference (P > 0.05), but also no statistical equivalence (P > 0.05), it is still possible there is an effect large enough to be interesting.

We see p-values (from equivalence tests) solve 50% of the misinterpretations of p-values (from NHST). They also allow you to publish effects that are statistically significant because they reject the presence of a meaningful effect. They are just as easy to calculate and report as a t-test. Read my practical primer if you want to get started.

P.S. In case you are wondering about the other 50% of the misinterpretations: These are all solved by remembering one simple thing. P-values tell you nothing about the probability a hypothesis is true (neither about the alternative hypothesis, nor about the null hypothesis). This fixed points 1, 2, 3, 9, and 10 below by Greenland and colleagues. So if we use equivalence tests together will null-hypothesis significance tests, and remember p-values do not tell us anything about the probability a hypothesis is true, you’re good.

1. The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true.
2. The P value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the P value for the null hypothesis is 0.08, there is an 8 % probability that chance alone produced the association.
3. A significant test result (P 0.05) means that the test hypothesis is false or should be rejected.
9. The P value is the chance of our data occurring if the test hypothesis is true; for example, P = 0.05 means that the observed association would occur only 5 % of the time under the test hypothesis.
10. If you reject the test hypothesis because P 0.05, the chance you are in error (the chance your
‘‘significant finding’’ is a false positive) is 5 %.

References

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3
Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science.

4 comments:

  1. I really like this approach to testing, since it does provide much more information than a single NHST result. I am wondering though about one conclusion. You write "We can only conclude the effect is significant, and large enough to matter, under pattern B." But in scenario B it still looks most likely that the effect size is still below the size of d = 0.5 that's of interest. The effect *could* be large enough to be of interest, but I'd only feel confident concluding that it likely was that big if the mean was above 0.5 and the CI didn't include 0.5. Are we really interested in demonstrating that the effect size is likely as big or bigger than the one of interest, or simply that it's not statistically smaller than it?

    ReplyDelete
    Replies
    1. Hi Alistair - glad you like the approach - o do I. It's simple and efficient! You are asking a question about confirmation VS rejection. Indeed, in scenario B we can only reject the null, not accept a d = 0.5. But we can not reject a d = 0.5 (or d = -0.5). So it would be better to say 'and possibly large enough to matter'.

      Delete
    2. After an additional comment about this, I've changed the text to "and that the possibility that the effect is large enough to matter can not be rejected," - thanks for the feedback!

      Delete