Greenland and colleagues (Greenland et al., 2016) published a list with 25 common misinterpretations of statistical concepts such as power, confidence intervals, and, in points 1-10, p-values. Here I’ll explain how 50% of these problems are resolved by using equivalence tests in addition to null-hypothesis significance tests.

First, let’s look through the 5 points we
will resolve:

4. A nonsignificant test result (P >
0.05) means that the test hypothesis [NOTE DL: This is typically the null-hypothesis] is true or should be accepted.

5. A large P value is evidence in favor of
the test hypothesis.

6. A null-hypothesis P value greater than
0.05 means that no effect was observed, or that absence of an effect was shown
or demonstrated.

7. Statistical significance indicates a
scientifically or substantively important relation has been detected.

8. Lack of statistical significance indicates
that the effect size is small.

With an equivalence test, we specify which
effect is scientifically or substantially important. We call this the ‘smallest
effect size of interest’. If we find an effect that is surprisingly small,
assuming any effect we would care about exists, we can conclude practical
equivalence (and we’d not be wrong more than 5% of the time).

Let’s take a look at the figure below (adapted
from Lakens, 2017). A mean difference of Cohen’s

*d*0.5 (either positive or negative) is specified as a smallest effect size of interest. Data is collected, and one of four possible outcomes is observed.
Equivalence tests fix point 4 above,
because a non-significant result no longer automatically means we can accept
the null hypothesis. We can accept the null if we find the pattern indicated by
A: The p-value from NHST is > 0.05,

*and*the p-value for the equivalence test is ≤ 0.05. However, if the p-value for the equivalence test is also > 0.05, the outcome matches pattern D, and we can not reject either hypothesis, and thus remain undecided. Equivalence tests similarly fix point 5: A large p-value is not evidence in favor of the test hypothesis. Ignoring that p-values don’t have an evidential interpretation to begin with, we can only conclude equivalence under pattern A, not under pattern D. Point 6 is just more of the same (empirical scientists inflate error rates, statisticians inflate lists criticizing p-values): We can only conclude there is absence of evidence under pattern A, but not under pattern D.
Point 7 is solved because by using
equivalence tests, we can also observe pattern C: An effect is statistically
significant, but

*also*smaller than anything we care about, or equivalent to null. We can only conclude the effect is significant,*and that the possibility that the effect is large enough to matter**can not be rejected*, under pattern B. Similarly, point 8 is solved because we can only conclude an effect is non-significant and small under pattern A, but not under pattern D. When there is no significant difference (P > 0.05), but also no statistical equivalence (P > 0.05), it is still possible there is an effect large enough to be interesting.
We see p-values (from equivalence tests)
solve 50% of the misinterpretations of p-values (from NHST). They also allow
you to publish effects that are

*statistically significant because they reject the presence of a meaningful effect*. They are just as easy to calculate and report as a t-test. Read my practical primer if you want to get started.
P.S. In case you are wondering about the
other 50% of the misinterpretations: These are all solved by remembering one
simple thing. P-values tell you nothing about the probability a hypothesis is
true (neither about the alternative hypothesis, nor about the null hypothesis).
This fixed points 1, 2, 3, 9, and 10 below by Greenland and colleagues. So if
we use equivalence tests together will null-hypothesis significance tests, and
remember p-values do not tell us anything about the probability a hypothesis is
true, you’re good.

1. The P value is the probability that the
test hypothesis is true; for example, if a test of the null hypothesis gave P =
0.01, the null hypothesis has only a 1 % chance of being true; if instead it
gave P = 0.40, the null hypothesis has a 40 % chance of being true.

2. The P value for the null hypothesis is
the probability that chance alone produced the observed association; for
example, if the P value for the null hypothesis is 0.08, there is an 8 %
probability that chance alone produced the association.

3. A significant test result (P ≤ 0.05) means that the test hypothesis is false or should be
rejected.

9. The P value is the chance of our data
occurring if the test hypothesis is true; for example, P = 0.05 means that the
observed association would occur only 5 % of the time under the test
hypothesis.

10. If you reject the test hypothesis
because P ≤ 0.05, the chance you are in error (the chance your

‘‘significant finding’’ is a false
positive) is 5 %.

*References*

Greenland, S., Senn,
S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman,
D. G. (2016). Statistical tests, P values, confidence intervals, and power: a
guide to misinterpretations.

*European Journal of Epidemiology*,*31*(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3
Lakens,
D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and
meta-analyses.

*Social Psychological and Personality Science*.
I really like this approach to testing, since it does provide much more information than a single NHST result. I am wondering though about one conclusion. You write "We can only conclude the effect is significant, and large enough to matter, under pattern B." But in scenario B it still looks most likely that the effect size is still below the size of d = 0.5 that's of interest. The effect *could* be large enough to be of interest, but I'd only feel confident concluding that it likely was that big if the mean was above 0.5 and the CI didn't include 0.5. Are we really interested in demonstrating that the effect size is likely as big or bigger than the one of interest, or simply that it's not statistically smaller than it?

ReplyDeleteHi Alistair - glad you like the approach - o do I. It's simple and efficient! You are asking a question about confirmation VS rejection. Indeed, in scenario B we can only reject the null, not accept a d = 0.5. But we can not reject a d = 0.5 (or d = -0.5). So it would be better to say 'and possibly large enough to matter'.

DeleteThanks for the clarification.

DeleteAfter an additional comment about this, I've changed the text to "and that the possibility that the effect is large enough to matter can not be rejected," - thanks for the feedback!

Delete