In a previous
post, I compared equivalence tests to Bayes factors, and pointed out
several benefits of equivalence tests. But a much more logical comparison, and
one I did not give enough attention to so far, is the ROPE procedure using
Bayesian estimation. I’d like to thank John Kruschke for feedback on a draft of
this blog post. Check out his own recent blog comparing ROPE to Bayes factors here.
When we perform a study, we would like to conclude there is an effect, when there is an effect. But it is just as important to be able to conclude there is no effect, when there is no effect. I’ve recently published a paper that makes Frequentist equivalence tests (using the two-one-sided tests, or TOST, approach) as easy as possible (Lakens, 2017). Equivalence tests allow you to reject the presence of any effect you care about. In Bayesian estimation, one way to argue for the absence of a meaningful effect is the Region of Practical Equivalence (ROPE) procedure (Kruschke, 2014, chapter 12), which is “somewhat analogous to frequentist equivalence testing” (Kruschke & Liddell, 2017).
When we perform a study, we would like to conclude there is an effect, when there is an effect. But it is just as important to be able to conclude there is no effect, when there is no effect. I’ve recently published a paper that makes Frequentist equivalence tests (using the two-one-sided tests, or TOST, approach) as easy as possible (Lakens, 2017). Equivalence tests allow you to reject the presence of any effect you care about. In Bayesian estimation, one way to argue for the absence of a meaningful effect is the Region of Practical Equivalence (ROPE) procedure (Kruschke, 2014, chapter 12), which is “somewhat analogous to frequentist equivalence testing” (Kruschke & Liddell, 2017).
In the ROPE procedure, a 95% Highest Density Interval (HDI)
is calculated based on a posterior distribution (which is calculated based on a
prior and the data). Kruschke suggests that: “if the 95 % HDI falls entirely inside the ROPE then we decide to accept
the ROPE’d value for practical purposes”. Note that the same HDI can also
be used to reject the null hypothesis, where in Frequentist statistics, even
though the confidence interval plays a similar role, you would still perform
both a traditional t-test and the
TOST procedure.
The only real difference with equivalence testing is that
instead of using a confidence interval, a Bayesian Highest Density Interval is
used. If the prior used by Kruschke was perfectly uniform, ROPE and equivalence
testing would identical, barring philosophical differences in how the numbers
should be interpreted. The BEST package by default uses a ‘broad’ prior, and
therefore the 95% CI and 95% HDI are not exactly the same, but they are very
close, for single comparisons. When multiple comparisons are made, (for example
when using sequential analyses, Lakens, 2014), the CI needs to be adjusted to maintain the
desired error rate, but in Bayesian statistics, error rates are not directly
controlled (they are limited due to ‘shrinkage’, but can be inflated beyond 5%,
and often considerably so).
In the code below, I randomly generate random normally distributed data (with means of 0 and a sd of 1) and perform the ROPE procedure and the TOST. The 95% HDI is from -0.10 to 0.42, and the 95% CI is from -0.11 to 0.41, with mean differences of 0.17 or 0.15.
In the code below, I randomly generate random normally distributed data (with means of 0 and a sd of 1) and perform the ROPE procedure and the TOST. The 95% HDI is from -0.10 to 0.42, and the 95% CI is from -0.11 to 0.41, with mean differences of 0.17 or 0.15.
Indeed, if you will forgive me the pun, you might say these
two approaches are practically equivalent. But there are some subtle
differences between ROPE and TOST
95% HDI vs 90% CI
Kruschke (2014,
Chapter 5) writes: “How should we define
“reasonably credible”? One way is by saying that any points within the 95% HDI
are reasonably credible.” There is not a strong justification for the use
of a 95% HDI over a 96% of 93% HDI, except that it mirrors the familiar use of
a 95% CI in Frequentist statistics. In Frequentist statistics, the 95%
confidence interval is directly related to the 5% alpha level that is commonly
deemed acceptable for a maximum Type 1 error rate (even though this alpha level
is in itself a convention without strong justification).
But here’s the catch: The TOST equivalence testing procedure
does not use a 95% CI, but a 90% CI. The reason for this is that two one-sided
tests are performed. Each of these tests has a 5% error rate. You might
intuitively think that doing two tests with a 5% error rate will increase the
overall Type 1 error rate, but in this case, that’s not true. You could easily
replace the two tests, with just one test, testing the observed effect against
the equivalence bound (upper or lower) closest to it. If this test is
statistically significant, so is the other – and thus, there is no alpha
inflation in this specific case. That’s why the TOST procedure uses a 90% CI to
have a 5% error rate, while the same researcher would use a 95% CI in a
traditional two-sided t-test to examine whether the observed effect is
statistically different from 0, while maintaining a 5% error rate (see also Senn, 2007, section 22.2.4)
This nicely illustrates the difference between estimation
(where you just want to have a certain level of accuracy, such as 95%), and
Frequentist hypothesis testing, where you want to distinguish between signal
and noise, and not be wrong more than 5% of the time when you declare there is
a signal. ROPE keeps the accuracy the same across tests, Frequentist approaches
keep the error rate constant. From a Frequentist perspective, ROPE is more
conservative than TOST, like the use of alpha = 0.025 is more conservative than
the use of alpha = 0.05.
Power analysis
For an equivalence test, power analysis can be performed
based on closed functions, and the calculations take just a fraction of a
second. I find that useful, for example in my role in our ethics board, where
we evaluate proposals that have to justify their sample size, and we often
check power calculations. Kruschke has an excellent R package (BEST) that can
do power analyses for the ROPE procedure. This is great work – but the
simulations take a while (a little bit over an hour for 1000 simulations).
Because the BESTpower function relies on simulations, you
need to specify the sample size, and it will calculate the power. That’s
actually the reverse of what you typically want in a power analysis (you want
to input the desired power, and see which sample size you need). This means you
most likely need to run multiple simulations in BESTpower, before you have
determined the sample size that will yield good power. Furthermore, the
software requires your to specify the expected means and standard deviations,
instead of simply an expected effect size. Instead of Frequentist power
analysis, where the hypothesized effect size is a point value (e.g., d = 0.4),
Bayesian power analysis models the alternative as a distribution, acknowledging
there is uncertainty.
In the end, however, the result of a power analysis for ROPE
and for TOST is actually remarkably similar. Using the code below to perform
the power analysis for ROPE, we see that 100 participants in each group give us
approximately 88.4% power (with 2000 simulations, this estimate is still a bit
uncertain) to get a 95% HDI that falls within our ROPE of -0.5 to 0.5, assuming
standard deviations of 1.
We can use the powerTOSTtwo.raw function in the TOSTER
package (using an alpha of 0.025 instead of 0.05, to mirror to 95% HDI) to
calculate the sample size we would need to achieve 88.4% power for independent
t-test (using equivalence bounds of -0.5 and 0.5, and standard deviations of
1):
powerTOSTtwo.raw(alpha=0.025,statistical_power=0.875,low_eqbound=-0.5,high_eqbound=0.5,sdpooled=1)
The outcome is 100 as well. So if you use a broad prior, it
seems you can save yourself some time by using the power analysis for
equivalence tests, without severe consequences.
Use of prior
information
The biggest benefit of ROPE over TOST is that is allows you
to incorporate prior information in your data analysis. If you have reliable
prior information, ROPE can use this information, which is especially useful if
you don’t have a lot of data. If you use priors, it is typically advised to
check the robustness of the posterior against reasonable changes in the prior (Kruschke, 2013).
Conclusion
Using the ROPE procedure or the TOST procedure will most
likely lead to very similar inferences. For all practical purposes, the
differences are small. It’s quite a lot easier to perform a power analysis for
TOST, and by default, TOST has greater statistical power because it uses 90% CI.
But power analysis is possible for ROPE (which is a rare pleasure to see for
Bayesian analyses), and you could choose to use a 90% HDI, or any other value
that matches your goals. TOST will be easier and more familiar because it is
just a twist on the classic t-test, but ROPE might be a great way to dip your
toes in Bayesian waters and explore the many more things you can do with
Bayesian posterior distributions.
References
Kruschke, J. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146
Kruschke, J. (2014). Doing Bayesian
Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan (2
edition). Boston: Academic Press.
Kruschke, J., & Liddell, T. M.
(2017). The Bayesian New Statistics: Hypothesis testing, estimation,
meta-analysis, and power analysis from a Bayesian perspective. Psychonomic
Bulletin & Review. https://doi.org/10.3758/s13423-016-1221-4
Lakens, D. (2014). Performing
high-powered studies efficiently with sequential analyses: Sequential analyses.
European Journal of Social Psychology, 44(7), 701–710.
https://doi.org/10.1002/ejsp.2023
Lakens, D. (2017). Equivalence tests: A
practical primer for t-tests, correlations, and meta-analyses. Social
Psychological and Personality Science.
Senn, S. (2007). Statistical issues in
drug development (2nd ed). Chichester, England ; Hoboken, NJ: John Wiley
& Sons.
So I would argue that a region of practical equivalence (ROPE) is both computationally and conceptually very different from a equivalence testing.
ReplyDeleteA ROPE is a very simple concept, it's still a good concept, but it's also very simple. It's a range of differences between any two parameters where, if the "underlying" difference falls in that range, it isn't large enough to be of interest. This is a very general concept not tied to a specific model or specific parameters. You could use it for differences in means, scale parameters, or any other exotic parameters. You could use it for simple group models, or for more advanced models where it's not even clear how you would calculate a p-value. Even if it was originally introduced together with BEST you can use it with *any* Bayesian model, and once you have fitted a Bayesian model it's straight forward to calculate how much probability is in or out of the ROPE (or use an HDI if you want to).
Equivalence testing is something different, it's a procedure that requires you to use a model and a parameter where you can calculate p-values. Using a ROPE can be seen as a way of summarizing a posterior distribution, while equivalence testing relies on p-values. And I would say that there is a big conceptual difference between posterior probabilities and p-values even if they, in a few select cases, are numerically similar.
agreed...
DeleteIt is important to emphasize that in one instance you have a measurement of belief and the other you can only make a yes or no decision that may or may not update your belief but in the end provides no measurement of that belief. That is not a trivial pedantic distinction to ignore. People end up coming away thinking that the frequentist method provides a measure that it does not.
Rasmus, no need to calculate p-values - just the 90% CI around whatever estimate you have. Just as flexible as Bayesian approaches.
DeleteSo I can easily come up with statistical models where it's kind of tricky to come up with a CI, but the Bayesian credible interval is easy to get. An example of such a model would be the statistical model behind BEST.
DeleteOk - but I was only saying you don't need p-values.
DeleteAnd there we agree 100% ! :)
DeleteNice post, Daniel. Thank you. Does TOST or ROPE require including 0 in the interval, or can one or both be used to examine equivalence within nonzero ranges? (It's possible your Coursera course addressed this for TOST. If so, color me embarrassed that I can't remember.)
ReplyDeleteHi Heather - you can use it for equivalence with a non-zero range as well. For example, using the TOST for one-sample, you could test whether a score is equivalent to guessing average (e.g., 0.5).
DeleteI'm not sure this is addressed in the Coursera course - but I will be updating the equivalence assignment in the future now my own paper on this is out, and will add a non-zero example!
Hi Daniel, Glad you find BEST useful. If you just want an HDI, use HDInterval::hdi; same as BEST::hdi but faster for large objects and you don't need to install JAGS. The next version of BEST will 'Depend' on HDInterval.
ReplyDeleteHi, that looks like it will be much easier to use in the future! Excellent!
DeleteThis comment has been removed by the author.
ReplyDeleteThanks for this interesting blog post, Daniel. I've created a follow-up that shows cases in which TOST+NHST yield conflicting decisions, which can never happen with the HDI+ROPE procedure. It's here: http://doingbayesiandataanalysis.blogspot.com/2017/02/equivalence-testing-two-one-sided-test.html
ReplyDeleteReally nice blog shared. Keep sharing more updates with us.
ReplyDelete