The 20% Statistician: Examining Non-Significant Results with Bayes Factors and Equivalence Tests

Monday, January 30, 2017

Examining Non-Significant Results with Bayes Factors and Equivalence Tests

In this blog, I’ll compare two ways of interpreting non-significant effects: Bayes factors and TOST equivalence tests. I’ll explain why reporting more than only Bayes factors makes sense, and highlight some benefits of equivalence testing over Bayes factors. I’d like to say a big thank you to Bill (Lihan) Chen and Victoria Savalei for helping me out super-quickly with my questions as I was re-analyzing their data.

Does volunteering improve well being? A recent article by Ashley Whillans, Scott Seider, Lihan Chen, Ryan Dwyer, Sarah Novick, Kathryn Gramigna, Brittany Mitchell, Victoria Savalei, Sally Dickerson & Elizabeth W. Dunn suggests the answer is: Not so much. The study was published in Comprehensive Results in Social Psychology, one of the highest quality journals in social psychology, which peer-reviews pre-registrations of studies before they are performed.

People were randomly assigned to a volunteering program for 6 months, or to a control condition. Before and after, a wide range of well-being measures were collected. Bayes factors support the null for all measures. The main results (and indeed, except for some manipulation checks, the only results – not even means or standard deviations are provided in the article) are communicated in the form of Bayes factors in Table 2.

The Bayes factors were calculated using the Bayes factor calculator by Zoltan Dienes, who has a great open access paper in Frontiers, cited more than 200 times since 2014, on how to use Bayes to get most out of non-significant results. I won’t try to explain in detail how these Bayes factors are calculated – too many Bayesians on Twitter have told me I am too stupid to understand the math behind Bayes factors, and how I should have taken calculus in high school. They are right on both accounts, so just read Dienes (2014) for an explanation.

As Dienes (2014) discusses, you can also interpret non-significant results using Frequentist statistics. In a TOST equivalence test, which consists of two simple one-sided t-tests, you determine whether an effect falls between equivalence bounds set to the smallest effect size you care about (for an introduction, see Lakens, 2017). Dienes (2014) says it can be difficult to determine what this smallest effect size of interest is, but for me, if anything, it is easier to determine a smallest effect size of interest than to specify an alternative model in Bayesian statistics.

The authors examined whether well-being was improved by volunteering, and specified an alternative model (what would a true effect of improved well-being look like?) as follows (page 9): “Because our goal was to contrast the null hypothesis to an alternative hypothesis that the effect is moderate in size, we used a normal distribution prior with a mean of 0.50 and a standard deviation of 0.15 for the standardized effect size (e.g. the difference score between standardized T2 and T1 measures).”

It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis, and it is the reason my spreadsheet and R package “TOSTER” allow you to specify equivalence bounds in standardized effect sizes when performing an equivalence test.

In equivalence testing, we can test whether the observed data is surprisingly smaller than anything we would expect. The authors seem to find a true effect of d = 0.5 a realistic alternative model. So, a good start is to try to reject an effect of d = 0.5. We can just fill in the means, standard deviations, and sample sizes from both groups, and test against the equivalence bound of d = 0.5 (see the code at the bottom of the post). Note that the authors perform a two-sided test (even though they have a one-sided hypothesis, as indicated in the title “Does volunteering improve well-being?”, but following the authors, I will test whether the effect is statistically smaller than d = 0.5, and statistically larger than d = -0.5, instead of only testing whether the effect is smaller than d = 0.5). The most important results are summarized in the Figure below:

Testing the effect for WSB, one of the well-being measures, the standardized effect size of 0.5 equals a raw effect of 0.762 in scale points on the original measure. Because the 90% confidence interval around the mean difference does not contain -0.762 or 0.762, the observed data is surprising (a.k.a statistically significant), if there was a true effect of d = -0.5 or d = 0.5 (see Lakens, 2017, for a detailed explanation). We can reject the hypothesis that d = -0.5 or d = 0.5, and if we do this, given our alpha of 0.05, we would be wrong a maximum of 5% of the time, in the long run. Other people might find smaller effects still of interest. They can collect more data, and perform an equivalence test in a meta-analysis.

We could write: Using a TOST procedure to test the data against equivalence bounds of d = -0.5 and d = 0.5, the observed results were statistically equivalent to zero, t(78.24) = -2.86, p = 0.003. The mean difference was -0.094, 90% CI[-0.483; 0.295].

Benefits of equivalence tests compared to Bayes factors.

If we perform equivalence tests, we see that we can conclude statistical equivalence for all nine measures. You might wonder about whether we need to correct for the fact that we perform nine tests for all the different well-being measures. Would we conclude that volunteering has a positive effect on well-being, if any single one of these tests showed a significant effect? If so, we should indeed correct for multiple comparisons to control our overall Type 1 error rate, and you can do this in equivalence testing. There is no easy way to control error rates in Bayesian statistics. Some Bayesians simply don’t care about error control, and I don’t exactly know what Bayesian who care about error control do. I care about error control, and the attention p-hacking is getting suggests I am not alone. In equivalence testing, you can control the Type 1 error rate simply by adjusting the alpha level, which is one benefit of equivalence testing over Bayes factors.

To calculate a Bayes factor, you need to specify your prior by providing the mean and standard deviation of the alternative. Bayes factors are quite sensitive to how you specify these priors, and for this reason, not every Bayesian statistician would recommend the use of Bayes factors. Andrew Gelman, a widely known Bayesian statistician, recently co-authored a paper in which Bayes factors were used as one of three Bayesian approaches to re-analyze data. In footnote 3 it is written: “Andrew Gelman wishes to state that he hates Bayes factors” – mainly because of this sensitivity to priors. So not everyone likes Bayes factors (just like not everyone likes p-values!). You can discuss the sensitivity to priors in a sensitivity analysis, which would mean plotting Bayes factors for alternative models with a range of means and standard deviations and different distributions, but I rarely see this done in practice. Equivalence tests also depend on the choice of the equivalence bounds. But it is very easy to see the effect of different equivalence bounds on the test result – you can just check if the equivalence bound you would have chosen falls within the 90% confidence interval. So that is a second benefit of equivalence testing.

The authors used a power analysis to determine the sample size they needed (page 7): "To achieve 80% power to detect an effect size of r = 0.21 (d = 0.40), we required at least 180 participants to detect significant effects of volunteering on our SWB measures of interest." But what was the power of the study to support the null? Although you can simulate everything in R, there is no software to perform power analysis for Bayes factors (indeed, 'power' is a Frequentist concept). When performing an equivalence test, you can easily perform a power analysis to make sure you have a well-powered study if there is an effect, and when there is no effect (and the spreadsheet and R package allow you to do this). When pre-registering a study, you need to justify your sample size, both with an eye for when the alternative hypothesis is true, as when the null hypothesis is true. The ease with which you can perform power calculations is another benefit of equivalence tests.

A final benefit I’d like to discuss concerns the assumptions of statistical tests. You should not perform tests when their assumptions are violated. The authors in the paper examining the effect of volunteering on well-being correctly report Welch’s t-tests, because they have unequal sample sizes in each group, and the equal variances assumption is violated. This is excellent practice. I don’t know how Bayes factors deal with unequal variances (I think they don’t, and simply assume equal variances, but I’m sure the answer will appear in the comments, if there is one). My TOST equivalence test spreadsheet and R code use Welch’s t-test by default (just as R does), so unequal variances is no longer a problem. The equal variances assumption is not very plausible in many research questions in psychology (Delacre, Lakens, & Leys, under review), so not having to assume equal variances is another benefit of equivalence testing compared to Bayes factors.

Conclusion

Only reporting Bayes factors seems, to me, an incomplete description of the data. I think it makes sense to report an effect size, the mean difference, and the confidence interval around it. And if you do that, and have determined a smallest effect size of interest, then performing the TOST equivalence testing procedure is nothing more than checking and reporting whether the p-value for the TOST procedure is smaller than your alpha level to conclude the effect is statistically equivalent. And you can still add a Bayes factor, if you want.

All approaches to statistical inferences have strengths and weaknesses. In most situations, both Bayes factors and equivalence tests lead to conclusions that have the same practical consequences. Whenever they do not, it is never the case that one approach is correct, and one is wrong – the answers differ because the tests have different assumptions, and you will have to think about your data more, which is never a bad thing. In the end, as long as you share the data of your paper online, as the current authors did, anyone can calculate the statistics they like. But only reporting Bayes factors is not really enough to describe your data. You might want to at least report means and standard deviations, so that people who want to include the effect size in a meta-analysis don’t need to re-analyze your data. And you might want to try out equivalence tests next time you interpret null results.

26 comments:

AnonymousJanuary 30, 2017 at 12:23 PM
"The study was published in Comprehensive Results in Social Psychology, one of the highest quality journals in social psychology, which peer-reviews pre-registrations of studies before they are performed"

Can you find the pre-registration information anywhere?

There is a link to an OSF-project in the article, but i can't find the pre-registration information. When i click on the tab "registrations" is states:

"There have been no completed registrations of this project."
ReplyDelete
Replies
AnonymousJanuary 30, 2017 at 12:57 PM
I agree wholeheartedly that just reporting Bayes Factors is a very poor way to report data. I'm pretty sure any Bayesian would also agree, even those who support Bayes factors. The BF itself is just a random variable. I also agree that reporting the descriptive statistics you're recommending is wise as well, except that the CI should be a Bayesian CI because you're confusing the meaning of the CI you're using and the Bayes Factor.

A Bayesian states a belief and updates that belief based on evidence. A Bayesian can say, "I believe this coin is fair," and flip the coin once to see if she is right. She will of course correctly take this one flip as very little evidence to modify her belief and update it as she adds more evidence. This is a continuous process without "tests" per se. A Bayes Factor is not a test but is a way to quantify belief. The Bayes CI is a statement of where one believes a measure is likely to be with certainty attached to various components of it and a belief that it's not just one value but something flexible related to the density of the CI.

A frequentist has no ability to quantify the belief. He can only generate statistics with long run probabilities and therefore the scenario above is ridiculous. That doesn't mean that he won't update his beliefs based on tests with good frequentist properties as any rational person would. It's just that the outcome of the test, or CI, doesn't actually measure the belief. A frequentist CI only gets it's frequentist properties if the statement is that the true value is in the CI without equivocation and without any ability to say anything relative about values within the CI.

Therefore, the Bayes CI and frequentist CI are two very different things. Combining frequentist tests and Bayes Factors ends up producing a hodge podge that does nothing to further the field and ends up confusing the fundamental meaning of both measures. Cheerful articles about how we should all be able to just get along and let's integrate Bayes and frequentist stats show a lack of understanding of both... that whole 80% I suppose.
ReplyDelete
Replies
Andy GrieveJanuary 30, 2017 at 6:28 PM
I am sure there is ,ore modern work on Bayes Factors in a Behrens context. The one I remember from the 1980s working on my PhD is :

Bayes Factors for Behrens-Fisher Problems
Hari H. Dayal and James M. Dickey
Sankhyā: The Indian Journal of Statistics, Series B (1960-2002)
Vol. 38, No. 4 (Nov., 1976), pp. 315-328
ReplyDelete
Replies
JanJanuary 30, 2017 at 6:45 PM
Hello Daniël,

I realise this isn't the focus of your blog post, but would you like to elaborate on the following?

"It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis"

We've had our discussions on standardised ESs, and as you may recall, I think they're overused. The paper you discuss nicely illustrates why I think so: For the power analysis on p. 7, the authors didn't derive their effect size under H1 from theory, practical considerations or previous work on the topic at hand. Rather, they went for that the typical mean difference-to-sample standard deviation ratio from a likely p-hacked literature (pre-2003 social psychology).

For the Bayes factor analyses, they similarly ressorted to everyone's favourite standardised effect size ('moderate = d of 0.5') for all 10 variables in Table 2. (I don't really understand why the effect size would now be a different one, but I've only scanned the paper.) Is that really the authors' alternative hypothesis or just a convenient fiction?

It seems to me that rather than having an alternative hypothesis that, if the intervention 'worked', it would produce approximately a mean difference-to-sample standard deviation ratio of 0.4 (or 0.5+/-0.15), the authors didn't have an alternative hypothesis, but since they needed a power analysis and got a null result, they needed to come up with canned ones.

To be clear, I'm not blaming the authors here since few papers on power analysis and, presumably, Bayes factors provide guidance on working with genuine alternative hypotheses. But, to paraphrase John Tukey, are we really so uninterested in our hypotheses that we don't care about their units? So is using standardised effect sizes the easiest way to think about the alternative hypothesis or the most convenient way to avoid having to think about it?
ReplyDelete
Replies
matanJanuary 31, 2017 at 10:36 AM
I don't think the adoption of TOST for the purpose of "examining non-significant results" (as in the title) or "interpreting non-significant results" (as in the text body) is responsible from a frequentist perspective, for two reasons. First, performing a significance test conditioned on a null result in a previous test increases your type 1 error rate by definition. Second, even if preregistering the TOST together with the usual t-test, the two tests can not be treated as two independent tests and so the desired alpha should be split between them according to researcher's prior.

Once observing a null result, all alpha has been used to full, and error rates of any additional test on the same contrast can not even be approximated (e.g., http://www.sciencedirect.com/science/article/pii/S0001691814000304). This is one reason why the adoption of Bayes factors as a tool for examining non-significant results can't be justified from an NP perspective, and only makes sense from a Bayesian one. I think using TOST after observing a null result is no different than doing a right-tailed t-test, get null, and then perform a left-tailed test on the same set of data. Alpha is guaranteed to inflate. TOST is exactly the same, except that instead of right or left tails, you spread your alpha also at the center of the distribution. This additional rejection area should be somehow payed for.

The second argument follows from the first one. Even if you perform the two tests (t and TOST) to test two separate hypotheses (one is that the the effect is different from zero, the second that it is outside the equivalence region), you should split your alpha between the tests because there's a one to one mapping between the null distributions of the two tests. All you do is increase your rejection area, without paying for it.

Kruschke has a nice way to compensate for this additional rejection area (ROPE). The equivalent for our case would be that once you decided on your equivalence bounds, they should also be used for your original t-test (i.e., instead of showing the CI doesn't include 0, it should not intersect with the equivalence region). This way bigger equivalence regions don't only make "accepting the null" easier, but also make it harder to reject it. Not sure this has the desired frequentist properties (maintain alpha) - but would be interesting to examine.

Cheers,
- Matan
ReplyDelete
Replies
matanJanuary 31, 2017 at 10:47 AM
Hi Daniel,
Following our twitter conversation I wrote a comment and it now disappeared - have you erased it?
Thanks,
- Matan
ReplyDelete
Replies
JayMarch 4, 2017 at 6:11 AM
Regarding performing both a test of the null hypothesis that the effect is 0 and an equivalence test that the effect is, say, < |.5| seems to suggest that the investigators are confused. If they believe that effects < |.5| are equivalent to 0 for all practical purposes, then why would they care about whether the null hypothesis that the effect is 0 is rejected? because, clearly, rejecting that null would not imply that the effect size is not large enough to be considered different from 0 for all practical purposes.

Instead, it seems to me what would matter is whether the confidence or credible interval were (a) entirely within the equivalence limits, (b) entirely outside the equivalence limits, or (c) straddling an equivalence limit. From case (a) we would infer equivalence; from case (b) we would infer superiority; and case (c) would be indeterminate.
ReplyDelete
Replies

Add comment