A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, January 29, 2017

Examining Non-Significant Results with Bayes Factors and Equivalence Tests

In this blog, I’ll compare two ways of interpreting non-significant effects: Bayes factors and TOST equivalence tests. I’ll explain why reporting more than only Bayes factors makes sense, and highlight some benefits of equivalence testing over Bayes factors. I’d like to say a big thank you to Bill (Lihan) Chen and Victoria Savalei for helping me out super-quickly with my questions as I was re-analyzing their data.

Does volunteering improve well being? A recent article by Ashley Whillans, Scott Seider, Lihan Chen, Ryan Dwyer, Sarah Novick, Kathryn Gramigna, Brittany Mitchell, Victoria Savalei, Sally Dickerson & Elizabeth W. Dunn suggests the answer is: Not so much. The study was published in Comprehensive Results in Social Psychology, one of the highest quality journals in social psychology, which peer-reviews pre-registrations of studies before they are performed.

People were randomly assigned to a volunteering program for 6 months, or to a control condition. Before and after, a wide range of well-being measures were collected. Bayes factors support the null for all measures. The main results (and indeed, except for some manipulation checks, the only results – not even means or standard deviations are provided in the article) are communicated in the form of Bayes factors in Table 2.

The Bayes factors were calculated using the Bayes factor calculator by Zoltan Dienes, who has a great open access paper in Frontiers, cited more than 200 times since 2014, on how to use Bayes to get most out of non-significant results. I won’t try to explain in detail how these Bayes factors are calculated – too many Bayesians on Twitter have told me I am too stupid to understand the math behind Bayes factors, and how I should have taken calculus in high school. They are right on both accounts, so just read Dienes (2014) for an explanation.

As Dienes (2014) discusses, you can also interpret non-significant results using Frequentist statistics. In a TOST equivalence test, which consists of two simple one-sided t-tests, you determine whether an effect falls between equivalence bounds set to the smallest effect size you care about (for an introduction, see Lakens, 2017). Dienes (2014) says it can be difficult to determine what this smallest effect size of interest is, but for me, if anything, it is easier to determine a smallest effect size of interest than to specify an alternative model in Bayesian statistics.

The authors examined whether well-being was improved by volunteering, and specified an alternative model (what would a true effect of improved well-being look like?) as follows (page 9): “Because our goal was to contrast the null hypothesis to an alternative hypothesis that the effect is moderate in size, we used a normal distribution prior with a mean of 0.50 and a standard deviation of 0.15 for the standardized effect size (e.g. the difference score between standardized T2 and T1 measures).

It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis, and it is the reason my spreadsheet and R package “TOSTER” allow you to specify equivalence bounds in standardized effect sizes when performing an equivalence test.

In equivalence testing, we can test whether the observed data is surprisingly smaller than anything we would expect. The authors seem to find a true effect of d = 0.5 a realistic alternative model. So, a good start is to try to reject an effect of d = 0.5. We can just fill in the means, standard deviations, and sample sizes from both groups, and test against the equivalence bound of d = 0.5 (see the code at the bottom of the post). Note that the authors perform a two-sided test (even though they have a one-sided hypothesis, as indicated in the title “Does volunteering improve well-being?”, but following the authors, I will test whether the effect is statistically smaller than d = 0.5, and statistically larger than d = -0.5, instead of only testing whether the effect is smaller than d = 0.5). The most important results are summarized in the Figure below:

Testing the effect for WSB, one of the well-being measures, the standardized effect size of 0.5 equals a raw effect of 0.762 in scale points on the original measure. Because the 90% confidence interval around the mean difference does not contain -0.762 or 0.762, the observed data is surprising (a.k.a statistically significant), if there was a true effect of d = -0.5 or d = 0.5 (see Lakens, 2017, for a detailed explanation). We can reject the hypothesis that d = -0.5 or d = 0.5, and if we do this, given our alpha of 0.05, we would be wrong a maximum of 5% of the time, in the long run. Other people might find smaller effects still of interest. They can collect more data, and perform an equivalence test in a meta-analysis. 

We could write: Using a TOST procedure to test the data against equivalence bounds of d = -0.5 and d = 0.5, the observed results were statistically equivalent to zero, t(78.24) = -2.86, p = 0.003. The mean difference was -0.094, 90% CI[-0.483; 0.295].

Benefits of equivalence tests compared to Bayes factors.

If we perform equivalence tests, we see that we can conclude statistical equivalence for all nine measures. You might wonder about whether we need to correct for the fact that we perform nine tests for all the different well-being measures. Would we conclude that volunteering has a positive effect on well-being, if any single one of these tests showed a significant effect? If so, we should indeed correct for multiple comparisons to control our overall Type 1 error rate, and you can do this in equivalence testing. There is no easy way to control error rates in Bayesian statistics. Some Bayesians simply don’t care about error control, and I don’t exactly know what Bayesian who care about error control do. I care about error control, and the attention p-hacking is getting suggests I am not alone. In equivalence testing, you can control the Type 1 error rate simply by adjusting the alpha level, which is one benefit of equivalence testing over Bayes factors.

To calculate a Bayes factor, you need to specify your prior by providing the mean and standard deviation of the alternative. Bayes factors are quite sensitive to how you specify these priors, and for this reason, not every Bayesian statistician would recommend the use of Bayes factors. Andrew Gelman, a widely known Bayesian statistician, recently co-authored a paper in which Bayes factors were used as one of three Bayesian approaches to re-analyze data. In footnote 3 it is written: “Andrew Gelman wishes to state that he hates Bayes factors” – mainly because of this sensitivity to priors. So not everyone likes Bayes factors (just like not everyone likes p-values!). You can discuss the sensitivity to priors in a sensitivity analysis, which would mean plotting Bayes factors for alternative models with a range of means and standard deviations and different distributions, but I rarely see this done in practice. Equivalence tests also depend on the choice of the equivalence bounds. But it is very easy to see the effect of different equivalence bounds on the test result – you can just check if the equivalence bound you would have chosen falls within the 90% confidence interval. So that is a second benefit of equivalence testing.

The authors used a power analysis to determine the sample size they needed (page 7): "To achieve 80% power to detect an effect size of r = 0.21 (d = 0.40), we required at least 180 participants to detect significant effects of volunteering on our SWB measures of interest." But what was the power of the study to support the null? Although you can simulate everything in R, there is no software to perform power analysis for Bayes factors (indeed, 'power' is a Frequentist concept). When performing an equivalence test, you can easily perform a power analysis to make sure you have a well-powered study if there is an effect, and when there is no effect (and the spreadsheet and R package allow you to do this). When pre-registering a study, you need to justify your sample size, both with an eye for when the alternative hypothesis is true, as when the null hypothesis is true. The ease with which you can perform power calculations is another benefit of equivalence tests.

A final benefit I’d like to discuss concerns the assumptions of statistical tests. You should not perform tests when their assumptions are violated. The authors in the paper examining the effect of volunteering on well-being correctly report Welch’s t-tests, because they have unequal sample sizes in each group, and the equal variances assumption is violated. This is excellent practice. I don’t know how Bayes factors deal with unequal variances (I think they don’t, and simply assume equal variances, but I’m sure the answer will appear in the comments, if there is one). My TOST equivalence test spreadsheet and R code use Welch’s t-test by default (just as R does), so unequal variances is no longer a problem. The equal variances assumption is not very plausible in many research questions in psychology (Delacre, Lakens, & Leys, under review), so not having to assume equal variances is another benefit of equivalence testing compared to Bayes factors.


Only reporting Bayes factors seems, to me, an incomplete description of the data. I think it makes sense to report an effect size, the mean difference, and the confidence interval around it. And if you do that, and have determined a smallest effect size of interest, then performing the TOST equivalence testing procedure is nothing more than checking and reporting whether the p-value for the TOST procedure is smaller than your alpha level to conclude the effect is statistically equivalent. And you can still add a Bayes factor, if you want.

All approaches to statistical inferences have strengths and weaknesses. In most situations, both Bayes factors and equivalence tests lead to conclusions that have the same practical consequences. Whenever they do not, it is never the case that one approach is correct, and one is wrong – the answers differ because the tests have different assumptions, and you will have to think about your data more, which is never a bad thing. In the end, as long as you share the data of your paper online, as the current authors did, anyone can calculate the statistics they like. But only reporting Bayes factors is not really enough to describe your data. You might want to at least report means and standard deviations, so that people who want to include the effect size in a meta-analysis don’t need to re-analyze your data. And you might want to try out equivalence tests next time you interpret null results.