A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, January 30, 2017

Examining Non-Significant Results with Bayes Factors and Equivalence Tests

In this blog, I’ll compare two ways of interpreting non-significant effects: Bayes factors and TOST equivalence tests. I’ll explain why reporting more than only Bayes factors makes sense, and highlight some benefits of equivalence testing over Bayes factors. I’d like to say a big thank you to Bill (Lihan) Chen and Victoria Savalei for helping me out super-quickly with my questions as I was re-analyzing their data.

Does volunteering improve well being? A recent article by Ashley Whillans, Scott Seider, Lihan Chen, Ryan Dwyer, Sarah Novick, Kathryn Gramigna, Brittany Mitchell, Victoria Savalei, Sally Dickerson & Elizabeth W. Dunn suggests the answer is: Not so much. The study was published in Comprehensive Results in Social Psychology, one of the highest quality journals in social psychology, which peer-reviews pre-registrations of studies before they are performed.

People were randomly assigned to a volunteering program for 6 months, or to a control condition. Before and after, a wide range of well-being measures were collected. Bayes factors support the null for all measures. The main results (and indeed, except for some manipulation checks, the only results – not even means or standard deviations are provided in the article) are communicated in the form of Bayes factors in Table 2.

The Bayes factors were calculated using the Bayes factor calculator by Zoltan Dienes, who has a great open access paper in Frontiers, cited more than 200 times since 2014, on how to use Bayes to get most out of non-significant results. I won’t try to explain in detail how these Bayes factors are calculated – too many Bayesians on Twitter have told me I am too stupid to understand the math behind Bayes factors, and how I should have taken calculus in high school. They are right on both accounts, so just read Dienes (2014) for an explanation.

As Dienes (2014) discusses, you can also interpret non-significant results using Frequentist statistics. In a TOST equivalence test, which consists of two simple one-sided t-tests, you determine whether an effect falls between equivalence bounds set to the smallest effect size you care about (for an introduction, see Lakens, 2017). Dienes (2014) says it can be difficult to determine what this smallest effect size of interest is, but for me, if anything, it is easier to determine a smallest effect size of interest than to specify an alternative model in Bayesian statistics.

The authors examined whether well-being was improved by volunteering, and specified an alternative model (what would a true effect of improved well-being look like?) as follows (page 9): “Because our goal was to contrast the null hypothesis to an alternative hypothesis that the effect is moderate in size, we used a normal distribution prior with a mean of 0.50 and a standard deviation of 0.15 for the standardized effect size (e.g. the difference score between standardized T2 and T1 measures).

It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis, and it is the reason my spreadsheet and R package “TOSTER” allow you to specify equivalence bounds in standardized effect sizes when performing an equivalence test.

In equivalence testing, we can test whether the observed data is surprisingly smaller than anything we would expect. The authors seem to find a true effect of d = 0.5 a realistic alternative model. So, a good start is to try to reject an effect of d = 0.5. We can just fill in the means, standard deviations, and sample sizes from both groups, and test against the equivalence bound of d = 0.5 (see the code at the bottom of the post). Note that the authors perform a two-sided test (even though they have a one-sided hypothesis, as indicated in the title “Does volunteering improve well-being?”, but following the authors, I will test whether the effect is statistically smaller than d = 0.5, and statistically larger than d = -0.5, instead of only testing whether the effect is smaller than d = 0.5). The most important results are summarized in the Figure below:

Testing the effect for WSB, one of the well-being measures, the standardized effect size of 0.5 equals a raw effect of 0.762 in scale points on the original measure. Because the 90% confidence interval around the mean difference does not contain -0.762 or 0.762, the observed data is surprising (a.k.a statistically significant), if there was a true effect of d = -0.5 or d = 0.5 (see Lakens, 2017, for a detailed explanation). We can reject the hypothesis that d = -0.5 or d = 0.5, and if we do this, given our alpha of 0.05, we would be wrong a maximum of 5% of the time, in the long run. Other people might find smaller effects still of interest. They can collect more data, and perform an equivalence test in a meta-analysis. 

We could write: Using a TOST procedure to test the data against equivalence bounds of d = -0.5 and d = 0.5, the observed results were statistically equivalent to zero, t(78.24) = -2.86, p = 0.003. The mean difference was -0.094, 90% CI[-0.483; 0.295].

Benefits of equivalence tests compared to Bayes factors.

If we perform equivalence tests, we see that we can conclude statistical equivalence for all nine measures. You might wonder about whether we need to correct for the fact that we perform nine tests for all the different well-being measures. Would we conclude that volunteering has a positive effect on well-being, if any single one of these tests showed a significant effect? If so, we should indeed correct for multiple comparisons to control our overall Type 1 error rate, and you can do this in equivalence testing. There is no easy way to control error rates in Bayesian statistics. Some Bayesians simply don’t care about error control, and I don’t exactly know what Bayesian who care about error control do. I care about error control, and the attention p-hacking is getting suggests I am not alone. In equivalence testing, you can control the Type 1 error rate simply by adjusting the alpha level, which is one benefit of equivalence testing over Bayes factors.

To calculate a Bayes factor, you need to specify your prior by providing the mean and standard deviation of the alternative. Bayes factors are quite sensitive to how you specify these priors, and for this reason, not every Bayesian statistician would recommend the use of Bayes factors. Andrew Gelman, a widely known Bayesian statistician, recently co-authored a paper in which Bayes factors were used as one of three Bayesian approaches to re-analyze data. In footnote 3 it is written: “Andrew Gelman wishes to state that he hates Bayes factors” – mainly because of this sensitivity to priors. So not everyone likes Bayes factors (just like not everyone likes p-values!). You can discuss the sensitivity to priors in a sensitivity analysis, which would mean plotting Bayes factors for alternative models with a range of means and standard deviations and different distributions, but I rarely see this done in practice. Equivalence tests also depend on the choice of the equivalence bounds. But it is very easy to see the effect of different equivalence bounds on the test result – you can just check if the equivalence bound you would have chosen falls within the 90% confidence interval. So that is a second benefit of equivalence testing.

The authors used a power analysis to determine the sample size they needed (page 7): "To achieve 80% power to detect an effect size of r = 0.21 (d = 0.40), we required at least 180 participants to detect significant effects of volunteering on our SWB measures of interest." But what was the power of the study to support the null? Although you can simulate everything in R, there is no software to perform power analysis for Bayes factors (indeed, 'power' is a Frequentist concept). When performing an equivalence test, you can easily perform a power analysis to make sure you have a well-powered study if there is an effect, and when there is no effect (and the spreadsheet and R package allow you to do this). When pre-registering a study, you need to justify your sample size, both with an eye for when the alternative hypothesis is true, as when the null hypothesis is true. The ease with which you can perform power calculations is another benefit of equivalence tests.

A final benefit I’d like to discuss concerns the assumptions of statistical tests. You should not perform tests when their assumptions are violated. The authors in the paper examining the effect of volunteering on well-being correctly report Welch’s t-tests, because they have unequal sample sizes in each group, and the equal variances assumption is violated. This is excellent practice. I don’t know how Bayes factors deal with unequal variances (I think they don’t, and simply assume equal variances, but I’m sure the answer will appear in the comments, if there is one). My TOST equivalence test spreadsheet and R code use Welch’s t-test by default (just as R does), so unequal variances is no longer a problem. The equal variances assumption is not very plausible in many research questions in psychology (Delacre, Lakens, & Leys, under review), so not having to assume equal variances is another benefit of equivalence testing compared to Bayes factors.


Only reporting Bayes factors seems, to me, an incomplete description of the data. I think it makes sense to report an effect size, the mean difference, and the confidence interval around it. And if you do that, and have determined a smallest effect size of interest, then performing the TOST equivalence testing procedure is nothing more than checking and reporting whether the p-value for the TOST procedure is smaller than your alpha level to conclude the effect is statistically equivalent. And you can still add a Bayes factor, if you want.

All approaches to statistical inferences have strengths and weaknesses. In most situations, both Bayes factors and equivalence tests lead to conclusions that have the same practical consequences. Whenever they do not, it is never the case that one approach is correct, and one is wrong – the answers differ because the tests have different assumptions, and you will have to think about your data more, which is never a bad thing. In the end, as long as you share the data of your paper online, as the current authors did, anyone can calculate the statistics they like. But only reporting Bayes factors is not really enough to describe your data. You might want to at least report means and standard deviations, so that people who want to include the effect size in a meta-analysis don’t need to re-analyze your data. And you might want to try out equivalence tests next time you interpret null results.


  1. "The study was published in Comprehensive Results in Social Psychology, one of the highest quality journals in social psychology, which peer-reviews pre-registrations of studies before they are performed"

    Can you find the pre-registration information anywhere?

    There is a link to an OSF-project in the article, but i can't find the pre-registration information. When i click on the tab "registrations" is states:

    "There have been no completed registrations of this project."

    1. HI, no, I don't have access to the pre-registration. The pre-registration is handled by the editors. You can contact the authors if you want details (they were very responsive to my questions, they might want to make it public). But the editors and reviewers at CRSP check the pre-registration, there is no requirement to make it public.

    2. What?!?

      That defeats the entire purpose of pre-registration?! Hiding pre-registration from the reader is the exact opposite of open science, in fact i would argue that it is pseudo open science.

      Registered Reports started out quite promising with your special issue (in which pre-registration information *was* included in the articles). As far as i am concerned, the Registered Report format has now been compromised already. Perhaps they should come up with a new format: SSRR's (Super Secret Registered Reports).

      As long as pre-registration information is not publically accessible to the reader, e.g. via a link in the paper, "Comprehensive Results in Social Psychology" most definitely is not "one of the highest quality journals" in my reasoning. In fact, i think it could be reasoned that they have set a pseudo (open) scientific precedent...

    3. Hi, I think editors and reviewers are capable of checking whether the pre-registration is followed through (even though making the pre-registration public after publishing the article makes total sense to me). One important aspect is that this format prevents publication bias. So, it is excellent that this journals have pre-registration, as a reviewer of some articles at CRSP and other registered report journals, I see no problems or super secret sneaky stuff happening, but by all means, e-mail the editors of the journal - they might be willing to change their policy.

  2. I agree wholeheartedly that just reporting Bayes Factors is a very poor way to report data. I'm pretty sure any Bayesian would also agree, even those who support Bayes factors. The BF itself is just a random variable. I also agree that reporting the descriptive statistics you're recommending is wise as well, except that the CI should be a Bayesian CI because you're confusing the meaning of the CI you're using and the Bayes Factor.

    A Bayesian states a belief and updates that belief based on evidence. A Bayesian can say, "I believe this coin is fair," and flip the coin once to see if she is right. She will of course correctly take this one flip as very little evidence to modify her belief and update it as she adds more evidence. This is a continuous process without "tests" per se. A Bayes Factor is not a test but is a way to quantify belief. The Bayes CI is a statement of where one believes a measure is likely to be with certainty attached to various components of it and a belief that it's not just one value but something flexible related to the density of the CI.

    A frequentist has no ability to quantify the belief. He can only generate statistics with long run probabilities and therefore the scenario above is ridiculous. That doesn't mean that he won't update his beliefs based on tests with good frequentist properties as any rational person would. It's just that the outcome of the test, or CI, doesn't actually measure the belief. A frequentist CI only gets it's frequentist properties if the statement is that the true value is in the CI without equivocation and without any ability to say anything relative about values within the CI.

    Therefore, the Bayes CI and frequentist CI are two very different things. Combining frequentist tests and Bayes Factors ends up producing a hodge podge that does nothing to further the field and ends up confusing the fundamental meaning of both measures. Cheerful articles about how we should all be able to just get along and let's integrate Bayes and frequentist stats show a lack of understanding of both... that whole 80% I suppose.

    1. Hi,

      you can have perfectly correct interpretations of a Frequentist confidence interval without confusing them with post-data Bayesian interpretations. As a hard-core Bayesian, you might not like nuanced messages, but combining Frequentist and Bayesian inferences is my preferred approach, and I see strenghts in both. If you don't like a nuanced message, I totally understand.

  3. I am sure there is ,ore modern work on Bayes Factors in a Behrens context. The one I remember from the 1980s working on my PhD is :

    Bayes Factors for Behrens-Fisher Problems
    Hari H. Dayal and James M. Dickey
    Sankhyā: The Indian Journal of Statistics, Series B (1960-2002)
    Vol. 38, No. 4 (Nov., 1976), pp. 315-328

  4. Hello Daniël,

    I realise this isn't the focus of your blog post, but would you like to elaborate on the following?

    "It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis"

    We've had our discussions on standardised ESs, and as you may recall, I think they're overused. The paper you discuss nicely illustrates why I think so: For the power analysis on p. 7, the authors didn't derive their effect size under H1 from theory, practical considerations or previous work on the topic at hand. Rather, they went for that the typical mean difference-to-sample standard deviation ratio from a likely p-hacked literature (pre-2003 social psychology).

    For the Bayes factor analyses, they similarly ressorted to everyone's favourite standardised effect size ('moderate = d of 0.5') for all 10 variables in Table 2. (I don't really understand why the effect size would now be a different one, but I've only scanned the paper.) Is that really the authors' alternative hypothesis or just a convenient fiction?

    It seems to me that rather than having an alternative hypothesis that, if the intervention 'worked', it would produce approximately a mean difference-to-sample standard deviation ratio of 0.4 (or 0.5+/-0.15), the authors didn't have an alternative hypothesis, but since they needed a power analysis and got a null result, they needed to come up with canned ones.

    To be clear, I'm not blaming the authors here since few papers on power analysis and, presumably, Bayes factors provide guidance on working with genuine alternative hypotheses. But, to paraphrase John Tukey, are we really so uninterested in our hypotheses that we don't care about their units? So is using standardised effect sizes the easiest way to think about the alternative hypothesis or the most convenient way to avoid having to think about it?

    1. Hi - I discuss how using standardized ES will bootstrap the use of unstandardized effect sizes in my paper https://osf.io/preprints/psyarxiv/97gpc/

    2. Hi Jan,

      A lot of good points here. I'd like to make a quick clarification about the alternative hypothesis, RE:"It seems to me that... the authors didn't have an alternative hypothesis, but since they needed a power analysis and got a null result, they needed to come up with canned ones."

      While we did expect to see a null result due to prior research, we did not know we would get a null result on this particular data set at the time of the pre-registration. In accordance to the rules, we were explicitly forbidden to touch the data until the pre-registration was complete.

      The point alternative for power analysis and the expected value of the alternative prior distribution are different, because the former sets a minimum value, while the latter sets the centre of a symmetrical distribution. We went with a canned standardized effect for power analysis, because we were unable to find many analogous studies from which we could form a more precise alternative. The Bayesian prior was a somewhat subjective illustration of how someone who believed in the effect would describe that belief. While the alternative hypothesis may not be entirely ideal, but we did have it, so to speak.


    3. Thanks for the clarification, Bill!

      "The Bayesian prior was a somewhat subjective illustration of how someone who believed in the effect would describe that belief."

      I appreciate that it's difficult to quantify subject beliefs, particularly if you don't share them. But would someone holding this belief expect that the ratio of the treatment effect and the standard deviation was the same for all ten outcome variables? I.e., that while the variability of the data for DV1 may be larger than that for DV2, the treatment effect for DV1 would be correspondingly larger as well to produce the same ratio?

      This isn't a criticism of your study. But I don't understand, in general, why one would express predictions in standardised effect sizes. 'Because we don't know about the raw effect size' isn't really a strong argument, because you need the raw effect size to calculate the standardised effect size.

      @Daniël: Thanks for the link.

  5. I don't think the adoption of TOST for the purpose of "examining non-significant results" (as in the title) or "interpreting non-significant results" (as in the text body) is responsible from a frequentist perspective, for two reasons. First, performing a significance test conditioned on a null result in a previous test increases your type 1 error rate by definition. Second, even if preregistering the TOST together with the usual t-test, the two tests can not be treated as two independent tests and so the desired alpha should be split between them according to researcher's prior.

    Once observing a null result, all alpha has been used to full, and error rates of any additional test on the same contrast can not even be approximated (e.g., http://www.sciencedirect.com/science/article/pii/S0001691814000304). This is one reason why the adoption of Bayes factors as a tool for examining non-significant results can't be justified from an NP perspective, and only makes sense from a Bayesian one. I think using TOST after observing a null result is no different than doing a right-tailed t-test, get null, and then perform a left-tailed test on the same set of data. Alpha is guaranteed to inflate. TOST is exactly the same, except that instead of right or left tails, you spread your alpha also at the center of the distribution. This additional rejection area should be somehow payed for.

    The second argument follows from the first one. Even if you perform the two tests (t and TOST) to test two separate hypotheses (one is that the the effect is different from zero, the second that it is outside the equivalence region), you should split your alpha between the tests because there's a one to one mapping between the null distributions of the two tests. All you do is increase your rejection area, without paying for it.

    Kruschke has a nice way to compensate for this additional rejection area (ROPE). The equivalent for our case would be that once you decided on your equivalence bounds, they should also be used for your original t-test (i.e., instead of showing the CI doesn't include 0, it should not intersect with the equivalence region). This way bigger equivalence regions don't only make "accepting the null" easier, but also make it harder to reject it. Not sure this has the desired frequentist properties (maintain alpha) - but would be interesting to examine.

    - Matan

    1. Hi Matan, when you have two distinct hypotheses, they each have their own alpha. In this case, the error rates you are controlling are the following: 1) I do not want to say there is a significant effect, when the null is true, more than 5% of the time (t-test), and 2) I do not want to say there is statistical equivalence, when there is actually a true effect that equals one of the equivalence bounds, more than 5% of the time (TOST).

      You are controlling both there error rates at 5%. Remember that you can have a finding that is statistically equivalent AND statistically significant. So there are really two different hypotheses. Each can be individually true or false. That's why you can perform both tests, and each has it's own alpha level.

      Now, you might want to say: I want to control my error rate over BOTH these tests. Yes, then you would need to correct (although how much, given their dependencies, is a difficult calculation). But then you might as well correct for all tests you do in the article, or all tests in your lifetime, and I explain here why that is not how NP testing works: http://daniellakens.blogspot.nl/2016/02/why-you-dont-need-to-adjust-you-alpha.html

    2. Hi Daniel,
      Thanks for restoring my reply :-)

      The fact that you can have a finding that is both statistically equivalent and statistically significant is not evidence for the two tests being independent. My point is that the two tests are performed on the same statistic under some rearrangement of terms, and thus are actually the same test with different rejection areas, just like right-tailed and left-tailed t tests.

      Let's take a concrete example:
      Say I sample 100 samples, with mean x̄ and std σ. Before sampling, I decided to perform
      1. a t-test, to test whether μ==0
      2. an equivalence test, to test whether d>0.5 or d<0.5

      Rejection areas for the first test are t>1.98 OR t<1.98, i.e., x̄/(σ/sqrt(100))=10*x̄/σ>1.98 OR 10*x̄/σ<-1.98, i.e., x̄/σ>0.198 OR 10*x̄/σ<-0.198

      Rejection area for the second test is 0.5>d>-0.5, i.e., 0.5>x̄/σ>-0.5.

      The statistic x̄/σ has one null distribution. Once you know your sample size and the p value of test 1, you also know the p value of test 2. Also true the other way. It's not about family wise correction, it's about the maintaining the alphas for the same test, even when it's under disguise.

    3. This comment has been removed by the author.

    4. Corrections: should be:
      Rejection areas for the first test are t>1.98 OR t<1.98, i.e., x̄/(σ/sqrt(100))=10*x̄/σ>1.98 OR 10*x̄/σ<-1.98, i.e., x̄/σ>0.198 OR x̄/σ<-0.198

      more importantly, rejection areas for the second test are 1.66/sqrt(100)-0.5<x̄/σ<-1.66/sqrt(100)-0.5, i.e., -0.33<x̄/σ<0.33
      The important point holds though - both have the same null distribution given the df, and thus should not be treated as independent tests.

    5. Hi, yes, they are performed on the same data, but either 1) the true effect is 0, so you can make a Type 1 error for the t-test, but not for the equivalence test, or 2) The true effect is <> 0, so you can make a Type 1 error for the equivalence test, but not for the t-test. Doesn't this solve the problem? It's an interesting question, and it is very well possible that I am missing something.

    6. Well, you can use the exact same argument to justify doing a right tailed t test and move on to left tailed t test only if not significant. Either you make type 1 error on the first test or on the second, can never be both. In the example from my previous comment, you will reject at least one hypothesis for every possible combination of x̄ and σ, so your alpha is 1! Each of the tests is legitimate, its the combination that's problematic.

    7. I think there is a difference with the 2-tail example, namely that in that case, you are testing: 1) d > 0, 2) d < 0, 3) d = 0. If d = 0 is true, but you test 1 and 2 with 5% alpha, your overall alpha is actually 10% when d = 0. But with equivalence tests, the t-test has a 5% error rate when d = 0, and the TOST test only has a 5% error rate when d <> 0. I think that's a difference.

    8. I don't understand. In both cases the error rate of each of the tests is kept under alpha, and in both cases it's meaningless to talk about alpha for single tests because the tests are the same, only with different rejection areas. The serious problem here is that at least one of the null hypotheses is always false: either d!=0, or d==0, and then it lies within the equivalence interval. This makes alpha completely meaningless, because there is no null distribution (so to correct my previous comment, alpha is not 1, it is just undefined).

    9. My full response here: https://medium.com/@mazormatan/cant-have-your-tost-and-eat-it-too-f55efff0c85e#.a2vl4umpq

    10. I finally got convinced there's no problem with doing TOST and t-test on the same set of data. What is still unintuitive to me is that this combination makes significant results more frequent without inflating alpha. I understand the reason is that both null hypotheses are mutually exclusive. Thanks for your patience :-)

  6. Hi Daniel,
    Following our twitter conversation I wrote a comment and it now disappeared - have you erased it?
    - Matan

    1. Hi, the very bad spam filter had flagged it as spam (and let's through many messages that are clearly spam on other posts!). I restored your message.

  7. Regarding performing both a test of the null hypothesis that the effect is 0 and an equivalence test that the effect is, say, < |.5| seems to suggest that the investigators are confused. If they believe that effects < |.5| are equivalent to 0 for all practical purposes, then why would they care about whether the null hypothesis that the effect is 0 is rejected? because, clearly, rejecting that null would not imply that the effect size is not large enough to be considered different from 0 for all practical purposes.

    Instead, it seems to me what would matter is whether the confidence or credible interval were (a) entirely within the equivalence limits, (b) entirely outside the equivalence limits, or (c) straddling an equivalence limit. From case (a) we would infer equivalence; from case (b) we would infer superiority; and case (c) would be indeterminate.

    1. Hi Jay, what you are describing is statistically equivalent to combining TOST and NHST. See my explanation in the preprint.