A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, July 18, 2016

Dance of the Bayes factors

You might have seen the ‘Dance of the p-values’ video by Geoff Cumming (if not, watch it here). Because p-values and the default Bayes factors (Rouder, Speckman, Sun, Morey, & Iverson, 2009) are both calculated directly from t-values and sample sizes, we might expect there is also a Dance of the Bayes factors. And indeed, there is. Bayes factors can vary widely over identical studies, just due to random variation.

If people would always correctly interpret Bayes factors, that would not be a problem. Bayes factors tell you how much data are in line with models, and quantify relative evidence in favor of one of these models. The data is what it is, even when it is misleading (i.e., supporting a hypothesis that is not true). So, you can conclude the null model is more likely than some other model, but purely based on a Bayes factor, you can’t draw a conclusion such as “This Bayes factor allows us to conclude that there are no differences between conditions”. Regrettably, researchers are massively starting to misinterpret Bayes factors (I won't provide references, though I have many). This is not surprising – people find statistical inferences difficult, whether these are about p-values, confidence intervals, or Bayes factors.

As a consequence, we see many dichotomous absolute interpretations (“we conclude there is no effect”) instead of continuous relative interpretations (“we conclude the data increase our belief in the null model compared to the alternative model”). As a side note: In my experience some people who advocate Bayesian statistics over NHST often live in a weird Limbo. They believe the null is never true when they are criticizing Null-Hypothesis Significance Testing as a useless procedure because we already know the null is not true, but they love using Bayes factors to conclude the null-hypothesis is supported.

For me, there is one important difference between the dance of the p-values and the dance of the Bayes factors: When people draw dichotomous conclusions, p-values allow you to control your error rate in the long run, while error rates are ignored when people use Bayes factors. As a consequence, you can easily conclude there is ‘no effect’, where there is an effect, 25% of the time (see below). This is a direct consequence of the ‘Dance of the Bayes factors’.

Let’s take the following scenario: There is a true small effect, Cohen’s d = 0.3. You collect data and perform a default two-sided Bayesian t-test with 75 participants in each condition. Let’s repeat this 100.000 times, and plot the Bayes factors we can expect. 

If you like a more dynamic version, check the ‘Dance of the Bayes factors’ R script at the bottom of this post. As output, it gives you a :D smiley when you have strong evidence for the null (BF < 0.1), a :) smiley when you have moderate evidence for the null, a (._.) when data is inconclusive, and a :( or :(( when data strongly support the alternative (smileys are coded based on the assumption researchers want to find support for the null). See the .gif below for the Dance of the Bayes factors if you don’t want to run the script.

I did not choose this example randomly (just as Geoff Cumming did not randomly choose to use 50% statistical power in his ‘Dance of the p-values’ video). In this situation, approximately 25% of Bayes factors are smaller than 1/3 (which can be interpreted as support for the null), 25% are higher than 3 (which can be interpreted as support for the alternative), and 50% are inconclusive. If you would conclude, based on your Bayes factor, that there are no differences between groups, you’d be wrong 25% of the time, in the long run. That’s a lot.

(You might feel more comfortable using a BF of 1/10 as a ‘strong evidence’ threshold: BF < 0.1 happen 12.5% of the time in this simulation. A BF > 10 never happens: We don't have a large enough sample size. If your true effect size is 0.3, you have decided to collect a maximum of 75 participants in each group, and you will look at the data repeatedly until you have ‘strong evidence’ (BF > 10 or BF < 0.1), you will never observe support for the alternative, and you can only observe strong evidence in favor of the null model, even though there is a true effect).

Felix Schönbrodt gives some examples for the probability you will observe a misleading Bayes factor for different effect sizes and priors (Schönbrodt, Wagenmakers, Zehetleitner, & Perugini, 2015). Here, I just want note you might want to take the Frequentist properties of Bayes factors in to account, if you want to make dichotomous conclusions such as ‘the data allow us to conclude there is no effect’. Just as the ‘Dance of the p-values’ can be turned into a ‘March of the p-values’ by increasing the statistical power, you can design studies that will yield informative Bayes factors, most of the time (Schönbrodt & Wagenmakers, 2016). But you can only design informative studies, in the long run, if you take Frequentist properties of tests into account. If you just look ‘at the data at hand’ your Bayes factors might be dancing around. You need to look at their Frequentist properties to design studies where Bayes factors march around. My main point in this blog is that this is something you might want to do.

What’s the alternative? First, never make incorrect dichotomous conclusions based on Bayes factors. I have the feeling I will be repeating this for the next 50 years. Bayes factors are relative evidence. If you want to make statements about how likely the null is, define a range of possible priors, use Bayes factors to update these priors, and report posterior probabilities as your explicit subjective belief in the null.

Second, you might want to stay away from the default priors. Using default priors as a Bayesian is like eating a no-fat no-sugar no-salt chocolate-chip cookie: You might as well skip it. You will just get looks of sympathy as you try to swallow it down. Look at Jeff Rouder’s post on how to roll your own priors.

Third, if you just want to say the effect is smaller than anything you find worthwhile (without specifically concluding there no effect) equivalence testing might be much more straightforward. It has error control, so you won’t incorrectly say the effect is smaller than anything you care about too often, in the long run.

The final alternative is just to ignore error rates. State loudly and clearly that you don’t care about Frequentist properties. Personally, I hope Bayesians will not choose this option. I would not be happy with a literature where thousands of articles claim the null is true, when there is a true effect. And you might want to know how to design studies that are likely to give answers you find informative.

When using Bayes factors, remember they can vary a lot across identical studies. Also remember that Bayes factors give you relative evidence. The null model might be more likely than the alternative, but both models can be wrong. If the true effect size is 0.3, the data might be closer to a value of 0 than to a value of 0.7, but it does not mean the true value is 0. In Bayesian statistics, the same reasoning holds. Your data may be more likely under a null model than under an alternative model, but that does not mean there are no differences. If you nevertheless want to argue that the null-hypothesis is true based on just a Bayes factor, realize you might be fooling yourself 25% of the time. Or more. Or less.


  • Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. http://doi.org/10.3758/PBR.16.2.225
  • Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential Hypothesis Testing With Bayes Factors: Efficiently Testing Mean Differences. Psychological Methods. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/26651986
  • Schönbrodt, F. D., & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence (SSRN Scholarly Paper No. ID 2722435). Rochester, NY: Social Science Research Network. Retrieved from http://papers.ssrn.com/abstract=2722435


  1. Interesting post, thanks Daniel! As you know my knowledge of Bayesian stats is very rough, as I'm still in the early stages of learning it. Nevertheless, some comments:

    You say that:
    25% of BF10 < 1/3 (which can be interpreted as support for the null)
    25% of BF10 > 3 (which can be interpreted as support for the alternative)
    50% of BF10 are between 1/3 and 3 (inconclusive)

    Eyeballing the graph, it seems the mean and medium BF10 are between 2 and 3.

    So basically, what you've shown is that for a dataset with little evidence, you'll never get substantive evidence (BF10 > 10), and what you will find is fairly random and often inconclusive. While this is good to know, it does not seem surprising, nor an argument for or against BF?

    On the latter point: how would p-values do in this case?

    Like you said: don't interpret BF as absolute evidence, or rely exclusively on dichotomous interpretations of BF.

    As a researcher, I would more typically sample until I get a specific BF, say BF10 > 10 for example. I will try to simulate that myself, but I'm fairly certain that with such a design errors are extremely rare, albeit you might run into practical issues with the required sample size.

    All in all, I guess that given your focus on controlling error rates, either setting alpha very low or BF high are both perfectly fine. The alpha has the advantage of an explicit (max) level of long-run error, while the BF has the advantage of quantifying evidence.

    1. Hi Tim, indeed, we could say the dataset contain little evidence. P-values would vary as much (I explicitly say this example is chosen to show the variability, just as when Cumming uses 50% power).

      If you sample until you get a BF < 0.1, whether or not you will get misleading evidence depends on your max sample size, and when and where you look. It's simply not easy to know, because there are no formal error control mechanisms. But if 75 people in each cell is your max, and the effect size is small, you can never get strong evidence favoring HA. The real problem here is the bad prior. You should use a better prior to solve this problem.

  2. Very nice Daniel!
    For me, take home message 1 is that BFs are subject to sampling variability. Of course! But folks don't seem to recognise this sufficiently. I hope now they will. On repeated sampling they dance, as do p values (and means and CIs and values of sample SD and…).

    Take home message 2 is that, for yet one more reason, we should move on from dichotomous decision making and use estimation. DDM tempts with seductive but illusory certainty. It does so even when given the better-sounding label ‘Hypothesis Testing’. Estimation, by contrast (1) focuses on effect sizes, which are the outcomes of most research interest and what we should interpret; (2) is necessary for meta-analysis, which is essential in a world of Open Science and replication; and (3) is likely to encourage quantitative modelling and the development of a more quantitative discipline. For me it is much more important that folks move on from DDM to estimation than whether they choose to do estimation using confidence intervals or credible intervals.

    Indeed Bayesian estimation centred on credible intervals may be a major way of the future. Please someone develop some great materials for teaching beginners using this approach!

    You mention ‘March of the p values’ if power is high. With high power, yes, we more often get small p values. Of course. But the weird thing is that the extent of sampling variability of p is still very high. I quantify that with the p interval: the 80% prediction interval for p, given just the p value found in an initial study. For initial p = .001, the p interval (all p values two-tail) is (.0000003, .14). So there’s an 80% chance a replication gives p in that interval, a 10% chance of p < .0000003, and fully a 10% chance of p > .14! Hardly a march! Possibly more surprisingly, that is all true whatever the value of N! (Provided N is not very small.) Yes, that’s hard to believe, but note that the interval is conditional on that initial p value, not on a particular true effect size. There’s more in the paper:
    Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300.

    Geoff Cumming

    1. What you say about the March of p-values not being true also applies to the default Bayes Factors. I had similar simulations of this in my BSE preprint: http://biorxiv.org/content/early/2015/04/02/017327

      You can see that even the weight of evidence (log BF) varies quite dramatically when the true effect size is large.

      The more I learn about this though the more I realize it comes down to philosophy only. As a Bayesian you shouldn't care about the long run. It's all about the data and seen from this perspective this is a feature, not a bug. I still haven't decided entirely in how far the long run actually matters.

  3. The dance of the p-values or BF is so shockingly large because we do not have a proper understanding of p-values or BF. BF are ratios and we are not very used to ratios. Using log(BF) or log(p) might help, but I prefer converting everything into z-scores. You got a z-score of 1.96 (p = .05), You need an 80% CI interval around that. You take the standard normal and you get z = qnorm(.80) = .84, so you have an interval from 1.12 to 2.80. (corresponding p-values are p = .26 to .005.

    If you got p = .01, you get z = 3.29, 80%CI = 2.45 to 4.13. corresponding p-values are .01 to .0000359.

    I find 2.45 to 4.14 easier to interpret than .01 to .0000359, but it expresses the same amount of uncertainty about the probability of obtaining a test statistic when the null hypothesis is true.

  4. > For me, there is one important difference between the dance of the p-values and the dance of the Bayes factors: When people draw dichotomous conclusions, p-values allow you to control your error rate in the long run, while error rates are ignored when people use Bayes factors.

    You can choose the BF threshold such that BF test the p(H_0|D) with the 0-1 loss function used by the neyman-pearson framework. For instance for p(H_0)=p(H_1) and alpha=0.05, then H_0 can be rejected if B_10 > 19 and in general the threshold is given by (1-alpha)/alpha*p(H_1)/p(H_0). Of course this is rarely done in the literature, so your point stands; but in principle BF are amendable to your complaint.

    Even better strategy is to just forget about bayes factors and to derive a test from first principles based on the loss function. Christian Roberts' book The bayesian choice covers this topic. See also Roberts' opinion on Bayes Factors: http://stats.stackexchange.com/a/204334

    As for my opinion, I'm with Geoff Cumming on this. Just escape the false trichotomy of "p-values, confidence intervals, or Bayes factors" by going with bayesian estimation (Kruschke's bayesian new statistics)

  5. Hi Daniel,
    very interesting post. However, one criticism on Bayesian statistics is that it is (too) subjective. You advise to stay away from default priors. I have the feeling that I fuel the criticism that it's subjective by staying away from the default priors.

    1. This is the long and ever ongoing discussion within Bayesian statistics, between the objective Bayesians and the Subjective Bayesians. There are arguments for both sides, but I don't think there is one side that is right and another side that is wrong. They both have benefits and downsides, so the duality will continue I'm afraid.

  6. As you probably know I have done such simulations in the past as well. Your argument here actually seems to repeat some of the debate I had about the Boekel replication study summarized in my now (ancient) blog post: https://neuroneurotic.net/2015/03/26/failed-replication-or-flawed-reasoning/

    What these simulations are essentially doing is to restate what the Bayes Factor means. You get a dance of the BFs with little conclusive evidence either way because most results simply do not provide conclusive evidence for either model. Is this really a problem? It seems to tell us what we want to know.

    What I think is a problem is when results are misinterpreted (which was the reason for my blog post then although my thinking has also evolved since then). I think while Bayesians keep reminding us that we should condition on the data and not the truth, the way we report and discuss results is still largely focused on the latter. Even studies employing Bayesian hypothesis tests seem guilty of this.

    We recently published a paper (my first with default BFs) where I've tried to counteract this by emphasizing that a BF quantifies relative evidence. In this we were mostly trying to make an inference whether participants are guessing so the BFs are all tests whether accuracy at the group level is at chance. Most of the BF10 > 1/3 even though they're below 1. What this means is that you should update your belief that people can do the task better than chance towards the null, but not by very much. If you are really convinced (Bem style) that people can actually do the task, then this won't reduce your belief by much but it's up to you to decide if your belief is justified in the first place.

  7. Thanks Daniel for your inspiring work. I recently gave a talk where I show many statistical dances in graphical form, and I mentioned your blog post: http://www.aviz.fr/badstats#sec0

  8. This comment has been removed by a blog administrator.