A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, September 15, 2014

Bayes Factors and p-values for independent t-tests



This Thursday I’ll be giving a workshop on good research practices in Leuven, Belgium. The other guest speaker at the workshop is Eric-Jan Wagenmakers, so I thought I’d finally dive in to the relationship between Bayes Factors and p-values to be prepared to talk in the same workshop as such an expert on Bayesian statistics and methodology. This was a good excuse to finally play around with the BayesFactor package for R witten by Richard Morey, who was super helpful through Twitter at 21:30 pm on a Sunday to enable me to do the calculations in this post. Remaining errors are my own responsibility (see the R script below to reproduce these calculations).

Bayes Factors tell you something about the probability H0 or H1 are true, given some data (as opposed to p-values, which give you the probability of some data, given the H0). As explained in detail by Felix Schönbrodt here, you can express Bayes Factors as support for H0 over H1 (BF01) or as support for H1 over H0 (BF10), and report raw Bayes Factors (ranging from 0 to infinity, where 1 means equal support for H1 as H0) or Bayes Factors on a log scale (from minus infinity through 0 to plus infinity, where 0 means equal support for H1 as H0). And yes, that gets pretty confusing pretty fast. Luckily, Richard Morey was so nice to adjust the output of Jeff Rouder's Bayes Factor calculation website to include the R script for the BayesFactor package, which makes the output of different tools to compute Bayes Factors more uniform.

Doing a single Bayes independent t-test in R is easy. Run the code below, and replace the t with the t-value from your Student's t-test, fill in n1 and n2 (the sample size in each of the two groups in the independent t-test) and you are ready to go. For example, entering a t-value of 3, and 50 participants in each condition gives BF10 = 0.11, indicating the alternative hypothesis is around (1/0.11) = 9 times more likely than the null hypothesis.

exp(-ttest.tstat(t,n1,n2,rscale=1)$bf)

In the figure below, raw BF01 are plotted, which means they indicate the Bayes Factor for the null over the alternative. Therefore, small values (closer to 0) indicate stronger support for H1, 1 means equal support for H1 and H0, and large values indicate support for H0. First, let’s give an overview of Bayes Factors as a function of the t-value of an independent t-test, ranging from t=0 (no differences between groups) to t=5.



You can see three curves (for 20, 50, or 100 participants per condition) displaying the corresponding Bayes Factors as a function of increasing t-values. The green lines correspond to Bayes Factors of 1:3 (upper line, favoring H0) or 3:1 (lower line, favoring H1). Bayes Factors, just like p-values, are continuous, and shouldn’t be thought of a dichotomous manner (but I know polar opposition is a foundation of human cognition, so I expect almost everyone will ignore this explicit statement in their implicit interpretation of Bayes Factors). Let’s zoom in a little for our comparison of BF and p-values, to t-values above 1.96.




The dark grey line in this figure illustrates data in favor of H1 of 3:1 (some support for H1), and the light grey line represents data in favor of H1 of 10:1 (strong support for H1). The vertical lines indicate which t-values represent an effect in a t-test that is statistically different from 0 at p = 0.05 (the larger the sample size, the closer this t-value lies to 1.96). There are two interesting observations we can make from this figure. 

First of all, where smaller sample sizes require slightly higher t-values to find a p<0.05 (as indicated by the blue vertical dotted line being further to the right than the black vertical dotted line), smaller sample sizes actually yield better Bayes Factors for the same t-value. The reason for this, I think (but there's a comment section below, so if you know better, let me know) is that the larger the sample size, the less likely it is to find a relatively low t-value if there is an effect – instead, you’d expect to find a higher t-value, on average.

 P-values are altogether much less dependent on the sample size in a t-test. The figure below shows three curves (for 20, 50, and 100 participants per condition). Researchers can conclude their data is ‘significant’ for t-values somewhere around 2, ranging from 1.96 for large samples, to 2.03 for N=20. In other words, there is a relatively small effect of sample size. The dark and light grey lines indicate p = 0.05 and p = 0.01 thresholds.




The second thing that becomes clear from the plot of Bayes Factors is that the p<0.05 threshold allows researchers to conclude their data supports H1 long before a BF01 of 0.33. The t-values at which a Frequentist t-test yields a p < 0.05 are much lower than the t-values required for a BF to be lower than 0.33. For 20 participants per condition, a t-value of 2.487 is needed to conclude that there is some support for H1. A Frequentist t-test would give p=0.017. The larger the sample size, the more pronounced this difference becomes (e.g., with 200 participants per condition, a t=2.732 gives a BF = 0.33 and a p = 0.007).

It can even be the case that a ‘significant’ p-value in an independent t-test with 100 participants per condition (e.g., a t-value of 2, yielding a p=0.047) gives a BF>1, which means support in the opposite direction (favoring H0). Such high p-values really don’t provide support for our hypotheses. Furthermore, the use of a fixed significance level (0.05) regardless of the sample size of the study is a bad research practice. If we would require a higher t-value (and thus lower p-value) in larger samples, we would at least prevent the rather ridiculous situations where we interpret data as support for H1, when the BF actually favors H0. 

On the other side, the recommendation to use p<0.001 by some statisticians is a bit of an overreaction to the problem. As you can see from the grey line at p=0.01 in the p-value plot, and the grey line at 0.33 in the Bayes Factor plot, using p<0.01 gets us pretty close to the same conclusions as we would draw using Bayes Factors. Stronger evidence is preferable over weaker evidence, but can come at too high costs.

In the end, our first priority should be to draw logical inferences about our hypotheses from our data. Given how easy it is to calculate the Bayes Factor, I'd say that at the very minimum you should want to calculate it to make sure your significant p-value actually isn't stronger support for H0. You can easily report it alongside p-values, confidence intervals, and effect sizes. For example, in a recent paper (Evers & Lakens, 2014, Study 2b) we wrote: "Overall, there was some indication of a diagnosticity effect of 4.4% (SD = 13.32), t(38) = 2.06, p = 0.046, gav = 0.24, 95% CI [0.00, 0.49], but this difference was not convincing when evaluated with Bayesian statistics, JZS BF10 = 0.89".

If you want to play around with the functions, you can grab the the script to produce the zoomed in version of the Bayes Factors and p-values graphs using the R script below (you need to install and load the Bayes Factor package for the script to work). If you want to read more about this (or see similar graphs and more) read this paper by Rouder et al (2009).



12 comments:

  1. Readers who want more of the theory can check out my post "Bayes factor t tests, part 2: Two-sample tests" and the previous posts linked there. (bayesfactor.blogspot.com)

    ReplyDelete
  2. Hey Daniel, thanks for this post. Just wanted to let you know, that the Link to Rouder et al. (2009) is set incorrectly.

    ReplyDelete
  3. great post. I think something went wrong in Figure 2. You write 'smaller sample sizes actually yield higher Bayes Factors for the same t-value' even though the figure shows the exact opposite. I guess the legend got mixed up.

    ReplyDelete
    Replies
    1. Thanks! You are right. Fixed it - graph is correct, but should be lower (or now: better) Bayes Factor. That's how counterintuitive it was ;)

      Delete
  4. "If we would require a higher t-value (and thus lower p-value) in larger samples, we would at least prevent the rather ridiculous situations where we interpret data as support for H1, when the BF actually favors H0."

    This doesn't take into account publication bias. If we assume that publication bias is a larger problem for smaller studies (which I think we can), we would require lower p-values for studies with smaller samples, i.e. the opposite you suggest.

    ReplyDelete
    Replies
    1. Hi, the correction is intended to prevent situations where a p-values says there is support for H1, but a Bayes factor says there is stronger support for H0. This post is written for researchers who want to interpret their own data. Publication bias is a problem, but not in interpreting and reporting you own data. Obviously, people should pre-register their hypotheses if they wat their statistical inferences to be taken seriously be others.

      Delete
    2. That is right, if you want to intepret your own data or your study is preregistered, it makes sense. But as a general rule, we would want it the other way around, i.e. lower (more strict and conservative) significance threshold for smaller studies.

      Delete
    3. Why would we want a significance threshold for anything? "Significance" -- that is, a threshold on the p value itself -- doesn't really mean anything of value. It's fine to set a "more strict and conservative" threshold on something of meaning -- for instance, an evidential threshold, perhaps -- but we really need to stop thinking in terms of "significance" altogether. It's a useless, arbitrary idea.

      Delete
    4. Hi Richard, why would you want to use a Cauchy prior? Why would you make any fixed recommendation except 'use your brain'. If we could live in a world where everyone had the time to build the expertise to know everything about all statistics they use (factor analysis, mix models, Bayes Factors, etc) in addition to all measurement techniques they use (physiological data, scales, etc) in all theoreies they test, people would probably send 20 years before they feel comfortable publishing anything. Recommendations are not perfect, but useful in getting people to do the right thing, most of the time. I think that's why. If you know of a more efficient system (except the 'use your brain' alternative, I think we all want to know.

      Delete
    5. I can justify the necessity of using *some* prior, and then we can argue over the details of that prior. Fine, but that's a detail, not part of the basic logical structure of method. But my question was why use *any* criterion on the p value to justify scientific claims? Using the p value in this way leads to logically invalid arguments (choosing a ridiculous prior, leads to silly arguments, but not *logically invalid* ones).

      Scientists are lifelong learners; learning new things is what we do for a living. Asking scientists to stop using a moribund, completely unreasonable method -- and to learn one that is (or can be) reasonable -- is within the bounds of the job description. After all, method is the *core* of science, not a peripheral concern. Without reasonable method, science is nothing.

      As for "[r]ecommendations [about p values] are not perfect, but useful in getting people to do the right thing, most of the time" I'm more interested in making reasonable judgments, than the "do[ing] the right thing". I don't know what "do[ing] the right thing" means when we're talking about interpreting the results of an experiment.

      Delete
    6. Doing the right thing means interpreting data as support for your hypothesis when they support your hypothesis. Let's say I expect two groups (randomly assigned) to differ on some variable (a rather boring, unpretentious hypothesis, but ok). I do a test, find the difference is statistically greater than 0. I conclude the data support my hypothesis.

      Did I do anything wrong? No. I don't know how likely it is my hypothesis is correct, but it is supported by the data. This could be a freak accident, a one in a million error. The prior for H0 might be 99%. But the data I have still support my hypothesis.

      Why would we conclude this, only when p <.05? Or perhaps p < .01? First of all, no one is saying you should. If you find a p = .77 and still want to continue believing in your hypothesis, go ahead - try again, better. Is it nonsense to require some threshold researchers should aim for when they want to convince others? I don't think so. Which value works best is a matter of opinion, and we cautiously allow higher p-values every now and then. But asking researchers to provide sufficient improbability of their data makes sense.

      Now you have a problem with scientists saying 'I found this p-value, now I make the scientific claim that H1 is true'. But the data can be 'in line with' or 'supporting' H1. We are not talking about truth, but about collecting observations in line with what might be the truth. P-values quantify the extent to which these observations should be regarded in line with the hypothesis, not whether the hypothesis is correct.

      This is more or less the core of a future blog post, so if I'm completely bonkers, stop me now :)

      Delete