The 20% Statistician: Bayes Factors and p-values for independent t-tests

Monday, September 15, 2014

Bayes Factors and p-values for independent t-tests

This Thursday I’ll be giving a workshop on good research practices in Leuven, Belgium. The other guest speaker at the workshop is Eric-Jan Wagenmakers, so I thought I’d finally dive in to the relationship between Bayes Factors and p-values to be prepared to talk in the same workshop as such an expert on Bayesian statistics and methodology. This was a good excuse to finally play around with the BayesFactor package for R witten by Richard Morey, who was super helpful through Twitter at 21:30 pm on a Sunday to enable me to do the calculations in this post. Remaining errors are my own responsibility (see the R script below to reproduce these calculations).

Bayes Factors tell you something about the probability H0 or H1 are true, given some data (as opposed to p-values, which give you the probability of some data, given the H0). As explained in detail by Felix Schönbrodt here, you can express Bayes Factors as support for H0 over H1 (BF01) or as support for H1 over H0 (BF10), and report raw Bayes Factors (ranging from 0 to infinity, where 1 means equal support for H1 as H0) or Bayes Factors on a log scale (from minus infinity through 0 to plus infinity, where 0 means equal support for H1 as H0). And yes, that gets pretty confusing pretty fast. Luckily, Richard Morey was so nice to adjust the output of Jeff Rouder's Bayes Factor calculation website to include the R script for the BayesFactor package, which makes the output of different tools to compute Bayes Factors more uniform.

Doing a single Bayes independent t-test in R is easy. Run the code below, and replace the t with the t-value from your Student's t-test, fill in n1 and n2 (the sample size in each of the two groups in the independent t-test) and you are ready to go. For example, entering a t-value of 3, and 50 participants in each condition gives BF₁₀ = 0.11, indicating the alternative hypothesis is around (1/0.11) = 9 times more likely than the null hypothesis.

exp(-ttest.tstat(t,n1,n2,rscale=1)$bf)

In the figure below, raw BF01 are plotted, which means they indicate the Bayes Factor for the null over the alternative. Therefore, small values (closer to 0) indicate stronger support for H1, 1 means equal support for H1 and H0, and large values indicate support for H0. First, let’s give an overview of Bayes Factors as a function of the t-value of an independent t-test, ranging from t=0 (no differences between groups) to t=5.

You can see three curves (for 20, 50, or 100 participants per condition) displaying the corresponding Bayes Factors as a function of increasing t-values. The green lines correspond to Bayes Factors of 1:3 (upper line, favoring H0) or 3:1 (lower line, favoring H1). Bayes Factors, just like p-values, are continuous, and shouldn’t be thought of a dichotomous manner (but I know polar opposition is a foundation of human cognition, so I expect almost everyone will ignore this explicit statement in their implicit interpretation of Bayes Factors). Let’s zoom in a little for our comparison of BF and p-values, to t-values above 1.96.

The dark grey line in this figure illustrates data in favor of H1 of 3:1 (some support for H1), and the light grey line represents data in favor of H1 of 10:1 (strong support for H1). The vertical lines indicate which t-values represent an effect in a t-test that is statistically different from 0 at p = 0.05 (the larger the sample size, the closer this t-value lies to 1.96). There are two interesting observations we can make from this figure.

First of all, where smaller sample sizes require slightly higher t-values to find a p<0.05 (as indicated by the blue vertical dotted line being further to the right than the black vertical dotted line), smaller sample sizes actually yield better Bayes Factors for the same t-value. The reason for this, I think (but there's a comment section below, so if you know better, let me know) is that the larger the sample size, the less likely it is to find a relatively low t-value if there is an effect – instead, you’d expect to find a higher t-value, on average.

P-values are altogether much less dependent on the sample size in a t-test. The figure below shows three curves (for 20, 50, and 100 participants per condition). Researchers can conclude their data is ‘significant’ for t-values somewhere around 2, ranging from 1.96 for large samples, to 2.03 for N=20. In other words, there is a relatively small effect of sample size. The dark and light grey lines indicate p = 0.05 and p = 0.01 thresholds.

The second thing that becomes clear from the plot of Bayes Factors is that the p<0.05 threshold allows researchers to conclude their data supports H1 long before a BF01 of 0.33. The t-values at which a Frequentist t-test yields a p < 0.05 are much lower than the t-values required for a BF to be lower than 0.33. For 20 participants per condition, a t-value of 2.487 is needed to conclude that there is some support for H1. A Frequentist t-test would give p=0.017. The larger the sample size, the more pronounced this difference becomes (e.g., with 200 participants per condition, a t=2.732 gives a BF = 0.33 and a p = 0.007).

It can even be the case that a ‘significant’ p-value in an independent t-test with 100 participants per condition (e.g., a t-value of 2, yielding a p=0.047) gives a BF>1, which means support in the opposite direction (favoring H0). Such high p-values really don’t provide support for our hypotheses. Furthermore, the use of a fixed significance level (0.05) regardless of the sample size of the study is a bad research practice. If we would require a higher t-value (and thus lower p-value) in larger samples, we would at least prevent the rather ridiculous situations where we interpret data as support for H1, when the BF actually favors H0.

On the other side, the recommendation to use p<0.001 by some statisticians is a bit of an overreaction to the problem. As you can see from the grey line at p=0.01 in the p-value plot, and the grey line at 0.33 in the Bayes Factor plot, using p<0.01 gets us pretty close to the same conclusions as we would draw using Bayes Factors. Stronger evidence is preferable over weaker evidence, but can come at too high costs.

In the end, our first priority should be to draw logical inferences about our hypotheses from our data. Given how easy it is to calculate the Bayes Factor, I'd say that at the very minimum you should want to calculate it to make sure your significant p-value actually isn't stronger support for H0. You can easily report it alongside p-values, confidence intervals, and effect sizes. For example, in a recent paper (Evers & Lakens, 2014, Study 2b) we wrote: "Overall, there was some indication of a diagnosticity effect of 4.4% (SD = 13.32), t₍₃₈₎ = 2.06, p = 0.046, g_av = 0.24, 95% CI [0.00, 0.49], but this difference was not convincing when evaluated with Bayesian statistics, JZS BF₁₀ = 0.89".

If you want to play around with the functions, you can grab the the script to produce the zoomed in version of the Bayes Factors and p-values graphs using the R script below (you need to install and load the Bayes Factor package for the script to work). If you want to read more about this (or see similar graphs and more) read this paper by Rouder et al (2009).

	#####
	p1 <-numeric(200) #set up empty container for all p-values
	bf1 <-numeric(200) #set up empty container for all bf
	p2 <-numeric(200) #set up empty container for all p-values
	bf2 <-numeric(200) #set up empty container for all bf
	p3 <-numeric(200) #set up empty container for all p-values
	bf3 <-numeric(200) #set up empty container for all bf
	t <- 1.96 #set start point t-value

	for (i in 1:201)
	{
	t<-t+0.01
	bf1[i]<-exp(-ttest.tstat(t,20,20,rscale=1)$bf) # calc BF
	p1[i]<-2*pt(-abs(t),df=40-2) #calc p-value
	bf2[i]<-exp(-ttest.tstat(t,50,50,rscale=1)$bf)
	p2[i]<-2*pt(-abs(t),df=100-2)
	bf3[i]<-exp(-ttest.tstat(t,100,100,rscale=1)$bf)
	p3[i]<-2*pt(-abs(t),df=200-2)
	}


	#Below is a lot of syntax to draw the plot
	plot(bf3,type='l', xaxt="n", xlab="t-values", ylab="Bayes Factor")
	tvalues<-seq(1.96, 3.96, by=0.01)
	axis(1, 1:201, side = 1, at = c(1, 50, 100, 150, 200), labels = c("1.96", "2.46", "2.96", "3.46", "3.96"))
	legend(150,1.4, # places a legend at the appropriate place
	c("N = 100","N = 50", "N = 20"), # puts text in the legend
	lty=c(1,1), # gives the legend appropriate symbols (lines)
	lwd=c(2.5,2.5,2.5),col=c("black", "red","blue")) # format lines and colors

	#draw some horizontal and vertical lines
	abline(h=0.1, col="azure3", lty=5)
	abline(h=0.33, col="azure4", lty=5)
	abline(v=7, col="blue", lty=3)
	abline(v=3, col="red", lty=3)
	abline(v=2, lty=3)
	lines(bf2, col='red')
	lines(bf1, ,col='blue')



	#Draw p-value plot
	plot(p3,type='l', xaxt="n", xlab="t-values", ylab="p-values")
	tvalues<-seq(1.96, 3.96, by=0.01)
	axis(1, 1:201, side = 1, at = c(1, 50, 100, 150, 200), labels = c("1.96", "2.46", "2.96", "3.46", "3.96"))
	legend(150,0.045, # places a legend at the appropriate place
	c("N = 100","N = 50", "N = 20"), # puts text in the legend
	lty=c(1,1), # gives the legend appropriate symbols (lines)
	lwd=c(2.5,2.5,2.5),col=c("black", "red","blue")) # format lines and colors
	lines(p2, col='red')
	lines(p1, col='blue')
	abline(h=0.05, col="azure4", lty=5)
	abline(h=0.01, col="azure3", lty=5)

view raw gistfile1.r hosted with ❤ by GitHub

14 comments:

Richard MoreySeptember 15, 2014 at 10:35 PM
Readers who want more of the theory can check out my post "Bayes factor t tests, part 2: Two-sample tests" and the previous posts linked there. (bayesfactor.blogspot.com)
ReplyDelete
Replies
AnonymousSeptember 16, 2014 at 10:15 AM
Hey Daniel, thanks for this post. Just wanted to let you know, that the Link to Rouder et al. (2009) is set incorrectly.
ReplyDelete
Replies
AnonymousSeptember 16, 2014 at 10:36 AM
great post. I think something went wrong in Figure 2. You write 'smaller sample sizes actually yield higher Bayes Factors for the same t-value' even though the figure shows the exact opposite. I guess the legend got mixed up.
ReplyDelete
Replies
UnknownSeptember 16, 2014 at 1:07 PM
"If we would require a higher t-value (and thus lower p-value) in larger samples, we would at least prevent the rather ridiculous situations where we interpret data as support for H1, when the BF actually favors H0."

This doesn't take into account publication bias. If we assume that publication bias is a larger problem for smaller studies (which I think we can), we would require lower p-values for studies with smaller samples, i.e. the opposite you suggest.
ReplyDelete
Replies
Hendrik BrunsSeptember 19, 2017 at 4:31 PM
Dear Daniel,
I struggled with the following sentence: "Bayes Factors tell you something about the probability H0 or H1 are true, given some data (as opposed to p-values, which give you the probability of some data, given the H0)."
A Bayes Factor is defined as BF_10 = P(y|H1)/P(y|H0)
So, the Bayes Factor is the ratio of the likelihood of data under H1 and the likelihood of data under H0.
This confuses me, when contrasting it to the p-value, which also gives the probability of data given the null hypothesis.

Is it that you mean by "Bayes Factors tell you something about the probability H0 or H1 are true, given some data" they are a central component in updating the prior odds to arrive at the posterior odds?
But then, wouldn't the posterior odds, i.e. P(H1|y)/P(H0|y) constitute the central difference between Bayesian and Frequentist inference?
ReplyDelete
Replies