First of all, the v-statistic
is not based on Frequentist of Bayesian statistics. It introduces a third
perspective on accuracy. This is great, because I greatly dislike any type of
polarized discussion, and especially the one between Frequentists and
Bayesians. With a new kid on the block, perhaps people will start to
acknowledge the value of multiple perspectives on statistical inferences.
Second, v is determined by the number of parameters you
examine (p), the effect size R-squared (Rsq), and the sample size. To increase
accuracy, you need to increase the sample size. But where other approaches,
such as those based on the width of a confidence interval, lack a clear minimal
value researchers should aim for, the v-statistics
has a clear lower boundary to beat: 50% guessing average. You want a v>.5. It’s great to say people should
think for themselves, and not blindly use numbers (significance levels of 0.05,
80% power, medium effect sizes of .5, Bayes Factors > 10) but let’s be
honest: That’s not what the majority of researchers want. And whereas under
certain circumstances the use of a p = .05
is rather silly, you can’t go wrong with using v > .5 as a minimum. Everyone is happy.
Third, Ellen Evers and I wrote about the v-statistic in our 2014 paper on
improving the informational value of studies (Lakens & Evers, 2014),
way before v won an award. It’s like
discovering a really great band before it becomes popular.
Fourth,
mathematically v is the volume of a
hypersphere. How cool is that? It’s like it’s from an X-men comic!
I also have a weakness for v because calculating it required R, which I had never used before I wanted to be able to calculate v, and so v was the reason I started using R. When re-reading the paper by Clintin &
Jason, I felt the graphs they present (for studies estimating 3 to 18 parameters, and sample sizes from 0 to 600) did not directly correspond to my
typical studies. So, it being the 1.5 year anniversary of R and me, I thought
I’d plot v as a function of R-squared
for some more typical numbers of parameters (2, 3, 4, and 6), effect sizes (R-squared of 0.01 - 0.25), and sample sizes in
psychology (30-300).
A quick R-squared to R conversion table for those who need
it, and remember Cohen’s guidelines suggest an R = .1 is small, R = .3 =
medium, and R = .5 is large.
R-squared 0.05 0.10 0.15 0.20 0.25
R 0.22 0.32 0.39 0.44 0.50
As we see, v
depends on the sample size, number of parameters, and the effect size. For 2,
3, and 4 parameters, the effect sizes at which v > .5 doesn’t change
substantially, but with more parameters being estimated (e.g., > 6) accuracy
decreases substantially, which means you need substantially larger samples. For example, when estimating 2 parameters, a sample size of 50 requires
an effect size larger than R-squared = 0.115 (R = .34) to have a v >.5.
When planning sample sizes, the v-stat can be one criterion you can use to decide which sample size
you will plan for. You can also use v
to evaluate the accuracy in published studies (see Lakens & Evers, 2014 for two examples).
The R script to create these curves for different numbers of parameters, sample
sizes, and effect sizes is available below.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#You will need to load the R package "hypergeo" to use the vstat function | |
library(hypergeo) | |
#Below, I'm vectorizing the function so that I can plot curves. | |
#The rest is unchanged from the vstat function by Stober-Davis & Dana. | |
#If you want to use R unbiased, remove the # before the Rsq adjustment calculation below | |
vstat <- Vectorize(function(n,p,Rsq) | |
{ | |
#Rsq = Re(1-((n-2)/(n-p))*(1-Rsq)*hypergeo(1,1,(n-p+2)*.5,1-Rsq)) | |
if (Rsq<=0) {Rsq = .0001} | |
r = ((p-1)*(1-Rsq))/((n-p)*Rsq) | |
g = min(r,1) | |
if (g<.5001 && g>.4999) {g = .5001} | |
z = (g - sqrt(g-g^2))/(2*g - 1) | |
alpha = acos((1-z)/sqrt(1-2*z*(1-z))) | |
v = Re((((2*cos(alpha)*gamma((p+2)/2))/(sqrt(pi)*gamma((p+1)/2)))*(hypergeo(.5,(1-p)/2, 3/2, cos(alpha)^2) - sin(alpha)^(p-1)))) | |
return(v) | |
} | |
) | |
#Plot all curves (there's probably a cleaner way to do this, if so, let me know) | |
curve(vstat(Rsq=x, n=300, p=2), 0.01, 0.25, type="l", col="orange", ylim=c(0, 1), xlab="R-squared when Estimating 2 Parameters", ylab="v-stat") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=200, p=2), 0.01, 0.25, type="l", col="red", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=100, p=2), 0.01, 0.25, type="l", col="green", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=50, p=2), 0.01, 0.25, type="l", col="purple", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=30, p=2), 0.01, 0.25, type="l", col="black", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
#draw horizontal line at 0.5 cut-off | |
abline(h=0.5, col="azure4", lty=5) | |
#add legend | |
legend(0.2,0.44,c("n=300","n=200","n=100","n=50","n=30"), lty=c(1,1,1,1,1), lwd=c(2.5,2.5,2.5,2.5,2.5), col=c("orange","red","green","purple","black")) | |
curve(vstat(Rsq=x, n=300, p=3), 0.01, 0.25, type="l", col="orange", ylim=c(0, 1), xlab="R-squared when Estimating 3 Parameters", ylab="v-stat") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=200, p=3), 0.01, 0.25, type="l", col="red", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=100, p=3), 0.01, 0.25, type="l", col="green", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=50, p=3), 0.01, 0.25, type="l", col="purple", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=30, p=3), 0.01, 0.25, type="l", col="black", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
#draw horizontal line at 0.5 cut-off | |
abline(h=0.5, col="azure4", lty=5) | |
#add legend | |
legend(0.2,0.44,c("n=300","n=200","n=100","n=50","n=30"), lty=c(1,1,1,1,1), lwd=c(2.5,2.5,2.5,2.5,2.5), col=c("orange","red","green","purple","black")) | |
curve(vstat(Rsq=x, n=300, p=4), 0.01, 0.25, type="l", col="orange", ylim=c(0, 1), xlab="R-squared when Estimating 4 Parameters", ylab="v-stat") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=200, p=4), 0.01, 0.25, type="l", col="red", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=100, p=4), 0.01, 0.25, type="l", col="green", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=50, p=4), 0.01, 0.25, type="l", col="purple", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=30, p=4), 0.01, 0.25, type="l", col="black", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
#draw horizontal line at 0.5 cut-off | |
abline(h=0.5, col="azure4", lty=5) | |
#add legend | |
legend(0.2,0.44,c("n=300","n=200","n=100","n=50","n=30"), lty=c(1,1,1,1,1), lwd=c(2.5,2.5,2.5,2.5,2.5), col=c("orange","red","green","purple","black")) | |
#Plot all curves (there's probably a cleaner way to overlay them) | |
curve(vstat(Rsq=x, n=300, p=6), 0.01, 0.25, type="l", col="orange", ylim=c(0, 1), xlab="R-squared when Estimating 6 Parameters", ylab="v-stat") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=200, p=6), 0.01, 0.25, type="l", col="red", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=100, p=6), 0.01, 0.25, type="l", col="green", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=50, p=6), 0.01, 0.25, type="l", col="purple", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
par(new=TRUE) | |
curve(vstat(Rsq=x, n=30, p=6), 0.01, 0.25, type="l", col="black", ylim=c(0, 1), xaxt = "n", yaxt = "n", xlab="", ylab="") | |
#draw horizontal line at 0.5 cut-off | |
abline(h=0.5, col="azure4", lty=5) | |
#add legend | |
legend(0.2,0.44,c("n=300","n=200","n=100","n=50","n=30"), lty=c(1,1,1,1,1), lwd=c(2.5,2.5,2.5,2.5,2.5), col=c("orange","red","green","purple","black")) |
No comments:
Post a Comment