I was reworking a lecture on confidence intervals I’ll be
teaching, when I came across a perfect real life example of a common error
people make when interpreting confidence intervals. I hope everyone (Harvard Professors,
Science editors, my bachelor students) will benefit from a clear explanation of
this misinterpretation of confidence intervals.
Let’s assume a Harvard professor and two Science editors make
the following statement: 
If you take 100 original studies and replicate them, then “sampling error alone should cause 5% of the
replication studies to “fail” by producing results that fall outside the 95% confidence
interval of the original study.”*
The formal meaning of a confidence interval is that 95% of
the confidence intervals should, in the long run, contain the true population
parameter. See Kristoffer Magnusson’s excellent visualization, where you can see how 95% of the
confidence intervals include the true population value. Remember that
confidence intervals are a statement about where future confidence intervals
will fall. 
Single confidence intervals are not a statement about where the
means of future samples will fall. The percentage of means in future samples that falls within a single
confidence interval is called the capture
percentage. The percentage of future means that fall within a single unbiased confidence interval depends
upon which single confidence interval you happened to observe, but in the long run, 95% confidence intervals have a 83.4% capture percentage (Cumming & Maillardet, 2006). In
other words, in a large number of unbiased original studies, 16.6% (not 5%) of replication
studies will observe a parameter estimate that falls outside of a single
confidence interval. (Note that this percentage assumes an equal sample size in the
original and replication study – if sample sizes differ, you would need to
simulate the capture percentages for each study.) 
Let’s experience this through simulation. Run the entire
R script available at the bottom of this post. This scripts will simulate a single sample with a true population mean
of 100 and standard deviation of 15 (the mean and SD of an IQ test), and create
a plot. Samples drawn from this true population will show variation, as you can
see from the mean and standard deviation of the sample in the plot. The black
dotted line illustrates the true mean of 100. The orange area illustrates the
95% confidence interval around the sample mean, and 95% of orange bars will contain the black dotted line. For example: 
The simulation also generates a large number of additional
samples, after the initial one that was plotted. The simulation returns the
number of confidence intervals from these simulations that contain the mean (which should be 95%
in the long run). The simulation also returns the % of sample means from future
studies that fall within the 95% of the original study. This is the capture
percentage. It differs from (and is typically lower than) the confidence interval. 
Q1: Run the simulations multiple times (the 100000
simulations take a few seconds). Look at the output you will get in the R
console. For example: “95.077 % of the 95% confidence intervals contained the
true mean” and “The capture percentage for the plotted study, or the % of
values within the observed confidence interval from 88.17208 to 103.1506 is:
82.377 %”. While running the simulations multiple times, look at the confidence
interval around the sample mean, and relate this to the capture percentage.
Which statement is true?
A) The further the sample mean in the original study is from the true population
mean, the lower the capture percentage.
B) The further the sample mean in the original study is from the true population
mean, the higher the capture percentage.
C) The wider the confidence interval around the mean, the
higher the capture percentage.
D) The narrower the confidence interval around the mean, the
higher the capture percentage.
Q2: Simulations in R are randomly generated, but you can
make a specific simulation reproducible by setting the seed of the random
generation process. Copy-paste “set.seed(123456)” to the first line of the R
script, and run the simulation. The sample mean should be 108 (see the picture below). This is a clear
overestimate of the true population parameter. Indeed, the just by chance, this
simulation yielded a result that is significantly different from the null
hypothesis (the mean IQ of 100), even though it is a Type 1 error. Such overestimates
are common in a literature rife with publication bias. A recent large scale
replication project revealed that even for studies that replicated (according
to a p < 0.05 criterion), the
effect sizes in the original studies were substantially inflated. Given the true mean of 100, many sample means should fall to the left of the orange bar, and this percentage is clearly much larger than 5%. What is the
capture percentage in this specific situation where the original study yielded
an upwardly biased estimate?
A) 95% (because I believe Harvard Professors and Science editors over you and your simulations!)
B) 42.2%
C) 84.3%
D) 89.2%
I always find it easier to see how statistics work, if you
can simulate them. I hope this example makes it clear what the difference between a
confidence interval and a capture percentage is.
* This is a hypothetical statement. Any similarity to
commentaries that might be published in Science in the future is purely
coincidental.


 
Nice explanation of capture percentage, clearly differentiating it from coverage percentage. AND, thanks for the link to Magnusson's mesmerizing demo.
ReplyDeleteNote how Nate Silver gets this wrong in regard to polling, despite his linking to a correct definition. (Some commentators attempted explanations.)
ReplyDeletehttp://errorstatistics.com/2016/02/12/rubbing-off-uncertainty-confidence-and-nate-silver/
Nice post - although you might want to include the 95% replication capture intervals (what you should do for this type of inference) as a comparator for the the 95% CI.
ReplyDeleteHacked into your script below:
if(!require(ggplot2)){install.packages('ggplot2')}
library(ggplot2)
n=30 #set sample size
nSims<-1000 #set number of simulations
x<-rnorm(n = n, mean = 100, sd = 15) #create sample from normal distribution
samplemean <-mean(x)
#95%CI
CIU<-samplemean+qt(0.975, df = n-1)*sd(x)*sqrt(1/n)
CIL<-samplemean-qt(0.975, df = n-1)*sd(x)*sqrt(1/n)
RCIU<-samplemean+qt(0.975, df = n-1)*sd(x)*sqrt(2/n)
RCIL<-samplemean-qt(0.975, df = n-1)*sd(x)*sqrt(2/n)
#plot data
#png(file="CI_mean.png",width=2000,height=2000, res = 300)
ggplot(as.data.frame(x), aes(x)) +
geom_rect(aes(xmin=CIL, xmax=CIU, ymin=0, ymax=Inf), fill="#E69F00") +
geom_histogram(colour="black", fill="grey", aes(y=..density..), binwidth=2) +
xlab("IQ") + ylab("number of people") + ggtitle("Data") + theme_bw(base_size=20) +
theme(panel.grid.major.x = element_blank(), axis.text.y = element_blank(), panel.grid.minor.x = element_blank()) +
geom_vline(xintercept=100, colour="black", linetype="dashed", size=1) +
coord_cartesian(xlim=c(50,150)) + scale_x_continuous(breaks=c(50,60,70,80,90,100,110,120,130,140,150)) +
annotate("text", x = mean(x), y = 0.02, label = paste("Mean = ",round(mean(x)),"\n","SD = ",round(sd(x)),sep=""), size=6.5)
#dev.off()
#Simulate Confidence Intervals
CIU_sim<-numeric(nSims)
CIL_sim<-numeric(nSims)
RCIU_sim<-numeric(nSims)
RCIL_sim<-numeric(nSims)
mean_sim<-numeric(nSims)
capture = 0
Tcrit = qt(0.975, df = n-1)
for(i in 1:nSims){ #for each simulated experiment
x<-rnorm(n = n, mean = 100, sd = 15) #create sample from normal distribution
sim_mean = mean(x)
CIW = Tcrit*sd(x)*sqrt(1/n)
CIU_sim[i]<-sim_mean+CIW
CIL_sim[i]<-sim_mean-CIW
RCIU_sim[i]<-sim_mean+CIW*sqrt(2)
RCIL_sim[i]<-sim_mean-CIW*sqrt(2)
mean_sim[i]<-sim_mean #store means of each sample
for (j in 1:i){
if(mean_sim[i]<=RCIU_sim[j]&&mean_sim[i]>=RCIL_sim[j]){
capture=capture+1
}
if(mean_sim[j]<=RCIU_sim[i]&&mean_sim[j]>=RCIL_sim[i]){
capture=capture+1
}
}
}
#How many simulations does the true value lie outside the 95% CI
CIU_sim<-CIU_sim[CIU_sim<100]
CIL_sim<-CIL_sim[CIL_sim>100]
#How many simulations does our original observed value lie outside the 95% RCI
RCIU_sim<-RCIU_sim[RCIU_simsamplemean]
cat((100*(1-(length(CIU_sim)/nSims+length(CIL_sim)/nSims))),"% of the 95% confidence intervals contained the true mean")
cat((100*(1-(length(RCIU_sim)/nSims+length(RCIL_sim)/nSims))),"% of the 95% replication capture intervals contained the observed mean")
#Calculate how many times the simulated mean fell within the 95% CI of the original study
mean_sim1<-mean_sim[mean_sim>CIL&mean_simRCIL&mean_sim<RCIU]
cat("The RCI capture percentage for the plotted study, or the % of means from other simulations within the observed 95% replication capture interval from",RCIL,"to",RCIU,"is:",100*length(mean_sim2)/nSims,"%")
#What proportion ofmany times did one simulation capture another within RCI
cat(100*(capture-nSims)/(i*(j-1)),"% of pairwise replication captures were successful from simulated 95% RCIs")
Not sure if you (Mike Atiken) are still reading this, but the code breaks for me from this point on.
DeleteRCIU_sim<-RCIU_sim[RCIU_simsamplemean]
(There isn't an object RCIU_simsamplemean, or an object mean_simRCIL or an object mean_sim2)
Apologies - didn't mean to post previous anonymously!
ReplyDeleteHow is this different from a prediction interval versus a confidence interval (as is often discussed in regression)? Rob Hyndman has a post on the this (http://robjhyndman.com/hyndsight/intervals/)
ReplyDeletePredictions intervals typically are for a single new observation. Are much wider than confidence intervals.
DeleteWell, I seem to recall that prediction intervals can be used for future sample means (of independent samples), too, treating the future sample mean as an observable. The usual multiplier becomes sqrt(1/n_new + 1/n_old) rather than sqrt(1 + 1/n_old). So, loosely, a 95% CI is really about an 84% PI.
DeleteSeymour Geisser wrote a book on this.
A bit late, but I came across a direct reference today:
Delete* Kalbfleisch (1975, 1989) Probability and Statistical Inference II Example 16.3.1
I do not know whether there is an issue with the simulation, but with a large sample size (n =1000) and repeating the simulation 100 times, I've found the capture percentage is higher than 84% (I've found 92%)
ReplyDelete@AntoViral (do not know how to sign, I have'nt a URL)
This is the code, adapted from yours:
library(ggplot2)
### creation of an empty dataframe
data <- data.frame()
N <- 100
for (i in 1:N) {
### original
## n=20 #set sample size
## nSims<-100000 #set number of simulations
### modified by me
set.seed(i) ### set seed for reproducibility
n=1000 #set sample size
nSims<-1000 #set number of simulations
x<-rnorm(n = n, mean = 100, sd = 15) #create sample from normal distribution
#95%CI
CIU<-mean(x)+qt(0.975, df = n-1)*sd(x)*sqrt(1/n)
CIL<-mean(x)-qt(0.975, df = n-1)*sd(x)*sqrt(1/n)
#plot data
#png(file="CI_mean.png",width=2000,height=2000, res = 300)
ggplot(as.data.frame(x), aes(x)) +
geom_rect(aes(xmin=CIL, xmax=CIU, ymin=0, ymax=Inf), fill="#E69F00") +
geom_histogram(colour="black", fill="grey", aes(y=..density..), binwidth=2) +
xlab("IQ") + ylab("number of people") + ggtitle("Data") + theme_bw(base_size=20) +
theme(panel.grid.major.x = element_blank(), axis.text.y = element_blank(), panel.grid.minor.x = element_blank()) +
geom_vline(xintercept=100, colour="black", linetype="dashed", size=1) +
coord_cartesian(xlim=c(50,150)) + scale_x_continuous(breaks=c(50,60,70,80,90,100,110,120,130,140,150)) +
annotate("text", x = mean(x), y = 0.02, label = paste("Mean = ",round(mean(x)),"\n","SD = ",round(sd(x)),sep=""), size=6.5)
#dev.off()
#Simulate Confidence Intervals
CIU_sim<-numeric(nSims)
CIL_sim<-numeric(nSims)
mean_sim<-numeric(nSims)
for(i in 1:nSims){ #for each simulated experiment
x<-rnorm(n = n, mean = 100, sd = 15) #create sample from normal distribution
CIU_sim[i]<-mean(x)+qt(0.975, df = n-1)*sd(x)*sqrt(1/n)
CIL_sim[i]<-mean(x)-qt(0.975, df = n-1)*sd(x)*sqrt(1/n)
mean_sim[i]<-mean(x) #store means of each sample
}
#Save only those simulations where the true value was inside the 95% CI
CIU_sim<-CIU_sim[CIU_sim<100]
CIL_sim<-CIL_sim[CIL_sim>100]
# cat((100*(1-(length(CIU_sim)/nSims+length(CIL_sim)/nSims))),"% of the 95% confidence intervals contained the true mean")
#Calculate how many times the observed mean fell within the 95% CI of the original study
mean_sim<-mean_sim[mean_sim>CIL&mean_sim<CIU]
# cat("The capture percentage for the plotted study, or the % of values within the observed confidence interval from",CIL,"to",CIU,"is:",100*length(mean_sim)/nSims,"%")
conf <- (100*(1-(length(CIU_sim)/nSims+length(CIL_sim)/nSims)))
capt <- 100*length(mean_sim)/nSims
### collect the data in a dataframe
data <- rbind(data, c(conf, capt))
names(data) <- c("95% CI", "Capture %")
}
### check the result
head(data)
cap <- ifelse(data[,2]<94.9, 1, 0)
plot(data,pch=19)
mtext(paste0("95% confidence intervals have a ", sum(cap), "% capture percentage"))
Sure. Normal. Add
DeletecolMeans(data)
and see it's 83.4 - ON AVERAGE
Hi Daniel, quick question. All of the discussion around CI's has focused on population data. I'm wondering about the implications for individual data (such as neuropsych assessment).
ReplyDeleteAs a neuropsych I was trained that the 95% CI provides a range about which we can be 95% confident contains the individuals 'true' score. But is this actually the case??? Is a more accurate interpretation that if we tested the patient over and over again (not accounting for practice effects) that their score would fall within the 95% CI,95% of the time...??
Hi, no, that is incorrect. It is often taught incorrectly. Confidence intervals are counterintuitive things.
DeleteOkay, do you mind explaining the application of CI's in this context (please)?
DeleteThere is no special application - CI are always what they are, as explained above. It sounds like they are misused - but I can't explain that.
DeleteThat's a great post but I think it misses the point that not only are statements like the one being critiqued incorrect but they are caring about the wrong thing. They're like early astronomy where the earth is the centre of the universe. The researcher is thinking in terms of their mean and CI as the centre of the universe. Accepting what the CI really means and what a proper statement about it is allows one to be correct 95% of the time. So, after the calculations the critical method is that making your CI the centre of discussion you've reduced your long run accuracy of statements dramatically and further reduced the useful relevance of your study.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete