A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Tuesday, December 2, 2014

98% Capture Percentages and 99.9% Confidence Intervals

I think it is important people report confidence intervals to provide an indication of the uncertainty in the point estimates they report. However, I am not too enthusiastic about the current practice to report 95% confidence intervals. I think there are good reasons to consider alternatives, such as reporting 99.9% confidence intervals instead.

I’m not alone in my dislike of a 95% CI. Sterne and Smith (2001, p. 230) have provided the following recommendation for the use of confidence intervals:

Confidence intervals for the main results should always be included, but 90% rather than 95% levels should be used. Confidence intervals should not be used as a surrogate means of examining significance at the conventional 5% level. Interpretation of confidence intervals should focus on the implications (clinical importance) of the range of values in the interval.

This last sentence is, along with world peace, an excellent recommendation of what we should focus on. Neither seems very likely in the near future. I think people (especially when they just do theoretical research) will continue to interpret confidence intervals as an indication of whether an effect is statistically different from 0, and make even more dichotomous statements than they do with p-values. After all, a confidence intervals includes 0 or not, but p-values come in three* different** magnitudes***.

We know that relatively high p-values (e.g., p-values > 0.01) provide relatively weak support for H1 (or sometimes, in large samples which give high power, they actually provide support for H0). So instead of using a 90% CI, I think it’s a better idea to use a 99.9% CI. This has several benefits.

First of all, instead of arguing to stop reporting p-values (which I don’t think is necessary) because confidence intervals give us exactly the same information, we can report p-values as we are used to (using p < 0.05) and 99.9% CI that tell us whether an effect differs from 0 with p < .001. We can now immediately see whether an effect is statistically different from 0 using the typical alpha-value, and whether it is still statistically different from 0 if we would have used a much stricter alpha level of 0.001. Note that I have argued against using a p<.001 as a hard criterion to judge whether a scientific finding has evidential value, and prefer a more continuous evaluation of research findings. However, when using 99.9% CI, you can use the traditional significance criterion we are used to, while at the same time looking what would have happened had you followed stricter recommendations of p < .001. Since evidential value is negatively correlated with p-values (the lower the p-value, the higher the evidential value, all else being equal, see Good, 1992), any effect that would have been significant with a p < .001 has more evidential value than an effect only significant at p < .05.

Second, confidence intervals are often incorrectly interpreted as the range which will contain the true parameter of interest (such as an effect size). Confidence intervals are statements about the number of future confidence intervals that will include the true parameter, not a statement about the number of parameters that will fall within the confidence interval.

The more intuitive interpretation people want to use when they see a confidence interval is to interpret it as a Capture Percentage (CP). My back-of-an-envelope explanation in the picture below shows how a 95% CI is only a 95% CP when the parameter (such as an effect size) you observe in a single sample happens to be exactly the same as the true parameter (left). When this is not the case (and it is almost often not exactly the case) less than 95% of future effect sizes will fall within the CI from your current sample (see the right side of the figure). In the long run, so on average, a 95% CI has a 83.4% capture probability.

This is explained in Cumming & Maillardet (2006), who present two formula to convert a Confidence Interval to a Capture Percentage. I’ve made a spreadsheet in case you want to try out different values:


Capture probabilities are interesting. You typically have only a single sample, and thus a single confidence interval, so any statement about what an infinity of future confidence intervals will do is relatively uninteresting. However, a capture percentage is a statement you can make based on a single confidence interval. Based on a single interval, it will say something about where future statistics (such as means or effect sizes) are likely to fall. A value of 83.4% is a little low (it means on average 16.6% of the time you will be wrong in the future). For a 99.9% confidence interval, the capture percentage is 98%. That’s two easy to remember numbers, and being 98% certain of where you can expect something is pretty good.

So, changing reporting practices away from 95% confidence intervals to 99.9% confidence intervals and 98% capture intervals has at least two benefits. The only downside is that confidence intervals are a little wider (see below for an independent t-test with n = 20 and a true d of 1), but if you really care about the width of a confidence interval, you can always collect a larger sample. Does this make sense? I'd love to hear your ideas about using 99.9% confidence intervals in the comments.


  1. If you really want to use confidence intervals instead of probability/credible intervals there is nothing stopping you from reporting many different coverages at the same time, and this is especially easy if you use a plot. This is sometimes called a caterpillar plot (an example: http://xavier-fim.net/packages/ggmcmc/figure/xcaterpillar.png.pagespeed.ic.bTv6onD83d.png). Reporting many different coverages at the same time (say, 50%, 90% 99.9%) would , I believe, put more focus on the actual estimate and less focus on whether an interval crosses zero or not.

    1. That's indeed an excellent suggestion for plots, and might even be a good suggestion for tables! I think it will indeed work in moving the attention of readers to estimation instead of statistical differences from 0.

  2. Interesting post, I did not know of coverage percentages before. I am not sure how much it helps to report p-values with CIs because most people might focus more (or only) on the p-value. Besides, I do not see the added value of a p-value when you also present the CI, except of pleasing readers who want like to see p-values. If one were to present information on both, one could easily do this in one plot. The plot would contain the CI and one could use the usual *, **, *** as marker labels right next to each CI.

  3. I think pleasing readers who like to see p-values is (for now) a pretty good reason to also report p-values. Give people time to get used to seeing both, then maybe someday we won't need p-values.
    Another common mistake I see in interpreting CIs is people forgetting that values at edges of CI are less likely than values in the middle. I like the caterpillar plot suggestion for that reason.

  4. I report the mean + 95% interval and use the caterpillar (eh, is that what it is called?) plot with 50% and 95% interval.
    I think reporting larger interval than 95% is bad idea. Actually I sometimes feel bad about 95%. Let me put aside the obvious problem, that for informed interpretation you would never, ever be interested in something like 99.9% interval. I just wish to point out that the extreme CIs are simply beyond the resolution of your data and stat method. Your stat tools come with certain precision. E.g. if you use bootstrap or mcmc to construct the estimated interval you will need to get more samples. With 1000 samples your 99.9% interval will depend on the position of the 2 most extreme points. Such boundary estimate will not be very robust and vary considarably across replications of your analysis. The problem arises because we usually use distributions with small amount of probability mass at its tails. That is thediscrepancy in the estimated var at the tail (eg between 99.9 and 99.8) will be greater compared to more central regions. Similar, the noise will have bigger impact at tails.
    There is another reason for this problem, namely that there will be discrepancy between the model our analysis assumes and how nature really works. This discrepancy will be larger in parts of population where we have little data to check the model's assumption. In worst case, the estimation of an extreme interval boundary such as 99.9 will just reflect the assumptions behind the model. Or the estimate will reflect arbitrary outliers, which should be properly handled by a different models - as is the case in various outlier detection algorithms.

    1. Hi Matus, thanks for your reply - very good points. Could this be resolved by using robust statistics? Have to learn more about this, but some windsorized variances might improve things perhaps?

    2. I don't like the ad-hoc, black-box approaches to handling of outliers/extreme values like windsorizing, outlier exclusion or t dist with wide tails. If you are really interested in the extreme boundary I would model outliers explicitly with some mixture model. The simplest way to do so would be a bimodal distribution. Of course that means you have to make assumptions about how outliers behave, which is often difficult.

      The best way, really, is to treat the value at say 99.9% boundary as the target dependent variable. Something like - extend model M1: y~bx where you report the 99.9%CI of b into M2: y~bx; u~getCI(b,99.9) and report the mean and CI of u. Then you would design an experiment that allows you to estimate u precisely (most probably at the expense of b). In a simple study this may just mean that you target mostly well-skilled and miserably skilled population or that you use very easy and very difficult items.

      It may look to you like this creates an infinite regress, but it really is about what your target variable is, which should be determined by your research question. A meta CI (say 99.9% CI of a 99.9 percentile of some variable) is of little theoretical importance. As I said I can barely imagine any use-case for 99.9% CI.

  5. Am I missing something or is the case for capture percentages actually quite weak?

    1) 95% of future CI will contain the true effect size (so, my CI has 95% chance of including *true* effect size)

    2) 83% of future effect sizes will be in my CI (so, my CI has 83% chance of including any individual future effect size)

    Surely, we are interested in the true effect size, not any individual effect size. We are in the game of investigating human psychology, not predict each other's stats results.

    1. A late reply, but anyway: The interpretation of the CI is wrong because the % probability is not about your CI, but about how many future CIs contain the true effect size. After all, it is frequentism we talk about. (by Ingo Rohlfing; the website does not allow me to use my profile.)

  6. Any way you can re-upload your Excel spreadsheet to calculate this? Thanks in advance!