Sunday, May 18, 2014

Communicating uncertainty with p-values



It’s important to communicate any remaining uncertainty in the results of an experiment. P-values have been criticized for inviting dichotomous ‘real vs. not real’ judgments, and recently, we’ve seen recommendations to stop reporting p-values, and report 95% confidence intervals instead (e.g. Cumming, 2014). It’s important to realize that we are switching from an arguably limited and often misunderstood metric (the p-value) to another limited and often misunderstood metric (a confidence interval). One thing that strikes me as a benefit of continuing to report p-values and 95% CI side by side is that people have a gut-feeling for p-values they don’t yet have for 95% CI. Their gut feeling is not statistically accurate, but it has been sufficient to lead to scientific progress, in general. Confidence intervals provide more information, but in all fairness, it’s made up of three numbers (the 95, and 2 boundary values), so it’s no surprise the p-value can’t compete with that. What if we use p-values to communicate uncertainty?

Let’s say I have made up data for two imaginary independent groups of 50 people, with M1 = 6.25, SD1 = 2.23, and M2 = 7.24, SD2 = 2.46. If I report this in line with recent recommendations, I’d say something like: the means were different, t(98) = 2.09, Cohen’s d = 0.42, 95% CI [0.02, 0.81]. This means that in the long run, 95% of the confidence intervals calculated for close replications (with the same sample size!) will contain the true effect size. To me, this all sounds pretty good. There’s a difference, there are two boundary values that do not include zero, and I can say something about what will happen in 95% of future studies I run.
I could also say: the means were different, t(98) = 2.09, p = .039, Cohen’s d = 0.42. In approximately 83.4% of close replication studies we can expect a Cohen’s d between 0.02 and .081, and thus a p-value (given our sample size of 50 per condition) between .0001 and .92. To me, this sounds pretty bad. It tells me that only 83.4 (or five out of six) studies will observe an effect that, with a sample size of 50 in each condition, will provide a p-value that will fall somewhere between .0001 and .92. Furthermore, I know that relatively high p-values are at best very weak support for the alternative hypothesis (e.g., Lakens & Evers, 2014), and seeing that p-value is useful to gauge how confident I should be in the observed effect (i.e., not very much). If we want to communicate uncertainty, I have to say this second formulation is doing a much better job, for me.

If we would have observed the same effect size (d = .42) but had 1000 participants in each of two groups, we would be considerably more certain there was a real difference. We could say: t(1998) = 9.36, Cohen’s d = 0.42, 95% CI [0.33, 0.51]. To be honest, I’m not feeling this huge increase in certainty. I feel I know the effect with more precision, and I might feel a little more certain, but this increase in certainty is difficult to quantify. I could also say: t(1998) = 9.36, p < .001, Cohen’s d = 0.42. In approximately 83.4% of close replication studies we can expect a p < .001 (actually, any Cohen’s d higher than 0.14 would yield a p < .001 with 2000 participants).

Note that if you report a 99% CI (which would not be weird if you have 2000 participants), you could report Cohen’s d = 0.42, 99% CI [0.30, 0.54]. You could also say that on average in 93.1% of close replications, the p-value will be smaller than .001. To me, this really drives home a point. With 99% CI's the boundaries around the observed effect size become slightly wider, but I don’t have a good feeling for how much a 99% CI [0.30, 0.54] is an indication of more certainty than a 95% CI [0.33, 0.51]. However, I do understand that 93.1% of replications yielding a p < .001 is much better than 83.4% of replications yielding a p < .001. (See Cumming & Maillardet, 2006, for the explanation why 83.4% and 93.1% of studies will yield a Cohen’s d that falls within the 95% or 99% CI of a single study.)

Given time, I might feel a specific 95% CI will be capable of driving home the point in exactly the same manner, but for now, it doesn’t. That’s why I think it’s a good move to keep reporting p-values alongside 95% CI. If you teach the meaning of 95% CI to students that already have a feel for p-values, you might actually want to go back and forth between p-values and 95% CI to improve their understanding of what a 95% CI means. People who are enthusiastic about 95% CI might shiver after this suggestion, but I sincerely wonder whether I will ever feel the difference between a 99% CI [0.30, 0.54] and a 95% CI [0.33, 0.51]. I'd also be more than happy to hear how I can accurately gauge relative differences in the remaining uncertainty communicated by confidence intervals without having to rely on something so horribly subjective and erronous as my gut feeling.

P.S. I only now realize ESCI by Geoff Cumming only allows you to calculate 95% CI around Cohen's d if you have less then 100 subjects in two conditions in an independent t-test, which is quite a limitation. I've calculated 95% CI around d using these excellent instructions and SPSS files by Karl Wuensch.

8 comments:

  1. Since p < .001, presumably the 99.9% CI doesn't include zero either.
    So one approach could be to specify, for each test, the CI size of the highest CI which doesn't include zero. We could maybe replace 95% by *, 99% by **, and 99.9% by ***...

    Slightly more seriously, perhaps we could use a new statistic to present CI results. The reason why a 95% CI of [0.33, 0.51] is better than one of [0.02, 0.81] is surely that in the former case, there is almost two CI-widths (i.e., 0.51 - 0.33 = 0.18) of clear blue water between the lower bound of the CI and zero:
    cbw = (0.33 / 0.18) = 1.83. In the latter case, this is (0.02 / 0.79) = 0.025.

    This statistic would then need to be adjusted for different CIs, for example by dividing it by (1 - the CI percentage), although doubtless a better formula could be found by doing all this from first principles. So the 95% CI of [0.33, 0.51] would have a final statistic of (1.83 / .05) = 36.7, whereas a 99% CI of [0.30, 0.54] would have a final cbw value of ((0.30 / 0.24) / .01) = 125. This adjustment has the effect of making (what are in effect) p < .01 results visibly more impressive --- your two similar-looking sets of numbers now boil down to two single numbers that differ by a factor of more than three --- and p < .001 results even more so.

    This is probably all meaningless if you're a proper statistician, so I await the learning experience of refutation with considerable interest.

    ReplyDelete
  2. "In approximately 83.4% of close replication studies we can expect a Cohen’s d between 0.02 and .81, and thus a p-value (given our sample size of 50 per condition) between .0001 and .92. To me, this sounds pretty bad."

    I think if you used the same language when you interpreted your CIs then you'd feel just as uncertain. Think if you had originally said this: "the means were different, t(98) = 2.09, Cohen’s d = 0.42, 95% CI [0.02, 0.81]. This interval serves as a rough prediction interval, indicating where 83.4% of replication ds would likely fall (Cumming, Williams, Fidler, 2004)." If the next time you run the study you might find a difference between means as small as .02 or as large as .81 (nearly double!), or even smaller/larger=16.6%, how do you feel very certain about that? I can't say I do, even without the p-value language.

    "If we would have observed the same effect size (d = .42) but had 1000 participants in each of two groups, we would be considerably more certain there was a real difference. We could say: t(1998) = 9.36, Cohen’s d = 0.42, 95% CI [0.33, 0.51]. To be honest, I’m not feeling this huge increase in certainty."

    You honestly don't feel an increase in certainty going from "83% of replication ds could likely be as small as .02 or as large as .81" to "83% of replication ds could likely be as small as .33 or as large as .51"? That range of values feels quite a bit smaller to me.

    ReplyDelete
    Replies
    1. Well, obviously i feel *something* - my statistical heart is not made of stone! Yes, I understand 95% CI [0.02, 0.81] is not as good as I want to be (so I was overstating it a little - it is accepted literary technique). Let's make it more real: with 200 participants, 100 per condition, I get 95% CI [0.18, 0.74]. That's a little better, sure, I can count, but the p < .001 is really telling me much more. But, as I say, perhaps with time, I'll get a better feeling for when things enter the corridor of stability ;)

      Delete
    2. My statistical heart is made out of swiss cheese, still a lot to learn to fill the holes ;)

      So I've realized that I may not be picking up a certain distinction. Are you using "certainty" to characterize your belief in an effect (i.e. this effect is real or not)? That may be somewhere that I lose you. Your intro accurately says that there is criticism of dichotomous thinking promoted by p-values. Is this post intending to put a level of belief to our p-values in the form of p intervals, in such a way as to reduce the dichotomous thinking?

      Delete
  3. I was working on a piece on calibrated p-values when I thought of this. To be clear, I don't think we should really write up such statements in the result section. I think that (in small samples) it is good to move from a dichotomous interpretation of p-values to one where we think p < .001 is more certainty (for the alternative hypothesis) than a p = .032. Bayes factors will generally tell you the same. So, I think I mean certainty about the alternative hypothesis, which normal p-values overstate, but nevertheless provide some indication off (especially if they are really low). Using calibrated p-values would make this even better.

    ReplyDelete
  4. Daniel, I agree with you that different indices vary in their effectiveness of communicating a situation. And I share your gut feeling that CIs often do not drive a point home.

    Here are my current favorites which serve my personal statistical heart well:

    - Communicate the overall *evidence* “Is there an effect or not”? —> Bayes Factors

    - Communicate the *magnitude* of an effect: Use the common language effect size (McGraw and Wong, 1992)

    - Communicate the *uncertainty* of the parameter estimate: Well, that’s the arena of CIs. If you want to reduce it to a single number, you could use the SE (although this already needs a lot of inference by the reader before it can be decoded - “multiply by 2, and subtract and add it to the pout estimate”. A lot of mental arithmetic). A potential solution could be a plot of the posterior samples of the ES estimate - for me this is one of the best ways to visualize the uncertainty, in a way that I _feel_ it!

    ReplyDelete
    Replies
    1. Hi Felix, I love your multi-perspective approach to statistics, and I fully agree. I think that's how we should move forward. Personally, I see ways in which p-values can play a role, given that they ARE related to the weight of evidence (unlike what some people seem to be suggesting), but I fully understand you prefer Bayes Factors. I have no idea what you mean with plotting the posterior samples of the ES estimate, but I can't wait for your blog post on it :)

      Delete
  5. "One thing that strikes me as a benefit of continuing to report p-values and 95% CI side by side is that people have a gut-feeling for p-values they don’t yet have for 95% CI."

    The core of the problem is to avoid the standardized effect size. If you instead describe the effect size and CI in original units its easy to interpret. This is because the effect size mostly describes the effect (it's in the word!!!) of a causal manipulation. The "gut feeling" you refer to is domain specific knowledge (think of conway's data analysis diagramm). This tells a clinician whether a health improvement due to treatment of magnitude M and CI [L, U] is notable or not.

    Of course, once you standardize or if you can't figure out the causal interpretation of the quantity then you are lost and p-values won't help you.

    ReplyDelete