Comments on The 20% Statistician: Communicating uncertainty with p-values

"One thing that strikes me as a benefit of co...

2014-05-27T10:13:55.809+02:00

"One thing that strikes me as a benefit of continuing to report p-values and 95% CI side by side is that people have a gut-feeling for p-values they don’t yet have for 95% CI."

The core of the problem is to avoid the standardized effect size. If you instead describe the effect size and CI in original units its easy to interpret. This is because the effect size mostly describes the effect (it's in the word!!!) of a causal manipulation. The "gut feeling" you refer to is domain specific knowledge (think of conway's data analysis diagramm). This tells a clinician whether a health improvement due to treatment of magnitude M and CI [L, U] is notable or not.

Of course, once you standardize or if you can't figure out the causal interpretation of the quantity then you are lost and p-values won't help you.

Hi Felix, I love your multi-perspective approach t...

2014-05-23T06:23:08.168+02:00

Hi Felix, I love your multi-perspective approach to statistics, and I fully agree. I think that's how we should move forward. Personally, I see ways in which p-values can play a role, given that they ARE related to the weight of evidence (unlike what some people seem to be suggesting), but I fully understand you prefer Bayes Factors. I have no idea what you mean with plotting the posterior samples of the ES estimate, but I can't wait for your blog post on it :)

Daniel, I agree with you that different indices va...

2014-05-20T18:27:14.524+02:00

Daniel, I agree with you that different indices vary in their effectiveness of communicating a situation. And I share your gut feeling that CIs often do not drive a point home.

Here are my current favorites which serve my personal statistical heart well:

- Communicate the overall *evidence* “Is there an effect or not”? —> Bayes Factors

- Communicate the *magnitude* of an effect: Use the common language effect size (McGraw and Wong, 1992)

- Communicate the *uncertainty* of the parameter estimate: Well, that’s the arena of CIs. If you want to reduce it to a single number, you could use the SE (although this already needs a lot of inference by the reader before it can be decoded - “multiply by 2, and subtract and add it to the pout estimate”. A lot of mental arithmetic). A potential solution could be a plot of the posterior samples of the ES estimate - for me this is one of the best ways to visualize the uncertainty, in a way that I _feel_ it!

I was working on a piece on calibrated p-values wh...

2014-05-20T07:25:59.964+02:00

I was working on a piece on calibrated p-values when I thought of this. To be clear, I don't think we should really write up such statements in the result section. I think that (in small samples) it is good to move from a dichotomous interpretation of p-values to one where we think p < .001 is more certainty (for the alternative hypothesis) than a p = .032. Bayes factors will generally tell you the same. So, I think I mean certainty about the alternative hypothesis, which normal p-values overstate, but nevertheless provide some indication off (especially if they are really low). Using calibrated p-values would make this even better.

My statistical heart is made out of swiss cheese, ...

2014-05-19T22:23:12.587+02:00

My statistical heart is made out of swiss cheese, still a lot to learn to fill the holes ;)

So I've realized that I may not be picking up a certain distinction. Are you using "certainty" to characterize your belief in an effect (i.e. this effect is real or not)? That may be somewhere that I lose you. Your intro accurately says that there is criticism of dichotomous thinking promoted by p-values. Is this post intending to put a level of belief to our p-values in the form of p intervals, in such a way as to reduce the dichotomous thinking?

Well, obviously i feel something - my statistica...

2014-05-19T21:37:31.228+02:00

Well, obviously i feel *something* - my statistical heart is not made of stone! Yes, I understand 95% CI [0.02, 0.81] is not as good as I want to be (so I was overstating it a little - it is accepted literary technique). Let's make it more real: with 200 participants, 100 per condition, I get 95% CI [0.18, 0.74]. That's a little better, sure, I can count, but the p < .001 is really telling me much more. But, as I say, perhaps with time, I'll get a better feeling for when things enter the corridor of stability ;)

"In approximately 83.4% of close replication ...

2014-05-19T20:54:19.937+02:00

"In approximately 83.4% of close replication studies we can expect a Cohen’s d between 0.02 and .81, and thus a p-value (given our sample size of 50 per condition) between .0001 and .92. To me, this sounds pretty bad."

I think if you used the same language when you interpreted your CIs then you'd feel just as uncertain. Think if you had originally said this: "the means were different, t(98) = 2.09, Cohen’s d = 0.42, 95% CI [0.02, 0.81]. This interval serves as a rough prediction interval, indicating where 83.4% of replication ds would likely fall (Cumming, Williams, Fidler, 2004)." If the next time you run the study you might find a difference between means as small as .02 or as large as .81 (nearly double!), or even smaller/larger=16.6%, how do you feel very certain about that? I can't say I do, even without the p-value language.

"If we would have observed the same effect size (d = .42) but had 1000 participants in each of two groups, we would be considerably more certain there was a real difference. We could say: t(1998) = 9.36, Cohen’s d = 0.42, 95% CI [0.33, 0.51]. To be honest, I’m not feeling this huge increase in certainty."

You honestly don't feel an increase in certainty going from "83% of replication ds could likely be as small as .02 or as large as .81" to "83% of replication ds could likely be as small as .33 or as large as .51"? That range of values feels quite a bit smaller to me.

Since p < .001, presumably the 99.9% CI doesn&#...

2014-05-19T11:20:37.569+02:00

Since p < .001, presumably the 99.9% CI doesn't include zero either.
So one approach could be to specify, for each test, the CI size of the highest CI which doesn't include zero. We could maybe replace 95% by *, 99% by **, and 99.9% by ***...

Slightly more seriously, perhaps we could use a new statistic to present CI results. The reason why a 95% CI of [0.33, 0.51] is better than one of [0.02, 0.81] is surely that in the former case, there is almost two CI-widths (i.e., 0.51 - 0.33 = 0.18) of clear blue water between the lower bound of the CI and zero:
cbw = (0.33 / 0.18) = 1.83. In the latter case, this is (0.02 / 0.79) = 0.025.

This statistic would then need to be adjusted for different CIs, for example by dividing it by (1 - the CI percentage), although doubtless a better formula could be found by doing all this from first principles. So the 95% CI of [0.33, 0.51] would have a final statistic of (1.83 / .05) = 36.7, whereas a 99% CI of [0.30, 0.54] would have a final cbw value of ((0.30 / 0.24) / .01) = 125. This adjustment has the effect of making (what are in effect) p < .01 results visibly more impressive --- your two similar-looking sets of numbers now boil down to two single numbers that differ by a factor of more than three --- and p < .001 results even more so.

This is probably all meaningless if you're a proper statistician, so I await the learning experience of refutation with considerable interest.