The 20% Statistician: Sample Size Planning: P-values or Precision?

Thursday, October 30, 2014

Sample Size Planning: P-values or Precision?

Yesterday Mike McCullough posted an interesting question on Twitter. He had collected some data, observed a p = 0.026 for his hypothesis, but he wasn't happy. Being aware that higher p-values do not always provide strong support for H1, he wanted to set a new goal and collect more data. With sequential analyses it's no problem to look at data (at least when you plan them ahead of time), and collect additional observations (because you control the false positive rate) so the question was: which goal should you have?

Mike suggested a p-value of 0.01, which (especially with increasing sample sizes) is a good target. But others quickly suggested forgetting about those damned p-values altogether, and plan for accuracy. Planning for accuracy is simple: you decide upon the width of the confidence interval you'd like, and determine the sample size you need.

I don't really understand why people are pretending like these two choices are any different. They always boil down to the same thing: your sample size. A higher sample size will give you more power, and thus a better chance of observing a p-value of 0.01, or 0.001. A higher sample will also reduce the width of your confidence interval.

So the only difference is which calculation you use to base your sample size on. You either decide upon an effect size you expect, or a width of a confidence you desire, and calculate the sample size. One criticism against power analysis is that you often don't know the effect size (e.g., Maxwell, Kelley, & Rausch, 2008). But with sequential analyses (e.g., Lakens, 2014) you can simple collect some data, and calculate conditional power based on the observed effect size for the remaining sample.

I think a bigger problem is that people have no clue whatsoever when determining an appropriate width for a confidence interval. I've argued before that people have a much better feel for p-values than confidence intervals.

In the graph below, you see 60 one-sided t-test, all examining a true effect with a mean difference of 0.3 (dark vertical line) with a SD of 1. The bottom 20 are based on a sample size of 118, the middle 20 on a sample size of 167, and the top 20 on a sample size of 238. This gives you 90% power for a p=0.05, p=0.01, and p=0.001, respectively. Not surprisingly, as power increases, less confidence intervals include 0 (i.e., are significant). The higher the sample size, the further the confidence intervals stay away from 0.

Take a look at the width of the confidence intervals. Can you see the differences? Do you feel the difference in aiming for a width of the confidence interval of 0.40, 0.30, or 0.25 (more or less the width in the three groups from bottom to top)? If not, but you do have feel for the difference between aiming for p = 0.01 or p=0.001, then go for the conditional power analysis. I would.

R script that produced the figure above:

5 comments:

Michael FrankOctober 30, 2014 at 9:32 PM
Nice post! Another way of thinking about this - "what do I want to compare this measurement to?" If your answer is "the null" then the p value and conventional power analysis is a good idea. If your answer is "another set of conditions that might plausibly differ by the following amount," then it may make more sense to think about the precision of the CI. They are the same thing, but one privileges testing against the null, while the other is more compatible with model-based thinking about actual numerical predictions across different conditions.
ReplyDelete
Replies
matusOctober 30, 2014 at 10:44 PM
Once again, Daniel, you present your very limited idea of what statistics offers to researchers. Your argument is all about decisions. But what about people like me who don't make any decisions? If I develop a new therapy and test it, I then report that BDI decreases by x%, 95%CI [a,b]. Every idiot understands what this means (well, maybe minus the APA standard for reporting CI). A clinician further knows whether this decrease is notable. This depends on what alternative therapies there are to choose from. It is left to him/her to decide whether he wants to use this therapy. In comparison what does he gain if I tell him that my therapy significantly decreases the BDI score with p=0.023? I decide to discard a practically irrelevant and arbitrary hypothesis (H0, e.g. no change in BDI score) based on practically irrelevant and arbitrary cut off (e.g. alpha=0.05). Or what does your magical intuition for p-values somehow tell you whether a therapy with p=0.023 applying? I don't have such an intuition and I certainly doubt that any therapist has. Maybe you could finally disclose where and how these intuitions are obtained. You speak about them lot on this blog...

The same goes for precision. For instance, a clinician may know that the lower bound a of the ES is not difficult to achieve with the available methods - some of which are less costly than the proposed therapy. But maybe the mean estimate x is a notable decrease such that he asks for more data which would increase the precision of the estimate and tell us whether the mean indeed around x or the result was a fluke and the mean estimate is more like a. Again this kind of reasoning is beyond what p and sample size can tell you.

As to your graph, if you would tell us what in what units you was measured experts could use their domain relevant knowledge to decide whether the precision is good or should be improved.
ReplyDelete
Replies
Michael McCulloughNovember 4, 2014 at 10:48 PM
This is very helpful, Daniel! I really love the idea of using what we know (i.e., the effect size estimate from the data we already have) and using that to plan further data collection. What an ingenious idea. Thanks so much for taking my question/problem so seriously!
ReplyDelete
Replies

Add comment