A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, October 30, 2014

Sample Size Planning: P-values or Precision?

Yesterday Mike McCullough posted an interesting question on Twitter. He had collected some data, observed a p = 0.026 for his hypothesis, but he wasn't happy. Being aware that higher p-values do not always provide strong support for H1, he wanted to set a new goal and collect more data. With sequential analyses it's no problem to look at data (at least when you plan them ahead of time), and collect additional observations (because you control the false positive rate) so the question was: which goal should you have?

Mike suggested a p-value of 0.01, which (especially with increasing sample sizes) is a good target. But others quickly suggested forgetting about those damned p-values altogether, and plan for accuracy. Planning for accuracy is simple: you decide upon the width of the confidence interval you'd like, and determine the sample size you need.

I don't really understand why people are pretending like these two choices are any different. They always boil down to the same thing: your sample size. A higher sample size will give you more power, and thus a better chance of observing a p-value of 0.01, or 0.001. A higher sample will also reduce the width of your confidence interval.

So the only difference is which calculation you use to base your sample size on. You either decide upon an effect size you expect, or a width of a confidence you desire, and calculate the sample size. One criticism against power analysis is that you often don't know the effect size (e.g., Maxwell, Kelley, & Rausch, 2008). But with sequential analyses (e.g., Lakens, 2014) you can simple collect some data, and calculate conditional power based on the observed effect size for the remaining sample.

I think a bigger problem is that people have no clue whatsoever when determining an appropriate width for a confidence interval. I've argued before that people have a much better feel for p-values than confidence intervals.

In the graph below, you see 60 one-sided t-test, all examining a true effect with a mean difference of 0.3 (dark vertical line) with a SD of 1. The bottom 20 are based on a sample size of 118, the middle 20 on a sample size of 167, and the top 20 on a sample size of 238. This gives you 90% power for a p=0.05, p=0.01, and p=0.001, respectively. Not surprisingly, as power increases, less confidence intervals include 0 (i.e., are significant). The higher the sample size, the further the confidence intervals stay away from 0.

Take a look at the width of the confidence intervals. Can you see the differences? Do you feel the difference in aiming for a width of the confidence interval of 0.40, 0.30, or 0.25 (more or less the width in the three groups from bottom to top)? If not, but you do have feel for the difference between aiming for p = 0.01 or p=0.001, then go for the conditional power analysis. I would.

R script that produced the figure above:


  1. Nice post! Another way of thinking about this - "what do I want to compare this measurement to?" If your answer is "the null" then the p value and conventional power analysis is a good idea. If your answer is "another set of conditions that might plausibly differ by the following amount," then it may make more sense to think about the precision of the CI. They are the same thing, but one privileges testing against the null, while the other is more compatible with model-based thinking about actual numerical predictions across different conditions.

  2. Once again, Daniel, you present your very limited idea of what statistics offers to researchers. Your argument is all about decisions. But what about people like me who don't make any decisions? If I develop a new therapy and test it, I then report that BDI decreases by x%, 95%CI [a,b]. Every idiot understands what this means (well, maybe minus the APA standard for reporting CI). A clinician further knows whether this decrease is notable. This depends on what alternative therapies there are to choose from. It is left to him/her to decide whether he wants to use this therapy. In comparison what does he gain if I tell him that my therapy significantly decreases the BDI score with p=0.023? I decide to discard a practically irrelevant and arbitrary hypothesis (H0, e.g. no change in BDI score) based on practically irrelevant and arbitrary cut off (e.g. alpha=0.05). Or what does your magical intuition for p-values somehow tell you whether a therapy with p=0.023 applying? I don't have such an intuition and I certainly doubt that any therapist has. Maybe you could finally disclose where and how these intuitions are obtained. You speak about them lot on this blog...

    The same goes for precision. For instance, a clinician may know that the lower bound a of the ES is not difficult to achieve with the available methods - some of which are less costly than the proposed therapy. But maybe the mean estimate x is a notable decrease such that he asks for more data which would increase the precision of the estimate and tell us whether the mean indeed around x or the result was a fluke and the mean estimate is more like a. Again this kind of reasoning is beyond what p and sample size can tell you.

    As to your graph, if you would tell us what in what units you was measured experts could use their domain relevant knowledge to decide whether the precision is good or should be improved.

    1. Hi Matus, I'm talking about how to solve a practical problem a real-lilfe researcher had. How would you have solved this?

      I'm not sure I understand how your clinician 'knows' so many things. Especially any understanding of how to achieve a lower bound on an ES is not in line with the statistical understanding of any clinicians I know. You are betting a lot on relevant domain knowledge - but I doubt it exists, and you do not clearly explain where that should come from.

      In this example, the researcher was clearly doing novel research and was trying to convince himself whether this was real, or not. So how should a sample size for such exploratory research be planned?

    2. As a student I had to choose some applied field for core study and I went for clinical neuropsychology. Much of the diagnosistic trianing was focused on how you interpret various measures. Aside from the training, if you go into practice as a part of the patient record you will be handed results of the various test. If you just spent time looking at the scores of the various tests you will necessarily gain an impression what the variability in these measures is and how various interventions affect them. It's no black magic.

      In research the relevant domain knowledge comes from prior research usaully the published one. There has been more than century of this. It's also the reason why each report has an introduction where you describe the prior research. Although nowadays the authors just provide literature review and then just go to do something else.

      As to your researcher's problem. It may be a real-life problem, but imo psychologists are solving many artificial problems in their real life, This one seems to be of that sort. I already told you in our email exchange that I don't care much about planning of sample size. If the estimates are imprecise, you can always collect more data. I can't suggest more detailed strategy since the tweet doesn't tell more about the research question. But in general if you say that it's exploratory research this suggests that the author does not have any hypothesis. So what use does he have for hypothesis testing if he doesnt have any hypothesis? Looks like one more artificial real-life problem there..

  3. This is very helpful, Daniel! I really love the idea of using what we know (i.e., the effect size estimate from the data we already have) and using that to plan further data collection. What an ingenious idea. Thanks so much for taking my question/problem so seriously!