Recently some statisticians have argued we have to lower the
widely used p < .05 threshold. David Colquhoun got me
thinking about this by posting a manuscript here, but Valen Johnson’s paper in PNAS is
probably better known. They both suggest a p
< .001 threshold would lower the false discovery rate. The false discovery rate
(or concluding an observed
significant effect is true, when it is actually false) is 1 – the positive predictive value by Ioannidis, 2005 (see this
earlier post for details).
Using p < .001
works to reduce the false discovery rate in much the same way as lowering the
maximum speed to 10 kilometers an hour works to prevent lethal traffic
accidents (if people would adhere to speed limits). With such a threshold, it is extremely unlikely bad things will happen. It has a strong prevention
focus, but ignores a careful cost/benefit analysis of implementing such a
threshold. (I’ll leave it up to you to ponder the consequences in the case of
car driving – in The Netherlands there were 570 deaths in traffic in 2013 [not
all would have been prevented by lowered the speed limit], and we apparently
find this an acceptable price to pay for the benefits of being able to drive faster than 10 kilometers an hour).
The cost of lowering the threshold for considering a difference
support for an hypothesis (see how hard I’m trying not to say ‘significant’?)
is clear: we need larger samples to achieve the same level of power as we would
with a p < .05 threshold. Colquhoun
doesn’t talk about the consequence of having to increase sample sizes. Interestingly,
he mentions power only when stating why power is commonly set to 80%: “Clearly
it would be better to have 99% [power] but
that would often mean using an unfeasibly large sample size.)”. For an
independent two-sided t-test
examining an effect expected to be of a size d = 0.5 with an α
= .05, you need 64 participants in each cell for .80 power. To have .99 power,
with α = .05, you need
148 participants in each cell. To have .80
power with α = .001 you need 140 participants in each cell.
So, Colquhoun is stating .99 power often
requires ‘unfeasibly large sample sizes’ only to recommend p < .001 which often requires equally
large sample sizes.
Johnson discusses the required increase in sample sizes when
lowering the threshold to p <
.001: “To achieve 80% power in detecting a standardized
effect size of 0.3 on a normal mean, for instance, decreasing the threshold for
significance from 0.05 to 0.005 requires an increase in sample size from 69 to
130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from
112 to 172.”
Perhaps that doesn’t sound so bad, but let me give you some
more examples in the table below. As you see, decreasing the
threshold from p < .05 to p < .001 requires approximately doubling the sample
size.
d = .3
power .80
|
d = .3
power .90
|
d = .5
power .80
|
d = .5
power .90
|
d = .8
power .80
|
d = .8
power .90
|
|
0,05
|
176
|
235
|
64
|
86
|
26
|
34
|
0,001
|
383
|
468
|
140
|
170
|
57
|
69
|
Ratio
|
2,18
|
1,99
|
2,19
|
1,98
|
2,19
|
2,03
|
Now that we have surveyed the benefit (lower false discovery
rate) and the cost (doubling sample size for independent t-tests – the costs are much higher when you examine interactions)
let’s consider alternatives and other considerations.
The first thing I want to notice is how silent these
statisticians are about problems associated with Type 2 errors. Why is it
really bad to say something is true, when it isn’t, but is it perfectly fine to
have 80% power, which means you have a 20% change of concluding there is
nothing, while there actually was something? Fiedler, Kutzner, and
Krueger (2012) discuss this oversight in the wider discussion of false positives in psychology, although I
prefer the discussion of this issue by Cohen (1988) himself. Cohen realized we
would be using his minimum 80% power recommendation as a number outside of a
context. So let me remind you his reason for recommending 80% power was that he preferred a 1 to 4
balance between Type 1 and Type 2 errors. If you have a 5% false positive rate,
and a 20% Type 2 error rate (because of 80% power) this basically means you
consider Type 1 errors four times more serious than Type 2 errors. By
recommending p < .001 and 80%
power, Colquhoun and Johnson as saying a Type 1 error is 200 times as bad as a
Type 2 error. Cohen would not agree with this at all.
The second thing I want to note is that because you need to
double the sample size when using α
= .001, you might just as well perform two studies with p < .05. If you find a statistical difference from zero in two
studies, the false discovery rate has gone down substantially (for calculations,
see my
previous blog post). Doing two studies with p < .05 instead of one study with p < .001 has many benefits. For example, it allows you to
generalize over samples and stimuli. This means you are giving the tax payer
more worth for their money.
Finally, I don’t think single conclusive studies are the
most feasible and efficient way to do science, at least in psychology. This
model might work in medicine, where you sometimes really want to be sure a
treatment is beneficial. But especially for more exploratory research (currently
the default in psychology) an approach where you simply report everything, and
perform a meta-analysis over all studies, is a much more feasible approach to
scientific progress. Otherwise, what should we do with studies that yield p = .02? Or even p = .08? I assume everyone agrees publication bias is a problem, and if we consider only studies with p < .001 as worthy of publication, publication bias is likely to get much worse.
I think it is often smart to at least slightly lower the
alpha level (say to α =
.025) because in principle I agree with the main problem Colquhoun and Johnson
try to address, that high
p-values are not very strong support for
your hypothesis (see also Lakens & Evers, 2014). It’s just important that the solution to this problem is
realistic, instead of overly simplistic. In general, I don’t think fixed alpha levels are a
good idea (instead, you should select and pre-register the alpha level as a
function of the sample size, the prior probability the effect is true, and the
balance between Type 1 errors and Type 2 errors you can achieve, given the sample
size you can collect - more on that in a future blog post). These types of discussions remind me that statistics is
an applied science. If you want to make good recommendations, you need to have
sufficient experience with a specific field of research, because every field
has its own specific challenges.
Nice post. However, I just can't help thinking there is some inconsistent thing going on here.
ReplyDeleteWe have the highly cited paper Simmons et al. (2011) claiming that "perhaps the most costly error is a false positive, the incorrect rejection of a null hypothesis"
We have the 'replication movement' making people who published a positive research finding based on p values between 0.01 and 0.05 quite nervous.
We have Schimmach (2012), discussing "the ironic effect of significant results on the credibility of multiple-study articles"
And now we suddenly hear that we should go for high alpha values and small studies. How to make sense of all this? You can't push for both low alpha and low beta....
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551-566.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22, 1359-1366.
High anonymous (I really dislike anonymous comments, so I'm keeping it very short): there is no inconsistency. Replications can weed put false positives, and reporting everything as lomg as the meta analytic effect size estimate allows you to confirm your hypotheses, we are fine. My solutions might differ from Schimmack, who prefers huge samples, so take your pick.
DeleteYes those calculations are all true.
ReplyDeleteI guess the reasoning in terms of costs / benefits is sound as well, that is, if you are a manager.
They should not matter to a scientific discipline.
It's like physicists saying: "Wait a minute, we calculated that to evidence the Higgs Boson at 5 sigma we need to build a machine that will cost 7.5 billion € to build in 10 years, require collaboration of over 10,000 scientists and engineers from over 100 countries.
....
Let's build two smaller ones instead so we have 2.5 sigma"
If the consensus is that we need larger samples, lower alpha levels, maximal power, this is what we should do.
The taxpayer hired the physicists who built the LHC as well as us social scientists to figure out the best possible answers to fundamental questions given the scientific record, not the most cost-effective answers. That is ultimately for the industry or a government agency to figure out, they are the ones who want to turn the acquired fundamental knowledge into something useful or profitable.
I am not surprised companies like Google, essentially trading in knowledge about human behaviour and cognition, started their own campuses and conduct their own studies. The current scientific record of social science does not contain consensus theoretical knowledge with sufficient scientific credibility to be exploited to build some useful technology.
Anyway, I would change the recommendation to: we need more large scale collaborations to be able to establish effect sizes at the highest power and most skeptical evidence levels possible.
Hi Fred, sure - if there are things we all agree on are really important to know, large scale collaborations are the future of psychological science. I agree. Here, I focus more on exploratory research, as I mention. After some replications and extensions, reliab'e effects lendmthemselves for accurate ES estimation.
DeleteSo, in clinical trials they use the concept of phase I (proof of concept), phase II (pilot experiment, with all details in place, enough to demonstrate the effect but not large enough to fully generalize), phase III (large-scale demonstration of efficacy). The sample sizes vary accordingly: phase I ~50, phase II ~200, phase III 1000-10000. These are bulk estimates, numbers will vary with effect sizes and animal/human research (for phase I). I don't know about psychology, but in human neuroimaging, we've basically spent decades publishing only phase I science. Phase II/III was basically left to meta-analysis, either formal or through community consensus, but these attempts are crippled by the lack of methods standardization and data sharing. I think we need the formal equivalent of phases II and III as well, to have solid findings to build upon. But requiring every studies to adhere to phase II-III standards is non-sense.
DeleteThis post (and those preceding it) form a terrific online book chapter for teaching these concepts: Clear, succinct, and goal-focussed! Especially like this sentence: "Colquhoun and Johnson are saying a Type 1 error is 200 times as bad as a Type 2 error. Cohen would not agree with this at all." That's the question in a nutshell - There is no free lunch...
ReplyDeleteThanks for the nice comment! Happy to see it's helpful.
DeleteI agree with your arguments against using p<.001, but I think essentially the same argument can be applied to p<.05. Standard hypothesis testing allows control of Type I error for a single experiment; but things get complicated when one considers replications and so forth. A different perspective is to give up on controlling Type I errors and instead focus on evidence for different models (Bayes Factors) or on Type S (sign) and M (magnitude) errors. For the latter see http://www.stat.columbia.edu/~gelman/research/published/francis8.pdf
ReplyDeleteHi Greg, yes, interpreting results as sign errors and magnitude errors are an interesting approach. I don't think any existing approach is mutually exclusive to any other, so they can all be used to evaluate results. Depending on the research question (more exploratory, or testing predictions from a formal model) some may be more useful than others.
DeleteWell, speak for yourself. Personally I aim to be right rather more than 30% of the time. I fear that with attitudes expressed here it isn't surprising that there is a crisis of believability in psychology. Many people's response to the latest psychology headlines is "oh yeah?". That is not a good thing.
ReplyDeleteYour argument seems to be that it doesn't matter much if people publish results that aren't true because someone else will sort it out later.
ReplyDeleteI don't think that most people will be very impressed by this.
Another really sensible post that I only saw now.
ReplyDeleteAnother crew with some pretty big science reform heavy hitters (Nosek, Wagenmakers, Ioannidis, Vazire, and many others) is now recommending .005 for most exploratory analyses. https://osf.io/preprints/psyarxiv/mky9j/
ReplyDeleteI saw that you tweeted on this, but this post focuses exclusively on .001. I am returning to the .005 proposal in this comment.
This is obviously considerably less severe of a tradeoff than .001. Also, in the U.S., your driving metaphor makes things ... interesting. People routinely drive 5-10mph over the speed limit. And the cops will rarely ticket you unless you are at least 10mph over. So, if you want to keep people's speeds, say, under 35mph, you would post a speed limit of 25mph.
Metaphorically, then, given the evidence that many thing is pretty widespread of scientific "speeding" (or, to use Schimmack's term, "doping"), of phacking, garden of forking paths, suboptimal statistics, one could then plausibly argue that a lower "speed limit" (lower pvalue, .005 being proposed by this new crew), is necessary to keep the "true" speed down to the more reasonable .01 or .05.
This is an argument from scientific human behavior, not stats or methods per se; or, at least, it is at the intersection of the psychology of scientific behavior and stats/methods. It is surely useful to know what is the ideal stat solution, if any, but proposing solutions that also address the frailty of scientists' behaviors may not be ridiculous and just might be even more productive.
You are more stat-oriented than I am, so I am guessing you will stick to the stats. I am more of a social psychologist than stats guy (I teach grad stats, have pubbed some fairly sophisticated analyses, latent variable modeling, bayesian analysis, etc., and have a paper or two on how data is routinely misinterpreted, but I am not hardcore stats guy). Anyway, on balance, it seems to me that reform that attempts to address the psychology of stats use and interpretation has some merit.
What do you think?
(I think I signed up as "I'dratherbeplayingtennis," but I played already today, so)...
Best,
Lee Jussim
"It’s just important that the solution to this problem is realistic, instead of overly simplistic. In general, I don’t think fixed alpha levels are a good idea (instead, you should select and pre-register the alpha level as a function of the sample size, the prior probability the effect is true, and the balance between Type 1 errors and Type 2 errors you can achieve, given the sample size you can collect - more on that in a future blog post)."
ReplyDeleteHow realistic do you think your own proposal is? Changing the threshold is easy to accomplish.