A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, July 25, 2014

Why Psychologists Should Ignore Recommendations to Use α < .001



Recently some statisticians have argued we have to lower the widely used p < .05 threshold. David Colquhoun got me thinking about this by posting a manuscript here, but Valen Johnson’s paper in PNAS is probably better known. They both suggest a p < .001 threshold would lower the false discovery rate. The false discovery rate (or concluding an observed significant effect is true, when it is actually false) is 1 – the positive predictive value by Ioannidis, 2005 (see this earlier post for details).

Using p < .001 works to reduce the false discovery rate in much the same way as lowering the maximum speed to 10 kilometers an hour works to prevent lethal traffic accidents (if people would adhere to speed limits). With such a threshold, it is extremely unlikely bad things will happen. It has a strong prevention focus, but ignores a careful cost/benefit analysis of implementing such a threshold. (I’ll leave it up to you to ponder the consequences in the case of car driving – in The Netherlands there were 570 deaths in traffic in 2013 [not all would have been prevented by lowered the speed limit], and we apparently find this an acceptable price to pay for the benefits of being able to drive faster than 10 kilometers an hour).

The cost of lowering the threshold for considering a difference support for an hypothesis (see how hard I’m trying not to say ‘significant’?) is clear: we need larger samples to achieve the same level of power as we would with a p < .05 threshold. Colquhoun doesn’t talk about the consequence of having to increase sample sizes. Interestingly, he mentions power only when stating why power is commonly set to 80%: “Clearly it would be better to have 99% [power] but that would often mean using an unfeasibly large sample size.)”. For an independent two-sided t-test examining an effect expected to be of a size d = 0.5 with an α = .05, you need 64 participants in each cell for .80 power. To have .99 power, with α = .05, you need 148 participants in each cell. To have .80 power with α = .001 you need 140 participants in each cell. So, Colquhoun is stating .99 power often requires ‘unfeasibly large sample sizes’ only to recommend p < .001 which often requires equally large sample sizes.

Johnson discusses the required increase in sample sizes when lowering the threshold to p < .001: “To achieve 80% power in detecting a standardized effect size of 0.3 on a normal mean, for instance, decreasing the threshold for significance from 0.05 to 0.005 requires an increase in sample size from 69 to 130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from 112 to 172.”

Perhaps that doesn’t sound so bad, but let me give you some more examples in the table below. As you see, decreasing the threshold from p < .05 to p < .001 requires approximately doubling the sample size.


d = .3
power .80
d = .3
power .90
d = .5
power .80
d = .5
power .90
d = .8
power .80
d = .8
power .90
0,05
176
235
64
86
26
34
0,001
383
468
140
170
57
69
Ratio
2,18
1,99
2,19
1,98
2,19
2,03

Now that we have surveyed the benefit (lower false discovery rate) and the cost (doubling sample size for independent t-tests – the costs are much higher when you examine interactions) let’s consider alternatives and other considerations.

The first thing I want to notice is how silent these statisticians are about problems associated with Type 2 errors. Why is it really bad to say something is true, when it isn’t, but is it perfectly fine to have 80% power, which means you have a 20% change of concluding there is nothing, while there actually was something? Fiedler, Kutzner, and Krueger (2012) discuss this oversight in the wider discussion of false positives in psychology, although I prefer the discussion of this issue by Cohen (1988) himself. Cohen realized we would be using his minimum 80% power recommendation as a number outside of a context. So let me remind you his reason for recommending 80% power was that he preferred a 1 to 4 balance between Type 1 and Type 2 errors. If you have a 5% false positive rate, and a 20% Type 2 error rate (because of 80% power) this basically means you consider Type 1 errors four times more serious than Type 2 errors. By recommending p < .001 and 80% power, Colquhoun and Johnson as saying a Type 1 error is 200 times as bad as a Type 2 error. Cohen would not agree with this at all.

The second thing I want to note is that because you need to double the sample size when using α = .001, you might just as well perform two studies with p < .05. If you find a statistical difference from zero in two studies, the false discovery rate has gone down substantially (for calculations, see my previous blog post). Doing two studies with p < .05 instead of one study with p < .001 has many benefits. For example, it allows you to generalize over samples and stimuli. This means you are giving the tax payer more worth for their money.

Finally, I don’t think single conclusive studies are the most feasible and efficient way to do science, at least in psychology. This model might work in medicine, where you sometimes really want to be sure a treatment is beneficial. But especially for more exploratory research (currently the default in psychology) an approach where you simply report everything, and perform a meta-analysis over all studies, is a much more feasible approach to scientific progress. Otherwise, what should we do with studies that yield p = .02? Or even p = .08? I assume everyone agrees publication bias is a problem, and if we consider only studies with p < .001 as worthy of publication, publication bias is likely to get much worse.

I think it is often smart to at least slightly lower the alpha level (say to α = .025) because in principle I agree with the main problem Colquhoun and Johnson try to address, that high p-values are not very strong support for your hypothesis (see also Lakens & Evers, 2014). It’s just important that the solution to this problem is realistic, instead of overly simplistic. In general, I don’t think fixed alpha levels are a good idea (instead, you should select and pre-register the alpha level as a function of the sample size, the prior probability the effect is true, and the balance between Type 1 errors and Type 2 errors you can achieve, given the sample size you can collect - more on that in a future blog post). These types of discussions remind me that statistics is an applied science. If you want to make good recommendations, you need to have sufficient experience with a specific field of research, because every field has its own specific challenges.

14 comments:

  1. Nice post. However, I just can't help thinking there is some inconsistent thing going on here.

    We have the highly cited paper Simmons et al. (2011) claiming that "perhaps the most costly error is a false positive, the incorrect rejection of a null hypothesis"

    We have the 'replication movement' making people who published a positive research finding based on p values between 0.01 and 0.05 quite nervous.

    We have Schimmach (2012), discussing "the ironic effect of significant results on the credibility of multiple-study articles"

    And now we suddenly hear that we should go for high alpha values and small studies. How to make sense of all this? You can't push for both low alpha and low beta....

    Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551-566.

    Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22, 1359-1366.

    ReplyDelete
    Replies
    1. High anonymous (I really dislike anonymous comments, so I'm keeping it very short): there is no inconsistency. Replications can weed put false positives, and reporting everything as lomg as the meta analytic effect size estimate allows you to confirm your hypotheses, we are fine. My solutions might differ from Schimmack, who prefers huge samples, so take your pick.

      Delete
  2. Yes those calculations are all true.
    I guess the reasoning in terms of costs / benefits is sound as well, that is, if you are a manager.
    They should not matter to a scientific discipline.

    It's like physicists saying: "Wait a minute, we calculated that to evidence the Higgs Boson at 5 sigma we need to build a machine that will cost 7.5 billion € to build in 10 years, require collaboration of over 10,000 scientists and engineers from over 100 countries.
    ....
    Let's build two smaller ones instead so we have 2.5 sigma"

    If the consensus is that we need larger samples, lower alpha levels, maximal power, this is what we should do.

    The taxpayer hired the physicists who built the LHC as well as us social scientists to figure out the best possible answers to fundamental questions given the scientific record, not the most cost-effective answers. That is ultimately for the industry or a government agency to figure out, they are the ones who want to turn the acquired fundamental knowledge into something useful or profitable.

    I am not surprised companies like Google, essentially trading in knowledge about human behaviour and cognition, started their own campuses and conduct their own studies. The current scientific record of social science does not contain consensus theoretical knowledge with sufficient scientific credibility to be exploited to build some useful technology.

    Anyway, I would change the recommendation to: we need more large scale collaborations to be able to establish effect sizes at the highest power and most skeptical evidence levels possible.

    ReplyDelete
    Replies
    1. Hi Fred, sure - if there are things we all agree on are really important to know, large scale collaborations are the future of psychological science. I agree. Here, I focus more on exploratory research, as I mention. After some replications and extensions, reliab'e effects lendmthemselves for accurate ES estimation.

      Delete
    2. So, in clinical trials they use the concept of phase I (proof of concept), phase II (pilot experiment, with all details in place, enough to demonstrate the effect but not large enough to fully generalize), phase III (large-scale demonstration of efficacy). The sample sizes vary accordingly: phase I ~50, phase II ~200, phase III 1000-10000. These are bulk estimates, numbers will vary with effect sizes and animal/human research (for phase I). I don't know about psychology, but in human neuroimaging, we've basically spent decades publishing only phase I science. Phase II/III was basically left to meta-analysis, either formal or through community consensus, but these attempts are crippled by the lack of methods standardization and data sharing. I think we need the formal equivalent of phases II and III as well, to have solid findings to build upon. But requiring every studies to adhere to phase II-III standards is non-sense.

      Delete
  3. This post (and those preceding it) form a terrific online book chapter for teaching these concepts: Clear, succinct, and goal-focussed! Especially like this sentence: "Colquhoun and Johnson are saying a Type 1 error is 200 times as bad as a Type 2 error. Cohen would not agree with this at all." That's the question in a nutshell - There is no free lunch...

    ReplyDelete
    Replies
    1. Thanks for the nice comment! Happy to see it's helpful.

      Delete
  4. I agree with your arguments against using p<.001, but I think essentially the same argument can be applied to p<.05. Standard hypothesis testing allows control of Type I error for a single experiment; but things get complicated when one considers replications and so forth. A different perspective is to give up on controlling Type I errors and instead focus on evidence for different models (Bayes Factors) or on Type S (sign) and M (magnitude) errors. For the latter see http://www.stat.columbia.edu/~gelman/research/published/francis8.pdf

    ReplyDelete
    Replies
    1. Hi Greg, yes, interpreting results as sign errors and magnitude errors are an interesting approach. I don't think any existing approach is mutually exclusive to any other, so they can all be used to evaluate results. Depending on the research question (more exploratory, or testing predictions from a formal model) some may be more useful than others.

      Delete
  5. Well, speak for yourself. Personally I aim to be right rather more than 30% of the time. I fear that with attitudes expressed here it isn't surprising that there is a crisis of believability in psychology. Many people's response to the latest psychology headlines is "oh yeah?". That is not a good thing.

    ReplyDelete
  6. The broad contours of this point came up in commentaries on Johnson's PNAS article.
    http://www.stat.columbia.edu/~gelman/research/published/Val_Johnson_Letter.pdf

    http://xianblog.wordpress.com/2014/03/25/adaptive-revised-standards-for-statistical-evidence-guest-post/



    ReplyDelete
  7. Your argument seems to be that it doesn't matter much if people publish results that aren't true because someone else will sort it out later.
    I don't think that most people will be very impressed by this.

    ReplyDelete
  8. Another really sensible post that I only saw now.

    ReplyDelete
  9. Another crew with some pretty big science reform heavy hitters (Nosek, Wagenmakers, Ioannidis, Vazire, and many others) is now recommending .005 for most exploratory analyses. https://osf.io/preprints/psyarxiv/mky9j/

    I saw that you tweeted on this, but this post focuses exclusively on .001. I am returning to the .005 proposal in this comment.

    This is obviously considerably less severe of a tradeoff than .001. Also, in the U.S., your driving metaphor makes things ... interesting. People routinely drive 5-10mph over the speed limit. And the cops will rarely ticket you unless you are at least 10mph over. So, if you want to keep people's speeds, say, under 35mph, you would post a speed limit of 25mph.

    Metaphorically, then, given the evidence that many thing is pretty widespread of scientific "speeding" (or, to use Schimmack's term, "doping"), of phacking, garden of forking paths, suboptimal statistics, one could then plausibly argue that a lower "speed limit" (lower pvalue, .005 being proposed by this new crew), is necessary to keep the "true" speed down to the more reasonable .01 or .05.

    This is an argument from scientific human behavior, not stats or methods per se; or, at least, it is at the intersection of the psychology of scientific behavior and stats/methods. It is surely useful to know what is the ideal stat solution, if any, but proposing solutions that also address the frailty of scientists' behaviors may not be ridiculous and just might be even more productive.

    You are more stat-oriented than I am, so I am guessing you will stick to the stats. I am more of a social psychologist than stats guy (I teach grad stats, have pubbed some fairly sophisticated analyses, latent variable modeling, bayesian analysis, etc., and have a paper or two on how data is routinely misinterpreted, but I am not hardcore stats guy). Anyway, on balance, it seems to me that reform that attempts to address the psychology of stats use and interpretation has some merit.

    What do you think?

    (I think I signed up as "I'dratherbeplayingtennis," but I played already today, so)...

    Best,

    Lee Jussim

    ReplyDelete