A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, September 20, 2015

How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?

David Colquhoun (2014) recently wrote “If you use p = 0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.” At the same time, you might have learned that if you set your alpha at 5%, the Type 1 error rate (or false positive rate) will not be higher than 5%. How are these two statements related?

First of all, the statement by David Colquhoun is obviously incorrect – peer reviewers nowadays are not what they never were – but we can correct his sentence by changing ‘will’ into ‘might, under specific circumstances, very well be’. After all, if you would only examine true effects, you could never be wrong when you suggested, based on p = 0.05, that you made a discovery.

The probability that a statement about a single study being indicative of a true effect is correct, depends on the percentage of studies you do where there is an effect (H1 is true), and when there is no effect (H0 is true), the statistical power, and the alpha level. The false discovery rate is the percentage of positive results that are false positives (not the percentage of all studies that are false positives). If you perform 200 tests with 80% power, and 50% (i.e., 100) of the tests examine a true effect, you’ll find 80 true positives (0.8*100), but in the 50% of the tests that do not examine a true effect, you’ll find 5 false positives (0.05*100). For the 85 positive results (80 + 5), the false discovery rate is 5/85=0.0588, or approximately 6% (see the Figure below, from Lakens & Evers, 2014, for a visualization).

At the same time, the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors. This is also true in the Figure above. Of 200 studies, at most 0.05*200 = 10 will be false positives. This happens only when H0 is true for all 200 studies. In our situation, only 5 studies (2.5% of all studies) are Type 1 errors, which is indeed less than 5% of all the studies we’ve performed.

So what’s the problem? The problem is that you should not try to translate your Type 1 error rate into the evidential value of a single study. If you want to make a statement about a single p < 0.05 study representing a true effect, there is no way to quantify this without knowing the power in the studies where H1 is true, and the percentage of studies where H1 is true. P-values and evidential value are not completely unrelated, in the long run, but a single study won’t tell you a lot – especially when you investigate counterintuitive findings that are unlikely to be true.

So what should you do? The solution is to never say you’ve made a discovery based on a single p-value. This will not just make statisticians, but also philosophers of science, very happy. And instead of making a fool out of yourself perhaps as often as 30% of the time, you won't make a fool out of yourself at all.

A statistically significant difference might be ‘in line with’ predictions from a theory. After all, your theory predicts data patterns, and the p-value tells you the probability of observing data (or more extreme data), assuming the null hypothesis is true. ‘In line with’ is a nice way to talk about your results. It is not a quantifiable statement about your hypothesis (that would be silly, based on a p-value!), but it is a fair statement about your data.

P-values are important tools because they allow you to control error rates. Not the false positive discovery rate, but the false positive rate. If you do 200 studies in your life, and you control your error rates, you won't say that there is an effect, when there is no effect, more than 10 times (on average). That’s pretty sweet. Obviously, there are also Type 2 errors to take into account, which is why you should design high-powered studies, but that’s a different story.

Some people recommend lowering p-value thresholds to as much ass 0.001 before you announce a ‘discovery’ (I've already explained why we should ignore this), and others say we should get rid of p-values altogether. But I think we should get rid of ‘discovery’, and use p-values to control our error rates. 

It’s difficult to know, for any single dataset, whether a significant effect is indicative of a true hypothesis. With Bayesian statistics, you can convince everyone who has the same priors. Or, you can collect such a huge amount of data, that you can convince almost everyone (irrespective of their priors). But perhaps we should not try to get too much out of single studies. It might just be the case that, as long as we share all our results, a bunch of close replications extended with pre-registered novel predictions of a pattern in the data will be more useful for cumulative science than quantifying the likelihood a single study provides support for a hypothesis. And if you agree we need multiple studies, you'd better control your Type 1 errors in the long run.


  1. You wrote: " there is no way to quantify this without knowing the power in the studies where H1 is true, and the percentage of studies where H1 is true"

    How will you compute power for the case where H1 is true in your definition of H1?
    Under your definition, H1 is that $\mu \neq 0$. Power is not a unitary number here for $\mu\neq 0$, you would have to commit to a specific value of $\mu$. Power is best seen as a function, with different specific values for $\mu$. So the above sentence is not really correct.

    1. Yes, to calculate power, you need to know the true effect size, which is unknown. This post is about the difficulties in determining the evidential value of a single study - the difficulty of knowing the power fits wih that message. My my point is: Perhaps we should care more about error control, and less about determining the evidential value of single studies.

  2. Very sensible post. As far as I can tell, David Colquhoun's argument, and his overstatement, rests on the assumption of very low base rates (e.g. he uses the example of 10% of tested hypotheses being true). That will certainly be a valid assumption in some cases but it is completely preposterous in others. There must be a pretty enormous variance in base rates across different hypotheses, between different fields and even within fields. I can certainly see how testing numerous chemical compounds or thousands of genes or an exploratory fMRI analysis with thousands of voxels will have inflated false discovery rates (and you're supposed to correct for multiple comparisons in that case - why this doesn't take into account the base rate it will already help the situation). But not all science operates that way. A lot of hypotheses are being tested because researchers have good reasons to expect that they could be true, either because it follows from previous literature or from theoretical models. The best studies contrast different hypotheses that both have some footing in theory and the outcome can adjudicate between them. In the perfect scenario (one was discussed by that Firestone & Scholl review recently as the El Greco Fallacy) you could even use a significant finding to disprove a hypothesis. In many such situations surely the probability that the H1 your testing is true should be better than a coin toss.

    Anyway, you say we should get rid of 'discovery'. Do you think you can? I am not sure that it isn't simply human nature to interpret it this way. If you find something that reaches some criterion of evidence (however loosely defined) most people will inevitably be led to treat it as a discovery or important result etc. This isn't the fault of p-values either, you'll get the same with Bayes factors or whatever other statistical approach. Perhaps this emotional reaction can be counteracted by education and political changes to how research works but I am not sure it can.

    1. "As far as I can tell, David Colquhoun's argument, and his overstatement, rests on the assumption of very low base rates (e.g. he uses the example of 10% of tested hypotheses being true)."

      It is based on nothing of the sort!

    2. If you are accusing me of overstatement, I think we deserve a proper argument about why!

    3. Okay fine, let me rephrase. In your very comment on this blog today you again repeat the argument:

      "This prevalence (prior) is not known, but it’s not reasonable to assume any value greater than 0.5."

      I don't believe it's tenable. It may be tenable for certain questions such as drug testing (not sure about actually that but you know that far better than I so I accept it). I also have a strong suspicion that for social priming (I wasn't supposed to use that term anymore but nobody told me what to say instead so I will use it) the base rate is probably approaching 0 (of course we're now also getting into questions of effect size etc but let's not complicate matters). But I think a blanket statement that all scientific hypothesis have *at best* the same chance of being true as a coin toss is almost certainly wrong.

    4. Well, as I also said in my response here,

      "This prevalence (prior) is not known, but it’s not reasonable to assume any value greater than 0.5. To do so would amount to saying to the journal editor that I have made a discovery and my evidence for that claim is based on the assumption that I was almost sure that there was an effect before I did the experiment. I have never seen anyone advance such an argument in a paper, and to do so would invite derision. "

      In the absence of strong empirical evidence that your hypothesis is more likely than not to be true, I wish you luck with referees if you use that assumption in a paper.

    5. I honestly simply can't follow this logic. Saying that p(H1)>0.5 is not saying that you're "almost sure that there was an effect". Let's say the probability was 0.6-7 which is not unreasonable in my mind. It also matches the results of the (obviously subjective) survey of what researchers believe about many hypotheses (I can't recall where I saw that otherwise I'd show you - but as it's subjective I don't think it's all that relevant). I would call a probability of 0.7 "a good hunch" not being "almost sure".

      As I said to you on Twitter, I may write a blog about this some time. I think I can go through a few papers in my field and count the numbers of significance tests in which researchers could probably have a fairly good hunch that the hypothesis could be true. The outcome of this exercise would also interest me so I may do it but I'm afraid I can't do it now.

      More generally speaking though, I do think good hunches are normal when hypotheses are grounded in reasonably solid theory or at least based on prior research. A lot of science actually does this because it is inherently incremental and building on past findings. Obviously not all does - nor should it. I am fully willing to accept that the base rate for stabbing in the dark is lower than 50%.

    6. Your figure of 70% for the proportion of cases in which the hypothesis proved to be true could have come from Loiselle & Ramachandran's comment on my paper: http://rsos.royalsocietypublishing.org/content/2/8/150217

      However as I pointed out in my response ( http://rsos.royalsocietypublishing.org/content/2/8/150319 ), this number was arrived at by asking some colleagues how often their guesses were right. Their argument is, therefore, entirely circular, and so has little value.

      If one were to rely on the much better-documented 36% replication rate found in the recent reproduciblily project ( http://www.sciencemag.org/content/349/6251/aac4716 ) then the false discovery rate after observing P = 0.047 would be 39% rather than 26% (for power = 0.8)

    7. True and this is why I said it's subjective and probably not very helpful (I'm not sure it is this one - I am pretty sure I didn't see it on your blog but was referred to it from somewhere else - but it sounds very similar). As I said a more empirical way of going about this would be useful. It isn't a trivial thing in any case. The reproducibility project you keep mentioning (as if nobody had heard about it) is one set of results from a two subfields of psychology. That's just my point. It will vary between fields, subfields, and research questions. Anyway, i am not discussing this any more as this won't be fruitful until we actually have more quantitative evidence to discuss.

  3. You say that

    “First of all, the statement by David Colquhoun is obviously incorrect – peer reviewers nowadays are not what they never were”

    That is quite an accusation, especially since you don’t say what’s wrong with it.
    Royal Society Open science has open peer review so you can read the reports of the referees. They might be interested in your accusation too.

    In fact you have not addressed at all the question that I asked, which was “if you observe P = 0.047 in a perfect single experiment, and claim that there is a real effect, what is the probability that you make a fool of yourself by claiming a discovery when there is none?”

    This, if course, depends on the power of the test and on the prevalence of true effects (the probability that there is a real effect before the experiment was done).

    This prevalence (prior) is not known, but it’s not reasonable to assume any value greater than 0.5. To do so would amount to saying to the journal editor that I have made a discovery and my evidence for that claim is based on the assumption that I was almost sure that there was an effect before I did the experiment. I have never seen anyone advance such an argument in a paper, and to do so would invite derision.

    If the prevalence is 0.5, the chance of making a fool of yourself is AT LEAST 26% (rounded to 30% in my strap line). If the prevalence is lower than 0.5, the false discovery rate will be much higher (e.g. it is at least 76% for a prevalence of 0.1). Your figure of 6% is based on what happens when we look at all P values equal to or less than 0.05. This does not answer my question. In order to get the answer one has to look not at all tests that give P < 0.05, but only at those test that give what we observed. P = 0.047. This is explained in section 10 of my paper http://rsos.royalsocietypublishing.org/content/1/3/140216#sec-10

    That can be done algebraically, but I do it my simulation, which necessitates looking only at those tests which give P close to 0.047 (I used P between 0.045 and 0.05). When this is done, it’s found that the false discovery rate is not 6%, but 26%.

    While it is true that “the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors”, it is irrelevant because type 1 errors don’t answer the question of how often your discovery is false.

    Your argument seems to be that it doesn't matter much if people publish results that aren't true because someone else will sort it out later.

    I don't think that most people will be very impressed by this.

    The recent replication study shows that a majority of results can't be replicated. If I were a psychologist, I would be very worried indeed by that. It represents a colossal waste of research funds. The use of P < 0.05 must take some of the blame for this sad state of affairs.

    One result of this is that every new psychology study that appears in the news is greeted with yawns and "Oh yeah?". Using P=0.05 may get you lots of papers, but it damages science. Until people realise how little evidence is provided by marginal P values, this will continue.

    I guess one reason is the great pressure that's placed on academics to publish before the work is ready. Even that is, in a sense, a statistical problem. It's a problem that results from the statistical illiteracy of senior academics who rely on crude metrics and who care about quantity more than quality.

    1. David, I understand the stats you calculate (and published a similar, but nore nuanced, explanation of them in Lakens & Evers, 2014). My main points are that I don't want to know the false discovery rate, as long as I control my Type 1 error rate. Nowhere do I suggest 'others' should sort out what's true (see Nosek & Lakens, 2014, for our special issue full of registered replications, or Evers & Lakens, 2014, for a registered replication, or Koole & Lakens, 2012, for how to reward replications, or Zhang, Lakens, & IJsselsteijn, 2015, of how we replicate our own work in published papers).

      If you want to try to guess the evidential value of single studies, be my guest. I prefer a multi-study approach while controlling Type 1 errors (and having no file-drawer). Obviously, when we do good science, you should only observe very few p-values just below 0.05. See http://daniellakens.blogspot.nl/2015/05/after-how-many-p-values-between-0025.html. But when we start to publish everything, and replicate and extend, and show we can predict, judging the evidential value of single p-values is not so important - we can use meta-analysis. I agree with you current *use* of p-values is problematic. But I am making a more basic point here about what researchers should aim for in single studies: error control, not evidential value.

    2. And of course all this is also eminently related to the tweet conversation the two of us had a few days ago that I generated this figure for: https://twitter.com/sampendu/status/642237534332940288

      If I had time I would try to figure out if this actually does depend on the effect size. Since I am timeless (not really right but hey) I will just wait for somebody smarter to tell me instead.

    3. As far as I can see, the main results are more-or-less independent of the effect size, as long as you adjust the sample size to keep the power constant.

    4. I tried to do just that in the simulation but there is a chance that this wasn't right. The x-axis is effectively the empirically observed power rather than theoretical power. Not sure if this makes a difference. One fine day I may try this more properly.

    5. Well it you do it by simulation, you know the true power. In my paper (section 10) I found that the false discovery rate was insensitive to power.

  4. That answer seems to confirm the impression that you aren't worried by the fact that a majority of published results are wrong, on the basis that someone else will sort it out eventually.
    That seems to me to be irresponsible.

    1. From the last paragraph of the original article, I think the author is suggesting 'we' move away from the viewpoint that everything that is published must be true. The very nature of these studies implies that there is always a chance that you are not in fact observing a true phenomenon. Rather than trying to minimize the chance of this happening, we need to be more careful about how we interpret published results: interpreting them as what they are, a probabilistic statement about a complicated issue, and not as some boolean result.

    2. It's always probablistic. The question to be answered is what level of (im)probability are you willing to accept. Lakens seems quite happy to accept a chance of at least 30% that you are publishing nonsense, I'm not.