The 20% Statistician: How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?

Sunday, September 20, 2015

How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?

David Colquhoun (2014) recently wrote “If you use p = 0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.” At the same time, you might have learned that if you set your alpha at 5%, the Type 1 error rate (or false positive rate) will not be higher than 5%. How are these two statements related?

First of all, the statement by David Colquhoun is obviously incorrect – peer reviewers nowadays are not what they never were – but we can correct his sentence by changing ‘will’ into ‘might, under specific circumstances, very well be’. After all, if you would only examine true effects, you could never be wrong when you suggested, based on p = 0.05, that you made a discovery.

The probability that a statement about a single study being indicative of a true effect is correct, depends on the percentage of studies you do where there is an effect (H1 is true), and when there is no effect (H0 is true), the statistical power, and the alpha level. The false discovery rate is the percentage of positive results that are false positives (not the percentage of all studies that are false positives). If you perform 200 tests with 80% power, and 50% (i.e., 100) of the tests examine a true effect, you’ll find 80 true positives (0.8*100), but in the 50% of the tests that do not examine a true effect, you’ll find 5 false positives (0.05*100). For the 85 positive results (80 + 5), the false discovery rate is 5/85=0.0588, or approximately 6% (see the Figure below, from Lakens & Evers, 2014, for a visualization).

At the same time, the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors. This is also true in the Figure above. Of 200 studies, at most 0.05*200 = 10 will be false positives. This happens only when H0 is true for all 200 studies. In our situation, only 5 studies (2.5% of all studies) are Type 1 errors, which is indeed less than 5% of all the studies we’ve performed.

So what’s the problem? The problem is that you should not try to translate your Type 1 error rate into the evidential value of a single study. If you want to make a statement about a single p < 0.05 study representing a true effect, there is no way to quantify this without knowing the power in the studies where H1 is true, and the percentage of studies where H1 is true. P-values and evidential value are not completely unrelated, in the long run, but a single study won’t tell you a lot – especially when you investigate counterintuitive findings that are unlikely to be true.

So what should you do? The solution is to never say you’ve made a discovery based on a single p-value. This will not just make statisticians, but also philosophers of science, very happy. And instead of making a fool out of yourself perhaps as often as 30% of the time, you won't make a fool out of yourself at all.

A statistically significant difference might be ‘in line with’ predictions from a theory. After all, your theory predicts data patterns, and the p-value tells you the probability of observing data (or more extreme data), assuming the null hypothesis is true. ‘In line with’ is a nice way to talk about your results. It is not a quantifiable statement about your hypothesis (that would be silly, based on a p-value!), but it is a fair statement about your data.

P-values are important tools because they allow you to control error rates. Not the false positive discovery rate, but the false positive rate. If you do 200 studies in your life, and you control your error rates, you won't say that there is an effect, when there is no effect, more than 10 times (on average). That’s pretty sweet. Obviously, there are also Type 2 errors to take into account, which is why you should design high-powered studies, but that’s a different story.

Some people recommend lowering p-value thresholds to as much ass 0.001 before you announce a ‘discovery’ (I've already explained why we should ignore this), and others say we should get rid of p-values altogether. But I think we should get rid of ‘discovery’, and use p-values to control our error rates.

It’s difficult to know, for any single dataset, whether a significant effect is indicative of a true hypothesis. With Bayesian statistics, you can convince everyone who has the same priors. Or, you can collect such a huge amount of data, that you can convince almost everyone (irrespective of their priors). But perhaps we should not try to get too much out of single studies. It might just be the case that, as long as we share all our results, a bunch of close replications extended with pre-registered novel predictions of a pattern in the data will be more useful for cumulative science than quantifying the likelihood a single study provides support for a hypothesis. And if you agree we need multiple studies, you'd better control your Type 1 errors in the long run.

19 comments:

Shravan VasishthSeptember 21, 2015 at 9:01 AM
You wrote: " there is no way to quantify this without knowing the power in the studies where H1 is true, and the percentage of studies where H1 is true"

How will you compute power for the case where H1 is true in your definition of H1?
Under your definition, H1 is that $\mu \neq 0$. Power is not a unitary number here for $\mu\neq 0$, you would have to commit to a specific value of $\mu$. Power is best seen as a function, with different specific values for $\mu$. So the above sentence is not really correct.
ReplyDelete
Replies
Sam SchwarzkopfSeptember 21, 2015 at 10:31 AM
Very sensible post. As far as I can tell, David Colquhoun's argument, and his overstatement, rests on the assumption of very low base rates (e.g. he uses the example of 10% of tested hypotheses being true). That will certainly be a valid assumption in some cases but it is completely preposterous in others. There must be a pretty enormous variance in base rates across different hypotheses, between different fields and even within fields. I can certainly see how testing numerous chemical compounds or thousands of genes or an exploratory fMRI analysis with thousands of voxels will have inflated false discovery rates (and you're supposed to correct for multiple comparisons in that case - why this doesn't take into account the base rate it will already help the situation). But not all science operates that way. A lot of hypotheses are being tested because researchers have good reasons to expect that they could be true, either because it follows from previous literature or from theoretical models. The best studies contrast different hypotheses that both have some footing in theory and the outcome can adjudicate between them. In the perfect scenario (one was discussed by that Firestone & Scholl review recently as the El Greco Fallacy) you could even use a significant finding to disprove a hypothesis. In many such situations surely the probability that the H1 your testing is true should be better than a coin toss.

Anyway, you say we should get rid of 'discovery'. Do you think you can? I am not sure that it isn't simply human nature to interpret it this way. If you find something that reaches some criterion of evidence (however loosely defined) most people will inevitably be led to treat it as a discovery or important result etc. This isn't the fault of p-values either, you'll get the same with Bayes factors or whatever other statistical approach. Perhaps this emotional reaction can be counteracted by education and political changes to how research works but I am not sure it can.
ReplyDelete
Replies
David ColquhounSeptember 21, 2015 at 1:38 PM
You say that

“First of all, the statement by David Colquhoun is obviously incorrect – peer reviewers nowadays are not what they never were”

That is quite an accusation, especially since you don’t say what’s wrong with it.
Royal Society Open science has open peer review so you can read the reports of the referees. They might be interested in your accusation too.

In fact you have not addressed at all the question that I asked, which was “if you observe P = 0.047 in a perfect single experiment, and claim that there is a real effect, what is the probability that you make a fool of yourself by claiming a discovery when there is none?”

This, if course, depends on the power of the test and on the prevalence of true effects (the probability that there is a real effect before the experiment was done).

This prevalence (prior) is not known, but it’s not reasonable to assume any value greater than 0.5. To do so would amount to saying to the journal editor that I have made a discovery and my evidence for that claim is based on the assumption that I was almost sure that there was an effect before I did the experiment. I have never seen anyone advance such an argument in a paper, and to do so would invite derision.

If the prevalence is 0.5, the chance of making a fool of yourself is AT LEAST 26% (rounded to 30% in my strap line). If the prevalence is lower than 0.5, the false discovery rate will be much higher (e.g. it is at least 76% for a prevalence of 0.1). Your figure of 6% is based on what happens when we look at all P values equal to or less than 0.05. This does not answer my question. In order to get the answer one has to look not at all tests that give P < 0.05, but only at those test that give what we observed. P = 0.047. This is explained in section 10 of my paper http://rsos.royalsocietypublishing.org/content/1/3/140216#sec-10

That can be done algebraically, but I do it my simulation, which necessitates looking only at those tests which give P close to 0.047 (I used P between 0.045 and 0.05). When this is done, it’s found that the false discovery rate is not 6%, but 26%.

While it is true that “the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors”, it is irrelevant because type 1 errors don’t answer the question of how often your discovery is false.

Your argument seems to be that it doesn't matter much if people publish results that aren't true because someone else will sort it out later.

I don't think that most people will be very impressed by this.

The recent replication study shows that a majority of results can't be replicated. If I were a psychologist, I would be very worried indeed by that. It represents a colossal waste of research funds. The use of P < 0.05 must take some of the blame for this sad state of affairs.

One result of this is that every new psychology study that appears in the news is greeted with yawns and "Oh yeah?". Using P=0.05 may get you lots of papers, but it damages science. Until people realise how little evidence is provided by marginal P values, this will continue.

I guess one reason is the great pressure that's placed on academics to publish before the work is ready. Even that is, in a sense, a statistical problem. It's a problem that results from the statistical illiteracy of senior academics who rely on crude metrics and who care about quantity more than quality.
ReplyDelete
Replies
David ColquhounSeptember 21, 2015 at 2:13 PM
That answer seems to confirm the impression that you aren't worried by the fact that a majority of published results are wrong, on the basis that someone else will sort it out eventually.
That seems to me to be irresponsible.
ReplyDelete
Replies

Add comment