The 20% Statistician: Too True to be Bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True

Friday, June 27, 2014

Too True to be Bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True

Most of this post is inspired by a lecture on probabilities by Ellen Evers during a PhD workshop we taught (together with Job van Wolferen and Anna van ‘t Veer) called ‘How do we know what’s likely to be true’. I’d heard this lecture before (we taught the same workshop at Eindhoven a year ago) but now she extended her talk to the probability of observing a mix of significant an non-significant findings. If this post is useful for you, credit goes to Ellen Evers.

A few days ago, I sent around some questions on Twitter (thanks for answering!) and in this blog post, I’d like to explain the answers. Understanding this is incredibly important and will change the way you look at sets of studies that contain a mix of significant and non-significant results, so you want to read until the end. It’s not that difficult, but you probably want to get a coffee. 42 people answered the questions, and all but 3 worked in science, anywhere from 1 to 26 years. If you want to do the questions before reading the explanations below (which I recommend), go here.

I’ll start with the easiest question, and work towards the most difficult one.

Running a single study

I asked: You are planning a new study. Beforehand, you judge it is equally likely that the null-hypothesis is true, as that it is false (a uniform prior). You set the significance level at 0.05 (and pre-register this single confirmatory test to guarantee the Type 1 error rate). You design the study to have 80% power if there is a true effect (assume you succeed perfectly). What do you expect is the most likely outcome of this single study?

The four response options were:

1) It is most likely that you will observe a true positive (i.e., there is an effect, and the observed difference is significant).
2) It is most likely that you will observe a true negative (i.e., there is no effect, and the observed difference is not significant)
3) It is most likely that you will observe a false positive (i.e., there is no effect, but the observed difference is significant).
4) It is most likely that you will observe a false negative (i.e., there is an effect, but the observed difference is not significant)

59% of the people chose the correct answer: It’s most likely that you’ll observe a true negative. You might be surprised, because the scenario (5% significance level, 80% power, the null hypothesis (H0) and the alternative hypothesis (H1) are equally likely to be true) is pretty much the prototypical experiment. It thus means that a typical experiment (at least when you think your hypothesis is 50% likely to be true) is most likely not to reject the null-hypothesis (earlier, I wrote 'fail', but in the comments Ron Dotsch correctly points out not rejecting the null can be informative as well). Let’s break it down slowly.

If you perform a single study, the effect you are examining is either true or false, and the difference you observe is either significant or not significant. These four possible outcomes are referred to as true positives, false positives, true negatives, and false negatives. The percentage of false positives equals the Type 1 error rate (or α, the significance level), and false negatives (or Type 2 errors, β) equal 1 minus the power of the study. When the null hypothesis (H0) and the alternative hypothesis (H1) are a-priori equally likely, the significance level is 5%, and the study has 80% power, the relative likelihood of the four possible outcomes of this study before we collect the data is detailed in the table below.

	H0 True (A-Priori 50% Likely)	H1 True (A-Priori 50% Likely)
Significant Finding	False Positive (α) 2.5%	True Positive (1-β) 40%
Non-Significant Finding	True Negative (1- α) 47.5%	False Negative (β) 10%

The only way a true positive is most likely (the answer provided by 24% of the participants) given this a-priori likelihood of H0 is when the power is higher than 1-α, so in this example higher than 95%. After asking which outcome was most likely, I asked how likely this outcome was. In the sample of 42 people who filled out my there were people who responded intuitively, and those who did the math. Twelve people correctly reported 47.5%. What’s interesting is that 16 people (more than one-third) reported a percentage higher than 50%. These people might have simply ignored the information that the hypothesis was equally likely to be true, as it was that it’s false (which implies no outcome can be higher than 50%), and intuitively calculated probabilities assuming the effect was true, while ignoring the probability it was not true. The modal response for people who had indicated earlier that they thought it was most likely to observe a true positive also points to this, because they judged it would be 80% probable that this true positive was observed.

Then I asked:

“Assume you performed the single study described above, and have observed a statistical difference (p < .05, but you don’t have any further details about effect sizes, exact p-values, or the sample size). Simply based on the fact that the study is statistically significant, how likely do you think it is you observed a significant difference because you were examining a true effect?”

Eight people (who did the math) answered 94.1%, the correct answer. All but two people who responded intuitively underestimated the correct answer (the average answer was 57%). The remaining two answered 95%, which indicates they might have made the common error to assume that observing a significant result means it’s 95% likely the effect is true (it’s not, see Nickerson, 2000). It’s interesting that people who responded intuitively overestimated the a-priori chance of a specific outcome, but then massively underestimate the probability of having observed a specific outcome if the effect was true. The correct answer is 94.1% because now that we know we did not observe a non-significant effect, we are left with the remaining probabilities that the effect is significant. There was 2.5% chance of a Type 1 error, and a 40% chance of a true positive. That means the probability of observing this positive outcome, if the effect is true, is 40 divided by the total, which is 40+2.5. And 40/(40+2.5)=94.1%. Ioannidis (2005) calls this, the post-study probability that the effect is true, the positive predictive value, PPV, (thanks to Marcel van Assen for pointing this out).

What happens if you run multiple studies?

Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the a-priori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.

I might have not formulated the question clearly enough (most probability statements are so unlike natural language, they can be difficult to formulate precisely), but I tried to ask not for the a-priori probability,but for the probability that given these observations, the studies examined a true effect (similar to the single study case above, where the answer was not 80%, but 94.1%). In other words, the probability that H1 is true, conditional on the acceptence of H1, which Ioannidis (2005) calls the PPV. This is the likelihood of finding a true positive, divided by the total probability of finding a significant result (either a true positive or a false positive).

We therefore also need to know how likely it is to observe this finding when the null-hypothesis is true. In that case, we would find a Type 1 error (5%), another Type 1 error (5%), and a true negative (95%), and 0,05*0,05*0,95 = 0.2375%. There are three ways to get this pattern of results, so if you want the probability of 2 out of 3 significant findings under H0 irrespective of the order, this probability is 0.7125%. That’s not very likely at all.

To answer the question, we need to calculate 12.8/(12.8+0.2375) (for the specific order in which the results were observed) or 38.4/(38.4+0.7125) (for any 2 out of 3 studies) and both calculations give us 98.18%. Although a-priori it is not extremely likely to observe 2 significant and 1 non-significant finding, after you have observed this outcome, it is more than 98% likely to have observed 2 significant and one non-significant result in three studies when the effect is true (and thus only 1.82% when the effect is not true).

The probability that, given that you observed a mix of significant and non-significant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant low-powered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and non-significant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include non-significant findings are to be true.

These calculations depend on the power, significance level, and a-priori likelihood that H0 is true. If Ellen and I ever find the time to work on a follow up to our recent article on Practical Recommendations to Increase the Informational Value of Studies, I would like to discuss these issues in more detail. To interpret whether 1 out of 2 studies is still support for your hypothesis, these values matter a lot, but to interpret whether 4 out of 6 studies are support for your hypothesis, they are almost completely irrelevant. This means that one or two non-significant findings in a larger set of studies do almost nothing to reduce the likelihood that you were examining a true effect. If you’ve performed three studies that all worked, and a close replication isn’t significant, don’t get distracted by looking for moderators, at least until the unexpected result is replicated.

I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and a-priori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels. If you want to start trying out different options immediately, use the online spreadsheet below:

If we want to reduce publication bias, understanding (I mean, really understanding) that sets of studies that include non-significant findings are extremely likely, assuming H1 is true, is a very important realization. Depending on the number of studies, their power, significance level, and the a-priori likelihood of the idea you were testing, it can be no problem to submit a set of studies with mixed significant and non-significant results for publication. If you do, make sure that the Type 1 error rate is controlled (e.g., by pre-registering your study design).

I want to end with a big thanks to Ellen Evers for explaining this to me last week, and thanks so much to all of you who answered my questionnaire about probabilities.

8 comments:

JJJune 27, 2014 at 10:35 AM
Do you know any paper that contains these explanations? Alternatively, how could I cite your blog?
ReplyDelete
Replies
Ron DotschJune 27, 2014 at 10:58 AM
Very interesting, Daniel (and Ellen!), thank you for this. I looked at the questions earlier and decided that I'd wait for the responses, because honestly, despite having thought about these things a lot, I wouldn't have been able to come up with the correct answers some of the times.

Just a short question about terminology. You write that a true negative is a failed experiment: "59% of the people chose the correct answer: It’s most likely that you’ll observe a true negative. [...] It thus means that a typical experiment (at least when you think your hypothesis is 50% likely to be true) is most likely to fail." -- I would say that in the case of true negatives and true positives, the experiment did not fail at all, as it led the researcher to draw the correct conclusion. Am I missing something?
ReplyDelete
Replies
UnknownJune 27, 2014 at 3:10 PM
I am tempted to state the conclusion in a slightly different way. What matters in practical situations, as I understand it, is the number of studies that observed significant effects. The number that failed to reach significance is much less important, because failure to reach significance is not itself evidence for anything that matters, except in cases where you have prior information that you are unlikely to have in practice.
ReplyDelete
Replies
Greg FrancisJune 27, 2014 at 3:55 PM
Daniel,

This is a very clear discussion of an important issue. However, I feel I must add two thoughts that fit very well with your thesis but are inconsistent with some of your other postings (on this blog and elsewhere).

1) Your suggestion that a set of mixed results is likely for common experimental designs in psychology is entirely consistent with the Test for Excess Success (TES) that I have used (but you have criticised on Twitter). The TES simply looks for the absence of the expected non-significant findings. For example, in Jostmann, Lakens and Schubert (2009) every experiment was significant but the estimated probability of a significant outcome (assuming H1 is true) for each of the four experiments is: 0.512, 0.614, 0.560, and 0.512. The probability of all four such experiments rejecting the null is the product of the probabilities: 0.09. The low probability implies is that something is wrong with the data collection, data analysis, or reporting of these studies. Corroborating this claim, Jostmann reported an unpublished non-significant finding at PsychFileDrawer. Details of this analysis are in Francis (2014, Psychonomic Bulletin & Review).

I can understand you not liking the conclusion of the TES in this particular case, but it appears to be true and it follows logically from the observations you made in the post.

2) In other blog posts and articles you have encouraged the use of sequential methods rather than fixed sampling approaches because the former allows a researcher to generate more significant findings. I don’t think this claim is necessarily true, but even if it were true it seems like a silly goal. As your post explains, true effects should produce a fair number of non-significant outcomes. I cannot see any motivation to use a method that generates more significant outcomes when we know that there should be a certain number of non-significant outcomes.

I appreciate that you are thinking seriously about these statistical issues and that you go to the trouble to write up your thoughts on a blog. I hope you can step back and look at the bigger picture and see that some of your observations do not fit together.

Best wishes,

Greg Francis
ReplyDelete
Replies

Add comment