Most of this post is inspired by a lecture on probabilities
by Ellen Evers during a PhD workshop we taught (together with Job van Wolferen
and Anna van ‘t Veer) called ‘How do we know what’s likely to be true’. I’d
heard this lecture before (we taught the same workshop at Eindhoven a year ago)
but now she extended her talk to the probability of observing a mix of
significant an nonsignificant findings. If this post is useful for you, credit
goes to Ellen Evers.
A few days ago, I sent around some questions on Twitter (thanks
for answering!) and in this blog post, I’d like to explain the answers.
Understanding this is incredibly important and will change the way you look at
sets of studies that contain a mix of significant and nonsignificant results,
so you want to read until the end. It’s not that difficult, but you probably
want to get a coffee. 42 people answered the questions, and all but 3 worked in
science, anywhere from 1 to 26 years. If you want to do the questions before
reading the explanations below (which I recommend), go here.
I’ll start with the easiest question, and work towards the most difficult one.
Running a single study
Running a single study
I asked: You are
planning a new study. Beforehand, you judge it is equally likely that the
nullhypothesis is true, as that it is false (a uniform prior). You set the
significance level at 0.05 (and preregister this single confirmatory test to
guarantee the Type 1 error rate). You design the study to have 80% power if
there is a true effect (assume you succeed perfectly). What do you expect is
the most likely outcome of this single study?
The four response options were:
1) It is most likely that
you will observe a true positive (i.e., there is an effect, and the observed
difference is significant).


2) It is most likely that
you will observe a true negative (i.e., there is no effect, and the observed
difference is not significant)


3) It is most likely that
you will observe a false positive (i.e., there is no effect, but the observed
difference is significant).


4) It is most likely that
you will observe a false negative (i.e., there is an effect, but the observed
difference is not significant)

59% of the people chose the correct answer: It’s most likely
that you’ll observe a true negative. You might be surprised, because the
scenario (5% significance level, 80% power, the null hypothesis (H0) and the
alternative hypothesis (H1) are equally likely to be true) is pretty much the
prototypical experiment. It thus means that a typical experiment (at least when
you think your hypothesis is 50% likely to be true) is most likely not to reject the nullhypothesis (earlier, I wrote 'fail', but in the comments Ron Dotsch correctly points out not rejecting the null can be informative as well).
Let’s break it down slowly.
If you perform a single study, the effect you are examining
is either true or false, and the difference you observe is either significant
or not significant. These four possible outcomes are referred to as true
positives, false positives, true negatives, and false negatives. The percentage of false positives equals the Type 1 error rate (or α, the significance level), and false negatives (or Type 2 errors, β) equal 1 minus the power of the study. When the null hypothesis (H0) and the alternative hypothesis (H1) are apriori equally
likely, the significance level is 5%, and the study has 80% power, the relative
likelihood of the four possible outcomes of this study before we collect the
data is detailed in the table below.
H0 True
(APriori 50% Likely)

H1 True
(APriori 50% Likely)


Significant Finding

False Positive (α)
2.5%

True Positive (1β)
40%

NonSignificant Finding

True Negative (1 α)
47.5%

False Negative (β)
10%

The only way a true positive is most likely (the answer
provided by 24% of the participants) given this apriori likelihood of H0 is when the power is higher than 1α, so in this example higher
than 95%. After asking which outcome
was most likely, I asked how likely
this outcome was. In the sample of 42 people who filled out my there were
people who responded intuitively, and those who did the math. Twelve people
correctly reported 47.5%. What’s interesting is that 16 people (more than
onethird) reported a percentage higher than 50%. These people might have
simply ignored the information that the hypothesis was equally likely to be
true, as it was that it’s false (which implies no outcome can be higher than
50%), and intuitively calculated probabilities assuming the effect was true,
while ignoring the probability it was not true. The modal response for people
who had indicated earlier that they thought it was most likely to observe a
true positive also points to this, because they judged it would be 80% probable
that this true positive was observed.
Then I asked:
“Assume you performed the single study
described above, and have observed a statistical difference (p < .05, but you don’t have any
further details about effect sizes, exact pvalues,
or the sample size). Simply based on the fact that the study is statistically
significant, how likely do you think it is you observed a significant
difference because you were examining a true effect?”
Eight people (who did the math) answered 94.1%, the correct
answer. All but two people who responded intuitively underestimated the correct
answer (the average answer was 57%). The remaining two answered 95%, which
indicates they might have made the common error to assume that observing a
significant result means it’s 95% likely the effect is true (it’s not, see
Nickerson, 2000). It’s interesting that people who responded intuitively overestimated the apriori chance of a specific outcome, but then massively underestimate the probability of having
observed a specific outcome if the effect was true. The correct answer is 94.1%
because now that we know we did not observe a nonsignificant effect, we are
left with the remaining probabilities that the effect is significant. There was
2.5% chance of a Type 1 error, and a 40% chance of a true positive. That means
the probability of observing this positive outcome, if the effect is true, is
40 divided by the total, which is 40+2.5. And 40/(40+2.5)=94.1%. Ioannidis (2005) calls this, the poststudy probability that the effect is true, the positive predictive value, PPV, (thanks to Marcel van Assen for pointing this out).
What happens if you run multiple studies?
Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the apriori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.
What happens if you run multiple studies?
Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the apriori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.
We therefore also need to know how likely it is to observe
this finding when the nullhypothesis is true. In that case, we would find a
Type 1 error (5%), another Type 1 error (5%), and a true negative (95%), and
0,05*0,05*0,95 = 0.2375%. There are three ways to get this pattern of results,
so if you want the probability of 2 out of 3 significant findings under H0 irrespective of the order, this probability is 0.7125%. That’s not very likely at all.
To answer the question, we need to calculate 12.8/(12.8+0.2375)
(for the specific order in which the results were observed) or 38.4/(38.4+0.7125) (for any 2 out of 3 studies) and both calculations give us 98.18%. Although apriori it is not extremely likely to observe 2 significant and 1 nonsignificant finding, after you have observed this outcome, it is more than 98% likely to have observed 2
significant and one nonsignificant result in three studies when the effect is
true (and thus only 1.82% when the effect is not true).
The probability that, given that you observed a mix of significant and nonsignificant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant lowpowered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and nonsignificant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include nonsignificant findings are to be true.
The probability that, given that you observed a mix of significant and nonsignificant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant lowpowered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and nonsignificant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include nonsignificant findings are to be true.
These calculations depend on the power, significance level,
and apriori likelihood that H0 is true. If Ellen and I ever find the time to work on a follow up to our recent article on Practical Recommendations to Increase the Informational Value of Studies, I would like to discuss these issues in more detail. To interpret whether 1
out of 2 studies is still support for your hypothesis, these values matter a
lot, but to interpret whether 4 out of 6 studies are support for your
hypothesis, they are almost completely irrelevant. This means that one or two
nonsignificant findings in a larger set of studies do almost nothing to reduce
the likelihood that you were examining a true effect. If you’ve performed three
studies that all worked, and a close replication isn’t significant, don’t get
distracted by looking for moderators, at least until the unexpected result is
replicated.
I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and apriori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels. If you want to start trying out different options immediately, use the online spreadsheet below:
I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and apriori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels. If you want to start trying out different options immediately, use the online spreadsheet below:
If we want to reduce publication bias, understanding (I mean, really understanding) that sets of studies that include nonsignificant findings are extremely likely, assuming H1 is true, is a very important realization. Depending on the number of studies, their power,
significance level, and the apriori likelihood of the idea you were testing,
it can be no problem to submit a set of studies with mixed significant and
nonsignificant results for publication. If you do, make sure that the Type 1
error rate is controlled (e.g., by preregistering your study design).
I want to end with a big thanks to Ellen Evers for
explaining this to me last week, and thanks so much to all of you who answered
my questionnaire about probabilities.
Do you know any paper that contains these explanations? Alternatively, how could I cite your blog?
ReplyDeleteHi, I don't think there's a paper out there that specifies the calculations and probabilities for multiple studies like Ellen Evers did in the lecture, and I detail here in this blog. I can understand that, if you are about to submit a set of studies with some nonsignificant findings, you want to point the reviewers and editors here. For now, you can cite it as:
DeleteLakens, D., & Evers, E. R. K (2014, June 27). Too True to be Bad:When Sets of Studies with Significant and NonSignificant Findings Are Probably True. Retrieved from http://daniellakens.blogspot.nl/2014/06/tootruetobebadwhensetsofstudies.html
Thanks for your comment  that really motivates us to write it up for a real paper  we'll try to find the time, but Ellen is busy writing up her PhD thesis, so it might be a month or 2.
Very interesting, Daniel (and Ellen!), thank you for this. I looked at the questions earlier and decided that I'd wait for the responses, because honestly, despite having thought about these things a lot, I wouldn't have been able to come up with the correct answers some of the times.
ReplyDeleteJust a short question about terminology. You write that a true negative is a failed experiment: "59% of the people chose the correct answer: It’s most likely that you’ll observe a true negative. [...] It thus means that a typical experiment (at least when you think your hypothesis is 50% likely to be true) is most likely to fail."  I would say that in the case of true negatives and true positives, the experiment did not fail at all, as it led the researcher to draw the correct conclusion. Am I missing something?
Hi Ron, completely right  the term 'fail' is not correct  I'll update the blog post! Finding a true negative can be very interesting. As Ioanides focusses too much on findings that are most probable false, I focussed too much on findings that are most probably true positives  you could shift the focus and talk about the probability that you find a true negative, which can be just as important!
DeleteI am tempted to state the conclusion in a slightly different way. What matters in practical situations, as I understand it, is the number of studies that observed significant effects. The number that failed to reach significance is much less important, because failure to reach significance is not itself evidence for anything that matters, except in cases where you have prior information that you are unlikely to have in practice.
ReplyDeleteDaniel,
ReplyDeleteThis is a very clear discussion of an important issue. However, I feel I must add two thoughts that fit very well with your thesis but are inconsistent with some of your other postings (on this blog and elsewhere).
1) Your suggestion that a set of mixed results is likely for common experimental designs in psychology is entirely consistent with the Test for Excess Success (TES) that I have used (but you have criticised on Twitter). The TES simply looks for the absence of the expected nonsignificant findings. For example, in Jostmann, Lakens and Schubert (2009) every experiment was significant but the estimated probability of a significant outcome (assuming H1 is true) for each of the four experiments is: 0.512, 0.614, 0.560, and 0.512. The probability of all four such experiments rejecting the null is the product of the probabilities: 0.09. The low probability implies is that something is wrong with the data collection, data analysis, or reporting of these studies. Corroborating this claim, Jostmann reported an unpublished nonsignificant finding at PsychFileDrawer. Details of this analysis are in Francis (2014, Psychonomic Bulletin & Review).
I can understand you not liking the conclusion of the TES in this particular case, but it appears to be true and it follows logically from the observations you made in the post.
2) In other blog posts and articles you have encouraged the use of sequential methods rather than fixed sampling approaches because the former allows a researcher to generate more significant findings. I don’t think this claim is necessarily true, but even if it were true it seems like a silly goal. As your post explains, true effects should produce a fair number of nonsignificant outcomes. I cannot see any motivation to use a method that generates more significant outcomes when we know that there should be a certain number of nonsignificant outcomes.
I appreciate that you are thinking seriously about these statistical issues and that you go to the trouble to write up your thoughts on a blog. I hope you can step back and look at the bigger picture and see that some of your observations do not fit together.
Best wishes,
Greg Francis
Dear Greg, thanks for your comment.
DeleteLet's get some things straight first:
There is no 'liking' or 'not liking' the conclusion of TES  it's statistics, and when it comes to data, I'm like Commander Data. I have no feelings about them, they are what they are. My problem with TES is 1) that it is pointing out a problem we have all known for 50 years existed, and 2) it doesn't solve the problem. The Data Colada blog post today makes the same argument: http://datacolada.org/2014/06/27/24pcurvevsexcessivesignificancetest/
The goal of my blog post here (as well as the recent special issue full of preregistered replication studies I coedited with Brian Nosek) is to solve the problem. Getting people to realize they can submit studies with nonsignificant results might be one way to reduce publication bias.
The goal of my Sequential Analyses paper is not about getting more significant findings (you might want to read it more closely). The goal is to make it more efficient to run wellpowered studies. Obviously, higher powered studies are very important (see Lakens & Evers, 2014). I don't see any inconsistencies in any of my papers and blog posts: They are all ways to improve finding out what is likely to be true. That's also why I don't use TES  it doesn't tell me if it's true or not, it just tells me there is publication bias.
On a sidenote, I'm not completely sure your analysis of our 2009 paper is correct  we can at least debate about it. In Study 3, we say: "Initial analyses revealed no simple effects of clipboard weight on mayor evaluations or city attitudes, all Fs<1. We continued by regressing city attitudes on clipboard weight (heavy vs. light, contrastcoded), mayor evaluations (continuous and centered), and their interaction term." We don't explicitly say: our initial prediction was not confirmed, and we performed exploratory analyses, but this was 2008, and our statement is pretty close. I would not have taken it along in a TES analysis as a study that confirmed our apriori prediction. But these details don't matter. There was publication bias (just as there is in other studies where TES doesn't work, e.g., one or two study papers). I'm not proud we contributed to the filedrawer effect, but I am proud that when we realized the problems with the filedrawer effect, we were the first (and I think still the only) researchers to upload a failed replication of their own work to PsychFileDrawer.
Daniel,
DeleteI appreciate your attitude towards statistics and your willingness to discuss these issues. Your problem 1) with the TES is misplaced, and I do not think you actually believe it (I mean that in a good way!). Everyone agrees that there is some publication bias across the field. However, the TES, as I have used it, asks a much more specific question: is there bias within a particular set of studies that relate to a theoretical conclusion? Whether there is bias for topics in other papers hardly makes any statement about bias in any particular paper.
In particular, the knowledge of bias over the last 50 years did not (properly, I would suggest) stop you from publishing your paper. I am inclined to believe that if you and your coauthors had believed the nonsignificant finding was evidence against your theoretical conclusion, then you would not have dropped the experiment (that would be fraud, and I have no reason to think you would do such a thing).
Rather, I suspect that you believed there was some methodological flaw in the design or execution of the experiment and that not reporting it was justified. Most likely, you and your coauthors did not consider the point raised in your recent blog post that for a set of studies with relatively weak power, a nonsignificant finding is expected, from time to time. So, I would suggest that the TES really did point out a problem with the study that you (and others) did not know prior to the analysis.
I readily concede that the TES does not solve the broader problem. I appreciate your (and Brian's and other's) efforts to address the problem; I am skeptical about whether they are going to work, but the effort is laudable. In contrast, today's Data Colada blog post is almost utter nonsense. (The math is valid but almost everything else is wrong.) I wrote Uri about it, but he seems to want to persist in spreading confusion about the TES. I don't know why. My view on these issues is at http://www.sciencedirect.com/science/article/pii/S0022249613000527
We can discuss sequential analyses another time, but I am skeptical about the efficiency claims. More generally, it seems to me that we need to use statistics that allow (and encourage) gathering additional data as needed. Hypothesis testing (sequential or fixed sample) does not do that very well.
Regarding the details of how the TES was applied to your paper. I can only following the interpretation of the authors about the relation between the statistics and the theoretical claims. The first sentence of the General Discussion in your paper is "In four studies, we obtained evidence that the abstract concept of importance is linked to bodily experiences of weight." I think that makes it clear that, at the time, you felt Study 3 did provide evidence in support of the theoretical claim. If you feel otherwise now, I guess you could try to submit a comment to Psych Science.