A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, June 27, 2014

Too True to be Bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True

Most of this post is inspired by a lecture on probabilities by Ellen Evers during a PhD workshop we taught (together with Job van Wolferen and Anna van ‘t Veer) called ‘How do we know what’s likely to be true’. I’d heard this lecture before (we taught the same workshop at Eindhoven a year ago) but now she extended her talk to the probability of observing a mix of significant an non-significant findings. If this post is useful for you, credit goes to Ellen Evers.

A few days ago, I sent around some questions on Twitter (thanks for answering!) and in this blog post, I’d like to explain the answers. Understanding this is incredibly important and will change the way you look at sets of studies that contain a mix of significant and non-significant results, so you want to read until the end. It’s not that difficult, but you probably want to get a coffee. 42 people answered the questions, and all but 3 worked in science, anywhere from 1 to 26 years. If you want to do the questions before reading the explanations below (which I recommend), go here

I’ll start with the easiest question, and work towards the most difficult one.

Running a single study

I asked: You are planning a new study. Beforehand, you judge it is equally likely that the null-hypothesis is true, as that it is false (a uniform prior). You set the significance level at 0.05 (and pre-register this single confirmatory test to guarantee the Type 1 error rate). You design the study to have 80% power if there is a true effect (assume you succeed perfectly). What do you expect is the most likely outcome of this single study?

The four response options were:

1) It is most likely that you will observe a true positive (i.e., there is an effect, and the observed difference is significant).

2) It is most likely that you will observe a true negative (i.e., there is no effect, and the observed difference is not significant)

3) It is most likely that you will observe a false positive (i.e., there is no effect, but the observed difference is significant).

4) It is most likely that you will observe a false negative (i.e., there is an effect, but the observed difference is not significant)

59% of the people chose the correct answer: It’s most likely that you’ll observe a true negative. You might be surprised, because the scenario (5% significance level, 80% power, the null hypothesis (H0) and the alternative hypothesis (H1) are equally likely to be true) is pretty much the prototypical experiment. It thus means that a typical experiment (at least when you think your hypothesis is 50% likely to be true) is most likely not to reject the null-hypothesis (earlier, I wrote 'fail', but in the comments Ron Dotsch correctly points out not rejecting the null can be informative as well). Let’s break it down slowly.

If you perform a single study, the effect you are examining is either true or false, and the difference you observe is either significant or not significant. These four possible outcomes are referred to as true positives, false positives, true negatives, and false negatives. The percentage of false positives equals the Type 1 error rate (or α, the significance level), and false negatives (or Type 2 errors, β) equal 1 minus the power of the study. When the null hypothesis (H0) and the alternative hypothesis (H1) are a-priori equally likely, the significance level is 5%, and the study has 80% power, the relative likelihood of the four possible outcomes of this study before we collect the data is detailed in the table below.

H0 True
(A-Priori 50% Likely)
H1 True
(A-Priori 50% Likely)
Significant Finding
False Positive (α)
True Positive (1-β)
Non-Significant Finding
True Negative (1- α)
False Negative (β)

The only way a true positive is most likely (the answer provided by 24% of the participants) given this a-priori likelihood of H0 is when the power is higher than 1-α, so in this example higher than 95%. After asking which outcome was most likely, I asked how likely this outcome was. In the sample of 42 people who filled out my there were people who responded intuitively, and those who did the math. Twelve people correctly reported 47.5%. What’s interesting is that 16 people (more than one-third) reported a percentage higher than 50%. These people might have simply ignored the information that the hypothesis was equally likely to be true, as it was that it’s false (which implies no outcome can be higher than 50%), and intuitively calculated probabilities assuming the effect was true, while ignoring the probability it was not true. The modal response for people who had indicated earlier that they thought it was most likely to observe a true positive also points to this, because they judged it would be 80% probable that this true positive was observed.

Then I asked: 

“Assume you performed the single study described above, and have observed a statistical difference (p < .05, but you don’t have any further details about effect sizes, exact p-values, or the sample size). Simply based on the fact that the study is statistically significant, how likely do you think it is you observed a significant difference because you were examining a true effect?”

Eight people (who did the math) answered 94.1%, the correct answer. All but two people who responded intuitively underestimated the correct answer (the average answer was 57%). The remaining two answered 95%, which indicates they might have made the common error to assume that observing a significant result means it’s 95% likely the effect is true (it’s not, see Nickerson, 2000). It’s interesting that people who responded intuitively overestimated the a-priori chance of a specific outcome, but then massively underestimate the probability of having observed a specific outcome if the effect was true. The correct answer is 94.1% because now that we know we did not observe a non-significant effect, we are left with the remaining probabilities that the effect is significant. There was 2.5% chance of a Type 1 error, and a 40% chance of a true positive. That means the probability of observing this positive outcome, if the effect is true, is 40 divided by the total, which is 40+2.5. And 40/(40+2.5)=94.1%. Ioannidis (2005) calls this, the post-study probability that the effect is true, the positive predictive value, PPV, (thanks to Marcel van Assen for pointing this out).

What happens if you run multiple studies?

Continuing the example as Ellen Evers taught it, I asked people to imagine they performed three of the studies described above, and found that two were significant but one was not. How likely would it be to observe this outcome of the alternative hypothesis is true? All people who did the math gave the answer 38.4%. This is the a-priori likelihood of finding 2 out of 3 studies to be significant with 80% power and a 5% significance level. If the effect is true, there’s an 80% probability of finding an effect, times 80% probability of finding an effect, times 20% probability of finding a Type 2 error. 0.8*0.8*0.2= 12.8%. If you calculate the probability for the three ways to get two out of three significant results (S S NS; S NS S; NS S S) you multiply it by 3, and 3*12.8 gives 38.4%. Ellen prefers to focus on the single outcome you have observed, including the specific order in which it was observed.

I might have not formulated the question clearly enough (most probability statements are so unlike natural language, they can be difficult to formulate precisely), but I tried to ask not for the a-priori probability,but for the probability that given these observations, the studies examined a true effect (similar to the single study case above, where the answer was not 80%, but 94.1%). In other words, the probability that H1 is true, conditional on the acceptence of H1, which Ioannidis (2005) calls the PPV. This is the likelihood of finding a true positive, divided by the total probability of finding a significant result (either a true positive or a false positive).

We therefore also need to know how likely it is to observe this finding when the null-hypothesis is true. In that case, we would find a Type 1 error (5%), another Type 1 error (5%), and a true negative (95%), and 0,05*0,05*0,95 = 0.2375%. There are three ways to get this pattern of results, so if you want the probability of 2 out of 3 significant findings under H0 irrespective of the order, this probability is 0.7125%. That’s not very likely at all. 

To answer the question, we need to calculate 12.8/(12.8+0.2375) (for the specific order in which the results were observed) or 38.4/(38.4+0.7125) (for any 2 out of 3 studies) and both calculations give us 98.18%. Although a-priori it is not extremely likely to observe 2 significant and 1 non-significant finding, after you have observed this outcome, it is more than 98% likely to have observed 2 significant and one non-significant result in three studies when the effect is true (and thus only 1.82% when the effect is not true).

The probability that, given that you observed a mix of significant and non-significant studies, the effect you observed was true, is important to understand correctly if you do research. In a time where sets of 5 or 6 significant low-powered studies are criticized for being ‘too good to be true’ it’s important that we know when a set of studies with a mix of significant and non-significant studies is ‘too true to be bad’. Ioannidis (2005) briefly mentions you can extend the calculations for multiple studies, but focusses too much on when findings are most likely to be false. What struck me from the lecture Ellen Evers gave, is how likely some sets of studies that include non-significant findings are to be true.

These calculations depend on the power, significance level, and a-priori likelihood that H0 is true. If Ellen and I ever find the time to work on a follow up to our recent article on Practical Recommendations to Increase the Informational Value of Studies, I would like to discuss these issues in more detail. To interpret whether 1 out of 2 studies is still support for your hypothesis, these values matter a lot, but to interpret whether 4 out of 6 studies are support for your hypothesis, they are almost completely irrelevant. This means that one or two non-significant findings in a larger set of studies do almost nothing to reduce the likelihood that you were examining a true effect. If you’ve performed three studies that all worked, and a close replication isn’t significant, don’t get distracted by looking for moderators, at least until the unexpected result is replicated.

I've taken the spreadsheet Ellen Evers made and shared with the PhD students, and extended is slightly. You can download it here, and use it to perform your own calculations with different levels of power, significant levels, and a-priori likelihoods of H0. On the second tab of the spreadsheet, you can perform these calculations for studies that have different power and significance levels.  If you want to start trying out different options immediately, use the online spreadsheet below:

If we want to reduce publication bias, understanding (I mean, really understanding) that sets of studies that include non-significant findings are extremely likely, assuming H1 is true, is a very important realization. Depending on the number of studies, their power, significance level, and the a-priori likelihood of the idea you were testing, it can be no problem to submit a set of studies with mixed significant and non-significant results for publication. If you do, make sure that the Type 1 error rate is controlled (e.g., by pre-registering your study design). 

I want to end with a big thanks to Ellen Evers for explaining this to me last week, and thanks so much to all of you who answered my questionnaire about probabilities.


  1. Do you know any paper that contains these explanations? Alternatively, how could I cite your blog?

    1. Hi, I don't think there's a paper out there that specifies the calculations and probabilities for multiple studies like Ellen Evers did in the lecture, and I detail here in this blog. I can understand that, if you are about to submit a set of studies with some non-significant findings, you want to point the reviewers and editors here. For now, you can cite it as:

      Lakens, D., & Evers, E. R. K (2014, June 27). Too True to be Bad:When Sets of Studies with Significant and Non-Significant Findings Are Probably True. Retrieved from http://daniellakens.blogspot.nl/2014/06/too-true-to-be-badwhen-sets-of-studies.html

      Thanks for your comment - that really motivates us to write it up for a real paper - we'll try to find the time, but Ellen is busy writing up her PhD thesis, so it might be a month or 2.

  2. Very interesting, Daniel (and Ellen!), thank you for this. I looked at the questions earlier and decided that I'd wait for the responses, because honestly, despite having thought about these things a lot, I wouldn't have been able to come up with the correct answers some of the times.

    Just a short question about terminology. You write that a true negative is a failed experiment: "59% of the people chose the correct answer: It’s most likely that you’ll observe a true negative. [...] It thus means that a typical experiment (at least when you think your hypothesis is 50% likely to be true) is most likely to fail." -- I would say that in the case of true negatives and true positives, the experiment did not fail at all, as it led the researcher to draw the correct conclusion. Am I missing something?

    1. Hi Ron, completely right - the term 'fail' is not correct - I'll update the blog post! Finding a true negative can be very interesting. As Ioanides focusses too much on findings that are most probable false, I focussed too much on findings that are most probably true positives - you could shift the focus and talk about the probability that you find a true negative, which can be just as important!

  3. I am tempted to state the conclusion in a slightly different way. What matters in practical situations, as I understand it, is the number of studies that observed significant effects. The number that failed to reach significance is much less important, because failure to reach significance is not itself evidence for anything that matters, except in cases where you have prior information that you are unlikely to have in practice.

  4. Daniel,

    This is a very clear discussion of an important issue. However, I feel I must add two thoughts that fit very well with your thesis but are inconsistent with some of your other postings (on this blog and elsewhere).

    1) Your suggestion that a set of mixed results is likely for common experimental designs in psychology is entirely consistent with the Test for Excess Success (TES) that I have used (but you have criticised on Twitter). The TES simply looks for the absence of the expected non-significant findings. For example, in Jostmann, Lakens and Schubert (2009) every experiment was significant but the estimated probability of a significant outcome (assuming H1 is true) for each of the four experiments is: 0.512, 0.614, 0.560, and 0.512. The probability of all four such experiments rejecting the null is the product of the probabilities: 0.09. The low probability implies is that something is wrong with the data collection, data analysis, or reporting of these studies. Corroborating this claim, Jostmann reported an unpublished non-significant finding at PsychFileDrawer. Details of this analysis are in Francis (2014, Psychonomic Bulletin & Review).

    I can understand you not liking the conclusion of the TES in this particular case, but it appears to be true and it follows logically from the observations you made in the post.

    2) In other blog posts and articles you have encouraged the use of sequential methods rather than fixed sampling approaches because the former allows a researcher to generate more significant findings. I don’t think this claim is necessarily true, but even if it were true it seems like a silly goal. As your post explains, true effects should produce a fair number of non-significant outcomes. I cannot see any motivation to use a method that generates more significant outcomes when we know that there should be a certain number of non-significant outcomes.

    I appreciate that you are thinking seriously about these statistical issues and that you go to the trouble to write up your thoughts on a blog. I hope you can step back and look at the bigger picture and see that some of your observations do not fit together.

    Best wishes,

    Greg Francis

    1. Dear Greg, thanks for your comment.

      Let's get some things straight first:

      There is no 'liking' or 'not liking' the conclusion of TES - it's statistics, and when it comes to data, I'm like Commander Data. I have no feelings about them, they are what they are. My problem with TES is 1) that it is pointing out a problem we have all known for 50 years existed, and 2) it doesn't solve the problem. The Data Colada blog post today makes the same argument: http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/

      The goal of my blog post here (as well as the recent special issue full of pre-registered replication studies I co-edited with Brian Nosek) is to solve the problem. Getting people to realize they can submit studies with non-significant results might be one way to reduce publication bias.

      The goal of my Sequential Analyses paper is not about getting more significant findings (you might want to read it more closely). The goal is to make it more efficient to run well-powered studies. Obviously, higher powered studies are very important (see Lakens & Evers, 2014). I don't see any inconsistencies in any of my papers and blog posts: They are all ways to improve finding out what is likely to be true. That's also why I don't use TES - it doesn't tell me if it's true or not, it just tells me there is publication bias.

      On a sidenote, I'm not completely sure your analysis of our 2009 paper is correct - we can at least debate about it. In Study 3, we say: "Initial analyses revealed no simple effects of clipboard weight on mayor evaluations or city attitudes, all Fs<1. We continued by regressing city attitudes on clipboard weight (heavy vs. light, contrast-coded), mayor evaluations (continuous and centered), and their interaction term." We don't explicitly say: our initial prediction was not confirmed, and we performed exploratory analyses, but this was 2008, and our statement is pretty close. I would not have taken it along in a TES analysis as a study that confirmed our a-priori prediction. But these details don't matter. There was publication bias (just as there is in other studies where TES doesn't work, e.g., one or two study papers). I'm not proud we contributed to the filedrawer effect, but I am proud that when we realized the problems with the filedrawer effect, we were the first (and I think still the only) researchers to upload a failed replication of their own work to PsychFileDrawer.

    2. Daniel,

      I appreciate your attitude towards statistics and your willingness to discuss these issues. Your problem 1) with the TES is misplaced, and I do not think you actually believe it (I mean that in a good way!). Everyone agrees that there is some publication bias across the field. However, the TES, as I have used it, asks a much more specific question: is there bias within a particular set of studies that relate to a theoretical conclusion? Whether there is bias for topics in other papers hardly makes any statement about bias in any particular paper.

      In particular, the knowledge of bias over the last 50 years did not (properly, I would suggest) stop you from publishing your paper. I am inclined to believe that if you and your co-authors had believed the non-significant finding was evidence against your theoretical conclusion, then you would not have dropped the experiment (that would be fraud, and I have no reason to think you would do such a thing).

      Rather, I suspect that you believed there was some methodological flaw in the design or execution of the experiment and that not reporting it was justified. Most likely, you and your co-authors did not consider the point raised in your recent blog post that for a set of studies with relatively weak power, a non-significant finding is expected, from time to time. So, I would suggest that the TES really did point out a problem with the study that you (and others) did not know prior to the analysis.

      I readily concede that the TES does not solve the broader problem. I appreciate your (and Brian's and other's) efforts to address the problem; I am skeptical about whether they are going to work, but the effort is laudable. In contrast, today's Data Colada blog post is almost utter nonsense. (The math is valid but almost everything else is wrong.) I wrote Uri about it, but he seems to want to persist in spreading confusion about the TES. I don't know why. My view on these issues is at http://www.sciencedirect.com/science/article/pii/S0022249613000527

      We can discuss sequential analyses another time, but I am skeptical about the efficiency claims. More generally, it seems to me that we need to use statistics that allow (and encourage) gathering additional data as needed. Hypothesis testing (sequential or fixed sample) does not do that very well.

      Regarding the details of how the TES was applied to your paper. I can only following the interpretation of the authors about the relation between the statistics and the theoretical claims. The first sentence of the General Discussion in your paper is "In four studies, we obtained evidence that the abstract concept of importance is linked to bodily experiences of weight." I think that makes it clear that, at the time, you felt Study 3 did provide evidence in support of the theoretical claim. If you feel otherwise now, I guess you could try to submit a comment to Psych Science.