A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, September 12, 2015

Researchers who don't share their file-drawer for a meta-analysis

I’ve been reviewing a number of meta-analyses in the last few months, and want to share a problematic practice I’ve noticed. Many researchers do not share unpublished data when colleagues who are performing a meta-analysis send around requests for unpublished datasets.

It’s like these colleagues, standing right in front a huge elephant in the room, say: “Elephant? Which elephant?” Please. We can all see the elephant, and do the math. If you have published multiple studies on a topic (as many of the researchers who have become associated with specific lines of research have) it is often very improbable that you have no file-drawer.

If a meta-analytic effect size suggests an effect of d = 0.4 (not uncommon), and you contribute 5 significant studies with 50 participants in each between subject condition (a little optimistic sample size perhaps, but ok) you had approximately 50% power. If there is a true effect, finding a significant effect five times in a row happens, but only 0.5*0.5*0.5*0.5*0.5 = 3.125% of the time. The probability that someone contributes 10 significant, but no non-significant, studies to a meta-analysis is 0.09% if they had 50% power. Take a look at some published meta-analyses. This happens. These are the people I’m talking about.

I think we should have a serious discussion about how we are letting it slide when researchers don't share their file drawer when they get a request by colleagues who plan to do a meta-analysis. Not publishing these studies in the first place was clearly undesirable, but it also was pretty difficult in the past. But meta-analyses are a rare exception when non-significant findings can enter the literature. Not sharing your file-drawer, when colleagues especially ask for it, is something that rubs me the wrong way.

Scientists who do not share their file drawer are like people who throw their liquor bottle on a bicycle lane (yeah, I’m Dutch, we have bicycle lanes everywhere, and sometimes we have people who drop liquor bottles on them). First of all, you are being wasteful by not recycling data that would make the world a better (more accurate) place. Second, you can be pretty sure that every now and then, some students on their way to a PhD will drive through your glass and get a flat tire. The delay is costly. If you don’t care about that, I don’t like you.

If you don’t contribute non-significant studies, not only are you displaying a worrisome lack of interest in good science, and a very limited understanding of the statistical probability of finding only significant results, but you are actually making the meta-analysis less believable. When people don’t share non-significant findings, the alarm bells for every statistical technique to test for publication bias will go off. Techniques that estimate the true effect size while correcting for publication bias (like meta-regression or p-curve analysis) will be more likely to conclude there is no effect. So not only will you be clearly visible as a person who does not care about science, but you are shooting yourself in the foot if your goal is to make sure the meta-analysis reveals an effect size estimate significantly larger than 0.

I think this is something we need to deal with if we want to improve meta-analyses. A start could be to complement trim-and-fill analyses (which test for missing studies) with a more careful examination of which researchers are not contributing their missing studies to the meta-analysis. It might be a good idea to send these people an e-mail when you have identified them, to give them the possibility to decide whether, on second thought, it is worth the effort to locate and share their non-significant studies.


  1. Thanks for pointing me to this post on Twitter. Let me elaborate my point a bit more than I could there. First of all, I can't resist my familiar spiel on post-hoc probabilities: I still take issue to the idea that you can use observed effect sizes to derive the power of a study. Power is a a priori concept and becomes meaningless once the data have been collected. The way I see it, meta-analytic effect sizes are not exempt from this.

    Having said that, pragmatically speaking (and I am usually a pragmatist) you are probably right in most cases. The original studies were probably underpowered and so the probability that they are significant should indeed very small. I just question the idea that you can estimate this accurately.

    So something fishy is probably happening here. But I find it difficult to believe that there is a massive filedrawer. Perhaps this again has something to do with the difference between (social) psychology and neuroscience but I honestly can't believe that this can be common. Just imagine doing this in neuroimaging. You'd fry through thousands if not millions in research funds with very little to show for. That's completely unrealistic. It's more feasible in behavioural, psychophysical experiments but even there I don't think the filedrawer for most people can be as massive as this.

    Don't get me wrong. I'm sure there is some filedrawer. I have known of some cases where substantial amounts of data were thrown away, even in imaging studies. But both from personal experience and simply theoretically the problem can't be as wide-spread as measures of 'publication bias' suggest.

    For much unpublished data it may also not be possible to combine them meta-analytically because they test separate concepts. For instance, I know of one neuroimaging study where the authors threw away at least a third of the data collected. They tested a different condition (which isn't mentioned in the published paper). I have no idea why the data were excluded but I presume it's because they couldn't make sense of the results... This is really abhorrent in my opinion but it wouldn't make sense to meta-analyse this result together with the published ones because they tested conceptually different effects.

    In my view this is certainly a questionable research practice. In this case, it doesn't enhance the statistical significance of the published finding - it just makes it impossible to interpret it in context. Presumably the unpublished results would cast doubt on the validity of the published findings.

    But many QRPs do enhance the significance (and in fact I believe some were also used in this case). Dropping outliers willy-nilly, changing your design halfway through the experiment (Bem's Psi experiment 1, anyone? :P) optional stopping, creative data analysis, etc all can dramatically change the outcome fundamentally (I know I'm not telling you anything you didn't know. You know this stuff better than I do!). You said on Twitter that you can't p-hack 100% of your experiments to be significant. I don't think you have to but my guess is the percentage where this succeeds is in the mid-90s.

  2. Of course, also relevant to this topic on what is and isn't justifiable to be ignored from meta-analysis is the discussion we had a while ago about piloting. I am still hoping for a comment from you on my blog post on this ;)

  3. Daniel, it is nice to see that you embrace the ideas of the test for excess success.

    Sam, I think you are right that most people do not have file drawers bursting with non-significant outcomes. My guess is that most of the time non-promising experiments are identified early in data collection (e.g., after just 10% of the planned sample size). If the result looks promising, the study is continued as planned; but otherwise the study is terminated early and the researcher can (honestly) say that he does not know what would have happened if the study were run to completion, so he does not feel obligated to share the partial experiment (it does not even go into a file drawer). A problem with this approach is that the studies that run to completion are often those that had a (random) head start to show the effect.

    There are two other strategies that I suspect thin out the file drawer but still produce biased experiment sets. The first is identification of boundary conditions that mitigate the effect. The second is to measure several different variables and then generate a coherent story across a variety of experiments with different measures (or by using the one measure that happened to hit consistently). These approaches provide a lot of flexibility and ensure that the (expensive) neuroscience studies do not seem to be wasted.

    I think this characterization of the problem explains why so many people do not report their unpublished non-significant outcomes when asked to do so for a meta-analysis. From their perspective, they do not have (many) studies in a file drawer. Sure, there were the studies that terminated early, but no one wants those (and the data probably were not even saved). Likewise, there were those studies that showed the effect was weaker when subjects used their left hand, but that's a different effect. Finally, there were other measures that did not show the effect, but the researcher did not (in a post hoc sense, once he generated the coherent story) expect those measures to show the effect.

    From my perspective, publication bias is largely a by-product of the deeper problem of relating empirical data and theoretical conclusions. If one does not have a specific theoretical prediction to test it is really difficult to test that theory. Likewise, trying to generate a theory from empirical data is quite difficult; and simply following whatever the data tell you is often a recipe for tracking noise in the data (and signal too, but surely noise).

    1. Hi Greg, incidentally I just looked at what happens when you stop running experiments after only a few observations when it doesn't show anything "promising". You would still make a substantial number of erroneous inferences in that situation, that is, even for healthy mid-range effect sizes that are typical in the field you would make a type 2 error around a third of the time. So I think it would be unwise not to share those kinds of pilot findings. Even if publishing them would be too much hassle (which I can understand) they should be included in meta-analyses.

      I think you're completely right about the flexibility in neuroscience. In these discussions it is also often easy to forget that the problem may not be whether an effect exists (i.e. that H0 is true) but that the effect is being overestimated by biased studies. I'm know you are aware of this, I'm sure Daniel is aware or this, and most people who actually discuss this are aware of this - but my feeling is the broader debate often misses this point.

      It's interesting what you say about what people regard as being the same effect. I think it's a bit more complex than this. In the example I gave above, the issue was trickier than "the effect was weaker with the left hand". It was a different condition and, at least from a naive perspective, it was not compatible with the published effect (from a less naive point of view I would probably disagree with that though - there were prior behavioural data showing that it was likely to produce a similar effect...). Of course I personally think all of this is kind of moot: the experiment should have been published in the same study instead of pretending that it never happened! It could then be included in meta-analysis as well.

      Anyway, it's important to distinguish different things here though. As I argued on my blog piloting is perfectly valid. You shouldn't clog up meta-analysis with garbage either. If you have criteria that are orthogonal to the outcome measure that can tell you that your experiment is of insufficient quality to contribute to our knowledge then it seems entirely correct to me to exclude them.

  4. Do you have a personal “why” statement?  Why do new investors and entrepreneurs need to write and share their “why”?  How can this help you achieve greater success? See more statement of purpose marketing