A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, August 24, 2014

On the Reproducibility of Meta-Analyses

I have no idea how many people take the effort to reproduce a meta-analysis in their spare time. What I do know, based on my personal experiences of the last week, is that A) it’s too much work to reproduce a meta-analysis, primarily due to low reporting standards, and B) we need to raise the bar when doing meta-analyses. At the end of this post, I’ll explain how to do a meta-analysis in R in five seconds (assuming you have effect sizes and sample sizes for each individual study) to convince you that you can produce (or reproduce) meta-analyses yourself.

Any single study is nothing more than a data-point in a future meta-analysis. In the last years researchers have shared a lot of thoughts and opinions about reporting standards for individual studies, ranging from disclosure statements, additional reporting of alternative statistical results whenever a researcher had the flexibility to choose from multiple analyses, to sharing all the raw data and analysis files. When it comes to meta-analyses, reporting standards are even more important. 

Recently I tried to reproduce a meta-analysis (by Sheinfeld Gorin, Krebs, Badr, Janke, Jim, Spring, Mohr, Berendsen, & Jacobsen, 2012, with the titel “Meta-Analysis of Psychosocial Interventions to Reduce Pain in Patients With Cancer” in the Journal of Clinical Oncology, which has an IF of 18, and the article is cited 38 times) for a talk about statistics and reproducibility at the International Conference of Behavioral Medicine. Of the 38 effect sizes included in the meta-analysis I could reproduce 27 effect sizes (71%). Overall, I agreed with the way the original effect size was calculated for 18 articles (47%). I think both these numbers are too low. It could be my lack of ability in calculating effect sizes (let's call it a theoretical possibility) and I could be wrong in all cases in which I disagreed with which effect size to use (I offered the authors of the meta-analysis the opportunity to comment on this blog post, which they declined). But we need to make sure meta-analyses are 100% reproducible, if we want to be able to discuss and resolve such disagreements.

For three papers, statistics were not reported in enough detail for me to calculate effect sizes. The researchers who performed the meta-analysis might have contacted authors for the raw data in these cases. If so, it is important that authors of a meta-analysis share the summary statistics their effect size estimate is based on. Without additional information, those effect sizes are not reproducible by reviewers or readers. After my talk, an audience member noted that sharing data you have gotten from someone would require their permission - so if you ask for additional data when doing a meta-analysis, also ask to be able to share the summary statistics you will use in the meta-analysis to improve reproducibility.

For 9 studies, the effect sizes I calculated differed substantially from those by the authors of the meta-analysis (so much that it's not just due to rounding differences). It is difficult to resolve these inconsistencies, because I do not know how the authors calculated the effect size in these studies. Meta-analyses should give information about the data effect sizes are based on. A direct quote from the article that contains the relevant statistical test, or pointing to a row and column in a Table that contains the means and standard deviations would have been enough to allow me to compare calculations. 

We might still have disagreed about which effect size should be included, as was the case for 10 articles where I could reproduce the effect size the authors included, but where I would use a different effect size estimate. The most noteworthy disagreement probably was a set of three articles the authors included, namely:

de Wit R, van Dam F, Zandbelt L, et al: A pain education program for chronic cancer pain patients: Follow-up results from a randomized controlled trial. Pain 73:55-69, 1997

de Wit R, van Dam F: From hospital to home care: A randomized controlled trial of a pain education programme for cancer patients with chronic pain. J Adv Nurs 36:742-754, 2001a

de Wit R, van Dam F, Loonstra S, et al: Improving the quality of pain treatment by a tailored pain education programme for cancer patients in chronic pain. Eur J Pain 5:241-256, 2001b

The authors of the meta-analyses calculated three effect sizes for these studies: 0.21, 0.14, and -0.19. I had a research assistant prepare a document with as much statistical information about the articles as possible, and she noticed that in her calculations, the effect sizes of De Wit et al (2001b) and De Wit et al (1997) were identical. I checked De Wit 2001a (referred to as De Wit 2002 in the forest plot in the meta-analysis) and noticed that all three studies reported the data of 313 participants. It’s the same data, written up three times. It’s easy to miss, because the data is presented in slightly different ways, and there are no references to earlier articles in the later articles (in the 2001b article, the two earlier articles are in the reference list, but not mentioned in the main text). I contacted the corresponding second author for clarifications, but received no reply (I would still be happy to add any comments I receive). In any case, since this is the same data, and since the effect sizes are not independent, it should only be included in the meta-analysis once.

A second disagreement comes from Table 2 in Anderson, Mendoza, Payne, Valero, Palos, Nazario, Richman, Hurley, Gning, Lynch, Kalish, and Cleeland (2006), Pain Education for Underserved Minority Cancer Patients: A Randomized Controlled Trial, also published in the Journal of Clinical Oncology), reproduced below:

See if you can find the seven similar means and standard deviations – a clear copy-paste error. Regrettably, this makes it tricky to calculate the difference on the Pain Control Scale, because they might not be correct. I contacted the corresponding first author for clarifications, but have received no reply (but the production office of the journal is looking in to it).

There are some effect size calculations where I strongly suspect errors were made, for example because adjusted means from an ANCOVA seem to be used instead of unadjusted means, or the effect size seems to be based on part of the data (only post-scores instead of the differences on Time1 and Time2 change scores, or the effect size in the intervention condition instead of the difference between the intervention and control condition). To know this for sure, the authors should have shared the statistics their effect size calculations were based on. I could be wrong, but disagreements can only be resolved if the data the effect sizes are calculated on is clearly communicated together with the meta-analysis.

The most important take home message at this point should be that A) there are enough things that researchers can disagree about if you take a close look at published meta-analyses, and B) the only way to resolve these disagreements is by full disclosure about how the meta-analysis was performed. All meta-analyses should include a meta-analysis disclosure table with the publication which provides a detailed description of the effect sizes that were used, including copy-pasted sentences from the original article or references to rows and columns in Tables that contain the relevant data. In p-curve analyses (Simonsohn, Nelson, & Simmons, 2013) such disclosure tables are required, including alternative effects that could have been included and a description of the methods and design of the study. All meta-analyses should include a disclosure table with information on how effect sizes were calculated.

Inclusion Criteria: The Researcher Degrees of Freedom of the Meta-Analyst

The choice of which studies you do or do not include in a meta-analysis is a necessarily subjective. It requires researchers to determine what their inclusion criteria are, and to decide whether a study meets their inclusion criteria or not. More importantly, if meta-analysts share all the data their meta-analysis is based on, it’s easy for reviewers or readers to repeat the analysis, based on their own inclusion criteria. In the meta-analysis I checked, 3 types of interventions to reduce pain in cancer patients were used. The first is pain management education, which involves increasing knowledge about pain, how to treat pain, and when and how to contact healthcare providers when in pain (for example to change their pain treatment). The second is hypnosis, provided in individual sessions by a therapist, often tailored to each patient, consisting of for example suggestions for pleasant visual imagery and muscle relaxation. The third is relaxation and cognitive coping skills, consisting of training and practice in relaxation exercises, attention diversion, and positive affirmations.

When doing a random effects meta-analysis, effects under investigation should be ‘different but similar’ and not ‘different and unrelated’ (Higgins, Thompson, & Spiegelhalter, 2009). If there is heterogeneity in the effect size estimate, you should not just stop after reporting the overal effect size, but examine subsamples of studies. I wanted to know whether the conclusion of a positive and effect size that was statistically different from zero over all studies would also hold for the subsamples (and whether the subsets would no longer show heterogeneity). It turns out that the evidence for pain management education is pretty convincing, while the effect size estimates for relaxation intervention was less convincing. The hypnosis intervention (sometimes consisting of only a 15 minute session) yielded effect sizes that were twice as large, but based on my calculations and after controlling for outliers, were not yet convincing. Thus, even though I disagreed on which effect sizes to include, based on the set of studies selected from the literature (which is in itself another interesting challenge for reproducibility!) the main difference in conclusions were based on which effects were 'different but similar'. 

You can agree or disagree with my calculations. But what’s most important is that you should be able to perform your own meta-analysis on publically shared, open, and easily accessible data, to test your own ideas of which effects should and should not be included.

Performing a meta-analysis in R

I had no idea how easy doing a meta-analysis was in R (fun fact: when I was talking about this to someone, she pointed out the benefits of not sharing this too widely, to have an individual benefit of 'knowing how to do meta-analyses' - obviously, I think the collective benefit of everyone being able to do or check a meta-analysis is much greater). I did one small-scale meta-analysis once (Lakens, 2012), mainly by hand, which was effortful. Recently, I reviewed a paper by Carter & McCullough (2014) where the authors were incredibly nice to share their entire R script alongside their (very interesting) paper. I was amazed how easy it was to reproduce (or adapt) meta-analyses this way. If this part is useful, credit goes to Carter and McCollough and their R script (their script contains many more cool analyses, such as tests of excessive significance, and PET-PEESE meta-regressions, which are so cool they deserve an individual blog post in the future).

All you need to have to do a meta-analysis is the effect size for each study (for example Cohen’s d) and the sample size in each of the two conditions Cohen’s d is based on. The first string es.d contains five effect sizes from 5 studies. The n1 and n2 strings contain the sample sizes for the control conditions (n1) and the experimental condition (n2). That’s all you need to provide, and assuming you’ve calculated the effect sizes (not to brag, but I found my own excel sheets to calculate effect sizes that accompany my 2013 effect size paper very useful in this project) and coded the sample sizes, the rest of the meta-analysis takes 5 seconds. You need to copy-paste the entire code below in R or RStudio (both are free) and first need to install the meta and metaphor packages. After that, you just insert your effect sizes and sample sizes, and run it. The code below is by Carter and McCullough, with some additions I made.

The output you get will contain the results of the meta-analysis showing an overall effect size of d = 0.31, 95% CI [1.13; 0.50]:

                  95%-CI %W(fixed) %W(random)
1  0.38 [ 0.0571; 0.7029]     33.62      33.62
2  0.41 [ 0.0136; 0.8064]     22.32      22.32
3 -0.14 [-0.7387; 0.4587]      9.78       9.78
4  0.63 [-0.0223; 1.2823]      8.24       8.24
5  0.22 [-0.1470; 0.5870]     26.04      26.04

Number of studies combined: k=5

                                     95%-CI      z p.value
Fixed effect model   0.3148 [0.1275; 0.502] 3.2945   0.001
Random effects model 0.3148 [0.1275; 0.502] 3.2945   0.001

Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 2.12]; I^2 = 0% [0%; 77.8%]

Test of heterogeneity:
    Q d.f.  p.value
 3.75    4   0.4411

In addition, there’s a check for outliers and influential cases, and a forest plot:

This is just the basics, but it hopefully has convinced you that the calculations involved in doing a meta-analysis take no more than 5 seconds if you use the right software. Remember that you can easily share your R script, containing all your data (but don't forget a good disclosure table) and analyses when submitted your manuscript to a journal, or when it has been accepted for publication. Now go and reproduce.


  1. Hi Daniel
    Nice post - you raise important and overlooked issues, some of which have also bugged me for a while.

    People often take meta analyses at face value ...as arbiters of truth. But it is a serious mistake to view meta analyses as less fallible or less prone to bias than individual studies. I have argued elsewhere that "Meta analyses are not a 'ready-to-eat' dish ...and they require as much inspection as any primary data paper...possibly closer inspection http://keithsneuroblog.blogspot.co.uk/2014/06/meta-matic-meta-analyses-of-cbt-for.html

    The above link refers amongst others, to a meta analysis by Burns et al on CBT for treatment-resistant psychosis, which is a good example of researchers not declaring their methods. Although the authors do present their effect size calculation method (even an equation), it is deceptive. Burns et al choose to analyse symptom 'change' score (baseline to post symptom change for CBT vs controls) as their effect size (rather than end-point differences of CBT vs control, which is the approach used in the dozen or so other metas published in this area). Change scores are fine, but rather than use 'change standard deviation' as the denominator, the authors use end-point standard deviation ...and this is not clear in the paper.

    It may seem a minor technical point, but it is not because: a) as you say, it lacks the methodological transparency that we ought to expect of meta analyses; and b) such an effect size cannot be related to the nomenclature of 'small' 'medium' and 'large' (after Cohen) - although the authors do in fact make these such judgments about their effect sizes.

    On a related issue, we have tried to promote the publishing of meta analysis databases. When we published our recent meta analysis of CBT for symptoms of schizophrenia, we made available our interactive Excel database http://www.cbtinschizophrenia.com/
    This contains all of our analyses, heterogeneity values, forest plots and so on and permits interested readers to add or remove studies and recalculate effect sizes, heterogeneity etc and actually test their hypotheses. I very much like your 'disclosure table' idea and perhaps alongside, we could require authors to publish something like the database we propose

    1. Hi Keith, your http://www.cbtinschizophrenia.com website is an excellent demonstration of sharing data. I love the interactive forest plot! Note that the type of disclosure table I am proposing here is much more detailed than just the effect size for each study. It contains direct references to the original paper, and a justification of which effects where included and what alternatives were possible. I'm thinking more like a p-curve disclosure table (see http://supp.apa.org/psycarticles/supplemental/a0033242/p-curve-Supplemental_Materials_Simonsohn.pdf).

  2. Hi Daniel,
    thanks so much for the instructive post!
    Can the R syntax provided above be adapted to make use of partial eta-squares (instead of ds)?
    If yes, how are the standard errors for eta-squared calculated?

    1. Hi, it's possible, I guess, but I don't exactly know how. More typical would be to convert the ES to r, instead of eta. Borenstein et al have a great book about meta-analyses with the formulas you need: http://eu.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002313.html

    2. thanks for the swift and helpful response!