The 20% Statistician

Wednesday, September 24, 2014

Publication Bias in Psychology: Putting Things in Perspective

I'd like to gratefully acknowledge the extremely helpful comments and suggestions by Anton Kühberger and his co-authors on an earlier draft of this blog post, who patiently answered my questions and shared additional analyses. I'd also like to thank Marcel van Assen, Christina Bergmann, JP de Ruiter, and Uri Simonsohn for comments and suggestions. Any remaining errors are completely my own doing.

In this post I'll be taking a closer look at recent support for publication bias in psychology provided by Kühberger et al (2014). I'll show how it is easy to misinterpret some of their data, and how p-curve analyses of their data show strong evidential support for the underlying studies (with no signs of p-hacking). Finally, I'll illustrate recently developed techniques (Simonsohn et al., in press) to estimate the power of studies based on the p-curve distribution, and check these estimates using Z-score simulations in R.

Publication bias is a problem in science (Everyone, 1950-2014). It’s difficult to quantify the extent of the problem, and therefore articles that attempt to do this can be worthwhile. In a recent paper ‘Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size’ Kühberger, Fritz, and Scherndl (2014) coded 1000 articles in psychology and examined the presence of publication bias. 531 articles ended up in the dataset, and they were able to extract p-values and effect sizes from around 400 studies. I tweeted a picture from their article, showing a distribution of Z-scores that, as the authors conclude, ‘shows a peak in the number of articles with a Z-score just above the critical level, while there are few observations just below it.’

It’s easy to misinterpret this figure (I know I did) if you believe the bump around Z=1.96 should mean there are a surprising number of just significant p-values. I’ve spent time on my blog to explain how a healthy distribution of p-values should look like: We should not find a high number of p-values around 0.05 (or a Z=1.96), but a large number of low p-values, for example as in the graph below:

You might be surprised to hear that the p-values and Z-scores in both figures above are exactly the same data (although I’ve only plotted the p-values below 0.1), presented in a different way. In the graph below, you see the relationship between p-values (from 1 to 0) and Z-scores. As you see, the lower the p-value, the higher the Z-score (in excel, Z=NORMSINV(1-p/2). However, Z-scores do not increase linearly, but in a concave upward function.

If you look at the Figure 6 by Kühberger above, you will see they have plotted Z-scores as a linear function (from 0 to 9.80 and higher, in bars of equal width. Let’s plot Z-scores (and their associated p-values) linearly, starting at Z=1.96 (p=0.05) and going upward. Both Z-scores and p-values are presented in the top pane, but because the p-values are almost not visible, these are plotted in isolation in the bottom pane.

The main point the authors want to make is the strong drop to the left of the 1.96 Z-score (indicating much less non-significant than significant results in the literature). And I was looking at the drop to the right of the figure, which can be perfectly normal (once you understand how to interpret it). The first bar (Z=1.96-2.205) contains p-values between p = 0.05 and p = 0.0275 (a width of 0.0225), the bar to the right (Z=2.025-2.45) p-values between p = 0.0275 and p = 0.0143 (a width of 0.0132), the bar to the right (Z=2.45-2.695) p-values between p = 0.0143 and p = 0.007 (a width of 0.0073), the bar to the right (Z=2.695-2.94) p-values between p = 0.007 and p = 0.0033 (a width of 0.0037), etc. Depending on the power of the studies, the shape of the curve on the right side could be perfectly expected without a peculiar pre-dominance of p-values just below the traditional significance level. I’ll return to this in the end.

So, based on the distribution of p-values I can correct my earlier tweet: There are not too many just-significant p-values compared to very significant p-values, but there are too many significant p-values compared to non-significant p-values. The authors suggest nothing more.

One thing that has greatly contributed to my expectation of a bump around p = .05 are pictures of the prevalence of p-values based on google searches, such as the ones below by Matthew Hankins on Twitter:

If we compare these searches based on the hand-coded p-values by Kühberger et al (2014) we can be relatively certain the google scholar search p-curves are completely due to an artefact of the search method (where people very often report p < .05 or p < .1 instead of precise p-values). Nevertheless, and despite Matthew Hankins warning against this, this spreads the idea through the community that there is something very wrong with p-curves across science, which is not just incorrect, but damaging the reputation of science. A picture says more than a 1000 words - and posting pictures of search method artefacts on social media might not be the thing we need now that many scienctist are becoming overly skeptical and see p-hacking everywhere.

Kühberger et al (2014) provide Z-scores for 344 significant and 66 non-significant Z-scores. We can easily perform a p-curve analysis on these Z-scores, which yields the pattern below, illustrating strong evidential value for studies with N’s larger than 100, χ²(258)=826.46, p=<.0001, as well as for studies with N’s smaller than 100, χ²(428)=1185.78, p=<.0001. Researchers are not just reporting false positives in a scientific environment where publication bias reigns. Yes, there is probably some publication bias, and who knows whether there are specific research areas in this large set of studies were a lot of the results are p-hacked, but in general what ends up in the literature seems to have something going for it. I’d say it’s a pretty good looking p-curve for a random selection of psychology articles.

We can count the percentage of significant studies in small studies (N ≤ 100) and in large studies (N > 100). In 152 small studies, 84.21% of the p-values were smaller than p = 0.05. In 259 large studies, 84.17% of the p-values were smaller than p = 0.05. This is too similar. Because smaller studies typically have lower power, we should expect at least slightly more non-significant results – another indication of publication bias, in addition to the missing Z-values to the left of the 1.96 threshold.

The central observation in Kühberger et al (2014) was actually not the distribution of p-values, but the correlation between effect size and sample size. The authors state that ‘ES and sample size (SS) ought to be unrelated.’ This is true when people completely ignore power. Note that if researchers would take power into account when designing an experiment, power and sample size should be related: The smaller the effect size, the larger the sample you need to collect to have a decent probability of observing it). An interesting aspect about the data Kühlberger et al have collected, is that the negative correlation between the sample size and effect size is mainly drive by small groups (N<100), but remains stable across larger sample sizes.

This indeed points to actual publication bias. For an r = 0.3, a power analyses tells us you have 95% power with a sample size of 134 – around the sample size where the negative correlation disappears. Another possibility is the use of within-subject designs. These studies rarely collect more than 100 individuals, but they typically don’t have to, because they often have higher power than between subject designs. In an e-mail, the authors presented additional analyses, which suggest that the effects are indeed smaller for within designs, but don’t completely disappear.

Kühberger et al (2014) asked the authors of the studies they sampled to estimate the direction and size of the correlation between sample size and effect size in the data they had collected. Because the contacted authors had no idea, Kühberger et al (2014) concluded power considerations by the original authors were ‘unlikely as the source of a possible ES-SS relationship’. I don’t think this question is conclusive about a lack of power considerations. After all, researchers will probably not have been able to define the meaning of a p-value, which doesn’t mean significance considerations played no role in the studies. It’s difficult to measure whether researchers took power into account, and direct questions would probably have led to socially desirable answers, so at least the authors tried, but it's an interesting question to what extent people design studies in which power is (at least implicitly) taken into account.

On a slightly more positive note, the estimated percentage of negative results in the psychological literature (15.78%) is much greater than in other estimates in the literature (e.g., the 8.5% reported by Fanelli, 2010).

Estimating the power of the studies

Remember that the distribution of p-values is a function of the power of the studies. We should therefore be able to estimate the power of the set of studies for N < 100 and N > 100 based on the p-curve we observe. When we plot the p-curves for 53% power and 63% power, the distributions are quite comparable to the p-curves observed based on the data by Kühberger.

This visual comparison is nice to communicate the basic idea, but a more formal mathematical approach is better. Simonsohn, Nelson, & Simmons (in press) have recently extended the use of p-curve analyses in exactly this direction, and provide the R script to estimate the power of the studies.

If we enter all significant Z-scores, the estimated power estimate based on the p-curve distribution for all the studies is 90%. That's extremely high, mainly due to a decent number of studies that observed extremely low p-values (or high Z-scores). For example, 115 out of 410 p-values are p < .0001. The huge heterogeneity in the effects in the dataset by Kühberger et al (2014) is problematic for these kinds of analyses, but for illustrative purposes, I'm going to give it a go anyway.

There is a difference between my visual matching attempt and the mathematical matching by Simonsohn et al (2014). I’ve focused on matching the percentage of p-values between .00 and .01, while Simonsohn et al (2014) plot a function that matches the entire p-value distribution (including the higher p-values, and differences between p = 0.0001 and p = 0.00006). These decisions about which loss function to use leave room for debate, and I expect future research on these techniques will address different possibilities.

As mentioned above, the shape of the Z-score (or p-value) distribution depends critically on the power of the performed studies.We can simulate Z-score distributions in R by using the code below:

If we simulate Z-scores with 90% power, our picture does not look like the figure in Kuhberger et al (2014), because the peak is too far to the right:

Note that even with 90% power, our simulation expects much less Z-scores > 5 compared to the figure by Kühberger et al (2014). The heterogeneity in their study set makes it difficult to simulate using a single distribution. Normally you would examine heterogeneity in a meta-analysis by looking more closely at how the studies differ, but the studies in the dataset are not identified. So let’s resort to a more extreme solution of excluding the very high Z-scores (perhaps assuming these are manipulation checks or other types of tests) and only look at Z< 4, or p > 0.00006). We get a (probably more reasonable) power estimate of 51%.

It's clear that power estimations based on p-curve distributions with huge heterogeneity are difficult, and I'm expecting more future work on this technique that examine different ways to attempt this. However, simulating Z-scores with a power of 51% does lead to a distribution with a peak located closer to that observed in Kühberger et al (2014).

We can be reasonably certain all studies that we see to the right of Z=1.96 in the simulation, but are missing from Kühberger et al (2014) were performed, but are not reported. What a waste in resources! As such, the article by Kühberger is an excellent reminder of the negative consequences of performing underpowered studies (in addition to the difficulty of drawing statistical inferences from these studies, see Lakens & Evers, 2014).The missing studies would lower all effect size estimates that are calculated based on the published literature.

Below is the R-code to perform the power-estimation using the R-script by Simonsohn et al (you can find the original code here).

Monday, September 15, 2014

Bayes Factors and p-values for independent t-tests

This Thursday I’ll be giving a workshop on good research practices in Leuven, Belgium. The other guest speaker at the workshop is Eric-Jan Wagenmakers, so I thought I’d finally dive in to the relationship between Bayes Factors and p-values to be prepared to talk in the same workshop as such an expert on Bayesian statistics and methodology. This was a good excuse to finally play around with the BayesFactor package for R witten by Richard Morey, who was super helpful through Twitter at 21:30 pm on a Sunday to enable me to do the calculations in this post. Remaining errors are my own responsibility (see the R script below to reproduce these calculations).

Bayes Factors tell you something about the probability H0 or H1 are true, given some data (as opposed to p-values, which give you the probability of some data, given the H0). As explained in detail by Felix Schönbrodt here, you can express Bayes Factors as support for H0 over H1 (BF01) or as support for H1 over H0 (BF10), and report raw Bayes Factors (ranging from 0 to infinity, where 1 means equal support for H1 as H0) or Bayes Factors on a log scale (from minus infinity through 0 to plus infinity, where 0 means equal support for H1 as H0). And yes, that gets pretty confusing pretty fast. Luckily, Richard Morey was so nice to adjust the output of Jeff Rouder's Bayes Factor calculation website to include the R script for the BayesFactor package, which makes the output of different tools to compute Bayes Factors more uniform.

Doing a single Bayes independent t-test in R is easy. Run the code below, and replace the t with the t-value from your Student's t-test, fill in n1 and n2 (the sample size in each of the two groups in the independent t-test) and you are ready to go. For example, entering a t-value of 3, and 50 participants in each condition gives BF₁₀ = 0.11, indicating the alternative hypothesis is around (1/0.11) = 9 times more likely than the null hypothesis.

exp(-ttest.tstat(t,n1,n2,rscale=1)$bf)

In the figure below, raw BF01 are plotted, which means they indicate the Bayes Factor for the null over the alternative. Therefore, small values (closer to 0) indicate stronger support for H1, 1 means equal support for H1 and H0, and large values indicate support for H0. First, let’s give an overview of Bayes Factors as a function of the t-value of an independent t-test, ranging from t=0 (no differences between groups) to t=5.

You can see three curves (for 20, 50, or 100 participants per condition) displaying the corresponding Bayes Factors as a function of increasing t-values. The green lines correspond to Bayes Factors of 1:3 (upper line, favoring H0) or 3:1 (lower line, favoring H1). Bayes Factors, just like p-values, are continuous, and shouldn’t be thought of a dichotomous manner (but I know polar opposition is a foundation of human cognition, so I expect almost everyone will ignore this explicit statement in their implicit interpretation of Bayes Factors). Let’s zoom in a little for our comparison of BF and p-values, to t-values above 1.96.

The dark grey line in this figure illustrates data in favor of H1 of 3:1 (some support for H1), and the light grey line represents data in favor of H1 of 10:1 (strong support for H1). The vertical lines indicate which t-values represent an effect in a t-test that is statistically different from 0 at p = 0.05 (the larger the sample size, the closer this t-value lies to 1.96). There are two interesting observations we can make from this figure.

First of all, where smaller sample sizes require slightly higher t-values to find a p<0.05 (as indicated by the blue vertical dotted line being further to the right than the black vertical dotted line), smaller sample sizes actually yield better Bayes Factors for the same t-value. The reason for this, I think (but there's a comment section below, so if you know better, let me know) is that the larger the sample size, the less likely it is to find a relatively low t-value if there is an effect – instead, you’d expect to find a higher t-value, on average.

P-values are altogether much less dependent on the sample size in a t-test. The figure below shows three curves (for 20, 50, and 100 participants per condition). Researchers can conclude their data is ‘significant’ for t-values somewhere around 2, ranging from 1.96 for large samples, to 2.03 for N=20. In other words, there is a relatively small effect of sample size. The dark and light grey lines indicate p = 0.05 and p = 0.01 thresholds.

The second thing that becomes clear from the plot of Bayes Factors is that the p<0.05 threshold allows researchers to conclude their data supports H1 long before a BF01 of 0.33. The t-values at which a Frequentist t-test yields a p < 0.05 are much lower than the t-values required for a BF to be lower than 0.33. For 20 participants per condition, a t-value of 2.487 is needed to conclude that there is some support for H1. A Frequentist t-test would give p=0.017. The larger the sample size, the more pronounced this difference becomes (e.g., with 200 participants per condition, a t=2.732 gives a BF = 0.33 and a p = 0.007).

It can even be the case that a ‘significant’ p-value in an independent t-test with 100 participants per condition (e.g., a t-value of 2, yielding a p=0.047) gives a BF>1, which means support in the opposite direction (favoring H0). Such high p-values really don’t provide support for our hypotheses. Furthermore, the use of a fixed significance level (0.05) regardless of the sample size of the study is a bad research practice. If we would require a higher t-value (and thus lower p-value) in larger samples, we would at least prevent the rather ridiculous situations where we interpret data as support for H1, when the BF actually favors H0.

On the other side, the recommendation to use p<0.001 by some statisticians is a bit of an overreaction to the problem. As you can see from the grey line at p=0.01 in the p-value plot, and the grey line at 0.33 in the Bayes Factor plot, using p<0.01 gets us pretty close to the same conclusions as we would draw using Bayes Factors. Stronger evidence is preferable over weaker evidence, but can come at too high costs.

In the end, our first priority should be to draw logical inferences about our hypotheses from our data. Given how easy it is to calculate the Bayes Factor, I'd say that at the very minimum you should want to calculate it to make sure your significant p-value actually isn't stronger support for H0. You can easily report it alongside p-values, confidence intervals, and effect sizes. For example, in a recent paper (Evers & Lakens, 2014, Study 2b) we wrote: "Overall, there was some indication of a diagnosticity effect of 4.4% (SD = 13.32), t₍₃₈₎ = 2.06, p = 0.046, g_av = 0.24, 95% CI [0.00, 0.49], but this difference was not convincing when evaluated with Bayesian statistics, JZS BF₁₀ = 0.89".

If you want to play around with the functions, you can grab the the script to produce the zoomed in version of the Bayes Factors and p-values graphs using the R script below (you need to install and load the Bayes Factor package for the script to work). If you want to read more about this (or see similar graphs and more) read this paper by Rouder et al (2009).

Sunday, August 24, 2014

On the Reproducibility of Meta-Analyses

I have no idea how many people take the effort to reproduce a meta-analysis in their spare time. What I do know, based on my personal experiences of the last week, is that A) it’s too much work to reproduce a meta-analysis, primarily due to low reporting standards, and B) we need to raise the bar when doing meta-analyses. At the end of this post, I’ll explain how to do a meta-analysis in R in five seconds (assuming you have effect sizes and sample sizes for each individual study) to convince you that you can produce (or reproduce) meta-analyses yourself.

Any single study is nothing more than a data-point in a future meta-analysis. In the last years researchers have shared a lot of thoughts and opinions about reporting standards for individual studies, ranging from disclosure statements, additional reporting of alternative statistical results whenever a researcher had the flexibility to choose from multiple analyses, to sharing all the raw data and analysis files. When it comes to meta-analyses, reporting standards are even more important.

Recently I tried to reproduce a meta-analysis (by Sheinfeld Gorin, Krebs, Badr, Janke, Jim, Spring, Mohr, Berendsen, & Jacobsen, 2012, with the titel “Meta-Analysis of Psychosocial Interventions to Reduce Pain in Patients With Cancer” in the Journal of Clinical Oncology, which has an IF of 18, and the article is cited 38 times) for a talk about statistics and reproducibility at the International Conference of Behavioral Medicine. Of the 38 effect sizes included in the meta-analysis I could reproduce 27 effect sizes (71%). Overall, I agreed with the way the original effect size was calculated for 18 articles (47%). I think both these numbers are too low. It could be my lack of ability in calculating effect sizes (let's call it a theoretical possibility) and I could be wrong in all cases in which I disagreed with which effect size to use (I offered the authors of the meta-analysis the opportunity to comment on this blog post, which they declined). But we need to make sure meta-analyses are 100% reproducible, if we want to be able to discuss and resolve such disagreements.

For three papers, statistics were not reported in enough detail for me to calculate effect sizes. The researchers who performed the meta-analysis might have contacted authors for the raw data in these cases. If so, it is important that authors of a meta-analysis share the summary statistics their effect size estimate is based on. Without additional information, those effect sizes are not reproducible by reviewers or readers. After my talk, an audience member noted that sharing data you have gotten from someone would require their permission - so if you ask for additional data when doing a meta-analysis, also ask to be able to share the summary statistics you will use in the meta-analysis to improve reproducibility.

For 9 studies, the effect sizes I calculated differed substantially from those by the authors of the meta-analysis (so much that it's not just due to rounding differences). It is difficult to resolve these inconsistencies, because I do not know how the authors calculated the effect size in these studies. Meta-analyses should give information about the data effect sizes are based on. A direct quote from the article that contains the relevant statistical test, or pointing to a row and column in a Table that contains the means and standard deviations would have been enough to allow me to compare calculations.

We might still have disagreed about which effect size should be included, as was the case for 10 articles where I could reproduce the effect size the authors included, but where I would use a different effect size estimate. The most noteworthy disagreement probably was a set of three articles the authors included, namely:

de Wit R, van Dam F, Zandbelt L, et al: A pain education program for chronic cancer pain patients: Follow-up results from a randomized controlled trial. Pain 73:55-69, 1997

de Wit R, van Dam F: From hospital to home care: A randomized controlled trial of a pain education programme for cancer patients with chronic pain. J Adv Nurs 36:742-754, 2001a

de Wit R, van Dam F, Loonstra S, et al: Improving the quality of pain treatment by a tailored pain education programme for cancer patients in chronic pain. Eur J Pain 5:241-256, 2001b

The authors of the meta-analyses calculated three effect sizes for these studies: 0.21, 0.14, and -0.19. I had a research assistant prepare a document with as much statistical information about the articles as possible, and she noticed that in her calculations, the effect sizes of De Wit et al (2001b) and De Wit et al (1997) were identical. I checked De Wit 2001a (referred to as De Wit 2002 in the forest plot in the meta-analysis) and noticed that all three studies reported the data of 313 participants. It’s the same data, written up three times. It’s easy to miss, because the data is presented in slightly different ways, and there are no references to earlier articles in the later articles (in the 2001b article, the two earlier articles are in the reference list, but not mentioned in the main text). I contacted the corresponding second author for clarifications, but received no reply (I would still be happy to add any comments I receive). In any case, since this is the same data, and since the effect sizes are not independent, it should only be included in the meta-analysis once.

A second disagreement comes from Table 2 in Anderson, Mendoza, Payne, Valero, Palos, Nazario, Richman, Hurley, Gning, Lynch, Kalish, and Cleeland (2006), Pain Education for Underserved Minority Cancer Patients: A Randomized Controlled Trial, also published in the Journal of Clinical Oncology), reproduced below:

See if you can find the seven similar means and standard deviations – a clear copy-paste error. Regrettably, this makes it tricky to calculate the difference on the Pain Control Scale, because they might not be correct. I contacted the corresponding first author for clarifications, but have received no reply (but the production office of the journal is looking in to it).

There are some effect size calculations where I strongly suspect errors were made, for example because adjusted means from an ANCOVA seem to be used instead of unadjusted means, or the effect size seems to be based on part of the data (only post-scores instead of the differences on Time1 and Time2 change scores, or the effect size in the intervention condition instead of the difference between the intervention and control condition). To know this for sure, the authors should have shared the statistics their effect size calculations were based on. I could be wrong, but disagreements can only be resolved if the data the effect sizes are calculated on is clearly communicated together with the meta-analysis.

The most important take home message at this point should be that A) there are enough things that researchers can disagree about if you take a close look at published meta-analyses, and B) the only way to resolve these disagreements is by full disclosure about how the meta-analysis was performed. All meta-analyses should include a meta-analysis disclosure table with the publication which provides a detailed description of the effect sizes that were used, including copy-pasted sentences from the original article or references to rows and columns in Tables that contain the relevant data. In p-curve analyses (Simonsohn, Nelson, & Simmons, 2013) such disclosure tables are required, including alternative effects that could have been included and a description of the methods and design of the study. All meta-analyses should include a disclosure table with information on how effect sizes were calculated.

Inclusion Criteria: The Researcher Degrees of Freedom of the Meta-Analyst

The choice of which studies you do or do not include in a meta-analysis is a necessarily subjective. It requires researchers to determine what their inclusion criteria are, and to decide whether a study meets their inclusion criteria or not. More importantly, if meta-analysts share all the data their meta-analysis is based on, it’s easy for reviewers or readers to repeat the analysis, based on their own inclusion criteria. In the meta-analysis I checked, 3 types of interventions to reduce pain in cancer patients were used. The first is pain management education, which involves increasing knowledge about pain, how to treat pain, and when and how to contact healthcare providers when in pain (for example to change their pain treatment). The second is hypnosis, provided in individual sessions by a therapist, often tailored to each patient, consisting of for example suggestions for pleasant visual imagery and muscle relaxation. The third is relaxation and cognitive coping skills, consisting of training and practice in relaxation exercises, attention diversion, and positive affirmations.

When doing a random effects meta-analysis, effects under investigation should be ‘different but similar’ and not ‘different and unrelated’ (Higgins, Thompson, & Spiegelhalter, 2009). If there is heterogeneity in the effect size estimate, you should not just stop after reporting the overal effect size, but examine subsamples of studies. I wanted to know whether the conclusion of a positive and effect size that was statistically different from zero over all studies would also hold for the subsamples (and whether the subsets would no longer show heterogeneity). It turns out that the evidence for pain management education is pretty convincing, while the effect size estimates for relaxation intervention was less convincing. The hypnosis intervention (sometimes consisting of only a 15 minute session) yielded effect sizes that were twice as large, but based on my calculations and after controlling for outliers, were not yet convincing. Thus, even though I disagreed on which effect sizes to include, based on the set of studies selected from the literature (which is in itself another interesting challenge for reproducibility!) the main difference in conclusions were based on which effects were 'different but similar'.

You can agree or disagree with my calculations. But what’s most important is that you should be able to perform your own meta-analysis on publically shared, open, and easily accessible data, to test your own ideas of which effects should and should not be included.

Performing a meta-analysis in R

I had no idea how easy doing a meta-analysis was in R (fun fact: when I was talking about this to someone, she pointed out the benefits of not sharing this too widely, to have an individual benefit of 'knowing how to do meta-analyses' - obviously, I think the collective benefit of everyone being able to do or check a meta-analysis is much greater). I did one small-scale meta-analysis once (Lakens, 2012), mainly by hand, which was effortful. Recently, I reviewed a paper by Carter & McCullough (2014) where the authors were incredibly nice to share their entire R script alongside their (very interesting) paper. I was amazed how easy it was to reproduce (or adapt) meta-analyses this way. If this part is useful, credit goes to Carter and McCollough and their R script (their script contains many more cool analyses, such as tests of excessive significance, and PET-PEESE meta-regressions, which are so cool they deserve an individual blog post in the future).

All you need to have to do a meta-analysis is the effect size for each study (for example Cohen’s d) and the sample size in each of the two conditions Cohen’s d is based on. The first string es.d contains five effect sizes from 5 studies. The n1 and n2 strings contain the sample sizes for the control conditions (n1) and the experimental condition (n2). That’s all you need to provide, and assuming you’ve calculated the effect sizes (not to brag, but I found my own excel sheets to calculate effect sizes that accompany my 2013 effect size paper very useful in this project) and coded the sample sizes, the rest of the meta-analysis takes 5 seconds. You need to copy-paste the entire code below in R or RStudio (both are free) and first need to install the meta and metaphor packages. After that, you just insert your effect sizes and sample sizes, and run it. The code below is by Carter and McCullough, with some additions I made.

The output you get will contain the results of the meta-analysis showing an overall effect size of d = 0.31, 95% CI [1.13; 0.50]:

95%-CI %W(fixed) %W(random)

1 0.38 [ 0.0571; 0.7029] 33.62 33.62

2 0.41 [ 0.0136; 0.8064] 22.32 22.32

3 -0.14 [-0.7387; 0.4587] 9.78 9.78

4 0.63 [-0.0223; 1.2823] 8.24 8.24

5 0.22 [-0.1470; 0.5870] 26.04 26.04

Number of studies combined: k=5

95%-CI z p.value

Fixed effect model 0.3148 [0.1275; 0.502] 3.2945 0.001

Random effects model 0.3148 [0.1275; 0.502] 3.2945 0.001

Quantifying heterogeneity:

tau^2 = 0; H = 1 [1; 2.12]; I^2 = 0% [0%; 77.8%]

Test of heterogeneity:

Q d.f. p.value

3.75 4 0.4411

In addition, there’s a check for outliers and influential cases, and a forest plot:

This is just the basics, but it hopefully has convinced you that the calculations involved in doing a meta-analysis take no more than 5 seconds if you use the right software. Remember that you can easily share your R script, containing all your data (but don't forget a good disclosure table) and analyses when submitted your manuscript to a journal, or when it has been accepted for publication. Now go and reproduce.