A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, August 24, 2014

On the Reproducibility of Meta-Analyses

I have no idea how many people take the effort to reproduce a meta-analysis in their spare time. What I do know, based on my personal experiences of the last week, is that A) it’s too much work to reproduce a meta-analysis, primarily due to low reporting standards, and B) we need to raise the bar when doing meta-analyses. At the end of this post, I’ll explain how to do a meta-analysis in R in five seconds (assuming you have effect sizes and sample sizes for each individual study) to convince you that you can produce (or reproduce) meta-analyses yourself.

Any single study is nothing more than a data-point in a future meta-analysis. In the last years researchers have shared a lot of thoughts and opinions about reporting standards for individual studies, ranging from disclosure statements, additional reporting of alternative statistical results whenever a researcher had the flexibility to choose from multiple analyses, to sharing all the raw data and analysis files. When it comes to meta-analyses, reporting standards are even more important. 

Recently I tried to reproduce a meta-analysis (by Sheinfeld Gorin, Krebs, Badr, Janke, Jim, Spring, Mohr, Berendsen, & Jacobsen, 2012, with the titel “Meta-Analysis of Psychosocial Interventions to Reduce Pain in Patients With Cancer” in the Journal of Clinical Oncology, which has an IF of 18, and the article is cited 38 times) for a talk about statistics and reproducibility at the International Conference of Behavioral Medicine. Of the 38 effect sizes included in the meta-analysis I could reproduce 27 effect sizes (71%). Overall, I agreed with the way the original effect size was calculated for 18 articles (47%). I think both these numbers are too low. It could be my lack of ability in calculating effect sizes (let's call it a theoretical possibility) and I could be wrong in all cases in which I disagreed with which effect size to use (I offered the authors of the meta-analysis the opportunity to comment on this blog post, which they declined). But we need to make sure meta-analyses are 100% reproducible, if we want to be able to discuss and resolve such disagreements.

For three papers, statistics were not reported in enough detail for me to calculate effect sizes. The researchers who performed the meta-analysis might have contacted authors for the raw data in these cases. If so, it is important that authors of a meta-analysis share the summary statistics their effect size estimate is based on. Without additional information, those effect sizes are not reproducible by reviewers or readers. After my talk, an audience member noted that sharing data you have gotten from someone would require their permission - so if you ask for additional data when doing a meta-analysis, also ask to be able to share the summary statistics you will use in the meta-analysis to improve reproducibility.

For 9 studies, the effect sizes I calculated differed substantially from those by the authors of the meta-analysis (so much that it's not just due to rounding differences). It is difficult to resolve these inconsistencies, because I do not know how the authors calculated the effect size in these studies. Meta-analyses should give information about the data effect sizes are based on. A direct quote from the article that contains the relevant statistical test, or pointing to a row and column in a Table that contains the means and standard deviations would have been enough to allow me to compare calculations. 

We might still have disagreed about which effect size should be included, as was the case for 10 articles where I could reproduce the effect size the authors included, but where I would use a different effect size estimate. The most noteworthy disagreement probably was a set of three articles the authors included, namely:

de Wit R, van Dam F, Zandbelt L, et al: A pain education program for chronic cancer pain patients: Follow-up results from a randomized controlled trial. Pain 73:55-69, 1997

de Wit R, van Dam F: From hospital to home care: A randomized controlled trial of a pain education programme for cancer patients with chronic pain. J Adv Nurs 36:742-754, 2001a

de Wit R, van Dam F, Loonstra S, et al: Improving the quality of pain treatment by a tailored pain education programme for cancer patients in chronic pain. Eur J Pain 5:241-256, 2001b

The authors of the meta-analyses calculated three effect sizes for these studies: 0.21, 0.14, and -0.19. I had a research assistant prepare a document with as much statistical information about the articles as possible, and she noticed that in her calculations, the effect sizes of De Wit et al (2001b) and De Wit et al (1997) were identical. I checked De Wit 2001a (referred to as De Wit 2002 in the forest plot in the meta-analysis) and noticed that all three studies reported the data of 313 participants. It’s the same data, written up three times. It’s easy to miss, because the data is presented in slightly different ways, and there are no references to earlier articles in the later articles (in the 2001b article, the two earlier articles are in the reference list, but not mentioned in the main text). I contacted the corresponding second author for clarifications, but received no reply (I would still be happy to add any comments I receive). In any case, since this is the same data, and since the effect sizes are not independent, it should only be included in the meta-analysis once.

A second disagreement comes from Table 2 in Anderson, Mendoza, Payne, Valero, Palos, Nazario, Richman, Hurley, Gning, Lynch, Kalish, and Cleeland (2006), Pain Education for Underserved Minority Cancer Patients: A Randomized Controlled Trial, also published in the Journal of Clinical Oncology), reproduced below:

See if you can find the seven similar means and standard deviations – a clear copy-paste error. Regrettably, this makes it tricky to calculate the difference on the Pain Control Scale, because they might not be correct. I contacted the corresponding first author for clarifications, but have received no reply (but the production office of the journal is looking in to it).

There are some effect size calculations where I strongly suspect errors were made, for example because adjusted means from an ANCOVA seem to be used instead of unadjusted means, or the effect size seems to be based on part of the data (only post-scores instead of the differences on Time1 and Time2 change scores, or the effect size in the intervention condition instead of the difference between the intervention and control condition). To know this for sure, the authors should have shared the statistics their effect size calculations were based on. I could be wrong, but disagreements can only be resolved if the data the effect sizes are calculated on is clearly communicated together with the meta-analysis.

The most important take home message at this point should be that A) there are enough things that researchers can disagree about if you take a close look at published meta-analyses, and B) the only way to resolve these disagreements is by full disclosure about how the meta-analysis was performed. All meta-analyses should include a meta-analysis disclosure table with the publication which provides a detailed description of the effect sizes that were used, including copy-pasted sentences from the original article or references to rows and columns in Tables that contain the relevant data. In p-curve analyses (Simonsohn, Nelson, & Simmons, 2013) such disclosure tables are required, including alternative effects that could have been included and a description of the methods and design of the study. All meta-analyses should include a disclosure table with information on how effect sizes were calculated.

Inclusion Criteria: The Researcher Degrees of Freedom of the Meta-Analyst

The choice of which studies you do or do not include in a meta-analysis is a necessarily subjective. It requires researchers to determine what their inclusion criteria are, and to decide whether a study meets their inclusion criteria or not. More importantly, if meta-analysts share all the data their meta-analysis is based on, it’s easy for reviewers or readers to repeat the analysis, based on their own inclusion criteria. In the meta-analysis I checked, 3 types of interventions to reduce pain in cancer patients were used. The first is pain management education, which involves increasing knowledge about pain, how to treat pain, and when and how to contact healthcare providers when in pain (for example to change their pain treatment). The second is hypnosis, provided in individual sessions by a therapist, often tailored to each patient, consisting of for example suggestions for pleasant visual imagery and muscle relaxation. The third is relaxation and cognitive coping skills, consisting of training and practice in relaxation exercises, attention diversion, and positive affirmations.

When doing a random effects meta-analysis, effects under investigation should be ‘different but similar’ and not ‘different and unrelated’ (Higgins, Thompson, & Spiegelhalter, 2009). If there is heterogeneity in the effect size estimate, you should not just stop after reporting the overal effect size, but examine subsamples of studies. I wanted to know whether the conclusion of a positive and effect size that was statistically different from zero over all studies would also hold for the subsamples (and whether the subsets would no longer show heterogeneity). It turns out that the evidence for pain management education is pretty convincing, while the effect size estimates for relaxation intervention was less convincing. The hypnosis intervention (sometimes consisting of only a 15 minute session) yielded effect sizes that were twice as large, but based on my calculations and after controlling for outliers, were not yet convincing. Thus, even though I disagreed on which effect sizes to include, based on the set of studies selected from the literature (which is in itself another interesting challenge for reproducibility!) the main difference in conclusions were based on which effects were 'different but similar'. 

You can agree or disagree with my calculations. But what’s most important is that you should be able to perform your own meta-analysis on publically shared, open, and easily accessible data, to test your own ideas of which effects should and should not be included.

Performing a meta-analysis in R

I had no idea how easy doing a meta-analysis was in R (fun fact: when I was talking about this to someone, she pointed out the benefits of not sharing this too widely, to have an individual benefit of 'knowing how to do meta-analyses' - obviously, I think the collective benefit of everyone being able to do or check a meta-analysis is much greater). I did one small-scale meta-analysis once (Lakens, 2012), mainly by hand, which was effortful. Recently, I reviewed a paper by Carter & McCullough (2014) where the authors were incredibly nice to share their entire R script alongside their (very interesting) paper. I was amazed how easy it was to reproduce (or adapt) meta-analyses this way. If this part is useful, credit goes to Carter and McCollough and their R script (their script contains many more cool analyses, such as tests of excessive significance, and PET-PEESE meta-regressions, which are so cool they deserve an individual blog post in the future).

All you need to have to do a meta-analysis is the effect size for each study (for example Cohen’s d) and the sample size in each of the two conditions Cohen’s d is based on. The first string es.d contains five effect sizes from 5 studies. The n1 and n2 strings contain the sample sizes for the control conditions (n1) and the experimental condition (n2). That’s all you need to provide, and assuming you’ve calculated the effect sizes (not to brag, but I found my own excel sheets to calculate effect sizes that accompany my 2013 effect size paper very useful in this project) and coded the sample sizes, the rest of the meta-analysis takes 5 seconds. You need to copy-paste the entire code below in R or RStudio (both are free) and first need to install the meta and metaphor packages. After that, you just insert your effect sizes and sample sizes, and run it. The code below is by Carter and McCullough, with some additions I made.

The output you get will contain the results of the meta-analysis showing an overall effect size of d = 0.31, 95% CI [1.13; 0.50]:

                  95%-CI %W(fixed) %W(random)
1  0.38 [ 0.0571; 0.7029]     33.62      33.62
2  0.41 [ 0.0136; 0.8064]     22.32      22.32
3 -0.14 [-0.7387; 0.4587]      9.78       9.78
4  0.63 [-0.0223; 1.2823]      8.24       8.24
5  0.22 [-0.1470; 0.5870]     26.04      26.04

Number of studies combined: k=5

                                     95%-CI      z p.value
Fixed effect model   0.3148 [0.1275; 0.502] 3.2945   0.001
Random effects model 0.3148 [0.1275; 0.502] 3.2945   0.001

Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 2.12]; I^2 = 0% [0%; 77.8%]

Test of heterogeneity:
    Q d.f.  p.value
 3.75    4   0.4411

In addition, there’s a check for outliers and influential cases, and a forest plot:

This is just the basics, but it hopefully has convinced you that the calculations involved in doing a meta-analysis take no more than 5 seconds if you use the right software. Remember that you can easily share your R script, containing all your data (but don't forget a good disclosure table) and analyses when submitted your manuscript to a journal, or when it has been accepted for publication. Now go and reproduce.

Thursday, August 21, 2014

What is a p-value?

I'll be teaching my first statistics lectures this September together with +Chris Snijders. While preparing some assignments, I gained a new appreciation for clearly explaining the basics of statistics. I'm also becoming an R fanboy, so I thought I'd adapt one assignment into this blog post to explain what a p-value is, and how to plot data, make figures, and do a t-test in R, for people who like to brush up or learn some of the basics.

When testing predictions in empirical research, researchers often report whether their results are statistically different from 0. For example, a researcher might be interested in whether the use of a cell-phone increases the likelihood of getting into a car accident compared to not using a cell phone. A researcher might ask one group of individuals to use cell phones while driving in a driving simulator, while a control group does not use cell phones. The researcher might predict cell phone users get into more accidents, and test this prediction by comparing whether the difference between the two groups in the experiment is statistically different from zero. This is typically referred to as null-hypothesis significance testing (NHST). The ‘significance’ part of this name is a misnomer: what we understand as the 'significance' of a finding in normal English depends on its theoretical or practical importance and has very little to do with statistics. Instead, we will therefore refer to such tests as ‘null-hypothesis testing’, and use ‘statistical difference’ for what is sometimes referred to in the literature as a ‘significant finding’ or a ‘statistically significant finding’.

Assume we ask two groups of 10 people how much they liked the extended directors cut version of the Lord of the Rings (LOTR) trilogy. The first group consists of my friends, and the second groups consists of friends of my wife. Our friends rate the trilogy on a score from 1 to 10. We can calculate the average rating by my friends, which is 8.7, and the average rating by my wife’s friends, which is 7.7. The difference is 1.

My Friends
My Wife’s Friends
 Friend 1
 Friend 2
 Friend 3
 Friend 4
 Friend 5
 Friend 6
 Friend 7
 Friend 8
 Friend 9
 Friend 10

To get these data in R, we use:

myfriends = c(9,7,8,9,8,9,9,10,9,9) 
mywifesfriends = c(9,6,7,8,7,9,8,8,8,7)

We can see in the table that the ratings vary, not just between the groups, but also within the groups. The average difference is 1 scale point between groups, but the ratings also differ between friends within each group. To take a look at the distribution of these scores, let’s plot the in a density plot (closely related to a histogram).
d1 <- density(myfriends)
d2 <- density(mywifesfriends)
plot(range(d1$x, d2$x), range(d1$y, d2$y), type = "n", xlab = "LOTR Rating", ylab = "Density")
polygon(d1, col=rgb(0,0,1,1/4), border=rgb(0,0,1))
polygon(d2, col=rgb(1,0,0,1/4), border=rgb(1,0,0,1))
We can see the groups overlap to a certain extent, but there is also some non-overlap. So is the difference between the two groups just random variation, or can we conclude my friends really like the extended directors cut of the Lord of the Rings (LOTR) trilogy more than my wife’s friends?

In null-hypothesis testing we try to answer this question by calculating the probability of observing a specific, or more extreme, outcome of a test (i.e., a difference in movie ratings of 1 point or more) assuming that the null hypothesis is true (i.e., there is no real difference between how much my friends and my wife’s friends like the extended directors cut of LOTR). This probability is called the p-value.

The null-hypothesis assumes that if we would ask an infinite number of my friends  and an infinite number of my wife’s friends how much they like LOTR, the difference between these huge groups is exactly 0. However, in a small sample of my friends and my wife’s friends (say 10 friends in each group), random variation is very likely to lead to a difference larger or smaller than 0. How do we know whether an observed difference is due to random variation around a difference of 0, or whether an observed difference is due to random variation around a real difference between our friends?

To answer this question we need to know is what constitutes a reasonable expectation about how much the differences between groups can vary, if we would repeatedly ask samples of groups of friends to rate LOTR. When we compare two groups, we use the means, standard deviations, and number of observations in each group to calculate the t-statistic.

With the means (M), standard deviations (SD) and sample size (N) for each of the two groups from the Table above we can examine the probability that the difference between the two groups is larger than 0, given the data we have available. This is done by calculating the t-statistic, which is related to a probability (a p-value)  of getting the observed or a more extreme t-statistic in the t-distribution. The t-distribution describes samples drawn from a population. It is similar to the normal distribution (which describes the entire population). Because the t-distribution describes the probability density function for a specific sample, the shape depends on the sample size. The larger the sample, the more similar the t-distribution becomes to the normal distribution. Below, you can see the normal distribution (with M = 0 and SD = 1) in black, and the t-distribution for 10 and 5 degrees of freedom. We can see the t-distribution has somewhat thicker tails than the normal distribution.

x = seq(-4.5,4.5,length=10)
plot(x, y, type="l", col="black", lwd=3)
legend(-4,0.4, # places a legend at the appropriate place
c("Normal","t(10)", "t(5)"), # puts text in the legend
lty=c(1,1), # gives the legend appropriate symbols (lines)
lwd=c(2.5,2.5,2.5),col=c("black", "red","blue")) # format lines and colors
Now we have a probability density function to compare our t-statistic against, all we need to do is to calculate the t-test. Actually, you should first check whether both variables appear to come from a normal distribution and have equal variances, but we're skipping the tests for assumptions here. In R, you use the command below (here, we assume equal variances), which returns the output (in grey):
t.test(myfriends,mywifesfriends, var.equal = TRUE)

     Two Sample t-test

data:  myfriends and mywifesfriends
t = 2.5175, df = 18, p-value = 0.02151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1654875 1.8345125
sample estimates:
mean of x mean of y
      8.7       7.7

The t-value is 2.5175. The t-test returns the probability of finding a difference as extreme, or more extreme, as the observed difference. We can graph the t-distribution (for df = 18) and a vertical line at t=2.5175 in R using:
abline(v=2.5175, lwd=2)

Let’s recall the definition of a p-value: A p-value is the probability of the observed or more extreme data, assuming the null hypothesis is true. The assumption that the null-hypothesis is true is represented by the t-distribution being centered around 0, which is the t-value if the difference between the two groups is exactly 0. The probability of an outcome as high, or higher, as the one observed, is indicated by the blue area. This is known as a one-tailed probability. It is equivalent to the probability that my friends like the extended directors cut of LOTR more than my wife’s friends.

However, that wasn’t my question. My question was whether my friends and my wife’s friends would differ in how much they liked the LOTR movies. It would have been just as interesting to find my friends liked the movies less than my wife’s friends. Since many of my friends have read the book more than 20 times, and it’s often the case that people like the book better than the movie, an opposite outcome would have been just as likely. Furthermore, there has only been one person in my family who has been to a marathon movie night in the cinema when the third LOTR movie came out, and it wasn’t me. So let’s plot the probability that we found a positive or negative difference, as or more extreme as the one observed (the two-tailed probability). Let's plot this:

abline(v=2.5175, lwd=2)
abline(v=-2.5175, lwd=2)
abline(v=2.100922, col="grey",lty=3)
I’ve plotted two critical values by horizontal grey dotted lines, which indicate the critical t-values for a difference to be significant (p < .05) for a t-distribution with 18 degrees of freedom. We can see that the observed difference is more extreme than the minimal critical values. However, we want to know what the exact probability of the two blue areas under the curve is. The t-test already provided us with the information,  but let’s calculate it. For the left tail, the probability of a value smaller than t=-2.5175 can be found by:


If we would simply change the sign on -2.5175, we would get the probability of a value smaller than 2.5175. What we want is the probability of a value higher than that, which equals:


This value is identical to the probability in the other tail because the t-distribution is symmetric. If we add these two probabilities together, we get p = 0.02151. This is the same as the p-value provided by the t-test we performed earlier returned.


What does a p<.05 mean?

If we find a difference which is statistically different from 0 with p < .05, this means that the probability of observing this difference, assuming there is no difference between all my and my wife’s friends, is relatively small. That is, if both types of friends would enjoy LOTR equally (on average), then the probability of getting the data that we have (or data that is more extreme than the one we have) is small (0.02151 in our previous example). Hence, the data we have observed should therefore be considered surprising. That’s all a p-value smaller than 0.05 means.

People, even seasoned researchers, often think p-values mean a lot more. Remember that p-values are a statement about the probability of observing the data, given that the null-hypothesis is true. It is not the probability that the null-hypothesis is true, given the data (we need Bayesian statistics if we want to make such statements). Results can be surprising when the null-hypothesis is true, but even more surprising when the alternative hypothesis is true (which would make you think that the H0 hypothesis is perhaps the one to go for after all). Morale: since a p-value only takes the null-hypothesis into account, it says nothing about the probability under the alternative hypothesis.
If a p-value is larger than 0.05, the date we have observed are not surprising. This doesn’t mean the null-hypothesis is true. The p-value can only be used to test to reject the null-hypothesis. It can never be used to accept the null-hypothesis is being true.

Finally, a p-value is not the probability that a finding will replicate. It’s not 97.849% likely the observed difference in LOTR evaluations will be replicated. We can easily show this in R by simulating some friends. We will use the observed means and standard deviations of the 10 friends of my wife and me, to draw 10 new friends from a normal distribution with the same means and standard deviations. Let’s also perform a t-test, and plot the results.

myfriends<-rnorm(n = 10, mean = 8.7, sd = 0.82)
mywifesfriends<-rnorm(n = 10, mean = 7.7, sd = 0.95)
t.test(myfriends,mywifesfriends, var.equal = TRUE) #perform the t-test
d1 <- density(myfriends)
d2 <- density(mywifesfriends)
plot(range(d1$x, d2$x), range(d1$y, d2$y), type = "n", xlab = "LOTR Rating", ylab = "Density")
legend(5, 0.4, legend=paste("mean=",round(m2,2),"; sd=",round(sd2,2), sep=""))
legend(9.1, 0.4, legend=paste("mean=",round(m1,2),"; sd=",round(sd1,2), sep=""))
polygon(d1, col=rgb(0,0,1,1/4), border="black")
polygon(d2, col=rgb(1,0,0,1/4), border="black")

Run this script 20 times, and note whether the p-value in the outcome of the t-test is smaller than 0.05. You will probably see some variation in the p-values (with a p>.05 several times), and quite some veriation in the means. This happens, because there is a lot of variation in small samples. Let’s change the number of simulated friends of my wife and me (n = 10 in the first two lines of the script) to 100 simulated friends. Run the adapted script 20 times. You will see that the p-values are very low practically every time you run the script.

Look at the means and standard deviations. What do you notice with respect to the means and standard deviations? How is it possible that the t-value becomes more extreme, while the means and standard deviations stay the same (or actually become closer to the true means and standard deviations with increasing sample size)? The answer can be found in the formula for the t-statistic, where (described in general terms) the difference between means is divided by the pooled standard deviation, divided by the square root of the sample size. This √n aspect of the function explains why, all else remaining equal and when there is a real difference to be observed, t-values increase, and p-values decrease, the larger the sample size.