The 20% Statistician: meta-analysis

Showing posts with label meta-analysis. Show all posts

Saturday, December 1, 2018

Justify Your Alpha by Decreasing Alpha Levels as a Function of the Sample Size

A preprint ("Justify Your Alpha: A Primer on Two Practical Approaches") that extends and improves the ideas in this blog post is available at: https://psyarxiv.com/ts4r6

Testing whether observed data should surprise us, under the assumption that some model of the data is true, is a widely used procedure in psychological science. Tests against a null model, or against the smallest effect size of interest for an equivalence test, can guide your decisions to continue or abandon research lines. Seeing whether a p-value is smaller than an alpha level is rarely the only thing you want to do, but especially early on in experimental research lines where you can randomly assign participants to conditions, it can be a useful thing.

Regrettably, this procedure is performed rather mindlessly. Doing Neyman-Pearson hypothesis testing well, you should carefully think about the error rates you find acceptable. How often do you want to miss the smallest effect size you care about, if it is really there? And how often do you want to say there is an effect, but actually be wrong? It is important to justify your error rates when designing an experiment. In this post I will provide one justification for setting the alpha level (something we recommended makes more sense than using a fixed alpha level).

Papers explaining how to justify your alpha level are very rare (for an example, see Mudge, Baker, Edge, & Houlahan, 2012). Here I want to discuss one of the least known, but easiest suggestions on how to justify alpha levels in the literature, proposed by Good. The idea is simple, and has been supported by many statisticians in the last 80 years: Lower the alpha level as a function of your sample size.

The idea behind this recommendation is most extensively discussed in a book by Leamer (1978, p. 92). He writes:

The rule of thumb quite popular now, that is, setting the significance level arbitrarily to .05, is shown to be deficient in the sense that from every reasonable viewpoint the significance level should be a decreasing function of sample size.

Leamer (you can download his book for free) correctly notes that this behavior, an alpha level that is a decreasing function of the sample size, makes sense from both a Bayesian as a Neyman-Pearson perspective. Let me explain.

Imagine a researcher who performs a study that has 99.9% power to detect the smallest effect size the researcher is interested in, based on a test with an alpha level of 0.05. Such a study also has 99.8% power when using an alpha level of 0.03. Feel free to follow along here, by setting the sample size to 204, the effect size to 0.5, alpha or p-value (upper limit) to 0.05, and the p-value (lower limit) to 0.03.

We see that if the alternative hypothesis is true only 0.1% of the observed studies will, in the long run, observe a p-value between 0.03 and 0.05. When the null-hypothesis is true 2% of the studies will, in the long run, observe a p-value between 0.03 and 0.05. Note how this makes p-values between 0.03 and 0.05 more likely when there is no true effect, than when there is an effect. This is known as Lindley’s paradox (and I explain this in more detail in Assignment 1 in my MOOC, which you can also do here).

Although you can argue that you are still making a Type 1 error at most 5% of the time in the above situation, I think it makes sense to acknowledge there is something weird about having a Type 1 error of 5% when you have a Type 2 error of 0.1% (again, see Mudge, Baker, Edge, & Houlahan, 2012, who suggest balancing error rates). To me, it makes sense to design a study where error rates are more balanced, and a significant effect is declared for p-values more likely to occur when the alternative model is true than when the null model is true.

Because power increases as the sample size increases, and because Lindley’s paradox (Lindley, 1957, see also Cousins, 2017) can be prevented by lowering the alpha level sufficiently, the idea to lower the significance level as a function of the sample is very reasonable. But how?

Zellner (1971) discusses how the critical value for a frequentist hypothesis test approaches a limit as the sample size increases (i.e., a critical value of 1.96 for p = 0.05 in a two-sided test) whereas the critical value for a Bayes factor increases as the sample size increases (see also Rouder, Speckman, Sun, Morey, & Iverson, 2009). This difference lies at the heart of Lindley’s paradox, and under certain assumptions comes down to a factor of √n. As Zellner (1971, footnote 19, page 304) writes (K₀₁ is the formula for the Bayes factor):

If a sampling theorist were to adjust his significance level upward as n grows larger, which seems reasonable, z_a would grow with n and tend to counteract somewhat the influence of the √n factor in the expression for K01.

Jeffreys (1939) discusses Neyman and Pearson’s work and writes:

We should therefore get the best result, with any distribution of α, by some form that makes the ratio of the critical value to the standard error increase with n. It appears then that whatever the distribution may be, the use of a fixed P limit cannot be the one that will make the smallest number of mistakes.

He discusses the issue more in Appendix B, where he compared his own test (Bayes factors) against Neyman-Pearson decision procedures, and he notes that:

In spite of the difference in principle between my tests and those based on the P integrals, and the omission of the latter to give the increase of the critical values for large n, dictated essentially by the fact that in testing a small departure found from a large number of observations we are selecting a value out of a long range and should allow for selection, it appears that there is not much difference in the practical recommendations. Users of these tests speak of the 5 per cent. point in much the same way as I should speak of the K = 10^-½ point, and of the 1 per cent. point as I should speak of the K = I0^-1 point; and for moderate numbers of observations the points are not very different. At large numbers of observations there is a difference, since the tests based on the integral would sometimes assert significance at departures that would actually give K > I. Thus there may be opposite decisions in such cases. But they will be very rare.

So even though extremely different conclusions between Bayes factors and frequentist tests will be rare, according to Jeffreys, when the sample size grows, the difference becomes noticeable.

This brings us to Good’s (1982) easy solution. His paper is basically just a single page (I’d love something akin to a Comments, Conjectures, and Conclusions format in Meta-Psychology! – note that Good himself was the section editor, which started with ‘Please be succinct but lucid and interesting’, and it reads just like a blog post).

He also explains the rationale in Good (1992):

‘we have empirical evidence that sensible P values are related to weights of evidence and, therefore, that P values are not entirely without merit. The real objection to P values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.

Based on the observation by Jeffrey’s (1939) that, under specific circumstances, the Bayes factor against the null-hypothesis is approximately inversely proportional to √N, Good (1982) suggests a standardized p-value to bring p-values in closer relationship with weights of evidence:

This formula standardizes the p-value to the evidence against the null hypothesis that what would be found if the p_stan-value was the tail area probability observed in a sample of 100 participants (I think the formula is only intended for between designs - I would appreciate anyone weighing in in the comments if it can be extended to within-designs). When the sample size is 100, the p-value and p_stan are identical. But for larger sample sizes p_stan is larger than p. For example, a p = .05 observed in a sample size of 500 would have a p_stan of 0.11, which is not enough to reject the null-hypothesis for the alternative. Good (1988) demonstrates great insight when he writes: ‘I guess that standardized p-values will not become standard before the year 2000.’

Good doesn’t give a lot of examples of how standardized p-values should be used in practice, but I guess it makes things easier to think about a standardized alpha level (even though the logic is the same, just like you can double the p-value, or halve the alpha level, when you are correcting for 2 comparisons in a Bonferroni correction). So instead of an alpha level of 0.05, we can think of a standardized alpha level:

Again, with 100 participants α and α_stan are the same, but as the sample size increases above 100, the alpha level becomes smaller. For example, a α = .05 observed in a sample size of 500 would have a α_stan of 0.02236.

So one way to justify your alpha level is by using a decreasing alpha level as the sample size increases. I for one have always thought it was rather nonsensical to use an alpha level of 0.05 in all meta-analyses (especially when testing a meta-analytic effect size based on thousands of participants against zero), or large collaborative research project such as Many Labs, where analyses are performed on very large samples. If you have thousands of participants, you have extremely high power for most effect sizes original studies could have detected in a significance test. With such a low Type 2 error rate, why keep the Type 1 error rate fixed at 5%, which is so much larger than the Type 2 error rate in these analyses? It just doesn’t make any sense to me. Alpha levels in meta-analyses or large-scale data analyses should be lowered as a function of the sample size. In case you are wondering: an alpha level of .005 would be used when the sample size is 10.000.

When designing a study based on a specific smallest effect size of interest, where you desire to have decent power (e.g., 90%), we run in to a small challenge because in the power analysis we now have two unknowns: The sample size (which is a function of the power, effect size, and alpha), and the standardized alpha level (which is a function of the sample size). Luckily, this is nothing that some R-fu can’t solve by some iterative power calculations. [R code to calculate the standardized alpha level, and perform an iterative power analysis, is at the bottom of the post]

When we wrote Justify Your Alpha (I recommend downloading the original draft before peer review because it has more words and more interesting references) one of the criticism I heard the most is that we gave no solutions how to justify your alpha. I hope this post makes it clear that statisticians have discussed that the alpha level should not be any fixed value even since it was invented. There are already some solutions available in the literature. I like Good’s approach because it is simple. In my experience, people like simple solutions. It might not be a full-fledged decision theoretical cost-benefit analysis, but it beats using a fixed alpha level. I recently used it in a submission for a Registered Report. At the same time, I think it has never been used in practice, so I look forward to any comments, conjectures, and conclusions you might have.

References

Good, I. J. (1982). C140. Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16(1), 65–66. https://doi.org/10.1080/00949658208810607
Good, I. J. (1988). The interface between statistics and philosophy of science. Statistical Science, 3(4), 386–397.
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. https://doi.org/10.2307/2290192
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Leamer, E. E. (1978). Specification Searches: Ad Hoc Inference with Nonexperimental Data (1 edition). New York usw.: Wiley.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.

Sunday, October 11, 2015

Practicing Meta-Analytic Thinking Through Simulations

People find it difficult to think about random variation. Our mind is more strongly geared towards recognizing patterns than randomness. In this blogpost, you can learn what random variation looks like, how to reduce it by running well-powered studies, and how to meta-analyze multiple small studies. This is a long read, and most educational if you follow the assignments. You'll probably need about an hour.

We'll use R, and the R script at the bottom of this post (or download it from GitHub). Run the first section (sections are differentiated by # # # #) to install the required packages and change some setting.

IQ tests have been designed such that the mean IQ of the entire population of adults is 100, with a standard deviation of 15. This will not be true for every sample we draw from the population. Let’s get a feel for what the IQ scores from a sample look like. Which IQ scores will people in our sample have?

Assignment 1

We will start by simulating a random sample of 10 individuals. Run the script in the section #Assignment 1. Both the mean, as the standard deviation, differ from the true mean in the population. Simulate some more samples of 10 individuals and look at the means and SD's. They differ quite a lot. This type of variation is perfectly normal in small samples of 10 participants. See below for one example of a simulated sample.

Let’s simulate a larger sample, of 100 participants by changing the n=10 in line 23 of the R script to n = 100 (remember R code is case-sensitive).

We are slowly seeing what is known as the normal distribution. This is the well known bell shaped curve that represents the distribution of many variables in scientific research (although some other types of distributions are quite common as well). The mean and standard deviation are much closer to the true mean and standard deviation, and this is true for most of the simulated samples. Simulate at least 10 samples with n = 10, and 10 samples with n = 100. Look at the means and standard deviations. Let’s simulate one really large sample, of 1000 people (run the code, changing n=10 to n=1000). The picture shows one example.

Not every simulated study of 1000 people will yield the true mean and standard deviation, but this one did. And although the distribution is very close to a normal distribution, even with a 1000 people it is not perfect.

The accuracy with which you can measure the IQ in a population is easy to calculate when you know the standard deviation, and the percentage of long-run probability of being of making an error. If you choose a 95% confidence interval, and want to estimate IQ within an error range of 2 IQ points, you first convert the 95% confidence interval to a Z-score (1.96), and use the formula:

N = (Z * SD/error)²

In this example, (1.96*15/2)²= 216 people (rounded down). Feel free to check by running the code with n = 216 (remember that this is a long term average!)

In addition to planning for accuracy, you can plan for power. The power of a study is the probability of observing a statistically significant effect, given that there is a true effect to be found. It depends on the effect size, the sample size, and the alpha level.

We can simulate experiments, and count how many statistically significant results are observed, to see how much power we have. For example, when we simulate 100.000 studies, and 50% of the studies reveal a p-value smaller than 0.05, this means the power of our study (given a specific effect size, sample size, and alpha-level) is 50%.

We can use the code in the section of Assignment 2. Running this code will take a while. It will simulate 100000 experiments, where 10 participants are drawn from a normal distribution with the mean of 110, and a SD of 15. To continue our experiment, let’s assume the numbers represent measured IQ, which is 110 in our samples. For each simulated sample, we test whether the effect differs from an IQ of 100. In other words, we are testing whether our sample is smarter than average.

The program returns all p-values, and it will return the power, which will be somewhere around 47%. It will also yield a plot of the p-values. The first bar is the count of all p-values smaller than 0.05, so all statistically significant p-values. The percentage of p-values in this single bar visualizes the power of the study.

Instead of simulating the power of the study, you can also perform power calculations in R (see the code at the end of assignment 2). To calculate the power of a study, we need the sample size (in our case, n = 10), the alpha level (in our case, 0.05), and the effect size, which for a one-sample t-test is Cohen’s d, which can be calculated as d = (X-μ)/σ, or (110-100)/15 = 0.6667.

Assignment 2

Using the simulation and the pwr package, examine what happens with the power of the experiments when the sample size is increased to 20. How does the p-value distribution change?

Using the simulation and the pwr package, examine what happens with the power of the experiments when the mean in the sample changes from 110 to 105 (set the sample size to 10 again). How does the p-value distribution change?

Using the simulation and the pwr package, examine what happens with the power of the experiments when the mean in the sample is set to 100 (set the sample size to 10 again). Now, there is no difference between the sample and the average IQ. How does the p-value distribution change? Can we formally speak of ‘power’ in this case? What is a better name in this specific situation?

Variance in two groups, and their difference.

Now, assume we have a new IQ training program that will increase peoples IQ score with 6 points. People in condition 1 are in the control condition – they do not get IQ training. People in condition 2 get IQ training. Let’s simulate 10 people in each group, assuming the IQ in the control condition is 100, and in the experimental group is 106 (the SD is still 15 in each group) by running the code for Assignment 3.

The graph you get will look like a version of the one below. The means and SD for each sample drawn are provided in the graph (control condition on the left, experimental condition on the right).

The two groups differ in how close they are to their true means, and as a consequence, the difference between groups varies as well. Note that this difference is the main variable in statistical analyses when comparing two groups. Run at least 10 more simulations to look at the data pattern.

Assignment 3

Compared to the one-sample case above, we now have 2 variable group means, and two variable standard deviations. If we perform a power analysis, how do you think this additional variability will influence the power of our test? In other words, for the exact same effect size (e.g., 0.6667), will the power of our study remain the same, will it increase, or will it decrease?

Test whether your intuition was correct or not by running this power analysis for an independent samples t-test:

pwr.t.test(d=0.6667,n=10,sig.level=0.05,type="two.sample",alternative="two.sided")

In dependent samples, the mean in one sample correlates with the mean in the other sample. This reduced the amount of variability in the difference scores. If we perform a power analysis, how do you think this will influence the power of our test?

Effect size calculations for dependent samples are influenced by the correlation between the means. If this correlation is 0.5, the effect size calculation for the dependent case and the independent case is identical. But the power for a dependent t-test will be identical to the power in a one-sample t-test.

Verify this by running the power analysis for a dependent samples t-test, with a true effect size of 0.6667, and compare the power with the same power analysis for a one-sample t-test we performed above:

pwr.t.test(d=0.6667,n=10,sig.level=0.05,type="paired ",alternative="two.sided")

Variation across studies

Up until know, we have talked about the variation of data points within a single study. It is clear that the larger the sample size, the more the observed difference (in the case of two means) or the more the observed correlation (in the case or two related variables) mirrors the true difference or correlation. We can calculate the variation in the effects we are interested in directly. Both correlations are mean differences are effect sizes. Because mean differences are difficult to compare across studies that use different types of measures to examine an effect, or different scales to measure differences on, whenever multiple effect sizes are compared researchers often use standardized effect sizes. In this example, we will focus on Cohen’s d, which provides the standardized mean difference.

As explained in Borenstein, Hedges, Higgins, & Rothstein (2009) a very good approximation of the variance of d is provided by:

This formula shows that the variance of d depends only on the sample size and the value of d itself.

Single study meta-analysis

Perhaps you remember that whenever the 95% confidence interval around an effect size estimate excludes zero, the effect is statistically significant. When you want to test whether effects sizes across a number of studies differ from 0, you have to perform what is known as a meta-analysis. In essence, you perform an analysis over analyses. You first analyze individual studies, and then analyze the set of effect sizes you calculated from each individual study. To perform a meta-analysis, all you need are the effect sizes and the sample size of each individual study.

Let’s first begin with something you will hardly ever do in real life: a meta-analysis of a single study. This is a little silly, because a simple t-test of correlation will tell you the same thing – but that’s educational to see.

We will simulate one study examining our IQ training program. The IQ in the control condition has M = 100, SD = 15, and in the experimental condition the average IQ has improved to M = 106, SD = 15. We will randomly select the sample size, and draw between 20-50 participants in each condition.

Our simulated results for a single simulation (see the code below) for the control condition gives M=97.03, and for the experimental condition gives M = 107.52. The difference (of the experimental condition – the control condition, so lower scores mean better performance in the experimental condition) is statistically significant, t(158) = 2.80, p = 0.007. The effect size Hedges’ g = 0.71. This effect size overestimates the true effect size substantially. The true effect size is d = 0.4 – calculate this for yourself.

Run the code in assignment 6 (I'm skipping some parts I do use in teaching - feel free to run that code to explore variation in correlations) to see the data. Remove the # in front of the set.seed line to get the same result as in this example.

Assignment 6

If we perform a meta-analysis, we get almost the same result - the calculations used by the meta package differ slightly (although it will often round to the same 2 digits after the decimal point), because it uses a different (Wald) type of tests and confidence interval – but that’s not something we need to worry about here.

Run the simulation a number of times to see the variation in the results, and the similarity between the meta-analytic result and the t-test.

The meta-analysis compares the meta-analytic effect size estimate (which in this example is based on a single study) to zero, and tests whether the difference from zero is statistically significant. We see the estimate effect size g = 0.7144, a 95% CI, and a z-score (2.7178), which is the test statistic for which a p-value can be calculated. The p-value of 0.0066 is very similar to that observed in the t-test.

95%-CI z p-value

0.7143703 [0.1992018; 1.2295387] 2.7178 0.0066

Meta-analysis are often visualized using forest plots. We see a forest plot summarizing our single test below:

In this plot we see a number (1) for our single study. The effect size (0.71), which is Hedges's g, the unbiased estimate of Cohen's d, and the confidence interval [0.2; 1.23] are presented on the right. The effect size and confidence interval is also visualized. The effect size by the orange square (the larger the sample size, the bigger the square is) and the length of the line running through it is the 95% confidence interval.

A small-scale meta-analysis

Meta-analyses are made to analyze more than one study. Let’s analyze 4 studies, with different effect sizes (0.44, 0.7, 0.28, 0.35) and sample sizes (60, 35, 23, 80 and 60, 36, 25, 80).

Researchers have to choose between a fixed effect model or a random effects model when performing a meta-analysis.

Fixed effect models assume a single true effect size underlies all the studies included in the meta-analysis. Fixed effect models are therefore only appropriate when all studies in the meta-analysis are practically identical (e.g., use the same manipulation) and when researchers do not want to generalize to different populations (Borenstein, Hedges, Higgins, & Rothstein, 2009).

By contrast, random effects models allow the true effect size to vary from study to study (e.g., due to differences in the manipulations between studies). Note the difference between fixed effect and random effects (plural, meaning multiple effects). Random effects models therefore are appropriate when a wide range of different studies is examined and there is substantial variance between studies in the effect sizes. Since the assumption that all effect sizes are identical is implausible in most meta-analyses random effects meta-analyses are generally recommended (Borenstein et al., 2009).

The meta-analysis in this assignment, where we have simulated studies based on exactly the same true effect size, and where we don’t want to generalize to different populations, is one of the rare examples where a fixed effect meta-analysis would be appropriate – but for educational purposes, I will only show the random effects model. When variation in effect sizes is small, both models will give the same results.

In a meta-analysis, a weighted mean is computed. The reason studies are weighed when calculating the meta-analytic effect size is that larger studies are considered to be more accurate estimates of the true effect size (as we have seen above, this is true in general). Instead of simply averaging over an effect size estimate from a study with 20 people in each condition, and an effect size estimate from a study with 200 people in each condition, the larger study is weighed more strongly when calculating the meta-analytic effect size.

R makes it relatively easy to perform a meta-analysis by using the meta or metafor package. Run the code related to Assignment 7. We get the following output, where we see four rows (one for each study), the effect sizes and 95% CI for each effect, and the %W (random), which is the relative weight for each study in a random effects meta-analysis.

95%-CI %W(random)

1 0.44 [ 0.0802; 0.7998] 30.03

2 0.70 [ 0.2259; 1.1741] 17.30

3 0.28 [-0.2797; 0.8397] 12.41

4 0.35 [ 0.0392; 0.6608] 40.26

Number of studies combined: k=4

95%-CI z p-value

Random effects model 0.4289 [0.2317; 0.6261] 4.2631 < 0.0001

Quantifying heterogeneity:

tau^2 = 0; H = 1 [1; 1.97]; I^2 = 0% [0%; 74.2%]

Test of heterogeneity:

Q d.f. p-value

1.78 3 0.6194

The line below the summary gives us the statistics for the random effects model. First, the meta-analytic effect size estimate (0.43) with the 95% CI [0.23; 0.63], and the associated z-score and p-value. Based on the set of studies we simulated here, we would conclude it looks like there is a true effect.

The same information is visualized in a forest plot:

The meta-analysis also provides statistics for heterogeneity. Tests for heterogeneity examine whether there is large enough variation in the effect sizes included in the meta-analysis to assume their might be important moderators of the effect. For example, assume studies examine how happy receiving money makes people. Half of the studies gave people around 10 euros, while the other half of the study gave people 100 euros. It would not be surprising to find both these manipulations increase happiness, but 100 euro does so more strongly that 10 euro. Many manipulations in psychological research differ similarly in their strength. If there is substantial heterogeneity, researchers should attempt to examine the underlying reason for this heterogeneity, for example by identifying subsets of studies, and then examining the effect in these subsets. In our example, there does not seem to be substantial heterogeneity (the test for heterogeneity, the Q-statistic, is not statistically significant).

Assignment 7

Play around with the effect sizes and sample sizes in the 4 studies in our small meta-analysis. What happens if you increase the sample sizes? What happens if you make the effect sizes more diverse? What happens when the effect sizes become smaller (e.g., all effect sizes vary a little bit around d = 0.2). Look at the individual studies. Look at the meta-analytic effect size.

Simulating small studies

Instead of typing in specific number for every meta-analysis, we can also simulate a number of studies with a specific true effect size. This is quite informative, because it will show how much variability there is in small, underpowered, studies. Remember that many studies in psychology are small and underpowered.

In this simulation, we will randomly draw data from a normal distribution for two groups. There is a real difference in means between the two groups. Like above, the IQ in the control condition has M = 100, SD = 15, and in the experimental condition the average IQ has improved to M = 106, SD = 15. We will simulate between 20 and 50 participants in each condition (and thus create a ‘literature’ that consists primarily of small studies).

You can run the code we have used above (for a single meta-analysis) to simulate 8 studies, perform the meta-analysis, and create a forest plot. The code for Assignment 8 is the same as earlier, we just changed the nSims=1 to nSims=8.

The forest plot of one of the random simulations looks like:

The studies show a great deal of variability, even though the true difference between both groups is exactly the same in every simulated study. Only 50% of the studies reveal a statistically significant effect, but the meta-analysis provides clear evidence for the presence of a true effect in the fixed-effect model (p < 0.0001):

                     95%-CI %W(fixed) %W(random)

1 -0.0173 [-0.4461; 0.4116]     14.47      13.83

2 -0.0499 [-0.5577; 0.4580]     10.31      11.16

3  0.6581 [ 0.0979; 1.2183]      8.48       9.74

4  0.5806 [ 0.0439; 1.1172]      9.24      10.35

5  0.3104 [-0.1693; 0.7901]     11.56      12.04

6  0.4895 [ 0.0867; 0.8923]     16.40      14.87

7  0.7362 [ 0.3175; 1.1550]     15.17      14.22

8  0.2278 [-0.2024; 0.6580]     14.37      13.78

Number of studies combined: k=8

                                      95%-CI      z  p-value

Fixed effect model   0.3624 [0.1993; 0.5255] 4.3544 < 0.0001

Assignment 8

Pretend these would be the outcomes of studies you actually performed. Would you have continued to test your hypothesis in this line of research after study 1 and 2 showed no results?

Simulate at least 10 small meta-analyses. Look at the pattern of the studies, and how much they vary. Look at the meta-analytic effect size estimate. Does it vary, or is it more reliable? What happens if you increase the sample size? For example, instead of choosing samples between 20 and 50 [SampleSize<-sample(20:50, 1)], choose samples between 100 and 150 [SampleSize<-sample(100:150, 1)].

Meta-Analysis, not Miracles

Some people are skeptical about the usefulness of meta-analysis. It is important to realize what meta-analysis can and can’t do. Some researchers argue meta-analyses are garbage-in, garbage-out. If you calculate the meta-analytic effect size of a bunch of crappy studies, the meta-analytic effect size estimate will also be meaningless. It is true that a meta-analysis cannot turn bad data into a good effect size estimation. Similarly, meta-analytic techniques that aim to address publication bias (not discussed in this blog post) can never provide certainty about the unbiased effect size estimate.

However, meta-analysis does more than just provide a meta-analytic effect size estimate that is statistically different from zero or not. It allows researchers to examine the presence of bias, and the presence of variability. These analyses might allow researchers to identify different subsets of studies, some stronger than others. Very often, a meta-analysis will provide good suggestions for future research, such as large scale tests of the most promising effect under investigation.

Meta-analyses are not always performed with enough attention to detail (e.g., Lakens, Hilgard, & Staaks, 2015). It is important to realize that a meta-analysis has the potential to synthesize a large set of studies, but the extent to which a meta-analysis succesfully achieves this is open for discussion. For example, it is well-known that researchers on opposite sides of a debate (e.g., concerning the question whether aggressive video games do or do not lead to violence) can publish meta-analyses reaching opposite conclusions. This is obviously undesirable, but points towards the large degrees in freedom in choosing which articles to include in the meta-analysis, as well as other choices that are made throughout the meta-analysis.

Nevertheless, meta-analyses can be very useful. First of all, small scale-meta-analyses can actually mitigate publication bias, by allowing researchers to publish individual studies that show statistically significant effect and studies that do not show statistically significant effect, while the overall meta-analytic effect size provides clear support for a hypothesis. Second, meta-analyses provide us with a best estimate (or a range of best estimate, given specific assumptions of bias) of the size of effects, or the variation in effect sizes depending on specific aspects of the performed studies, which can inspire future research.

That’s a lot of information about variation in single studies, variation across studies, meta-analyzing studies, and performing power analyses to design studies that have a high probability of showing a true effect, if it’s there! I hopethis is helpful in designing studies and evaluating their results.

Saturday, September 12, 2015

Researchers who don't share their file-drawer for a meta-analysis

I’ve been reviewing a number of meta-analyses in the last few months, and want to share a problematic practice I’ve noticed. Many researchers do not share unpublished data when colleagues who are performing a meta-analysis send around requests for unpublished datasets.

It’s like these colleagues, standing right in front a huge elephant in the room, say: “Elephant? Which elephant?” Please. We can all see the elephant, and do the math. If you have published multiple studies on a topic (as many of the researchers who have become associated with specific lines of research have) it is often very improbable that you have no file-drawer.

If a meta-analytic effect size suggests an effect of d = 0.4 (not uncommon), and you contribute 5 significant studies with 50 participants in each between subject condition (a little optimistic sample size perhaps, but ok) you had approximately 50% power. If there is a true effect, finding a significant effect five times in a row happens, but only 0.5*0.5*0.5*0.5*0.5 = 3.125% of the time. The probability that someone contributes 10 significant, but no non-significant, studies to a meta-analysis is 0.09% if they had 50% power. Take a look at some published meta-analyses. This happens. These are the people I’m talking about.

I think we should have a serious discussion about how we are letting it slide when researchers don't share their file drawer when they get a request by colleagues who plan to do a meta-analysis. Not publishing these studies in the first place was clearly undesirable, but it also was pretty difficult in the past. But meta-analyses are a rare exception when non-significant findings can enter the literature. Not sharing your file-drawer, when colleagues especially ask for it, is something that rubs me the wrong way.

Scientists who do not share their file drawer are like people who throw their liquor bottle on a bicycle lane (yeah, I’m Dutch, we have bicycle lanes everywhere, and sometimes we have people who drop liquor bottles on them). First of all, you are being wasteful by not recycling data that would make the world a better (more accurate) place. Second, you can be pretty sure that every now and then, some students on their way to a PhD will drive through your glass and get a flat tire. The delay is costly. If you don’t care about that, I don’t like you.

If you don’t contribute non-significant studies, not only are you displaying a worrisome lack of interest in good science, and a very limited understanding of the statistical probability of finding only significant results, but you are actually making the meta-analysis less believable. When people don’t share non-significant findings, the alarm bells for every statistical technique to test for publication bias will go off. Techniques that estimate the true effect size while correcting for publication bias (like meta-regression or p-curve analysis) will be more likely to conclude there is no effect. So not only will you be clearly visible as a person who does not care about science, but you are shooting yourself in the foot if your goal is to make sure the meta-analysis reveals an effect size estimate significantly larger than 0.

I think this is something we need to deal with if we want to improve meta-analyses. A start could be to complement trim-and-fill analyses (which test for missing studies) with a more careful examination of which researchers are not contributing their missing studies to the meta-analysis. It might be a good idea to send these people an e-mail when you have identified them, to give them the possibility to decide whether, on second thought, it is worth the effort to locate and share their non-significant studies.

Thursday, August 27, 2015

Power of replications in the Reproducibility Project

The Open Science Collaboration has completed 100 replication studies of findings published in the scientific literature, and the results are available. The replicated studies have become much more likely to be true, but we are left with some questions about what it means that many studies did not replicate. This is a very rich dataset, and although there can be many reasons a finding does not replicate, I wanted to examine one concern. Studies in the Reproducibility Project were well powered for the effect sizes observed in the original studies. But we know effect sizes in the published literature are often overestimated. So is it possible that most of the replication studies that did not yield significant actually examined much smaller effects, and thus lacked power?

The table below (from the article in Science) summarizes some of the results. There is a nice range of interpretations (even though I'll focus a lot on the p < 0.05 criterium in this post). The probability of observing a statistically significant effect, if there is an effect to be found, depends on the statistical power of a study. The ‘average replication power’ provides estimates of the statistical power of the studies, assuming the effect size estimate in the original study was exactly the true effect size.

As the Open Science Collaboration (including myself) write: “On the basis of only the average replication power of the 97 original, significant effects [M = 0.92, median (Mdn) = 0.95], we would expect approximately 89 positive results in the replications if all original effects were true and accurately estimated.”

With 35 significant effects out of 89, we get a 40% replication rate. But we have very good reasons to believe that not all original effect sizes were accurately estimated, and that the average power of replications was lower (Shravan Vasishth called this 'power inflation' earlier today). And when the average power is lower, less findings are expected to replicate, which means the replication success is relatively higher (i.e., instead of 35 out of 89, 35 out of some number lower than 89 replicated).

When there is severe publication bias, effect sizes are overestimated. We can examine whether there is publication bias in the original studies in a meta-analysis (below, I follow one meta-analysis of the data analysis team and look at studies which reported t-tests and F-tests, 73 out of the 100). Effect sizes observed in studies should be independent from standard errors, but when there is publication bias, they are not. There is a funnel plot of these 73 original studies on the OSF, but I prefer contour enhanced funnel plots, which I made by first running the (absolutely amazing - I'm serious, check out the work they put into this R script!) masterscript for the data analysis, and then running:

funnel(res, level=c(90, 95, 99), shade=c("white", "gray", "darkgray"), refline=0, main = "Funnel plot based on original studies")

A contour-enhanced funnel plot makes it more strikingly clear that almost all original studies observed a statistically significant effect. This is surprising, given that sample sizes were much smaller than in replication attempts (and the replication studies had 92% power, based on the original effect sizes). This is also clear from the distribution of the effects – small studies (with large standard errors, on the bottom of the plot) have large effect sizes (because otherwise they would not be statistically significant), while larger studies (at the top) have smaller effect sizes (but still just large enough to be statistically significant, or fall outside of the white triangle).

A trim and fill analysis is often used to examine whether there are missing studies. Now we are grouping together 73 completely different and highly heterogeneous effects, so the following numbers should be interpreted in light of huge heterogeneity, but we can perform this analysis using:

taf <- trimfill(res)
taf
funnel(taf, level=c(90, 95, 99), shade=c("white", "gray", "darkgray"), refline=0, main = "Trim and Fill funnel plot based on original studies")

Trim-and-fill analysis can only be used as a sensitivity analysis (it does not provide accurate effect sizes or estimates of the actual number of missing studies), but it clearly shows studies are missing (there are 29 white dots in the trim-and-fill funnel plot, which represent the studies assumed to be missing), and reports a meta-analytic effect size estimate of r = 0.28 (instead of r = 0.42) based on these hypothetical missing studies. This does not mean r = 0.28 is the true effect size, but it’s probably close (a meta-analysis of meta-analyses estimated the average effect size in psychology at r = 0.21 – so that we might be in the ballpark).

The difference between the biased and unbiased effect size is substantial, and this means power could very reasonable be somewhat lower that 0.92. There’s not much the Reproducibility Project could do about publication bias (e.g., there are no full-proof statistical technique to estimate unbiased effect size estimates). The solution should come from us: We should publish all our effects, regardless of their significance level. If we don’t, we are sabotaging cumulative science.

However: power only matters when there is a true effect. An unknown percentage of studies did not replicate, because they were originally a false positive, and there simply is no true effect to be found (i.e., the true effect size is 0). It is difficult to tease apart failed replications due to low power, and failed replications because the original studies were false positives, and again, this is a very hetergeneous set of studies. But a look at the p-value distribution is interesting, which we can plot with:

pdist<-MASTER$T_pval_USE..R.[!is.na(MASTER$T_pval_USE..O.) & !is.na(MASTER$T_pval_USE..R.)]
hist(pdist, breaks=100)
abline(h=3.4, lty = 3, col = "gray60")

The histogram is divided into 20 bins, and the frequency of p-values in each bin are plotted. This means all significant results (p < 0.05) fall in the left-most bin. If all non-significant studies examined no true effects, the p-values would be uniformly distributed, with 3.4 studies in each bin (64 non-significant studies (there are 99 p-values plotted, so 99-35=64) in 19 remaining bins). If we think of this p-value distribution as a mix of null effects (uniformly distributed) and true effects (a skewed distribution highest at low p-values), the distribution is not a shallow curve (which would be a sign of low power, see p-value distributions as a function of power here). Instead, the distribution looks more like a sharp angle, which mirrors a p-value distribution from a set of highly powered experiments. It really looks like our power was very high (but we should remember we only have 100 datapoints). There will certainly be some replication studies that, with a much larger sample size, will reveal an effect. In general, it is extremely difficult (and requires huge sample sizes) to distinguish between a real but very small effect, and no effect. But at least the distribution of p-values takes away the concern I had when I started this blog post that the biased effect size estimates in the original studies affected the power in the replication studies.

For now, it means 35 out of 97 replicated effects have become quite a bit more likely to be true. We have learned something about what predicts replicability. For example, at least for some indicators of replication success, “Surprising effects were less reproducible” (take note, journalists and editors of Psychological Science!). For the studies that did not replicate, we have more data, which can inform not just our statistical inferences, but also our theoretical inferences. The Reproducibility Project demonstrates large scale collaborative efforts can work, so if you still believe in an effect that did not replicate, get some people together, collect enough data, and let me know what you find.