A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Showing posts with label p-curve. Show all posts
Showing posts with label p-curve. Show all posts

Sunday, October 11, 2015

Practicing Meta-Analytic Thinking Through Simulations

People find it difficult to think about random variation. Our mind is more strongly geared towards recognizing patterns than randomness. In this blogpost, you can learn what random variation looks like, how to reduce it by running well-powered studies, and how to meta-analyze multiple small studies. This is a long read, and most educational if you follow the assignments. You'll probably need about an hour.

We'll use R, and the R script at the bottom of this post (or download it from GitHub). Run the first section (sections are differentiated by # # # #) to install the required packages and change some setting. 


IQ tests have been designed such that the mean IQ of the entire population of adults is 100, with a standard deviation of 15. This will not be true for every sample we draw from the population. Let’s get a feel for what the IQ scores from a sample look like. Which IQ scores will people in our sample have?


Assignment 1

We will start by simulating a random sample of 10 individuals. Run the script in the section #Assignment 1. Both the mean, as the standard deviation, differ from the true mean in the population. Simulate some more samples of 10 individuals and look at the means and SD's. They differ quite a lot. This type of variation is perfectly normal in small samples of 10 participants. See below for one example of a simulated sample.




Let’s simulate a larger sample, of 100 participants by changing the n=10 in line 23 of the R script to n = 100 (remember R code is case-sensitive). 


We are slowly seeing what is known as the normal distribution. This is the well known bell shaped curve that represents the distribution of many variables in scientific research (although some other types of distributions are quite common as well). The mean and standard deviation are much closer to the true mean and standard deviation, and this is true for most of the simulated samples. Simulate at least 10 samples with n = 10, and 10 samples with n = 100. Look at the means and standard deviations. Let’s simulate one really large sample, of 1000 people (run the code, changing n=10 to n=1000). The picture shows one example.



Not every simulated study of 1000 people will yield the true mean and standard deviation, but this one did. And although the distribution is very close to a normal distribution, even with a 1000 people it is not perfect.

The accuracy with which you can measure the IQ in a population is easy to calculate when you know the standard deviation, and the percentage of long-run probability of being of making an error. If you choose a 95% confidence interval, and want to estimate IQ within an error range of 2 IQ points, you first convert the 95% confidence interval to a Z-score (1.96), and use the formula:

N = (Z * SD/error)2

In this example, (1.96*15/2) 2 = 216 people (rounded down). Feel free to check by running the code with n = 216 (remember that this is a long term average!)

In addition to planning for accuracy, you can plan for power. The power of a study is the probability of observing a statistically significant effect, given that there is a true effect to be found. It depends on the effect size, the sample size, and the alpha level.

We can simulate experiments, and count how many statistically significant results are observed, to see how much power we have. For example, when we simulate 100.000 studies, and 50% of the studies reveal a p-value smaller than 0.05, this means the power of our study (given a specific effect size, sample size, and alpha-level) is 50%.

We can use the code in the section of Assignment 2. Running this code will take a while. It will simulate 100000 experiments, where 10 participants are drawn from a normal distribution with the mean of 110, and a SD of 15. To continue our experiment, let’s assume the numbers represent measured IQ, which is 110 in our samples. For each simulated sample, we test whether the effect differs from an IQ of 100. In other words, we are testing whether our sample is smarter than average.

The program returns all p-values, and it will return the power, which will be somewhere around 47%. It will also yield a plot of the p-values. The first bar is the count of all p-values smaller than 0.05, so all statistically significant p-values. The percentage of p-values in this single bar visualizes the power of the study.


Instead of simulating the power of the study, you can also perform power calculations in R (see the code at the end of assignment 2). To calculate the power of a study, we need the sample size (in our case, n = 10), the alpha level (in our case, 0.05), and the effect size, which for a one-sample t-test is Cohen’s d, which can be calculated as d = (X-μ)/σ, or (110-100)/15 = 0.6667. 


Assignment 2

Using the simulation and the pwr package, examine what happens with the power of the experiments when the sample size is increased to 20. How does the p-value distribution change?

Using the simulation and the pwr package, examine what happens with the power of the experiments when the mean in the sample changes from 110 to 105 (set the sample size to 10 again). How does the p-value distribution change?

Using the simulation and the pwr package, examine what happens with the power of the experiments when the mean in the sample is set to 100 (set the sample size to 10 again). Now, there is no difference between the sample and the average IQ. How does the p-value distribution change? Can we formally speak of ‘power’ in this case? What is a better name in this specific situation?

Variance in two groups, and their difference.

Now, assume we have a new IQ training program that will increase peoples IQ score with 6 points. People in condition 1 are in the control condition – they do not get IQ training. People in condition 2 get IQ training. Let’s simulate 10 people in each group, assuming the IQ in the control condition is 100, and in the experimental group is 106 (the SD is still 15 in each group) by running the code for Assignment 3.


The graph you get will look like a version of the one below. The means and SD for each sample drawn are provided in the graph (control condition on the left, experimental condition on the right).



The two groups differ in how close they are to their true means, and as a consequence, the difference between groups varies as well. Note that this difference is the main variable in statistical analyses when comparing two groups. Run at least 10 more simulations to look at the data pattern.


Assignment 3

Compared to the one-sample case above, we now have 2 variable group means, and two variable standard deviations. If we perform a power analysis, how do you think this additional variability will influence the power of our test? In other words, for the exact same effect size (e.g., 0.6667), will the power of our study remain the same, will it increase, or will it decrease?

Test whether your intuition was correct or not by running this power analysis for an independent samples t-test:

pwr.t.test(d=0.6667,n=10,sig.level=0.05,type="two.sample",alternative="two.sided")

In dependent samples, the mean in one sample correlates with the mean in the other sample. This reduced the amount of variability in the difference scores. If we perform a power analysis, how do you think this will influence the power of our test?

Effect size calculations for dependent samples are influenced by the correlation between the means. If this correlation is 0.5, the effect size calculation for the dependent case and the independent case is identical. But the power for a dependent t-test will be identical to the power in a one-sample t-test.

Verify this by running the power analysis for a dependent samples t-test, with a true effect size of 0.6667, and compare the power with the same power analysis for a one-sample t-test we performed above:

pwr.t.test(d=0.6667,n=10,sig.level=0.05,type="paired ",alternative="two.sided")



Variation across studies

Up until know, we have talked about the variation of data points within a single study. It is clear that the larger the sample size, the more the observed difference (in the case of two means) or the more the observed correlation (in the case or two related variables) mirrors the true difference or correlation. We can calculate the variation in the effects we are interested in directly. Both correlations are mean differences are effect sizes. Because mean differences are difficult to compare across studies that use different types of measures to examine an effect, or different scales to measure differences on, whenever multiple effect sizes are compared researchers often use standardized effect sizes. In this example, we will focus on Cohen’s d, which provides the standardized mean difference.

As explained in Borenstein, Hedges, Higgins, & Rothstein (2009) a very good approximation of the variance of d is provided by:


This formula shows that the variance of d depends only on the sample size and the value of d itself. 

Single study meta-analysis

Perhaps you remember that whenever the 95% confidence interval around an effect size estimate excludes zero, the effect is statistically significant. When you want to test whether effects sizes across a number of studies differ from 0, you have to perform what is known as a meta-analysis. In essence, you perform an analysis over analyses. You first analyze individual studies, and then analyze the set of effect sizes you calculated from each individual study. To perform a meta-analysis, all you need are the effect sizes and the sample size of each individual study.

Let’s first begin with something you will hardly ever do in real life: a meta-analysis of a single study. This is a little silly, because a simple t-test of correlation will tell you the same thing – but that’s educational to see.

We will simulate one study examining our IQ training program. The IQ in the control condition has M = 100, SD = 15, and in the experimental condition the average IQ has improved to M = 106, SD = 15. We will randomly select the sample size, and draw between 20-50 participants in each condition.

Our simulated results for a single simulation (see the code below) for the control condition gives M=97.03, and for the experimental condition gives M = 107.52. The difference (of the experimental condition – the control condition, so lower scores mean better performance in the experimental condition) is statistically significant, t(158) = 2.80, p = 0.007. The effect size Hedges’ g = 0.71. This effect size overestimates the true effect size substantially. The true effect size is d = 0.4 – calculate this for yourself.

Run the code in assignment 6 (I'm skipping some parts I do use in teaching - feel free to run that code to explore variation in correlations) to see the data. Remove the # in front of the set.seed line to get the same result as in this example.


Assignment 6

If we perform a meta-analysis, we get almost the same result - the calculations used by the meta package differ slightly (although it will often round to the same 2 digits after the decimal point), because it uses a different (Wald) type of tests and confidence interval – but that’s not something we need to worry about here.

Run the simulation a number of times to see the variation in the results, and the similarity between the meta-analytic result and the t-test.


The meta-analysis compares the meta-analytic effect size estimate (which in this example is based on a single study) to zero, and tests whether the difference from zero is statistically significant. We see the estimate effect size g = 0.7144, a 95% CI, and a z-score (2.7178), which is the test statistic for which a p-value can be calculated. The p-value of 0.0066 is very similar to that observed in the t-test.

                          95%-CI      z  p-value
 0.7143703 [0.1992018; 1.2295387] 2.7178   0.0066

Meta-analysis are often visualized using forest plots. We see a forest plot summarizing our single test below:

In this plot we see a number (1) for our single study. The effect size (0.71), which is Hedges's g, the unbiased estimate of Cohen's d, and the confidence interval [0.2; 1.23] are presented on the right. The effect size and confidence interval is also visualized. The effect size by the orange square (the larger the sample size, the bigger the square is) and the length of the line running through it is the 95% confidence interval.

A small-scale meta-analysis

Meta-analyses are made to analyze more than one study. Let’s analyze 4 studies, with different effect sizes (0.44, 0.7, 0.28, 0.35) and sample sizes (60, 35, 23, 80 and 60, 36, 25, 80).

 Researchers have to choose between a fixed effect model or a random effects model when performing a meta-analysis.

Fixed effect models assume a single true effect size underlies all the studies included in the meta-analysis. Fixed effect models are therefore only appropriate when all studies in the meta-analysis are practically identical (e.g., use the same manipulation) and when researchers do not want to generalize to different populations (Borenstein, Hedges, Higgins, & Rothstein, 2009).

By contrast, random effects models allow the true effect size to vary from study to study (e.g., due to differences in the manipulations between studies). Note the difference between fixed effect and random effects (plural, meaning multiple effects). Random effects models therefore are appropriate when a wide range of different studies is examined and there is substantial variance between studies in the effect sizes. Since the assumption that all effect sizes are identical is implausible in most meta-analyses random effects meta-analyses are generally recommended (Borenstein et al., 2009).

The meta-analysis in this assignment, where we have simulated studies based on exactly the same true effect size, and where we don’t want to generalize to different populations, is one of the rare examples where a fixed effect meta-analysis would be appropriate – but for educational purposes, I will only show the random effects model. When variation in effect sizes is small, both models will give the same results.

In a meta-analysis, a weighted mean is computed. The reason studies are weighed when calculating the meta-analytic effect size is that larger studies are considered to be more accurate estimates of the true effect size (as we have seen above, this is true in general). Instead of simply averaging over an effect size estimate from a study with 20 people in each condition, and an effect size estimate from a study with 200 people in each condition, the larger study is weighed more strongly when calculating the meta-analytic effect size.

R makes it relatively easy to perform a meta-analysis by using the meta or metafor package. Run the code related to Assignment 7. We get the following output, where we see four rows (one for each study), the effect sizes and 95% CI for each effect, and the %W (random), which is the relative weight for each study in a random effects meta-analysis.


                  95%-CI %W(random)
1 0.44 [ 0.0802; 0.7998]      30.03
2 0.70 [ 0.2259; 1.1741]      17.30
3 0.28 [-0.2797; 0.8397]      12.41
4 0.35 [ 0.0392; 0.6608]      40.26
Number of studies combined: k=4

                                      95%-CI      z  p-value
Random effects model 0.4289 [0.2317; 0.6261] 4.2631 < 0.0001

Quantifying heterogeneity:
tau^2 = 0; H = 1 [1; 1.97]; I^2 = 0% [0%; 74.2%]

Test of heterogeneity:
    Q d.f.  p-value
 1.78    3   0.6194

The line below the summary gives us the statistics for the random effects model. First, the meta-analytic effect size estimate (0.43) with the 95% CI [0.23; 0.63], and the associated z-score and p-value. Based on the set of studies we simulated here, we would conclude it looks like there is a true effect.

The same information is visualized in a forest plot:


The meta-analysis also provides statistics for heterogeneity. Tests for heterogeneity examine whether there is large enough variation in the effect sizes included in the meta-analysis to assume their might be important moderators of the effect. For example, assume studies examine how happy receiving money makes people. Half of the studies gave people around 10 euros, while the other half of the study gave people 100 euros. It would not be surprising to find both these manipulations increase happiness, but 100 euro does so more strongly that 10 euro. Many manipulations in psychological research differ similarly in their strength. If there is substantial heterogeneity, researchers should attempt to examine the underlying reason for this heterogeneity, for example by identifying subsets of studies, and then examining the effect in these subsets. In our example, there does not seem to be substantial heterogeneity (the test for heterogeneity, the Q-statistic, is not statistically significant).


Assignment 7

Play around with the effect sizes and sample sizes in the 4 studies in our small meta-analysis. What happens if you increase the sample sizes? What happens if you make the effect sizes more diverse? What happens when the effect sizes become smaller (e.g., all effect sizes vary a little bit around d = 0.2). Look at the individual studies. Look at the meta-analytic effect size.

Simulating small studies

Instead of typing in specific number for every meta-analysis, we can also simulate a number of studies with a specific true effect size. This is quite informative, because it will show how much variability there is in small, underpowered, studies. Remember that many studies in psychology are small and underpowered.

In this simulation, we will randomly draw data from a normal distribution for two groups. There is a real difference in means between the two groups. Like above, the IQ in the control condition has M = 100, SD = 15, and in the experimental condition the average IQ has improved to M = 106, SD = 15. We will simulate between 20 and 50 participants in each condition (and thus create a ‘literature’ that consists primarily of small studies).

You can run the code we have used above (for a single meta-analysis) to simulate 8 studies, perform the meta-analysis, and create a forest plot. The code for Assignment 8 is the same as earlier, we just changed the nSims=1 to nSims=8.

The forest plot of one of the random simulations looks like:


The studies show a great deal of variability, even though the true difference between both groups is exactly the same in every simulated study. Only 50% of the studies reveal a statistically significant effect, but the meta-analysis provides clear evidence for the presence of a true effect in the fixed-effect model (p < 0.0001):


                     95%-CI %W(fixed) %W(random)
1 -0.0173 [-0.4461; 0.4116]     14.47      13.83
2 -0.0499 [-0.5577; 0.4580]     10.31      11.16
3  0.6581 [ 0.0979; 1.2183]      8.48       9.74
4  0.5806 [ 0.0439; 1.1172]      9.24      10.35
5  0.3104 [-0.1693; 0.7901]     11.56      12.04
6  0.4895 [ 0.0867; 0.8923]     16.40      14.87
7  0.7362 [ 0.3175; 1.1550]     15.17      14.22
8  0.2278 [-0.2024; 0.6580]     14.37      13.78
 
Number of studies combined: k=8
 
                                      95%-CI      z  p-value
Fixed effect model   0.3624 [0.1993; 0.5255] 4.3544 < 0.0001


Assignment 8

Pretend these would be the outcomes of studies you actually performed. Would you have continued to test your hypothesis in this line of research after study 1 and 2 showed no results?

Simulate at least 10 small meta-analyses. Look at the pattern of the studies, and how much they vary. Look at the meta-analytic effect size estimate. Does it vary, or is it more reliable? What happens if you increase the sample size? For example, instead of choosing samples between 20 and 50 [SampleSize<-sample(20:50, 1)], choose samples between 100 and 150 [SampleSize<-sample(100:150, 1)].



Meta-Analysis, not Miracles

Some people are skeptical about the usefulness of meta-analysis. It is important to realize what meta-analysis can and can’t do. Some researchers argue meta-analyses are garbage-in, garbage-out. If you calculate the meta-analytic effect size of a bunch of crappy studies, the meta-analytic effect size estimate will also be meaningless. It is true that a meta-analysis cannot turn bad data into a good effect size estimation. Similarly, meta-analytic techniques that aim to address publication bias (not discussed in this blog post) can never provide certainty about the unbiased effect size estimate.

However, meta-analysis does more than just provide a meta-analytic effect size estimate that is statistically different from zero or not. It allows researchers to examine the presence of bias, and the presence of variability. These analyses might allow researchers to identify different subsets of studies, some stronger than others. Very often, a meta-analysis will provide good suggestions for future research, such as large scale tests of the most promising effect under investigation.

Meta-analyses are not always performed with enough attention to  detail (e.g., Lakens, Hilgard, & Staaks, 2015). It is important to realize that a meta-analysis has the potential to synthesize a large set of studies, but the extent to which a meta-analysis succesfully achieves this is open for discussion. For example, it is well-known that researchers on opposite sides of a debate (e.g., concerning the question whether aggressive video games do or do not lead to violence) can publish meta-analyses reaching opposite conclusions. This is obviously undesirable, but points towards the large degrees in freedom in choosing which articles to include in the meta-analysis, as well as other choices that are made throughout the meta-analysis.

Nevertheless, meta-analyses can be very useful. First of all, small scale-meta-analyses can actually mitigate publication bias, by allowing researchers to publish individual studies that show statistically significant effect and studies that do not show statistically significant effect, while the overall meta-analytic effect size provides clear support for a hypothesis. Second, meta-analyses provide us with a best estimate (or a range of best estimate, given specific assumptions of bias) of the size of effects, or the variation in effect sizes depending on specific aspects of the performed studies, which can inspire future research.

That’s a lot of information about variation in single studies, variation across studies, meta-analyzing studies, and performing power analyses to design studies that have a high probability of showing a true effect, if it’s there! I hopethis is helpful in designing studies and evaluating their results.

Thursday, August 27, 2015

Power of replications in the Reproducibility Project

The Open Science Collaboration has completed 100 replication studies of findings published in the scientific literature, and the results are available. The replicated studies have become much more likely to be true, but we are left with some questions about what it means that many studies did not replicate. This is a very rich dataset, and although there can be many reasons a finding does not replicate, I wanted to examine one concern. Studies in the Reproducibility Project were well powered for the effect sizes observed in the original studies. But we know effect sizes in the published literature are often overestimated. So is it possible that most of the replication studies that did not yield significant actually examined much smaller effects, and thus lacked power?

The table below (from the article in Science) summarizes some of the results. There is a nice range of interpretations (even though I'll focus a lot on the p < 0.05 criterium in this post). The probability of observing a statistically significant effect, if there is an effect to be found, depends on the statistical power of a study. The ‘average replication power’ provides estimates of the statistical power of the studies, assuming the effect size estimate in the original study was exactly the true effect size.

 

As the Open Science Collaboration (including myself) write: “On the basis of only the average replication power of the 97 original, significant effects [M = 0.92, median (Mdn) = 0.95], we would expect approximately 89 positive results in the replications if all original effects were true and accurately estimated.”

With 35 significant effects out of 89, we get a 40% replication rate. But we have very good reasons to believe that not all original effect sizes were accurately estimated, and that the average power of replications was lower (Shravan Vasishth called this 'power inflation' earlier today). And when the average power is lower, less findings are expected to replicate, which means the replication success is relatively higher (i.e., instead of 35 out of 89, 35 out of some number lower than 89 replicated).

When there is severe publication bias, effect sizes are overestimated. We can examine whether there is publication bias in the original studies in a meta-analysis (below, I follow one meta-analysis of the data analysis team and look at studies which reported t-tests and F-tests, 73 out of the 100). Effect sizes observed in studies should be independent from standard errors, but when there is publication bias, they are not. There is a funnel plot of these 73 original studies on the OSF, but I prefer contour enhanced funnel plots, which I made by first running the (absolutely amazing - I'm serious, check out the work they put into this R script!) masterscript for the data analysis, and then running:

funnel(res, level=c(90, 95, 99), shade=c("white", "gray", "darkgray"), refline=0, main = "Funnel plot based on original studies")


A contour-enhanced funnel plot makes it more strikingly clear that almost all original studies observed a statistically significant effect. This is surprising, given that sample sizes were much smaller than in replication attempts (and the replication studies had 92% power, based on the original effect sizes). This is also clear from the distribution of the effects – small studies (with large standard errors, on the bottom of the plot) have large effect sizes (because otherwise they would not be statistically significant), while larger studies (at the top) have smaller effect sizes (but still just large enough to be statistically significant, or fall outside of the white triangle).

A trim and fill analysis is often used to examine whether there are missing studies. Now we are grouping together 73 completely different and highly heterogeneous effects, so the following numbers should be interpreted in light of huge heterogeneity, but we can perform this analysis using:

taf <- trimfill(res)
taf
funnel(taf, level=c(90, 95, 99), shade=c("white", "gray", "darkgray"), refline=0, main = "Trim and Fill funnel plot based on original studies")

Trim-and-fill analysis can only be used as a sensitivity analysis (it does not provide accurate effect sizes or estimates of the actual number of missing studies), but it clearly shows studies are missing (there are 29 white dots in the trim-and-fill funnel plot, which represent the studies assumed to be missing), and reports a meta-analytic effect size estimate of r = 0.28 (instead of r = 0.42) based on these hypothetical missing studies. This does not mean r = 0.28 is the true effect size, but it’s probably close (a meta-analysis of meta-analyses estimated the average effect size in psychology at r = 0.21 – so that we might be in the ballpark).

The difference between the biased and unbiased effect size is substantial, and this means power could very reasonable be somewhat lower that 0.92. There’s not much the Reproducibility Project could do about publication bias (e.g., there are no full-proof statistical technique to estimate unbiased effect size estimates). The solution should come from us: We should publish all our effects, regardless of their significance level. If we don’t, we are sabotaging cumulative science.

However: power only matters when there is a true effect. An unknown percentage of studies did not replicate, because they were originally a false positive, and there simply is no true effect to be found (i.e., the true effect size is 0). It is difficult to tease apart failed replications due to low power, and failed replications because the original studies were false positives, and again, this is a very hetergeneous set of studies. But a look at the p-value distribution is interesting, which we can plot with:

pdist<-MASTER$T_pval_USE..R.[!is.na(MASTER$T_pval_USE..O.) & !is.na(MASTER$T_pval_USE..R.)]
hist(pdist, breaks=100)
abline(h=3.4, lty = 3, col = "gray60")

The histogram is divided into 20 bins, and the frequency of p-values in each bin are plotted. This means all significant results (p < 0.05) fall in the left-most bin. If all non-significant studies examined no true effects, the p-values would be uniformly distributed, with 3.4 studies in each bin (64 non-significant studies (there are 99 p-values plotted, so 99-35=64) in 19 remaining bins). If we think of this p-value distribution as a mix of null effects (uniformly distributed) and true effects (a skewed distribution highest at low p-values), the distribution is not a shallow curve (which would be a sign of low power, see p-value distributions as a function of power here). Instead, the distribution looks more like a sharp angle, which mirrors a p-value distribution from a set of highly powered experiments. It really looks like our power was very high (but we should remember we only have 100 datapoints). There will certainly be some replication studies that, with a much larger sample size, will reveal an effect. In general, it is extremely difficult (and requires huge sample sizes) to distinguish between a real but very small effect, and no effect. But at least the distribution of p-values takes away the concern I had when I started this blog post that the biased effect size estimates in the original studies affected the power in the replication studies.  




For now, it means 35 out of 97 replicated effects have become quite a bit more likely to be true. We have learned something about what predicts replicability. For example, at least for some indicators of replication success, “Surprising effects were less reproducible” (take note, journalists and editors of Psychological Science!). For the studies that did not replicate, we have more data, which can inform not just our statistical inferences, but also our theoretical inferences. The Reproducibility Project demonstrates large scale collaborative efforts can work, so if you still believe in an effect that did not replicate, get some people together, collect enough data, and let me know what you find.

Saturday, April 4, 2015

Why a meta-analysis of 90 precognition studies does not provide convincing evidence of a true effect

A meta-analysis of 90 studies on precognition by Bem, Tressoldi, Rabeyron, & Duggan has been circulating recently. I have looked at this meta-analysis of precognition experiments for an earlier blog post. I had a very collaborative exchange with the authors, which was cordial and professional, and led the authors to correct the mistakes I pointed out and answer some questions I had. I thought it was interesting to write a pre-publication peer review of an article that had been posted in a public depository, and since I had invested time in commenting on this meta-analysis anyway, I was more than happy to accept the invitation to peer-review it. This blog is a short summary of my actual review - since a pre-print of the paper is already online, and it is already cited 11 times, perhaps people are interested in my criticism on the meta-analysis. I expect that many of my comments below apply to other meta-analyses by the same authors (e.g., this one), and a preliminary look at the data confirms this. Until I sit down and actually do a meta-meta-analysis, here's why I don't think there is evidence for pre-cognition in the Bem et al meta-analysis.

Only 18 statistically significant precognition effects have been observed in the last 14 years, by just 7 different labs, as the meta-analysis by Bem, Tressoldi, Rabeyron, and Duggan reveals. 72 studies reveal no effect. If research on pre-cognition has demonstrated anything, it is that when you lack a theoretical model, scientific insights are gained at a painstakingly slow pace, if they are gained at all.

The questions the authors attempt to answer in their meta-analysis is whether there is a true signal in this noisy set of 90 studies. If this is the case, it obviously does not mean we have proof that precognition exists. In science, we distinguish between statistical inferences and theoretical inferences (e.g., Meehl, 1990). Even if a meta-analysis would lead to the statistical inference that there is a signal in the noise, there is as of yet no compelling reason to draw the theoretical inference that precognition exists, due to the lack of a theoretical framework as acknowledged by the authors. Nevertheless, it is worthwhile to see if after 14 years and 90 studies something is going on.

In the abstract, the authors conclude: there is “an overall effect greater than 6 sigma, z = 6.40, p = 1.2 × 10-10 with an effect size (Hedges’ g) of 0.09. A Bayesian analysis yielded a Bayes Factor of 1.4 × 109, greatly exceeding the criterion value of 100 for “decisive evidence” in support of the experimental hypothesis.” Let’s check the validity of this claim.

Dealing with publication bias.

Every meta-analysis needs to deal with publication bias to prevent the meta-analytic effect size estimate being anything else than the inflation from 0 that emerges because people are more likely to share positive results. Bem and colleagues use Begg and Mazumdar’s rank correlation test to examine publication bias, stating that: “The preferred method for calculating this is the Begg and Mazumdar’s rank correlation test, which calculates the rank correlation (Kendall’s tau) between the variances or standard errors of the studies and their standardized effect sizes (Rothstein, Sutton & Borenstein, 2005).”

I could not find this recommendation in Rothstein et al., 2005. From the same book, Chapter 11, p. 196, about the rank correlation test: “the test has low power unless there is severe bias, and so a non-significant tau should not be taken as proof that bias is absent (see also Sterne et al., 2000, 2001b, c)”. Similarly, from the Cochrane handbook of meta-analyses: “The test proposed by Begg and Mazumdar (Begg 1994) has the same statistical problems but lower power than the test of Egger et al., and is therefore not recommended.

When the observed effect size is tiny (as in the case of the current meta-analysis), just a small amount of bias can yield a small meta-analytic effect size estimate that is statistically different from 0. In other words, whereas a significant test result is reason to worry, a non-significant test result is not reason not to worry.

The authors also report the trim-and-fill method to correct for publication bias. It is known that when publication bias is induced by a p-value boundary, rather than an effect size boundary, and there is considerable heterogeneity in the effects included in the meta-analysis, the trim-and-fill method might not perform well enough to yield a corrected meta-analytic effect size estimate that is close to the true effect size (Peters, Sutton, Jones, Abrams, & Rushton, 2007; Terrin, Schmid, Lau, & Olkin, 2003, see also the Cochrane handbook). I’m not sure what upsets me more: The fact that people continue to use this method, or the fact that the people who use this method still report the uncorrected effect size estimate in their abstract.

Better tests for publication bias

PET-PEESE meta-regression seems to be the best test to correct effect size estimates for publication bias we currently have. This approach is based on first using the precision-effect test (PET, Stanley, 2008) to examine whether there is a true effect beyond publication bias, and then following up on this test (if the confidence intervals for the estimate exclude 0) by a PEESE (precision-effect estimate with standard error, Stanley and Doucouliagos, 2007) to estimate the true effect size.

In the R code where I have reproduced the meta-analysis (see below), I have included the PET-PEESE meta-regression. The results are clear: the estimated effect size when correcting for publication bias is 0.008, and the confidence intervals around this effect size estimate do not exclude 0. In other words, there is no good reason to assume that anything more than publication bias is going on in this meta-analysis.

Perhaps it will help to realize that if precognition had an effect size of Cohen’s dz = 0.09, to have 90% power to examine an effect with an effect size estimate of 0.09, an alpha level of 0.05, and performing a two-sided t-test, you’d need 1300 participants. Only 1 experiment has been performed with a sufficiently large sample size (Galak, exp 7), and this experiment did not show an effect. Meier (study 3) has 1222 participants, and finds an effect at a significance level of 0.05. However, using a significance level of 0.05 is rather silly when sample sizes are so large (see http://daniellakens.blogspot.nl/2014/05/the-probability-of-p-values-as-function.html) and when we calculate a Bayes Factor using the t-value and the sample size, we see this results in a JZS Bayes Factor of 1.90 – nothing that should convince us.

library(BayesFactor)
1/exp(ttest.tstat(t=2.37, n1=1222, rscale = 0.707)[['bf']])

[1] 1.895727


Estimating the evidential value with p-curve and p-uniform.

The authors report two analyses to examine the effect size based on the distribution of p-values. These techniques are new, and although it is great the authors embrace these techniques, they should be used with caution. (I'm skipping a quite substantial discussion of the p-uniform test that was part of the review. The short summary is that the authors didn't know what they were doing).

The new test of the p-curve app returns a statistically significant effect when testing for right skew, or evidential value, when we use the test values the authors use (the test has recently been updated - in the version the authors used, the p-curve analysis was not significant). However, the p-curve analysis now also include an exploration of how much this test result depends on a single p-value, by plotting the significance levels of the test if the k most extreme p-values are removed. As we see in the graph below (blue, top-left), the test for evidential value returns a p-value above 0.05 after excluding only 1 p-value, which means we cannot put a lot of confidence in these results.


 
I think it is important to note that I have already uncovered many coding errors in a previous blog post, even though the authors note that 2 authors independently coded the effect sizes. I feel I could keep pointing out more and more errors in the meta-analysis (instead, I will just recommend to include a real statistician as a co-author), but let’s add one to illustrate how easily the conclusion in the current p-curve analysis changes.

The authors include Bierman and Bijl (2013) in their spreadsheet. The raw data of this experiment is shared by Bierman and Bijl (and available at: https://www.dropbox.com/s/j44lvj0c561o5in/Main%20datafile.sav - another excellent example of open science), and I can see that although Bierman and Bijl exclude one participant for missing data, the reaction times that are the basis for the effect size estimate in the meta-analysis are not missing. Indeed, in the master thesis itself (Bijl & Bierman, 2013), all reaction time data is included. If I reanalyze the data, I find the same result as in the master thesis:



I don’t think there can be much debate about whether all reaction time data should have been included (and Dick Bierman agrees with me in personal communication), and I think that the choice to report reaction time data from 67 instead of 68 participants in one of those tiny sources of bias that creep into the decisions researchers almost unconsciously make (after all, the results were statistically significant from zero regardless of the final choice). However, for the p-curve analysis (which assumes authors stop their analysis when p-values are smaller than 0.05) this small difference matters. If we include t(67)=2.11 in the p-curve analysis instead of t(67)=2.59, the new p-curve test no longer indicates the studies have evidential value.

Even if the p-curve test based on the correct data would have shown there is evidential value (although it is comforting it doesn’t) we should not be mindlessly interpreting the p-values we get from the analyses. Let’s just look at the plot of our data. We see a very weird p-value distribution with many more p-values between 0.01-0.02 then between 0.00-0.01 (whereas the reverse pattern should be observed, see for example Lakens, 2014).




Remember that p-curve is a relatively new technique. For many tests we use (e.g., the t-test) we first perform assumption checks. In the case of the t-test, we check the normality assumption. If data isn’t normally distributed, we cannot trust the conclusions from a t-test. I would severely doubt whether we can trust the conclusion from a p-curve if there is such a clear deviation from the expected distribution. Regardless of whether the p-curve tells us there is evidential value or not, the p-curve doesn’t look like a ‘normal p-value distribution’. Consider the p-curve analysis as an overall F-test for an ANOVA. The p-curve tells us there is an effect, but if we then perform the simple effects (looking at p-values between 0.00-0.01, and between 0.01-0.02) our predictions about what these effects look like is not confirmed. This is just my own interpretation of how we could improve the p-curve test, and it will useful to see how this test develops. For now, I just want to conclude it is debatable whether the conclusion there is an effect has passed the p-curve test for evidential value (I would say it has not), and passing the test is not immediately a guarantee there is evidential value.

The presence of bias

In the literature, a lot has been said about the fact that the low-powered studies reported in Bem (2011) strongly suggest there are an additional number of unreported experiments, or that the effect size estimates were artificially inflated by p-hacking (see Francis, 2012). The authors mention the following when discussing the possibility that there is a file-drawer (page 9):

“In his own discussion of potential file-drawer issues, Bem (2011) reported that they arose most acutely in his two earliest experiments (on retroactive habituation) because they required extensive preexperiment pilot testing to select and match pairs of photographs and to adjust the number and timing of the repeated subliminal stimulus exposures. Once these were determined, however, the protocol was “frozen” and the formal experiments begun. Results from the first experiment were used to rematch several of the photographs used for its subsequent replication. In turn, these two initial experiments provided data relevant for setting the experimental procedures and parameters used in all the subsequent experiments. As Bem explicitly stated in his article, he omitted one exploratory experiment conducted after he had completed the original habituation experiment and its successful replication.”

This is not sufficient. The power for his studies is too low to have observed the number of low p-values reported in Bem (2011) without having a much more substantial file-drawer, or p-hacking. It simply is not possible, and we should not accept vague statements about what has been reported. Where I would normally give researchers the benefit of the doubt (our science is built on this, to a certain extent) I cannot do this when there is a clear statistical indication that something is wrong. To illustrate this, let’s take a look at the funnel plot for just the studies by Dr. Bem.


Data outside of the grey triangle is statistically significant (in a two-sided test). The smaller the sample size (and the larger the standard error), the larger the effect size needs to be to show a statistically significant effect. If you would report everything you find, effect sizes should be randomly distributed around the true effect size. If they all fall on the edge of the grey triangle, there is a clear indication the studies were selected based on their (one-sided) p-value. It’s also interesting to note that the effect size estimates provided by Dr Bem are twice as large as the overall meta-analytic effect size estimate. The fact that there are no unpublished studies by Dr Bem in his own meta-analysis, even when the statistical signs are very clear that such studies should exists, is for me a clear sign of bias.

Now you can publish a set of studies like this in a top journal in psychology as evidence for precognition, but I just use these studies to explain to my students what publication bias looks like in a funnel plot.

For this research area to be taken seriously be scientists, it should make every attempt to be free from bias. I know many researchers in this field, among others Dr Tressoldi, one of the co-authors, are making every attempt to meet the highest possible standards, for example by publishing pre-registered studies (e.g., https://koestlerunit.wordpress.com/study-registry/registered-studies/). I think this is the true way forward. I also think it is telling us something that if replications are performed, these consistently fail to replicate the original results (such as a recent replication by one of the co-authors, Rabeyron, 2014, which did not replicate his own original results – note his original results are included in the meta-analysis, but his replication is not). Publishing a biased meta-analysis stating in the abstract there is “decisive evidence” in support of the experimental hypothesis’ while upon closer scrutiny, the meta-analysis fails to provide any conclusive evidence of the presence of an effect (let alone support for the hypothesis that psi exists) would be a step back, rather than a step forward.

Conclusion

No researcher should be convinced by this meta-analysis that psi effects exist. I think it is comforting that PET meta-regression indicates the effect is not reliably different from 0 after controlling for publication bias, and that p-curve analyses do not indicate the studies have evidential value. However, even when statistical techniques would all conclude there is no bias, we should not be fooled into thinking there is no bias. There most likely will be bias, but statistical techniques are simply limited in the bias they can reliably indicate.

I think that based on my reading of the manuscript, the abstract of the manuscript in a future revision should read as follows:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded, a generalized variant of the phenomenon traditionally denoted by the term precognition (Bem, 2011). To encourage replications, all materials needed to conduct them were made available on request. We here report a meta-analysis of 90 experiments from 33 laboratories in 14 countries which yielded an overall effect size (Hedges’ g) of 0.09, which after controlling for publication bias using a PET-meta-regression is reduced to 0.008, which is not reliably different from 0, 95% CI [-0.03; 0.05]. These results suggest positive findings in the literature are an indication of the ubiquitous presence of publication bias, but cannot be interpreted as support for psi-phenomena. In line with these conclusions, a p-curve analysis on the 18 significant studies did not provide evidential value for a true effect. We discuss the controversial status of precognition and other anomalous effects collectively known as psi, and stress that even if future statistical inferences from meta-analyses would result in an effect size estimate that is statistically different from zero, the results would not allow for any theoretical inferences about the existence of psi as long as there are no theoretical explanations for psi-phenomena.