A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, April 4, 2015

Why a meta-analysis of 90 precognition studies does not provide convincing evidence of a true effect

A meta-analysis of 90 studies on precognition by Bem, Tressoldi, Rabeyron, & Duggan has been circulating recently. I have looked at this meta-analysis of precognition experiments for an earlier blog post. I had a very collaborative exchange with the authors, which was cordial and professional, and led the authors to correct the mistakes I pointed out and answer some questions I had. I thought it was interesting to write a pre-publication peer review of an article that had been posted in a public depository, and since I had invested time in commenting on this meta-analysis anyway, I was more than happy to accept the invitation to peer-review it. This blog is a short summary of my actual review - since a pre-print of the paper is already online, and it is already cited 11 times, perhaps people are interested in my criticism on the meta-analysis. I expect that many of my comments below apply to other meta-analyses by the same authors (e.g., this one), and a preliminary look at the data confirms this. Until I sit down and actually do a meta-meta-analysis, here's why I don't think there is evidence for pre-cognition in the Bem et al meta-analysis.

Only 18 statistically significant precognition effects have been observed in the last 14 years, by just 7 different labs, as the meta-analysis by Bem, Tressoldi, Rabeyron, and Duggan reveals. 72 studies reveal no effect. If research on pre-cognition has demonstrated anything, it is that when you lack a theoretical model, scientific insights are gained at a painstakingly slow pace, if they are gained at all.

The questions the authors attempt to answer in their meta-analysis is whether there is a true signal in this noisy set of 90 studies. If this is the case, it obviously does not mean we have proof that precognition exists. In science, we distinguish between statistical inferences and theoretical inferences (e.g., Meehl, 1990). Even if a meta-analysis would lead to the statistical inference that there is a signal in the noise, there is as of yet no compelling reason to draw the theoretical inference that precognition exists, due to the lack of a theoretical framework as acknowledged by the authors. Nevertheless, it is worthwhile to see if after 14 years and 90 studies something is going on.

In the abstract, the authors conclude: there is “an overall effect greater than 6 sigma, z = 6.40, p = 1.2 × 10-10 with an effect size (Hedges’ g) of 0.09. A Bayesian analysis yielded a Bayes Factor of 1.4 × 109, greatly exceeding the criterion value of 100 for “decisive evidence” in support of the experimental hypothesis.” Let’s check the validity of this claim.

Dealing with publication bias.

Every meta-analysis needs to deal with publication bias to prevent the meta-analytic effect size estimate being anything else than the inflation from 0 that emerges because people are more likely to share positive results. Bem and colleagues use Begg and Mazumdar’s rank correlation test to examine publication bias, stating that: “The preferred method for calculating this is the Begg and Mazumdar’s rank correlation test, which calculates the rank correlation (Kendall’s tau) between the variances or standard errors of the studies and their standardized effect sizes (Rothstein, Sutton & Borenstein, 2005).”

I could not find this recommendation in Rothstein et al., 2005. From the same book, Chapter 11, p. 196, about the rank correlation test: “the test has low power unless there is severe bias, and so a non-significant tau should not be taken as proof that bias is absent (see also Sterne et al., 2000, 2001b, c)”. Similarly, from the Cochrane handbook of meta-analyses: “The test proposed by Begg and Mazumdar (Begg 1994) has the same statistical problems but lower power than the test of Egger et al., and is therefore not recommended.

When the observed effect size is tiny (as in the case of the current meta-analysis), just a small amount of bias can yield a small meta-analytic effect size estimate that is statistically different from 0. In other words, whereas a significant test result is reason to worry, a non-significant test result is not reason not to worry.

The authors also report the trim-and-fill method to correct for publication bias. It is known that when publication bias is induced by a p-value boundary, rather than an effect size boundary, and there is considerable heterogeneity in the effects included in the meta-analysis, the trim-and-fill method might not perform well enough to yield a corrected meta-analytic effect size estimate that is close to the true effect size (Peters, Sutton, Jones, Abrams, & Rushton, 2007; Terrin, Schmid, Lau, & Olkin, 2003, see also the Cochrane handbook). I’m not sure what upsets me more: The fact that people continue to use this method, or the fact that the people who use this method still report the uncorrected effect size estimate in their abstract.

Better tests for publication bias

PET-PEESE meta-regression seems to be the best test to correct effect size estimates for publication bias we currently have. This approach is based on first using the precision-effect test (PET, Stanley, 2008) to examine whether there is a true effect beyond publication bias, and then following up on this test (if the confidence intervals for the estimate exclude 0) by a PEESE (precision-effect estimate with standard error, Stanley and Doucouliagos, 2007) to estimate the true effect size.

In the R code where I have reproduced the meta-analysis (see below), I have included the PET-PEESE meta-regression. The results are clear: the estimated effect size when correcting for publication bias is 0.008, and the confidence intervals around this effect size estimate do not exclude 0. In other words, there is no good reason to assume that anything more than publication bias is going on in this meta-analysis.

Perhaps it will help to realize that if precognition had an effect size of Cohen’s dz = 0.09, to have 90% power to examine an effect with an effect size estimate of 0.09, an alpha level of 0.05, and performing a two-sided t-test, you’d need 1300 participants. Only 1 experiment has been performed with a sufficiently large sample size (Galak, exp 7), and this experiment did not show an effect. Meier (study 3) has 1222 participants, and finds an effect at a significance level of 0.05. However, using a significance level of 0.05 is rather silly when sample sizes are so large (see http://daniellakens.blogspot.nl/2014/05/the-probability-of-p-values-as-function.html) and when we calculate a Bayes Factor using the t-value and the sample size, we see this results in a JZS Bayes Factor of 1.90 – nothing that should convince us.

1/exp(ttest.tstat(t=2.37, n1=1222, rscale = 0.707)[['bf']])

[1] 1.895727

Estimating the evidential value with p-curve and p-uniform.

The authors report two analyses to examine the effect size based on the distribution of p-values. These techniques are new, and although it is great the authors embrace these techniques, they should be used with caution. (I'm skipping a quite substantial discussion of the p-uniform test that was part of the review. The short summary is that the authors didn't know what they were doing).

The new test of the p-curve app returns a statistically significant effect when testing for right skew, or evidential value, when we use the test values the authors use (the test has recently been updated - in the version the authors used, the p-curve analysis was not significant). However, the p-curve analysis now also include an exploration of how much this test result depends on a single p-value, by plotting the significance levels of the test if the k most extreme p-values are removed. As we see in the graph below (blue, top-left), the test for evidential value returns a p-value above 0.05 after excluding only 1 p-value, which means we cannot put a lot of confidence in these results.

I think it is important to note that I have already uncovered many coding errors in a previous blog post, even though the authors note that 2 authors independently coded the effect sizes. I feel I could keep pointing out more and more errors in the meta-analysis (instead, I will just recommend to include a real statistician as a co-author), but let’s add one to illustrate how easily the conclusion in the current p-curve analysis changes.

The authors include Bierman and Bijl (2013) in their spreadsheet. The raw data of this experiment is shared by Bierman and Bijl (and available at: https://www.dropbox.com/s/j44lvj0c561o5in/Main%20datafile.sav - another excellent example of open science), and I can see that although Bierman and Bijl exclude one participant for missing data, the reaction times that are the basis for the effect size estimate in the meta-analysis are not missing. Indeed, in the master thesis itself (Bijl & Bierman, 2013), all reaction time data is included. If I reanalyze the data, I find the same result as in the master thesis:

I don’t think there can be much debate about whether all reaction time data should have been included (and Dick Bierman agrees with me in personal communication), and I think that the choice to report reaction time data from 67 instead of 68 participants in one of those tiny sources of bias that creep into the decisions researchers almost unconsciously make (after all, the results were statistically significant from zero regardless of the final choice). However, for the p-curve analysis (which assumes authors stop their analysis when p-values are smaller than 0.05) this small difference matters. If we include t(67)=2.11 in the p-curve analysis instead of t(67)=2.59, the new p-curve test no longer indicates the studies have evidential value.

Even if the p-curve test based on the correct data would have shown there is evidential value (although it is comforting it doesn’t) we should not be mindlessly interpreting the p-values we get from the analyses. Let’s just look at the plot of our data. We see a very weird p-value distribution with many more p-values between 0.01-0.02 then between 0.00-0.01 (whereas the reverse pattern should be observed, see for example Lakens, 2014).

Remember that p-curve is a relatively new technique. For many tests we use (e.g., the t-test) we first perform assumption checks. In the case of the t-test, we check the normality assumption. If data isn’t normally distributed, we cannot trust the conclusions from a t-test. I would severely doubt whether we can trust the conclusion from a p-curve if there is such a clear deviation from the expected distribution. Regardless of whether the p-curve tells us there is evidential value or not, the p-curve doesn’t look like a ‘normal p-value distribution’. Consider the p-curve analysis as an overall F-test for an ANOVA. The p-curve tells us there is an effect, but if we then perform the simple effects (looking at p-values between 0.00-0.01, and between 0.01-0.02) our predictions about what these effects look like is not confirmed. This is just my own interpretation of how we could improve the p-curve test, and it will useful to see how this test develops. For now, I just want to conclude it is debatable whether the conclusion there is an effect has passed the p-curve test for evidential value (I would say it has not), and passing the test is not immediately a guarantee there is evidential value.

The presence of bias

In the literature, a lot has been said about the fact that the low-powered studies reported in Bem (2011) strongly suggest there are an additional number of unreported experiments, or that the effect size estimates were artificially inflated by p-hacking (see Francis, 2012). The authors mention the following when discussing the possibility that there is a file-drawer (page 9):

“In his own discussion of potential file-drawer issues, Bem (2011) reported that they arose most acutely in his two earliest experiments (on retroactive habituation) because they required extensive preexperiment pilot testing to select and match pairs of photographs and to adjust the number and timing of the repeated subliminal stimulus exposures. Once these were determined, however, the protocol was “frozen” and the formal experiments begun. Results from the first experiment were used to rematch several of the photographs used for its subsequent replication. In turn, these two initial experiments provided data relevant for setting the experimental procedures and parameters used in all the subsequent experiments. As Bem explicitly stated in his article, he omitted one exploratory experiment conducted after he had completed the original habituation experiment and its successful replication.”

This is not sufficient. The power for his studies is too low to have observed the number of low p-values reported in Bem (2011) without having a much more substantial file-drawer, or p-hacking. It simply is not possible, and we should not accept vague statements about what has been reported. Where I would normally give researchers the benefit of the doubt (our science is built on this, to a certain extent) I cannot do this when there is a clear statistical indication that something is wrong. To illustrate this, let’s take a look at the funnel plot for just the studies by Dr. Bem.

Data outside of the grey triangle is statistically significant (in a two-sided test). The smaller the sample size (and the larger the standard error), the larger the effect size needs to be to show a statistically significant effect. If you would report everything you find, effect sizes should be randomly distributed around the true effect size. If they all fall on the edge of the grey triangle, there is a clear indication the studies were selected based on their (one-sided) p-value. It’s also interesting to note that the effect size estimates provided by Dr Bem are twice as large as the overall meta-analytic effect size estimate. The fact that there are no unpublished studies by Dr Bem in his own meta-analysis, even when the statistical signs are very clear that such studies should exists, is for me a clear sign of bias.

Now you can publish a set of studies like this in a top journal in psychology as evidence for precognition, but I just use these studies to explain to my students what publication bias looks like in a funnel plot.

For this research area to be taken seriously be scientists, it should make every attempt to be free from bias. I know many researchers in this field, among others Dr Tressoldi, one of the co-authors, are making every attempt to meet the highest possible standards, for example by publishing pre-registered studies (e.g., https://koestlerunit.wordpress.com/study-registry/registered-studies/). I think this is the true way forward. I also think it is telling us something that if replications are performed, these consistently fail to replicate the original results (such as a recent replication by one of the co-authors, Rabeyron, 2014, which did not replicate his own original results – note his original results are included in the meta-analysis, but his replication is not). Publishing a biased meta-analysis stating in the abstract there is “decisive evidence” in support of the experimental hypothesis’ while upon closer scrutiny, the meta-analysis fails to provide any conclusive evidence of the presence of an effect (let alone support for the hypothesis that psi exists) would be a step back, rather than a step forward.


No researcher should be convinced by this meta-analysis that psi effects exist. I think it is comforting that PET meta-regression indicates the effect is not reliably different from 0 after controlling for publication bias, and that p-curve analyses do not indicate the studies have evidential value. However, even when statistical techniques would all conclude there is no bias, we should not be fooled into thinking there is no bias. There most likely will be bias, but statistical techniques are simply limited in the bias they can reliably indicate.

I think that based on my reading of the manuscript, the abstract of the manuscript in a future revision should read as follows:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded, a generalized variant of the phenomenon traditionally denoted by the term precognition (Bem, 2011). To encourage replications, all materials needed to conduct them were made available on request. We here report a meta-analysis of 90 experiments from 33 laboratories in 14 countries which yielded an overall effect size (Hedges’ g) of 0.09, which after controlling for publication bias using a PET-meta-regression is reduced to 0.008, which is not reliably different from 0, 95% CI [-0.03; 0.05]. These results suggest positive findings in the literature are an indication of the ubiquitous presence of publication bias, but cannot be interpreted as support for psi-phenomena. In line with these conclusions, a p-curve analysis on the 18 significant studies did not provide evidential value for a true effect. We discuss the controversial status of precognition and other anomalous effects collectively known as psi, and stress that even if future statistical inferences from meta-analyses would result in an effect size estimate that is statistically different from zero, the results would not allow for any theoretical inferences about the existence of psi as long as there are no theoretical explanations for psi-phenomena.


  1. Hi Daniel,

    we all know RT are skewed - in those 90 studies how many used mean(RT)? and what is the impact on p values? if I compare mean(RT) with median(RT) and/or mean(log(RT)), the mean(RT) can give very different results ..

    1. The relevant quantity in each study will not be RT but some comparison between RTs such as difference or quotient.

    2. My criticisms are not yet focussed on the raw data - this is purely based on the meta-analysis, and in some cases the data used to calculate the effect size. If we'd look at the raw data, I'm sure more criticisms would pop up.

  2. Muhaha Daniel, give up! Hypothesis testing does not work. H0 is always false and even the crazy precognition studies show this!

    1. Bayes Factors lead to the same conclusion. And unless you know how to model an effect without any theoretical prediction, you are either left with rejecting precognition outright (which would make you vunerable for the criticism of not being scientific) or testing the H0. Unless you have an alternative?

    2. Yeah, discard BFs with the rest of hypothesis testing. Just do parameter estimation. The problem of precognition is that the effects are so tiny that we can discard them based on their their magnitude and do not need to care whether they are reliably tiny or just non-existent. Instead we can focus on research that demonstrates reliably big effects - for instance the modern-day physics, which currently tells us that precognition can't exist.

    3. What would you say to using a null interval instead of a point null for hypothesis testing? Say it was uniform spanning +/-10^-6 or some other small interval.

    4. Matus, if we needed to dismiss everything modern physics can't explain, you needed to dismiss the light waved coming from your electronical device which allow you to read this. There's no good theory of light (Waves? Particles?) but it is empirically demonstrable. People who believe in precognition are not convinced by your argument, and use NHST to convince the general public there might be something there. Yes, we can ignore it. Or we can show they are not using NHST correctly and draw conclusions based on a biased literature.

    5. Daniel,
      I didn't say that "we needed to dismiss everything modern physics can't explain". I said we should focus on reliably big effect sizes - such as found in physics. I also explicitly used the qualifier "currently". If there is some quantum magic involved in precognition, physics is in excellent position to uncover it.
      "People who believe in precognition are not convinced by your argument, and use NHST to convince the general public"
      ... and they so far fail miserably (minus perhaps few eso-freak), so maybe they should start heeding my argument, which would help them better persuade the public. Btw. the soviets and americans studied psi extensively during the cold war, but gave up at some point. Why? No, not because they figured out that H0 is correct. What they figured out is that even if psi exists it's so weak that it doesn't impact our lives and there is no practical use for it.

    6. Alex, we could just test H0: theta>0.6 where theta is the success probability and theta= 0.5 is random performance. But I really prefer to know theta and its CI. if it's 0.7>theta>0.6 I would take notice. If it's 0.9>theta>0.7 I want to try that experiment in our lab, and if it's 0.9<theta I'm running to betting kiosk right away.

    7. Great post!
      I have a topical suggestion below...

      First: Let's just not talk about physics for a while as psychologists ok?

      [quickly puts on an -access all areas- philosopher badge]
      "... you needed to dismiss the light waved coming from your electronical device which allow you to read this. There's no good theory of light (Waves? Particles?) "

      I think what Bem et al. have produced warrants the folowing note to be included with almost all modern electronic devices: "made possible by Quantum Physics, the ultimate empirically accurate scientific theory about light and matter that we will not give up just because some psychologists can't properly apply the scientific method and misinterpret everything that we have achieved in less than a century since 1930s when the Quantum Formalism was first postulated -what have you guys been doing all that time... still figuring out what that Fisher character went on about in 1925?- and if you think the theory is false pleas refrain from using all electronic devices or you will be fined."

      If you can use the fundamental knowledge about the world posited by a scientific theory to actually build things that would have been impossible to build without it... the theory must be on to something.

      About detecting small effects: The S.Q.U.I.D. sensors in a MEG scanner are: Superconducting Quantum Interference Devices. The best are able to detect magnetic fields of 10-18 Tesla (refrigerator magnets produce 10-2 T). Flash memory makes use of quantum tunneling to erase data, every light emitting diode display has some quantum physics going on, especially of course the QuantumDot technology. Oh, and then there's that old CD player which uses a LASER. which is based on a quantum phenomenon described by Einstein.

      The 'explanation' of the wave-like and particle-like behavior of light and matter IS Quantum Physics (more precisely, the Quantum Formalism of 1925-1935). The quantum revolution started with Einstein's explanation of the photo-electric effect in 1905, for which he won the Nobel Prize in 1921(not for relativity). In the Quantum Formalism the duality of Wave and Particle descriptions (the Schrödinger picture and Heisenberg picture) are two sides of the same structural coin and are equivalent for all intents and purposes. Oversimplifying: The difference lies in wether the distribution parameters (density-matrix) or the covariates (system observables) are considered time-varying. There is also an interaction picture in which both are time-varying: http://en.wikipedia.org/wiki/Heisenberg_picture#Summary_comparison_of_evolution_in_all_pictures

      Back to common ground...
      The need to have independent replications conducted by different labs is being accepted... slowly (this has btw always been the ultimate test in physics needed to get consensus on a discovery)

      I think there's a real need for independent Meta-Analyses, conducted by teams who have no stakes in the topic analysed. There are other examples of Meta-gone-wrong by researchers testing their own beef, crop, pudding and what not.

  3. Daniel, I want to commend you for taking the time and providing the technical expertise to evaluate this meta-analysis. Since your posting focuses on technical details of the meta-analysis, it may inspire readers to think that meta-analysis can compensate for weak methodology and solve a scientific controversy like this. I want to emphasize your comment that the way forward is pre-registered, well-powered confirmatory research. If the experimenters actually understand and control a real phenomenon as they claim, 80% or more of pre-registered confirmatory studies should provide significant evidence for an effect—as compared to the 20% to 33% found in this and other meta-analyses in parapsychology.

    Meta-analyses of mostly nonsignificant, underpowered, unregistered studies attempt to use post hoc analyses of observational data to compensate for methodological weaknesses in the original studies (“synthesis-generated evidence” rather than “study-generated evidence”--discussion and references at http://jeksite.org/psi/jp13a.pdf). That strategy has not been and cannot be expected to be effective for resolving scientific controversies. Confirmatory studies are needed that eliminate the methodological weaknesses. As you noted, some parapsychological researchers are making efforts to do this type of research.

    I have not delved into the technical details of this meta-analysis, but I did look into the distribution of effect sizes for Bem’s original paper. I came to the conclusion that a meaningful evaluation of effect sizes could not be done because the experiments used different tasks and had different numbers of trials per subject for the different tasks. The effective sample size for a study depends on both the number of subjects and number of trials per subject. I do not see any reason to expect a fixed effect size for the different experiments (as if they were replications of one experiment), and am skeptical of efforts (pro or con) to interpret funnel plots and related evaluations that combine the different types of experiments.

  4. This comment has been removed by a blog administrator.

  5. Convincing ideas to the statistics report almost concerning more values in it which used to influence more values that you really want to opt for your statistical report, so this is all been so important to get ready for your purpose. stata statistical analysis

  6. Hi Daniel, I have two questions:

    1. In your critique, you argue (regarding publication bias) that: “PET-PEESE meta-regression seems to be the best test to correct effect size estimates for publication bias we currently have.” Do you have a citation for that claim?

    The authors (Bem et al.) do use several measures (based on correlation between effect size and sample size) to explore the possibility of publication bias. The only one where they don’t get a significant effect is the PET technique, very similar to the one you choose for your analysis. However, the authors note that “Sterne & Egger (2005) (upon which PET is based) themselves caution, however, this procedure cannot assign a causal mechanism, such as selection bias, to the correlation between study size and effect size, and they urge the use of the more noncommittal term “small-study effect.” So the literature suggests we should use caution in how we interpret this particular measure.

    2. I’m puzzled by your claim “there are no unpublished studies by Dr Bem in his own meta-analysis.” In Table 1, which summarizes the studies included in the meta-analysis you are critiquing, I count 38 lines (separate experiments) that were not peer reviewed.



  7. This comment has been removed by the author.

  8. The Lakens 2014 reference link to SSRN is dead. Is it possible to post a working link? Thanks!

  9. This comment has been removed by a blog administrator.