I recently read a meta-analysis on precognition studies by Bem, Tressoldi, Rabeyron, and Duggan (available on SSRN). The authors conclude in the abstract: 'We can now report a metaanalysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma with an effect size (Hedges’ g) of 0.09, combined z = 6.33, p = 1.2 ×10^-10). A Bayesian analysis yielded a Bayes Factor of 1.24 × 10^9, greatly exceeding the criterion value of 100 for “decisive evidence” in favor of the experimental hypothesis (Jeffries, 1961).'
If precognition is true, this would be quite something. I think our science should be able to answer the question whether pre-cognition exists in an objective manner. My interest was drawn to the meta-analysis because the authors reported a p-curve analysis (recently developed by Simonsohn, Nelson, & Simmons, 2014), which is an excellent complement to traditional meta-analyses (see Lakens & Evers, 2014). The research area of precognition has been quite responsive to methodological and statistical improvements (see this site where parapsychologists can pre-register their experiments).
The p-curve analysis provided support of evidential value in a binomial test (no longer part of the 2.0 p-curve app), but not when the more sensitive chi-squared test was used (now the only remaining option). The authors did not provide a p-curve disclosure table, so I worked on one for a little bit to distract myself while my wife was in the hospital and I took time off from work (it's emotion regulation, researcher-style). My p-curve disclosure table is not intended to be final or complete, but it did led me to some questions. I contacted the authors of the paper, who were just about as  responsive and friendly as you could imagine, and responded incredibly fast to my questions - I jokingly said that if they wanted to respond any better, they'd have to answers my questions before I asked them ;). 
First of all, I’d like to thank Patrizio Tressoldi for his
extremely fast and helpful replies to my e-mails. Professor Bem will reply to some of the questions below later, but did not have time due to personal circumstances that happen to coincide with this blog post. I liked to share the correspondence at this point, but will update it in the future with answers by Prof. Bem. I also want to thank many other scholars that have provided answers and discussed this topic with me - the internet makes scientific discussions so much better. I'm also grateful to the researchers who replied to Patrizio Tressoldi's questions (answers included below). Patrizio and I both happened to share a view on scientific communication not unlike that outlined in 'Scientific Utopia' by Brian Nosek and Yaov Bar-Anan. I think this pre-publication peer review of a paper posted on SSRN might be an example of a step in that direction.
Below are some questions concerning
the manuscript ‘Feeling the Future: A Meta-analysis of 90 Experiments on the
Anomalous Anticipation of Random Future Events’ by Bem, Tressoldi, Rabeyron,
& Duggan. My questions below concern
only a small part of the meta-analysis, primarily those studies that were
included in the p-curve analysis
reported in the manuscript. I did not read all unpublished studies included in
the meta-analysis, but will make some more general comments about the way the
meta-analysis is performed. 
The authors do not provide a p-curve disclosure table (as recommended
by Simonsohn, Nelson, & Simmons, 2014). I’ve extended the database by Bem
and colleagues to include a p-curvedisclosure table. In total, I’ve included 18 significant p-values (Bem and colleagues mention 21 significant values). I’ve
included a different p-value for
study 6 by Bem (2011) and Bierman (2011). I’ve also excluded the p-value by Traxler, which accounts for
the 3 different p-values (these choices are explained below). 
In my p-curve analysis, the results
are inconclusive, since we can neither conclude there is evidential value
(χ²(36)=45.52, p=.13) nor that they lack evidential value (χ²(36)=43.5, p=.1824).
It is clear from the graph there is no indication of p-hacking. 
An inconclusive result with 18 p-values is not an encouraging result.
The analysis has substantial power (as Simonsohn et al note, ‘With 20 p-values,
it is virtually guaranteed to detect evidential value, even when the set of
studies is powered at just 50%’). The lack of evidential value is mainly due to
the fact that Psi researchers have failed to perform studies of high
informational value (e.g., with large samples) that, if the Psi effect exists,
should return primarily p-values smaller than p < .01 (for an exception, discussed below, see Maier et al., in press). Running high-powered studies is a lot of effort.
Perhaps psi-researchers can benefit from a more coordinated and pre-registered research
effort similar to the many labs approach (https://osf.io/wx7ck/).
The current approach seems not to lead to studies that can provide conclusive
support for their hypothesis.The p-curve is a new tool, and although Nelson et al (2014) argue it might be better than traditional meta-analyses, we should see comparisons of p-curve analyses and meta-analyses in the future. 
Reply Tressoldi: The main use of p-curve
is a  control of p-hacking procedure,
certainly not a stat to summarize quantitatively the evidence available related
to a given phenomenon given the well-known limitations of the p values.  The parameters obtained by the meta-analyses based
on all available studies, are surely more apt to this demonstration.
For the meta-analysis, the authors
simply state that ‘Effect sizes (Hedges’ g) and their standard errors were
computed from t test values and sample sizes.’ However, for within-subject
designs, such as the priming effects, the standard error can only be calculated
when the correlation between the two means that are compared is known. This
correlation is practically never provided, and thus one would expect some
information about how these correlations were estimated when the raw data was
not available. 
Reply Tressoldi: We probably refer to different formula to  calculate the ES. That used by the
Comprehensive Meta-Analysis software we used, is t/Sqr(N) multiplied by the
Hedges’ correction factor. It gives the same result of using independent t-test
assuming a correlation of 0.50.
(comment DL - the assumption of a correlation of 0.50 is suboptimal, I think). 
Reply Tressoldi: this
study is related to the so called predictive physiological anticipatory effects
summarized and discussed in the Mossbridge et al. (2012) meta-analysis cited on
pag. 5.
Similarly, Bem himself contributes
only studies with effect sizes that are equal to or higher than the
meta-analytic effect size, despite the fact that Schimmack (2012) has noted
that it is incredibly unlikely anyone would observe such a set of studies
without having additional studies in a filedrawer. That this issue is not
explicitly addressed in the current meta-analysis is surprising and worrisome.
Prof. Bem will respond to this later.
If we look at which effects are
included in the meta-analysis, we can see surprising inconsistencies. Boer &
Bierman (2006) observed an effect for positive, but not for negative pictures.
Bem finds an effect for erotic pictures, but not for positive pictures. The
question is which effect is analyzed: If it’s the effect of positive non-erotic
pictures, the study by Bem should be excluded, if it’s erotic pictures, the
study by Boer & Bierman should be excluded. If the authors want to analyze
psi effects for all positive pictures (erotic or not), both erotic as positive
pictures by Bem should be included. Now, the authors of the meta-analysis are
cherry-picking significant results that share nothing (except their
significance level). This hugely increases the overall effect size in the
meta-analysis, and is extremely inconsistent. A better way to perform this
meta-analysis would be to include all stimuli (positive, negative, erotic, etc)
in a combined score, or to perform separate meta-analysis for each picture
type. Now, it is unclear what is being meta-analyzed, but it looks like a
meta-analysis of picture types for which significant effects were observed,
while the meta-analysis should be about a specific or all picture types. This
makes the meta-analysis flawed as it is.
Reply Tressoldi: Boer
& Bierman (2006) used a retro-priming whereas Bem exp. 1 used a reward protocol.
As you can see in our Table 2, we considered seven different protocols. Even if
all seven protocols test a similar hypothesis, their results are quite
different as expected. They ESs range 
from negative using the retro-practice for reading speed, to 0.14 for
the detection of reinforcement. Furthermore with only two exceptions
(retroactive practice and the detection of reinforcement) all other ES show a
random effect, caused by probable moderator variables, we did not analyzed
further given the low number of studies, but that future investigation must
take in account.
Bierman reply: The mean response time in the retro conditon for negative valence is 740 while the control gives 735. The conclusion is of course that there is no priming effect. It doesn’t make sense to run a t-test if the effect is so small and not in the predicted direction.
Of course
at the time we were not so sensitive for specifying precisely where we would
expect the retro-active priming effect and I think the proper way to deal with
this is to correct the p-value for selecting only the positive condition. On
the other hand the p-value given is two-tailed which is in retrospect rather
strange because priming generally results in faster response times so it is a
directionally specified effect. The quoted p-value of 0.004 two tailed should
have been presented as 0.002 one-tailed but should have been corrected for the
selection of the positive condition. So I would use a score corresponding to
p=0.004 (this time one-tailed) (which corresponds to the present t-test value
2.98, my note)
Similarly, in the study by Starkie
(2009) the authors include all 4 type of trials, and perform some (unspecified)
averaging which yields a single effect size. In this case, there are 4 effects,
provided by the same participants. These observations are not independent, but
assuming the correct averaging function is applied, the question is whether
this is desirable. For example, Bem (2011) argued neutral trials do not
necessarily need to lead to an effect. The authors seem to follow this
reasoning by not including the neutral trials from study 6 by Bem (2011), but
ignore this reasoning by including the neutral trials in Starkie (2009). This
inconsistency is again in the benefit of the authors own hypothesis, given that
excluding the neutral trials in the study by Starkie (2009) would substantially
increase the effect size, which is in the opposite direction as predicted, and
would thus lower the meta-analytic Psi effect size.
Reply Tressoldi: We
already clarified how we averaged multiple outcomes from a single study. As to
Starkie’s data,  yes, you are right, we
added the results of the neutral stimuli increasing the negative effect (with
respect to the alternative hypothesis). We will correct this result in the
revision of our paper.
In Bem, experiment 6, significant
effects are observed for negative and erotic stimuli (in opposite directions,
without any theoretical explanation for the reduction in the hit-rate for
erotic stimuli). The tests for negative (51.8%,t(149) = 1.80,p =.037, d=0.15,
binomial z = 1.74, p =.041) and erotic (48.2%, t(149) = 1.77, p = .039, d
=0.14, binomial z = 1.74, p = .041) should be included (as is the case in the
other experiments in the meta-analysis, where the tests with deviations against
guessing average are included). Instead, the authors include a the test of the
difference between positive and erotic picture. This is a different effect, and
should not be included in the meta-analysis (indeed, they correctly do not use
a comparable difference test reported in Study 1 in Bem, 2011). Moreover, the
two separate effects have a Cohen’s dz of around 0.145 each, but the difference
score has a Cohen’s dz of 0.19 – this inflates the ES.
Prof. Bem will respond to this later.
An even worse inflation of the
effect size is observed in Morris (2004). Here, the effects for negative,
erotic, and neutral trials with effect sizes of dz = 0.31, dz = 0.14, and dz = 0.32
are combined in some (unspecified) way to yield an effect size of dz = 0.447.
That is obviously plain wrong – combining three effects meta-analytically
cannot return a substantially higher effect size.
Reply Tressoldi: Yes,
you are correct. We averaged erroneously the t-test values. We corrected the
database and all corresponding analyses with the correct value 2.02 (averaging
the combining erotic and negative trials with the boredom effect).
Bierman (2011) reported a
non-significant psi-effect of t(168) = 1.41, p = 0.08 one-tailed. He reported
additional exploratory analyses where all individuals who had outliers on more
than 9 (out of 32) trials were removed. No justification for this ad-hoc
criterion is given (note that normal outlier removal procedures, such as
removing responses more than 3 SD removed from the mean response time, were
already applied). The authors nevertheless choose to include this (t(152) =
1.97, p = 0.026, one-tailed) test in their meta-analysis, which inflates the
overall ES. A better (and more conservative) approach it to include the effect
size of the test that includes all individuals.
Reply Tressoldi: we
emailed this comment to Bierman, and he agreed with your proposal. Consequently
we will update our database.
The authors do make one
inconsistent choice that is against their own hypothesis. In Traxler et al (Exp
1b) they include an analysis over items, which is strongly in the opposite
direction of the predicted effect, while they should actually have included the
(not significant) analysis over participants (as I have done in my p-curve
analysis).
Reply Tressoldi: The
choice to include item vs participants analysis is always debated.
Psycholinguistics prefer the first one, others the second one. We wrote to
Traxler and now we corrected the data using the by participants stats. You are
correct that in our p-curve analysis we erroneously added this negative effect
due to a bug of the apps that does not take in account the sign of the stats.
The authors state that ‘Like many
experiments in psychology, psi experiments tend to produce effect sizes that
Cohen (1988) would classify as “small” (d ≈ 0.2).’ However, Cohen’s dz cannot be interpreted following
the guidelines by Cohen, and the statement by the authors is misleading. 
Reply Tressoldi: Misleading
to what? We agree that Cohen’s classification is arbitrary and each ES must be
interpreted within its context given that sometime “small differences = big
effects" and viceversa.
(Comment DL - From my effect size primer: As Olejnik and Algina (2003) explain for eta squared (but the same is true for Cohen's dz), these benchmarks were developed for comparisons between unrestricted populations (e.g., men vs. women), and using these benchmarks when interpreting the effect size in designs that include covariates or repeated measures is not consistent with the considerations upon which the benchmarks were based.)
On the ‘Franklin’ tab of the
spreadsheet, it looks like 10 studies are analyzed which all show effects in
the expected direction. That is incredibly unlikely, if so. Other included
studies, such as http://www.chronos.msu.ru/old/EREPORTS/polikarpov_EXPERIMENTS_OF_D.BEM.pdf
have too low quality to be included. The authors would do well to more clearly
explain which data is included in the meta-analysis, and provide some more
openness about the unpublished materials.
Reply Tressoldi: all
materials are available upon request. As to Franklin data, we had a rich
correspondence with him and he sent us the more “conservative” results. As to
Fontana et al. study, we obtained the raw data from the authors we analyzed
independently.
Finally, I have been in e-mail
contact with Alexander Batthyany.
[Note that an earlier version of this blog post did not accurately reflect the comments made by Alexander Batthyany. I have corrected his comments below. My sincere apologies for this miscommunication].
The meta-analysis includes several of his data-sets, and one dataset by him with the comment ‘closed-desk but controlled’. He writes: ‘I can briefly tell you that the closed desk condition meant that there was an equal amount of positive and negative targets, and this is why I suggested to the authors of the meta-analysis that they should not include the study. Either consciously or unconsciously, subjects could have sensed whether more or less negative or positive stimuli would be left in the database, in which case their “guessing” would have been altered by expectancy effects (or a normal unconscious) bias. My idea to test this possibility was to see whether there were more hits towards the end for individual participants – I let Daryl Bem run this analysis and he said that there is no such trend.
Despite no such trend, I personally believe the difference between a closed desk and open desk condition seems worthwhile to discuss in a meta-analysis.
[Note that an earlier version of this blog post did not accurately reflect the comments made by Alexander Batthyany. I have corrected his comments below. My sincere apologies for this miscommunication].
The meta-analysis includes several of his data-sets, and one dataset by him with the comment ‘closed-desk but controlled’. He writes: ‘I can briefly tell you that the closed desk condition meant that there was an equal amount of positive and negative targets, and this is why I suggested to the authors of the meta-analysis that they should not include the study. Either consciously or unconsciously, subjects could have sensed whether more or less negative or positive stimuli would be left in the database, in which case their “guessing” would have been altered by expectancy effects (or a normal unconscious) bias. My idea to test this possibility was to see whether there were more hits towards the end for individual participants – I let Daryl Bem run this analysis and he said that there is no such trend.
Despite no such trend, I personally believe the difference between a closed desk and open desk condition seems worthwhile to discuss in a meta-analysis.
Prof. Bem will respond to this later.
Indeed, there is some additional
support for this confound in the literature. Maier and colleagues (in press, p
10) write: ‘A third unsuccessful
replication was obtained with another web-based study that was run after the
first web study. Material, design, and procedure were the same as in the
initial web study with two changes. Instead of trial randomization without
replacement, a replacement procedure was used, i.e., the exact equal
distribution of negative pictures to the left and right response key across the
60 trials was abandoned.’ 
Note that Maier and colleagues are
the only research team to perform a-priori power analyses, and run high powered
studies. With such high sample sizes, the outcome of studies should be stable.
In other words, it would be very surprising if several studies with more than
95% power do not reveal an effect. Nevertheless, several studies with extremely
high power did fail to provide an effect. This is a very surprising outcome (although
Maier and colleagues do not discuss it, instead opting to meta-analyze all
results), and makes it fruitful to consider important moderators indicated by
procedural differences between studies. This equal distribution of pictures
issue is very important to consider, as it could be an important confound, and
there seems to be support for this confound in the literature.
Markus
Maier and Vanessa Buechner reply: We agree with Dr. Lakens’ argument that
studies with a high power should have a higher likelihood of finding the
effect. However, his “high-power” argument does not apply to our second web
study. Although most of our studies have been performed based on a priori power
analyses the two web studies reported in our paper have not (see p. 10,
participants section). For both web studies a data collection window of three
months was pre-defined. This led to a sample size of 1222 individuals in web
study 1 but only to 640 participants in web study 2 (we think that the subject
pool was exhausted after a while). If we use the effect size of web study 1 (d
= .07) to perform an a posteriori power calculation (G*power 3.1.3; Faul et
al., 2007) of web study 2 we reach a power of .55. This means that web study 2
is significantly under-powered, what might explain the statistical trend that
we found only. Web study 1 instead had an a posteriori power of .79. 
Nevertheless,
we admit that unsuccessful study 2 had an a priori power of 95% and yet did not
reveal any significant effect. This is surprising and constitutes from a
frequentist point of view a clear empirical failure of finding evidence for
retro-causation. However, from a Bayesian perspective this study does neither
provide evidence for H0 nor H1.
We also
agree that procedural difference such as open vs. closed deck procedures should
be experimentally explored in future studies, but we think that this variation
does not explain whether a study in our series was significant or not. For
example, in the successful studies 1 to 3 as well as unsuccessful studies 1 and
2 a closed deck procedure was used and in successful study 4 and unsuccessful
study 3 an open deck procedure was applied. 
This data pattern indicates that such a procedural difference is
uncorrelated with the appearance of a retroactive avoidance effect. However, we
do think that such a variation might explain effect size differences found in
the studies (see our discussion of Study 4).
These are only some questions, and some answers, which I hope will benefit researchers interested in the topic of precognition. I present the responses without any 'concluding' comments from my side (except 2 short clarifications) because I believe other scientists should continue this discussion, and these points are only intermittent observations. It's interactions like these, between collaborative researchers who are trying to figure something out that make me proud to be a scientist. 

 
About the item vs. participant analysis issue: it is well known by now that both of these analyses are wrong. The classic reference here is Clark (1973), although even that is not the first (for that see Coleman, 1964). Relying on the conjunction of these two flawed analyses does, as well, constitute a flawed analysis, a fact probably discussed best by Raaikmakers (1999, 2003). There are many, many other papers on this topic, including more recently a 2008 one by Baayen, Davidson, & Bates, and a 2012 one by Judd, myself, and Kenny. Furthermore, I know that Bem at least is aware of these issues, because we talked to him about it around the time that we published our paper. (We contacted him because one of the real data examples we use in our paper is his Study 4 from his ESP paper, where we show that the correct analysis on the original data yields little evidence of ESP.) So what is the explanation for the total neglect of these issues by Bem et al.?
ReplyDeleteDear Jake, on the contrary we reflected a lot on how to consider this issue after Judd et al. paper. The problem, as for most m-a, is the difficulty to obtain and analyse the raw data instead of the summary stats. If you know how to adjust the summary stats (we have all one sample t-test), I'll be pleased to apply this correction. However for a theoretical point of view I think that the item analysis is only necessary when there are studies using the same items and not when there are studies using different items, in other words an empirical demonstration is better than an estimation one. Do you agree?
DeletePatrizio,
DeleteI would think that given the nature of your meta-analysis, it should not be hard to obtain and analyze the raw data, since (a) all of the studies appeared in just the last 2-3 years, so all the electronic data should be readily available and I see no reason why the authors would hesitate to share, (b) the majority of the studies use the materials that Bem provided for running the experiments and storing the data, so the datasets should mostly all be in the same format and therefore require minimal case-by-case data munging, and (c) the studies all use the same small number of experimental designs, so the statistical models should be exactly the same. So yes, you do need the raw data (a correction cannot be obtained from the commonly reported summary statistics), but I think that is eminently doable in this case.
To your theoretical point, I certainly do not agree. I have two responses. First, the total number of stimuli used in all of the replication studies is almost certainly still relatively small. For the photographs for example, the vast majority of the studies will have just used the same set of photographs employed in the Bem (2012) studies. For the words, some actual replication would have had to take place in the case where the language of the participants differed, but across the entire set of studies I think we are still looking at something on the order of a few hundred words. So there is *certainly* still sampling variability in the photo samples, and probability non-negligible variability in the word samples as well. Second and more importantly, even if it were the case that every study used new items, it just doesn't follow that this makes it okay to ignore stimulus variance in every individual study. Can you imagine if someone made this claim about participants? That because each study used a new sample of participants, it's therefore okay to ignore participant variance?
This comment has been removed by a blog administrator.
ReplyDeleteHi Daniel,
ReplyDeleteGiven your willingness to engage and unusually open mind, I think you may enjoy the commentary I provided on Wagenmaker's recent blog post: http://osc.centerforopenscience.org/2014/06/25/a-skeptics-review/#disqus_thread
Please let me know if there was anything I misrepresented about your analyses or your views, or if there is anything in my arguments that could be improved.
Keep up the good work, - Johann
P.S. If this is a repeat comment, please delete the first one.
It is possible to examine the data in the meta-analysis by lab as many labs reported multiple studies. If you do this, you see that ESP is moderated by lab. A clear outlier is Bem (2011). So, we can ask about moderators. Is ESP more prevalent at Cornell? Does it take a magician to show ESP? Aside from Bem, the only other noteworthy evidence comes from Maier's studies. I found out he is conducting a preregistered replication study with N > 1000. Main conclusion from the meta-analysis is that the true effect size is somewhere between d = 0 and .10. We can therefore safely ignore any additional replication studies with less than N = 1,000. Maybe ESP will die a slow death of attrition. When we are debating whether ESP effect size is d = .000 or d = .001, most people may not care about the existence of ESP.
ReplyDelete