# The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

## Sunday, April 12, 2015

### How many participants should you collect? An alternative to the N * 2.5 rule

Because the true size of effects is uncertain, determining the sample size for a study is a challenge. A-priori power analysis is often recommended, but practically impossible when effect sizes are very uncertain. One situation in which effect sizes are by definition uncertain is a replication study where the goal is to establish whether a previously observed effect can be reproduced. When replication studies are too small, they lack informational value and cannot distinguish between the signal (true effects) and the noise (random variation). When studies are too large and data collection is costly, research becomes inefficient. The question is how we can efficiently collect informative data when effect sizes are uncertain. Simonsohn (2015) recently proposes to design replication studies using sample sizes 2.5 times as large as the original study. Such studies are designed to have 80% power to detect an effect the original study had 33% power to detect. The N * 2.5 rule is not a recommendation to design a study that will allow researchers make a decision about whether they will accept or reject H0. Because I think this is what most researchers want to know, I propose an easy to follow recommendation to design studies that will allow researchers to accept or reject the null-hypothesis efficiently without making too many errors. Thanks to Richard Morey and Uri Simonsohn for their comments on an earlier draft.

The two dominant approaches to design studies that will yield informative results when the true effect size is unknown are Bayesian statistics and sequential analyses (see Lakens, 2014). Both these approaches are based on repeatedly analyzing data, and making the decision of whether or not to proceed with the data collection conditional on the outcome of the performed statistical test. If the outcome of the test indicates the null hypothesis can be rejected or the alternative hypothesis is supported (based on p-values and Bayes Factors, respectively) the data collection is terminated. The data is interpreted as indicating the presence of a true effect. If the test indicates strong support for the null-hypothesis (based on Bayes Factors), or whenever it is unlikely that a statistically significant difference will be observed (given the maximum number of participants a researcher is willing to collect), the trial is terminated. The data is interpreted as indicating there is no true effect, or if there is an effect, it is most likely very small. If neither of these two sets of decisions can be made, data collection is continued.

Both Bayes Factors as p-values (interpreted within Neyman-Pearson theory) tell us something about the relation between the null-hypothesis and the data. Designing studies to yield informative Bayes Factors and p-values is therefore a logical goal if one wants to make decisions based on data that are not too often wrong, in the long run.

Simonsohn (2015) recently proposed an alternative approach to designing replication studies. He explains how a replication that is 2.5 times larger than the original study has 80% power to detect an effect that is 33% as large as that observed in the original study. For example, when an original study is performed with 20 participants per condition, it has 33% power to detect and effect of d = 0.5 in a two-sided t-test:

A replication with 2.5 times as many participants (i.e., 50 per condition) has 80% power to detect and effect of 0.5 in a one-sided t-test.

When you design a replication study using 2.5 times the original sample size, your goal is not to design a study that will allow you to draw correct conclusions about the presence or absence of an effect. When you use 2.5 times the sample size of the original study, the goal is to reject the null of a detectable effect (defined as an effect size the original sample had 33% power for) with 80% power. Note that in this question, the null is the assumption that d = 0.5, and the alternative hypothesis is d = 0 (a reversal of what we typically would call the null and alternative hypothesis). If we can reject the null (d = 0.5) when testing the alternative hypothesis (d = 0), this means the replication results are inconsistent with an effect size big enough to be detectable by the original study, 80% of the time. In other words, with 2.5 times the sample size of an original study, we can reject an effect size the original study had 33% power for 80% of the time, if the true effect size is 0. It has become less likely that the effect size estimate in the original study was accurate, but we have learned nothing about whether the effect is true or not.

Bayesian and sequential analyses allow researchers to design high-powered studies that provide informational value about the presence or absence of an effect efficiently. However, researchers rarely use these methods. One reason might be that there are no easy to use guidelines to help researchers to design a study where they will repeatedly analyze their data.

I believe researchers most often perform studies because they want to learn something about the presence or absence of an effect (and not to have an 80% probability to conclude the effect is smaller than the effect size the original study had 33% power to observe). I also worry that researchers will use the 2.5 * N rule without realizing which question they will be answering, just because it is so easy to use without reading or understanding Uri's paper – I’ve already seen it happen, and have even done it myself.

Here, I will provide a straightforward recommendation to design novel and replication studies that is easy to implement and will allow researchers to draw informative statistical inferences about the presence or absence of an effect. (I might work this into a future paper if people are interested, so feel free to leave comments below).

1)      Determine the maximum sample size you are willing to collect (e.g., N = 400)
2)      Plan equally spaced analyses (e.g., four looks at the data, after 50, 100, 150, and 200 participants per condition in a two-sample t-test).
3)      Use alpha levels for each of the four looks at your data that control the Type 1 error rate (e.g., for four looks: 0.019 at each look; for three looks: 0.023 at each look; for two looks: 0.030 at each look).
4)      Calculate one-sided p-values and JZS Bayes Factors (with a scale r on the effect size of 0.5) at every analysis. Stop when the effect is statistically significant and/or JZS Bayes Factors > 3. Stop when there is support for the null hypothesis based on a JZS Bayes Factor < 0.33. If the results are inconclusive, continue. In small samples (e.g., 50 participants per condition) the risk of Type 1 errors when accepting the null using Bayes Factors is relatively high, so always interpret results from small samples with caution.
5)      When the maximum sample size is reached without providing convincing evidence for the null or alternative hypothesis, interpret the Bayes Factor while acknowledging the Bayes Factor provides weak support for either the null or the alternative hypothesis. Conclude that based on the power you had to observe a small effect size (e.g., 91% power to observe a d = 0.3) the true effect size is most likely either zero or small.
6)      Report the effect size and its 95% confidence interval, and interpret it in relation to other findings in the literature or to theoretical predictions about the size of the effect.

See the footnote for the reason behind the specific choices in this procedure. You can use R, JASP, or online calculators to compute the Bayes Factor. This approach will give you a good chance of providing support for the alternative hypothesis if there is a true effect (i.e., a low Type 2 error). This makes it unlikely researchers will falsely conclude an effect is not replicable, when there is a true effect.

Demonstration 1: d = 0.5

If we run 10 simulations of replication studies where the true effect size is d=0.5, and the sample size in the original study was 20 participants per cell, we can compare different approaches to determine the sample size. The graphs display 10 lines for the p-values (top) and effect size (bottom) as the sample size grows from 20 to 200. The vertical colored lines represent the time when the data is analyzed. There are four looks after 50, 100, 150, and 200 participants in each condition. The green curve in the upper plot is the power for a d = 0.5 as the sample size increases, and the yellow lines in the bottom plot display the 95% prediction interval around a d = 0.5 as the sample size increases. P-values above the brown curved line are statistically different from 0 (the brown straight line) using a one-sided test with an alpha of 0.02. This plot is heavily inspired by Schönbrodt & Perugini, 2013.

Let’s compare the results from the tests (both the p-values, Bayes Factors, and effect sizes) at the four looks at our data. The results from each test for the 10 simulations is presented below.

Our decision rules were to interpret the data as support of the presence of a true effect when p < α, or BF > 3. If we use sequential analyses, we rely on an alpha level of 0.018 for the first look. We can conclude an effect is present after the first look in four of the studies. We continue the data collection for 6 studies to the second looks, where three studies are now convincing support for the presence of an effect, but the remaining three are not. We continue to N = 150 and see all three remaining studies have now yielded significant p-values and stop the data collection. We make no Type 2 errors, and we have collected 4*50, 3*100, and 3*150 participants per condition across the 10 studies (950 per condition, or 1900 in total).

Based on 10000 simulations, we’d have 64%, 92%, 99%, and 99.9% power at each look to detect an effect of d = 0.5. Bayes Factors would correctly provide support for the alternative hypothesis at the four looks in 67%, 92%, 98%, and 99.7% of the studies, respectively. Bayes Factors would also incorrectly provide support for the null hypothesis at the four looks in 0.9%, 0.2%, 0.02%, and approximately 0% of the studies, respectively. Note that using BF > 3 as a decision rule in a Neyman-Pearson theory inflates the Type 1 error rate in the same way as it does for p-values, as Uri Simonsohn explains.

Using N * 2.5 to design a study

If our simulations were replications of an original study with 20 participants in each condition, analyzing the data after 50 participants in each condition would be identical to using the N * 2.5 rule. A study with a sample size of 20 per cell has 33% power for an effect with d = 0.5, and the power after 50 participants per cell is 80% (illustrated by the green power line for a one-sided test crossing the green horizontal line at 50 participants.

If we use the N*2.5 rule, and test after 50 participants with an alpha of .05 we see we can conclude 6 simulated studies reveal a data pattern that is surprising, assuming H0 is true. Even though we have approximately 80% power in the long run, in small dataset such as 10 replications, a success rate of 60% is no exception (but we could just as well have gotten lucky, and found 10 studies toe be statistically significant). There are 4 Type 2 errors, and have collected 10*50 = 500 participants in each condition (or 1000 participants in total).

It should be clear that the use of N * 2.5 is unrelated to the probability a study will provide convincing support for either the null hypothesis or the alternative hypothesis. If the original study was much larger than N = 20 per condition the study might have too much power (which is problematic if collecting data is costly), and when the true effect size is smaller than d = 0.5 in the current example, the N*2.5 rule could easily lead to studies that are substantially underpowered to detect a true effect. The N*2.5 rule is conditioned on the effect size that could reliably be observed based on the sample size used in the original study. It is not conditioned on the true effect size.

Demonstration 2: d = 0

Let’s run the simulation 1000 times, while setting the true effect size to d=0, to evaluate our decision rule when there is no true effect.

The top plot has turned black, because p-values are uniformly distributed when H0 is true, and thus end up all over the place. The bottom plot shows 95% of the effect sizes fall within the 95% prediction interval around 0.

Sequential analyses yield approximately 2% Type 1 errors on the four looks, and (more importantly) control the overall Type 1 error rate so that it stays below 0.05 over the four looks combined. After the first look, 53% of the Bayes Factors correctly lead to the decision to accept the null hypothesis (which we can interpret as 53% power, or 53% chance of concluding the null hypothesis is true, when the null hypothesis is true). This percentage increases to 66%, 73%, and 76% in the subsequent 3 looks (based on 10000 simulations). When using Bayes Factors, we will also make Type 1 errors (concluding there is an effect, when there is no effect). Across the 4 looks, this will happen 2.2%, 1.9%, 1.4%, and 1.4% of the time. Thus, the error rates for the decision to accept the null hypothesis or rejecting the alternative hypothesis become smaller as the sample size increases.

I’ve run additional simulations (N = 10000 per simulation) and have plotted to probabilities of Type 1 errors and Type 2 errors below in the Table below.

Conclusion

When effect sizes are uncertain, an efficient way to design a study is to analyze data repeatedly, and decide to continue the data collection conditional upon the results of a statistical test. Using p-values and/or Bayes Factors from a Neyman-Pearson perspective on statistics it is possible to make correct decision about whether to accept or reject the null hypothesis without making too many mistakes (i.e., Type 1 and Type 2 errors). The six steps I propose are relatively straightforward to implement, and do not require any advanced statistics beyond calculating Bayes Factors, effect sizes, and their confidence intervals. Anyone interested in diving into the underlying rationale of sequential analyses can easily choose a different number or time on which to look at the data, choose a different alpha spending function, or design more advanced sequential analyses (see Lakens, 2014).

The larger the sample we are willing to collect, the more often we can make good decisions. The smaller the true effect size, the more data we typically need. For example, effects smaller than d = 0.3 quickly require more than 200 participants to find effect reliably, and testing these effects in smaller sample sizes (either using p-values or Bayes Factors) will often not allow you to make good decisions about whether to accept or reject the null hypothesis. When it is possible the true effect size is small, be especially careful in accepting the null-hypothesis based on Bayes Factors. The error rates when effect sizes are small (e.g., d = 0.3) and sample sizes are small (to detect small effects, e.g., N < 100 per condition) are relatively high (see Figure 3 above).

There will always remain studies where the data are inconclusive. Reporting and interpreting the effect size (and its 95% confidence interval) is always useful, even when conclusions about the presence or absence of an effect must be postponed to a future meta-analysis. The goal to accurately estimate effect sizes is a different question than the goal to distinguish a signal from the noise (for the sample sizes needed to accurately estimate effect sizes, see Maxwell, Kelley, & Rausch, 2008). Indeed, some people recommend a similar estimation focus when using Bayesian statistics (e.g., Kruschke). I think estimation procedures become more important after we are relatively certain the effect exists over multiple studies, and can also be done using meta-analysis.

Whenever researchers are interested in distinguishing the signal from the noise efficiently, and have a maximum number of participants they are willing to collect, the current approach allows researchers to design studies with high informational value that will allow them to accept or reject the null-hypothesis efficiently for large and medium effect sizes without making too many errors. Give it a try.

Footnote

Obviously, all choices in this recommendation can be easily changed to your preferences, and I might change these recommendations based on progressive insights. To control the Type 1 error rate in sequential analyses, a spending function needs to be chosen. The Type 1 error rate can be spent in many different ways, but the continuous approximation of a Pocock spending function, which reduced the alpha level at each interim test to approximately the same alpha level, should work well in replication studies where the true effect size is unknown because it reduces the alpha level more or less to the same alpha level for each look. For four looks at the data, the alpha levels for the four one-sided tests are 0.018, 0.019, 0.020, and 0.021. It does not matter which test you perform when you look at the data. Lowering alpha levels reduces the power of the test, and thus requires somewhat more participants, but this is compensated by being able to stop the data collection early (see Lakens, 2014). These four looks after 50, 100, 150, and 200 participants per condition give you 90% power in a one-sided t-test to observe an effect of d = 0.68, d = 0.48, d = 0.39, and d = 0.33, respectively. You can also choose to look at the data less, or more. Finally, Bayes Factors come in many flavors: I use the JZS Bayes Factor, with a scale r on the effect size of 0.5, recommended by Rouder et al (2009) when one primarily expects small effects. This keep erroneous conclusions about H0 being true when there is a small effect reasonable acceptable (e.g., for d = 3, Type 2 errors are smaller than 5% after 100 participants in each condition of a t-test). It somewhat reduces the power to accept the null-hypothesis (e.g., 54%, 67%, 73%, and 77% in each of the four looks, respectively) but since inconclusive outcomes also indicate the true effect is either zero or very small, this decision procedure seems acceptable. Thanks to Uri Simonsohn for pointing to the importance of carefully choosing a scale r on the effect size.

The script used to run the simulations and create the graphs is available below:

## Saturday, April 4, 2015

### Why a meta-analysis of 90 precognition studies does not provide convincing evidence of a true effect

A meta-analysis of 90 studies on precognition by Bem, Tressoldi, Rabeyron, & Duggan has been circulating recently. I have looked at this meta-analysis of precognition experiments for an earlier blog post. I had a very collaborative exchange with the authors, which was cordial and professional, and led the authors to correct the mistakes I pointed out and answer some questions I had. I thought it was interesting to write a pre-publication peer review of an article that had been posted in a public depository, and since I had invested time in commenting on this meta-analysis anyway, I was more than happy to accept the invitation to peer-review it. This blog is a short summary of my actual review - since a pre-print of the paper is already online, and it is already cited 11 times, perhaps people are interested in my criticism on the meta-analysis. I expect that many of my comments below apply to other meta-analyses by the same authors (e.g., this one), and a preliminary look at the data confirms this. Until I sit down and actually do a meta-meta-analysis, here's why I don't think there is evidence for pre-cognition in the Bem et al meta-analysis.

Only 18 statistically significant precognition effects have been observed in the last 14 years, by just 7 different labs, as the meta-analysis by Bem, Tressoldi, Rabeyron, and Duggan reveals. 72 studies reveal no effect. If research on pre-cognition has demonstrated anything, it is that when you lack a theoretical model, scientific insights are gained at a painstakingly slow pace, if they are gained at all.

The questions the authors attempt to answer in their meta-analysis is whether there is a true signal in this noisy set of 90 studies. If this is the case, it obviously does not mean we have proof that precognition exists. In science, we distinguish between statistical inferences and theoretical inferences (e.g., Meehl, 1990). Even if a meta-analysis would lead to the statistical inference that there is a signal in the noise, there is as of yet no compelling reason to draw the theoretical inference that precognition exists, due to the lack of a theoretical framework as acknowledged by the authors. Nevertheless, it is worthwhile to see if after 14 years and 90 studies something is going on.

In the abstract, the authors conclude: there is “an overall effect greater than 6 sigma, z = 6.40, p = 1.2 × 10-10 with an effect size (Hedges’ g) of 0.09. A Bayesian analysis yielded a Bayes Factor of 1.4 × 109, greatly exceeding the criterion value of 100 for “decisive evidence” in support of the experimental hypothesis.” Let’s check the validity of this claim.

Dealing with publication bias.

Every meta-analysis needs to deal with publication bias to prevent the meta-analytic effect size estimate being anything else than the inflation from 0 that emerges because people are more likely to share positive results. Bem and colleagues use Begg and Mazumdar’s rank correlation test to examine publication bias, stating that: “The preferred method for calculating this is the Begg and Mazumdar’s rank correlation test, which calculates the rank correlation (Kendall’s tau) between the variances or standard errors of the studies and their standardized effect sizes (Rothstein, Sutton & Borenstein, 2005).”

I could not find this recommendation in Rothstein et al., 2005. From the same book, Chapter 11, p. 196, about the rank correlation test: “the test has low power unless there is severe bias, and so a non-significant tau should not be taken as proof that bias is absent (see also Sterne et al., 2000, 2001b, c)”. Similarly, from the Cochrane handbook of meta-analyses: “The test proposed by Begg and Mazumdar (Begg 1994) has the same statistical problems but lower power than the test of Egger et al., and is therefore not recommended.

When the observed effect size is tiny (as in the case of the current meta-analysis), just a small amount of bias can yield a small meta-analytic effect size estimate that is statistically different from 0. In other words, whereas a significant test result is reason to worry, a non-significant test result is not reason not to worry.

The authors also report the trim-and-fill method to correct for publication bias. It is known that when publication bias is induced by a p-value boundary, rather than an effect size boundary, and there is considerable heterogeneity in the effects included in the meta-analysis, the trim-and-fill method might not perform well enough to yield a corrected meta-analytic effect size estimate that is close to the true effect size (Peters, Sutton, Jones, Abrams, & Rushton, 2007; Terrin, Schmid, Lau, & Olkin, 2003, see also the Cochrane handbook). I’m not sure what upsets me more: The fact that people continue to use this method, or the fact that the people who use this method still report the uncorrected effect size estimate in their abstract.

Better tests for publication bias

PET-PEESE meta-regression seems to be the best test to correct effect size estimates for publication bias we currently have. This approach is based on first using the precision-effect test (PET, Stanley, 2008) to examine whether there is a true effect beyond publication bias, and then following up on this test (if the confidence intervals for the estimate exclude 0) by a PEESE (precision-effect estimate with standard error, Stanley and Doucouliagos, 2007) to estimate the true effect size.

In the R code where I have reproduced the meta-analysis (see below), I have included the PET-PEESE meta-regression. The results are clear: the estimated effect size when correcting for publication bias is 0.008, and the confidence intervals around this effect size estimate do not exclude 0. In other words, there is no good reason to assume that anything more than publication bias is going on in this meta-analysis.

Perhaps it will help to realize that if precognition had an effect size of Cohen’s dz = 0.09, to have 90% power to examine an effect with an effect size estimate of 0.09, an alpha level of 0.05, and performing a two-sided t-test, you’d need 1300 participants. Only 1 experiment has been performed with a sufficiently large sample size (Galak, exp 7), and this experiment did not show an effect. Meier (study 3) has 1222 participants, and finds an effect at a significance level of 0.05. However, using a significance level of 0.05 is rather silly when sample sizes are so large (see http://daniellakens.blogspot.nl/2014/05/the-probability-of-p-values-as-function.html) and when we calculate a Bayes Factor using the t-value and the sample size, we see this results in a JZS Bayes Factor of 1.90 – nothing that should convince us.

library(BayesFactor)
1/exp(ttest.tstat(t=2.37, n1=1222, rscale = 0.707)[['bf']])

 1.895727

Estimating the evidential value with p-curve and p-uniform.

The authors report two analyses to examine the effect size based on the distribution of p-values. These techniques are new, and although it is great the authors embrace these techniques, they should be used with caution. (I'm skipping a quite substantial discussion of the p-uniform test that was part of the review. The short summary is that the authors didn't know what they were doing).

The new test of the p-curve app returns a statistically significant effect when testing for right skew, or evidential value, when we use the test values the authors use (the test has recently been updated - in the version the authors used, the p-curve analysis was not significant). However, the p-curve analysis now also include an exploration of how much this test result depends on a single p-value, by plotting the significance levels of the test if the k most extreme p-values are removed. As we see in the graph below (blue, top-left), the test for evidential value returns a p-value above 0.05 after excluding only 1 p-value, which means we cannot put a lot of confidence in these results.

I think it is important to note that I have already uncovered many coding errors in a previous blog post, even though the authors note that 2 authors independently coded the effect sizes. I feel I could keep pointing out more and more errors in the meta-analysis (instead, I will just recommend to include a real statistician as a co-author), but let’s add one to illustrate how easily the conclusion in the current p-curve analysis changes.

The authors include Bierman and Bijl (2013) in their spreadsheet. The raw data of this experiment is shared by Bierman and Bijl (and available at: https://www.dropbox.com/s/j44lvj0c561o5in/Main%20datafile.sav - another excellent example of open science), and I can see that although Bierman and Bijl exclude one participant for missing data, the reaction times that are the basis for the effect size estimate in the meta-analysis are not missing. Indeed, in the master thesis itself (Bijl & Bierman, 2013), all reaction time data is included. If I reanalyze the data, I find the same result as in the master thesis:

I don’t think there can be much debate about whether all reaction time data should have been included (and Dick Bierman agrees with me in personal communication), and I think that the choice to report reaction time data from 67 instead of 68 participants in one of those tiny sources of bias that creep into the decisions researchers almost unconsciously make (after all, the results were statistically significant from zero regardless of the final choice). However, for the p-curve analysis (which assumes authors stop their analysis when p-values are smaller than 0.05) this small difference matters. If we include t(67)=2.11 in the p-curve analysis instead of t(67)=2.59, the new p-curve test no longer indicates the studies have evidential value.

Even if the p-curve test based on the correct data would have shown there is evidential value (although it is comforting it doesn’t) we should not be mindlessly interpreting the p-values we get from the analyses. Let’s just look at the plot of our data. We see a very weird p-value distribution with many more p-values between 0.01-0.02 then between 0.00-0.01 (whereas the reverse pattern should be observed, see for example Lakens, 2014).

Remember that p-curve is a relatively new technique. For many tests we use (e.g., the t-test) we first perform assumption checks. In the case of the t-test, we check the normality assumption. If data isn’t normally distributed, we cannot trust the conclusions from a t-test. I would severely doubt whether we can trust the conclusion from a p-curve if there is such a clear deviation from the expected distribution. Regardless of whether the p-curve tells us there is evidential value or not, the p-curve doesn’t look like a ‘normal p-value distribution’. Consider the p-curve analysis as an overall F-test for an ANOVA. The p-curve tells us there is an effect, but if we then perform the simple effects (looking at p-values between 0.00-0.01, and between 0.01-0.02) our predictions about what these effects look like is not confirmed. This is just my own interpretation of how we could improve the p-curve test, and it will useful to see how this test develops. For now, I just want to conclude it is debatable whether the conclusion there is an effect has passed the p-curve test for evidential value (I would say it has not), and passing the test is not immediately a guarantee there is evidential value.

The presence of bias

In the literature, a lot has been said about the fact that the low-powered studies reported in Bem (2011) strongly suggest there are an additional number of unreported experiments, or that the effect size estimates were artificially inflated by p-hacking (see Francis, 2012). The authors mention the following when discussing the possibility that there is a file-drawer (page 9):

“In his own discussion of potential file-drawer issues, Bem (2011) reported that they arose most acutely in his two earliest experiments (on retroactive habituation) because they required extensive preexperiment pilot testing to select and match pairs of photographs and to adjust the number and timing of the repeated subliminal stimulus exposures. Once these were determined, however, the protocol was “frozen” and the formal experiments begun. Results from the first experiment were used to rematch several of the photographs used for its subsequent replication. In turn, these two initial experiments provided data relevant for setting the experimental procedures and parameters used in all the subsequent experiments. As Bem explicitly stated in his article, he omitted one exploratory experiment conducted after he had completed the original habituation experiment and its successful replication.”

This is not sufficient. The power for his studies is too low to have observed the number of low p-values reported in Bem (2011) without having a much more substantial file-drawer, or p-hacking. It simply is not possible, and we should not accept vague statements about what has been reported. Where I would normally give researchers the benefit of the doubt (our science is built on this, to a certain extent) I cannot do this when there is a clear statistical indication that something is wrong. To illustrate this, let’s take a look at the funnel plot for just the studies by Dr. Bem.

Data outside of the grey triangle is statistically significant (in a two-sided test). The smaller the sample size (and the larger the standard error), the larger the effect size needs to be to show a statistically significant effect. If you would report everything you find, effect sizes should be randomly distributed around the true effect size. If they all fall on the edge of the grey triangle, there is a clear indication the studies were selected based on their (one-sided) p-value. It’s also interesting to note that the effect size estimates provided by Dr Bem are twice as large as the overall meta-analytic effect size estimate. The fact that there are no unpublished studies by Dr Bem in his own meta-analysis, even when the statistical signs are very clear that such studies should exists, is for me a clear sign of bias.

Now you can publish a set of studies like this in a top journal in psychology as evidence for precognition, but I just use these studies to explain to my students what publication bias looks like in a funnel plot.

For this research area to be taken seriously be scientists, it should make every attempt to be free from bias. I know many researchers in this field, among others Dr Tressoldi, one of the co-authors, are making every attempt to meet the highest possible standards, for example by publishing pre-registered studies (e.g., https://koestlerunit.wordpress.com/study-registry/registered-studies/). I think this is the true way forward. I also think it is telling us something that if replications are performed, these consistently fail to replicate the original results (such as a recent replication by one of the co-authors, Rabeyron, 2014, which did not replicate his own original results – note his original results are included in the meta-analysis, but his replication is not). Publishing a biased meta-analysis stating in the abstract there is “decisive evidence” in support of the experimental hypothesis’ while upon closer scrutiny, the meta-analysis fails to provide any conclusive evidence of the presence of an effect (let alone support for the hypothesis that psi exists) would be a step back, rather than a step forward.

Conclusion

No researcher should be convinced by this meta-analysis that psi effects exist. I think it is comforting that PET meta-regression indicates the effect is not reliably different from 0 after controlling for publication bias, and that p-curve analyses do not indicate the studies have evidential value. However, even when statistical techniques would all conclude there is no bias, we should not be fooled into thinking there is no bias. There most likely will be bias, but statistical techniques are simply limited in the bias they can reliably indicate.

I think that based on my reading of the manuscript, the abstract of the manuscript in a future revision should read as follows:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded, a generalized variant of the phenomenon traditionally denoted by the term precognition (Bem, 2011). To encourage replications, all materials needed to conduct them were made available on request. We here report a meta-analysis of 90 experiments from 33 laboratories in 14 countries which yielded an overall effect size (Hedges’ g) of 0.09, which after controlling for publication bias using a PET-meta-regression is reduced to 0.008, which is not reliably different from 0, 95% CI [-0.03; 0.05]. These results suggest positive findings in the literature are an indication of the ubiquitous presence of publication bias, but cannot be interpreted as support for psi-phenomena. In line with these conclusions, a p-curve analysis on the 18 significant studies did not provide evidential value for a true effect. We discuss the controversial status of precognition and other anomalous effects collectively known as psi, and stress that even if future statistical inferences from meta-analyses would result in an effect size estimate that is statistically different from zero, the results would not allow for any theoretical inferences about the existence of psi as long as there are no theoretical explanations for psi-phenomena.