Comments on The 20% Statistician: Data peeking without p-hacking

See http://pps.sagepub.com/content/9/3/293.abstrac...

2016-10-25T14:48:41.072+02:00

See http://pps.sagepub.com/content/9/3/293.abstract for an approach to evaluate how bad adding extra participants was after looking at the data once.

Hi Daniel, I wondered how to correctly apply sequ...

2016-10-25T14:38:47.821+02:00

Hi Daniel,

I wondered how to correctly apply sequential analyses/adjust p-values in the following situation:

We ran a study with four conditions and an overall N of 600. When we analyzed the data, we found that the pattern we expected (two specified contrasts) was there but with p-values higher than .05. We then spontaneously decided to add 600 more participants to the sample, leaving us with an overall N of 1200. Now we get p-values of .002 and .005 for the two constrasts. We would like to adjust these for having peeked at the data after half of it was collected. The situation differs from the one in your 2014 EJSP “Performing high-powered studies efficiently with sequential analyes” paper because we didn’t plan such sequential analyses. Accordingly, the p-value we used as a decision rule when looking at the data at time 1 was p < .05. Does this mean we need to enter a z-value of 1.96 for time 1 when computing the monitoring adjusted p-value as described in the paper (and the OSF materials)? It seems like a very conservative approach (which obviously doesn’t make it wrong).

Thank you so much for this super-informative blog and the very nice EJSP paper on sequential analysis!

Marie Hennecke

Hi Greg, in sequential testing, the difference bet...

2014-07-02T09:35:21.095+02:00

Hi Greg, in sequential testing, the difference between NHST and effect size estimation is even more pronounced than in a single testing situation. They are different goals, and require different things to be taken into account when designing an experiment. I think NHST has a function in exploratory scientific research, and effect size estimation is more important when the presence of an effect has been established. So I'm fine with sequential analyses being very good at NHST but not perfect at effect size estimation. As I mentioned many times on this blog, we should abandon the illusion there will ever be one statistic that tells us everything we want to know.

Hi Daniel, Thanks for pointing out that your prim...

2014-07-02T09:18:35.543+02:00

Hi Daniel,

Thanks for pointing out that your primer mentions a need for correcting effect size estimates. I'm not following the calculation of the drift parameter (mentioned in your supplemental material), and I leave for a trip in a few hours and will not get back to this topic until next week. Many (to my reading, "most") discussion of sequential methods do not address effect size overestimation or methods for correcting it. It would be interesting to see if the proposed adjustments fix the problems I noted above (obviously, more than one simulation is needed to make that judgment). I am rather skeptical that there is an advantage to a process that deliberately overestimates the effect size and then corrects for the overestimation. In most cases, simpler is better; but I am happy to be proven wrong.

Dear Greg, thanks for your comment. I'm a bit ...

2014-07-02T06:04:41.734+02:00

Dear Greg, thanks for your comment. I'm a bit disappointed in the quality of your criticism on sequential analyses. There are adjustments for bias you should calculate before you report effect sizes based on sequential analyses. You'd have known that if you had taken the time to actually read my article.

Better statisticians than you and me together have been working on this for decades - your conclusions based on one simulation are not going to make a dent in the idea behind sequential analyses. I suggest you read the paper, and if you have any valid criticisms after that, feel free to come back here and try a little harder.

Hi Daniel, In a comment for a previous posting, I...

2014-07-01T23:08:46.694+02:00

Hi Daniel,

In a comment for a previous posting, I mentioned that I had concerns about sequential methods. This post motivates me to mention some of my concerns (there are others that we can discuss elsewhere). The crux of the issue here is that standard (fixed sample) hypothesis testing is essentially a method for separating over-estimated effect sizes (called significant results) from under-estimated effect sizes (called non-significant results). Sequential methods control the Type I error rate by exaggerating this separation process. They stop sampling when the estimated effect size is really large and reach the maximum sample size when the estimated effect size is really small.

For example, I ran a version of the simulation you ran with a true effect size of d=0.4:

> set.seed(3)
> res <- phack(initialN=50, hackrate=50, grp1M=0.4, grp2M=0.0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=1000)

The simulation produces 391 experiments that stop at N=50 with a pooled effect size (Hedge's g) of 0.579. Notice that the estimated effect size is much larger than the true effect size of 0.4. This result is not unexpected because with an criterion of 0.0221, the only way this kind of sequential process will stop at N=50 is when g is greater than 0.462.

The simulation produces 333 experiments that stop at N=100 with a pooled effect size of 0.417. These experiments produce an effect size that cannot be too large (else it likely would have stopped at N=50) nor too small (because the effect size must be at least 0.325 for it to reject the null at N=100).

The simulation produces 276 experiments that stop at N=150 with a pooled effect size of 0.281. Notice that the estimated effect size is much smaller than the true effect size of 0.4. Again, this is not unexpected because an experiment with a larger effect size usually would have rejected at N=100 or N=50.

What bothers me is that if someone uses this kind of sequential method and then introduces a publication bias (e.g., not publish some of the non-significant findings), they will overestimate the true effect size; and the overestimation will often be even larger than publication bias with a fixed sample approach (Francis, 2012, Perspectives in Psych Sci), depending on the true effect size and the hack rate.

One could argue that researchers should publish all of their studies regardless of whether the findings are significant or not (and I think is a good idea). Indeed, if we pool all experiments using meta-analysis (giving higher weight to the studies with larger samples sizes), we get an unbiased estimate (0.389) of the effect size because the studies with the largest samples have the smallest effect sizes. But if all studies will/can be published, then why bother with the sequential approach? For that matter why bother with hypothesis testing at all? There seems little reason to classify experiments as being significant or non-significant when you are going to pool both types of findings to estimate the effect size.

One could further argue that analysis methods based on the p-curve could be used to estimate the effect size from just the significant studies. But such estimation requires that the experiments used a known significance criterion (more precisely a known publication criterion). Typically this is 0.05, but in the sequential method used in the simulation it was 0.0221. The p-curve analysis could be adjusted for a different criterion; but if different researchers use different sequential sampling techniques (vary initial N, hack rate, max N), then there will be no common significance criterion and (as far as I can tell), the p-curve analyses will not work properly.

The conclusion for me is that sequential methods only perpetuate and exaggerate problems with fixed-sample hypothesis testing. As a field, we have to find some other way to do our analyses.

Best wishes

Hi Ryne, there's nothing that can beat the fle...

2014-07-01T16:11:40.476+02:00

Hi Ryne, there's nothing that can beat the flexibility of just running some participant and seeing what happens. And in exploratory research, that's ok. But if you want to test an idea, and are serious enough to do a power analysis beforehand, working in sequential analyses is 2 minutes, tops. It is worth it, always.

I'm always intruiged by the difference between one-sided and two-sided testing. Two-sided is simply the default (for example when reproducing the Situation B by Simmons et al.) but one-sided tests have a lot more going for them, both mathematically and theoretically. I'd love to dive in to the topic. It might be that one-sided tests should be used more often (when pre-registered), but I don't know yet. If you have any recommended reading on this, let me know!

Very interesting. You already anticipated my first...

2014-07-01T15:17:47.839+02:00

Very interesting. You already anticipated my first reaction to sequential analyses (as did Simmons et al.) which is "...that in sequential analyses, researchers need to determine the number of looks at the data, and the alpha correction function." This is a (minor) bummer because I get the sense than many exploratory experiments don't employ that sort of forethought. I can see why sequential analyses would be considered very important in medicine. Nonetheless, the examples you provided suggest that it could be used in experimental psychology much more.

I'm not 100% sure about the impact of using two.sided. I wrote that function a long time ago (actually, I just looked...my log says I wrote it in January of 2013) and am not sure about the impact two.sided has on the effect size estimation. I point this out because it would be interesting to know if sequential analyses also provides more accurate effect size estimates (along with controlling Type I error rate). I am sure I only included two.sided as an argument because it was already included in the t.test function. They always struck me as odd though (hence I default to "greater") because it seemed like an experiment would have a direction in mind ahead of time. I guess I can imagine studies wondering if two naturally occurring groups are different from each other desiring a two tailed test though.