Tuesday, July 1, 2014

Data peeking without p-hacking



You might have looked at your data while the data collection was still in progress, and have been tempted to stop the study because the result was already significant. Alternatively, you might have analyzed your data, only to find the result was not yet significant, and decided to collect additional data. There are good ethical arguments to do this. You should spend tax money in the most efficient manner, and if adding some data makes your study more informative, that's better than running a completely new and bigger study. Similarly, asking 200 people to spend 5 minutes thinking about their death in a mortality salience manipulation when you only needed 100 participants to do this depressing task is not desirable. However, if you peek at your data but don’t control the Type 1 error rate when deciding to terminate or continue the data collection, you are p-hacking.

No worries: There’s an easy way to peek at data the right way, and decide whether to continue the data collection or call it a day while controlling the Type 1 error rate. It’s called sequential analyses, has been used extensively in large medical trials, and the math to control the false positives level is worked out pretty well (and there's easy to use software to perform the required calculations). If you’ve been reading this blog, you might have realized I think it’s important to point out what’s wrong, but even more important to prevent people from doing the wrong thing by explaining how to do the right thing.

Last week, Ryne Sherman posted a very cool R function to examine the effects of repeatedly analyzing data, and adding participants when the result is not yet significant. It allows you to specifiy how often you will collect additional samples, and gives the inflated alpha level and effect size. Here, I’m going to use his function to show how easy it is to prevent p-hacking while still being able to repeatedly analyze results while the data collection is in progress. I should point out that Bayesian analyses have no problem with repeated testing, so if you want to abandon NHST, that's also an option.

I’ve modified his original code slightly as follows:

res <- phack(initialN=50, hackrate=50, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=100000)

I’ve set the initial sample size (per condition) to 50, and the ‘hackrate’ (or the number of participants that are collected, if the original sample is not significant) to 50 additional participants in each group. I’ve set MaxN, the maximum sample size you are willing to collect, to 150. This means that you get three tries: After 50, after 100, and after 150 participants per condition. That’s not a p-hacking rampage (Ryne simulates results of checking after every 5 participants), but as we’ll see below, it’s enough to substantially inflate the Type 1 error rate. I also use ‘two-sided’ tests in this simulation, and increased the number of simulations from 1000 to 100000 for more stable results.

Most importantly, I have adjusted the alpha-level. Instead of the typical .05 level, I’ve lowered it to .0221. Before I explain why I adjusted the alpha level, let’s see if it works.

Running The Code


Make sure to have first installed and loaded the ‘psych’ package, and read in the p-hack function Ryne made:

install.packages("psych")  # load psych package
source("http://rynesherman.com/phack.r") # read in the p-hack function  

Then, run the code below (the set.seed(3) function makes sure you get the same result as in this example - remove it to simulate different random data).

set.seed(3)
res <- phack(initialN=50, hackrate=50, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=100000)

The output you get will tell you a lot of different things, but here I'm mainly interested in:

Proportion of Original Samples Statistically Significant = 0.02205 
Proportion of Samples Statistically Significant After Hacking = 0.04947 

This means that if we look at the data after the first 50 participants are in, only 0.02205% of the studies reveal a statistically significant result. That’s pretty close to the significance level of 0.0221 (as it should be, when there is no true effect to be found). Now for the nice part: We see that after ‘p-hacking’ (looking at the data multiple times) the overall alpha level is approximately 0.04947%. It stays nicely below the 0.05% significance level that we would adhere to if we had performed only a single statistical test.

What I have done here is formally called ‘sequential analyses’. I’ve applied Pocock’s (1977) boundary for three sequential analyses, and not surprisingly, it works very nicely. It lowers the alpha level for each analyses in a way such that the overall alpha level for three looks at the data stays below 0.05%. If we hadn’t lowered the significance level (which you can try out by re-running the analysis, changing the alpha=.0221 to alpha=.05, we would have found an overall Type 1 error rate of 10.7% - which is an inflated alpha-level due to flexibility in the data analysis that can be quite problematic (see also Lakens & Evers, 2014).

On page 7 of Simmons, Nelson, & Simonsohn (2011), the authors discuss correcting alpha levels (as we’ve done above), where they even refer to Pocock (1977). The paragraph reads a little bit like a reviewer made them write it, but in it, they say: “unless there is an explicit rule about exactly how to adjust alphas for each degree of freedom […] the additional ambiguity may make things worse by introducing new degrees of freedom.” I think there are good explicit rules that can be used in the specific case of repeatedly analyzing data and adding participants. Nevertheless, they are right that in sequential analyses, researchers need to determine the number of looks at the data, and the alpha correction function. All this could be additional sources of flexibility, and therefore I think sequential analyses need to be pre-registered. But for a pre-registered rule to determine the sample size, it allows for surprising flexibility in the data collection, while controlling the Type 1 error rate.

Note that Pocock’s rule is actually not the one I would recommend, and it isn’t even the rule Pocock would recommend (!), but it’s the only one that has the same alpha level for each intermittent test, and thus the only one I could demonstrate in the function Ryne Sherman wrote. I won’t go in too much detail about which adjustments to the alpha-level you should make, because I’ve written a practical primer on sequential analyses in which this, and a lot more, is discussed.

Note that another adjustment of Ryne's code nicely reproduces the 'Situation B' in Simmons et al's False Positive Psychology paper of collecting 20 participants, and adding 10 if the test is not significant (for a significance level of .05):

res <- phack(initialN=20, hackrate=10, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1, maxN=30, alpha=.05, alternative="two.sided", graph=TRUE, sims=100000)


When there is an effect to be found

I want to end by showing why sequential analyses can be very beneficial if there is a true effect to be found. Run the following code, where the grp1M (mean in group 1) is 0.4.

set.seed(3)
res <- phack(initialN=50, hackrate=50, grp1M=0.4, grp2M=0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=100000)

This study has an effect size of d=0.4. Remember that in real life, the true effect size is not known, so you might have just chosen to collect a number of participants based on some convention (e.g., 50 participants in each condition) which would lead to an underpowered study. In situations when the true effect is uncertain, sequential analyses can have a real benefit. After running the script above, we get:

Proportion of Original Samples Statistically Significant = 0.37624
Proportion of Samples Statistically Significant After Hacking = 0.89686 

Now that there is a true effect of d=0.4 these numbers mean that in 37.6% of the studies, we got lucky and already observe a statistical difference after collecting only 50 participants in each condition. That’s efficient, and you can take an extra week off, because even though single studies are never enough to accurately estimate the true effect size, the data give an indication something might be going on. Note this power is quite a lot lower than if we only look at the data once - other corrections for the alpha level than Pocock's correction have a lower cost in power. 

The data also tell us that after collecting 150 participants, we will have observed and effect in approximately 90% of the studies. If the difference happens not to be significant after running 100 participants, and you deem a significant difference to be important, you can continue collecting participants – without it being p-hacking - and improve your chances of observing a significant result.

If, after 50 participants in each condition, you observe a Cohen’s d of 0.001 (and you don’t have access to thousands and thousands of people on Facebook) you might decide you are not interested in pursuing this specific effect any further, or choose to increase the strength of your manipulation in a new study. That’s also more efficient than collecting 100 participants in each condition without looking at the data until you are done, and hoping for the best.

It was because of these efficiency benefits that Wald (1945), who published an early paper on sequential analyses, was kept from publically sharing his results during war time. These insights were judged to be sufficiently useful for the war effort to keep them out of the hands of the enemy:


Given how much more efficient sequential analyses are, it’s very surprising people don’t use them more often. If you want to get started, check out my practical primer on sequential analyses, which is in press in The European Journal of Social Psychology in a special issue on methodological improvements. If you want to listen to me explain it in person (or see how I look like when wearing a tie), you can listen to my talk about this at European Association of Social Psychology conference (EASP 2014) in Amsterdam, Wednesday July 9th, 09:40 AM in room OMHP F0.02. But I would suggest you just read the paper. There’s an easy step-by-step instruction (also for calculations in R), and the time it takes is easily worth it, since your data collection will be much more efficient in the future, while you will be able to aim for well-powered studies at a lower cost. I call that a win-win situation.

Thanks to Ryne Sherman for his very useful function (which can be used to examine the effects of peeking at data, even when it's not p-hacking!). This was his first post, and if future ones will be as useful, you will want to follow his blog or twitter account.


References

Lakens, D. (in press). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology. DOI: 10.1002/ejsp.2023. Pre-print available at SSRN: http://ssrn.com/abstract=2333729 

Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2), 191-199. 
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.
Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2), 117-186.

8 comments:

  1. Very interesting. You already anticipated my first reaction to sequential analyses (as did Simmons et al.) which is "...that in sequential analyses, researchers need to determine the number of looks at the data, and the alpha correction function." This is a (minor) bummer because I get the sense than many exploratory experiments don't employ that sort of forethought. I can see why sequential analyses would be considered very important in medicine. Nonetheless, the examples you provided suggest that it could be used in experimental psychology much more.

    I'm not 100% sure about the impact of using two.sided. I wrote that function a long time ago (actually, I just looked...my log says I wrote it in January of 2013) and am not sure about the impact two.sided has on the effect size estimation. I point this out because it would be interesting to know if sequential analyses also provides more accurate effect size estimates (along with controlling Type I error rate). I am sure I only included two.sided as an argument because it was already included in the t.test function. They always struck me as odd though (hence I default to "greater") because it seemed like an experiment would have a direction in mind ahead of time. I guess I can imagine studies wondering if two naturally occurring groups are different from each other desiring a two tailed test though.

    ReplyDelete
    Replies
    1. Hi Ryne, there's nothing that can beat the flexibility of just running some participant and seeing what happens. And in exploratory research, that's ok. But if you want to test an idea, and are serious enough to do a power analysis beforehand, working in sequential analyses is 2 minutes, tops. It is worth it, always.

      I'm always intruiged by the difference between one-sided and two-sided testing. Two-sided is simply the default (for example when reproducing the Situation B by Simmons et al.) but one-sided tests have a lot more going for them, both mathematically and theoretically. I'd love to dive in to the topic. It might be that one-sided tests should be used more often (when pre-registered), but I don't know yet. If you have any recommended reading on this, let me know!

      Delete
  2. Hi Daniel,

    In a comment for a previous posting, I mentioned that I had concerns about sequential methods. This post motivates me to mention some of my concerns (there are others that we can discuss elsewhere). The crux of the issue here is that standard (fixed sample) hypothesis testing is essentially a method for separating over-estimated effect sizes (called significant results) from under-estimated effect sizes (called non-significant results). Sequential methods control the Type I error rate by exaggerating this separation process. They stop sampling when the estimated effect size is really large and reach the maximum sample size when the estimated effect size is really small.

    For example, I ran a version of the simulation you ran with a true effect size of d=0.4:

    > set.seed(3)
    > res <- phack(initialN=50, hackrate=50, grp1M=0.4, grp2M=0.0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=1000)

    The simulation produces 391 experiments that stop at N=50 with a pooled effect size (Hedge's g) of 0.579. Notice that the estimated effect size is much larger than the true effect size of 0.4. This result is not unexpected because with an criterion of 0.0221, the only way this kind of sequential process will stop at N=50 is when g is greater than 0.462.

    The simulation produces 333 experiments that stop at N=100 with a pooled effect size of 0.417. These experiments produce an effect size that cannot be too large (else it likely would have stopped at N=50) nor too small (because the effect size must be at least 0.325 for it to reject the null at N=100).

    The simulation produces 276 experiments that stop at N=150 with a pooled effect size of 0.281. Notice that the estimated effect size is much smaller than the true effect size of 0.4. Again, this is not unexpected because an experiment with a larger effect size usually would have rejected at N=100 or N=50.

    What bothers me is that if someone uses this kind of sequential method and then introduces a publication bias (e.g., not publish some of the non-significant findings), they will overestimate the true effect size; and the overestimation will often be even larger than publication bias with a fixed sample approach (Francis, 2012, Perspectives in Psych Sci), depending on the true effect size and the hack rate.

    One could argue that researchers should publish all of their studies regardless of whether the findings are significant or not (and I think is a good idea). Indeed, if we pool all experiments using meta-analysis (giving higher weight to the studies with larger samples sizes), we get an unbiased estimate (0.389) of the effect size because the studies with the largest samples have the smallest effect sizes. But if all studies will/can be published, then why bother with the sequential approach? For that matter why bother with hypothesis testing at all? There seems little reason to classify experiments as being significant or non-significant when you are going to pool both types of findings to estimate the effect size.

    One could further argue that analysis methods based on the p-curve could be used to estimate the effect size from just the significant studies. But such estimation requires that the experiments used a known significance criterion (more precisely a known publication criterion). Typically this is 0.05, but in the sequential method used in the simulation it was 0.0221. The p-curve analysis could be adjusted for a different criterion; but if different researchers use different sequential sampling techniques (vary initial N, hack rate, max N), then there will be no common significance criterion and (as far as I can tell), the p-curve analyses will not work properly.

    The conclusion for me is that sequential methods only perpetuate and exaggerate problems with fixed-sample hypothesis testing. As a field, we have to find some other way to do our analyses.

    Best wishes

    ReplyDelete
    Replies
    1. Dear Greg, thanks for your comment. I'm a bit disappointed in the quality of your criticism on sequential analyses. There are adjustments for bias you should calculate before you report effect sizes based on sequential analyses. You'd have known that if you had taken the time to actually read my article.

      Better statisticians than you and me together have been working on this for decades - your conclusions based on one simulation are not going to make a dent in the idea behind sequential analyses. I suggest you read the paper, and if you have any valid criticisms after that, feel free to come back here and try a little harder.

      Delete
    2. Hi Daniel,

      Thanks for pointing out that your primer mentions a need for correcting effect size estimates. I'm not following the calculation of the drift parameter (mentioned in your supplemental material), and I leave for a trip in a few hours and will not get back to this topic until next week. Many (to my reading, "most") discussion of sequential methods do not address effect size overestimation or methods for correcting it. It would be interesting to see if the proposed adjustments fix the problems I noted above (obviously, more than one simulation is needed to make that judgment). I am rather skeptical that there is an advantage to a process that deliberately overestimates the effect size and then corrects for the overestimation. In most cases, simpler is better; but I am happy to be proven wrong.

      Delete
    3. Hi Greg, in sequential testing, the difference between NHST and effect size estimation is even more pronounced than in a single testing situation. They are different goals, and require different things to be taken into account when designing an experiment. I think NHST has a function in exploratory scientific research, and effect size estimation is more important when the presence of an effect has been established. So I'm fine with sequential analyses being very good at NHST but not perfect at effect size estimation. As I mentioned many times on this blog, we should abandon the illusion there will ever be one statistic that tells us everything we want to know.

      Delete
  3. Hi Daniel,

    I wondered how to correctly apply sequential analyses/adjust p-values in the following situation:

    We ran a study with four conditions and an overall N of 600. When we analyzed the data, we found that the pattern we expected (two specified contrasts) was there but with p-values higher than .05. We then spontaneously decided to add 600 more participants to the sample, leaving us with an overall N of 1200. Now we get p-values of .002 and .005 for the two constrasts. We would like to adjust these for having peeked at the data after half of it was collected. The situation differs from the one in your 2014 EJSP “Performing high-powered studies efficiently with sequential analyes” paper because we didn’t plan such sequential analyses. Accordingly, the p-value we used as a decision rule when looking at the data at time 1 was p < .05. Does this mean we need to enter a z-value of 1.96 for time 1 when computing the monitoring adjusted p-value as described in the paper (and the OSF materials)? It seems like a very conservative approach (which obviously doesn’t make it wrong).

    Thank you so much for this super-informative blog and the very nice EJSP paper on sequential analysis!

    Marie Hennecke

    ReplyDelete
    Replies
    1. See http://pps.sagepub.com/content/9/3/293.abstract for an approach to evaluate how bad adding extra participants was after looking at the data once.

      Delete