A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, January 1, 2016

Error Control in Exploratory ANOVA's: The How and the Why

In a 2X2X2 design, there are three main effects, three two-way interactions, and one three-way interaction to test. That’s 7 statistical tests.The probability of making at least one Type 1 error in a single ANOVA is 1-(0.95)^7=30%.

There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven't read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you’ll find in exploratory ANOVA's.

Simulating Type 1 errors in 3-way ANOVA's

Let’s simulate 250000 2x2x2 ANOVAs where all factors are manipulated between individuals, with 50 participants in each condition, and without any true effect (all group means are equal). The R code is at the bottom of this page. We store the p-values of the 7 tests. The total p-value distribution has the by now familiar uniform shape we see if the null hypothesis is true.

 
If we count the number of significant findings (even though there is no real effect), we see that from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05 (the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1 errors x 7 tests. If we split up the p-values for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5% error rate, which together add up to a 30% error rate due to multiple testing (i.e., the probability of not making a Type 1 error is 0.95*0.95*0.95*0.95*0.95*0.95*0.95, and the probability of making a Type 1 error is 1 minus this number). With a 2x2x2x2 ANOVA, the Type 1 errors you'll make reach a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.

Let’s fix this. We need to adjust the error rate. The Bonferroni correction (divide your alpha level by the number of tests, so for 7 tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g., economics, gene discovery) work on optimal Type 1 error control procedures continues. I’ve used the mutoss package in R in my simulations to check a wide range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I'm wrong in the comments!). It orders p-values from lowest to highest, and tests them sequentially against an increasingly more lenient alpha level. If you prefer a spreadsheet, go here.

In a 2x2x2 ANOVA, we can test three main effects, three 2-way interactions, and one 3-way interaction. The table below shows the error rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the Holm-Bonferroni correction, the Type 1 error rate nicely controlled.



However, another challenge is to not let Type 1 error control increase the Type 2 errors too much. To examine this, I’ve simulated 2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction). 

Let’s first look at the p-value distribution. I’ve added a horizontal and vertical line. The horizontal line indicates the null-distribution caused by the four null-effects. The vertical line indicates the significance level of 0.05. The two lines create four quarters. Top left are the true positives, bottom left are the false positives, top right are the false negatives (not significant due to a lack of power) and the bottom right are the true negatives.


Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).


We see a substantial drop in the left-most column, and this drop is larger than the false height due to false positives. We also see a peculiarly high bar on the right, caused by the Holm correction adjusting a large number of p-values to 1. We can see this drop in power in the Table below as well. It’s substantial: From 87% power to 68% power.

If you perform a 2x2x2 ANOVA, we might expect you are not really interested in the main effects (if you were, a simply t-test would have sufficed). The power cost is already much lower if the exploratory analysis focusses on only four tests, the three 2-way interactions, and the 3-way interaction (see the third row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100% exploratory. If so, preregistering the subset of all tests you are interesting in, and controlling the error rate for this subset of tests, provides an important boost in power. 


Oh come on you silly methodological fetishist!

If you think Type 1 error control should not endanger the discovery of true effects, here’s what you should not do. You should not wave your hands at controlling Type 1 error rates, saying it is ‘methodological fetishism’ (Ellemers, 2013). It ain’t gonna work. If you choose to report p-values (by all means, don’t), and want to do quantitative science (by all means, don’t) than the formal logic you are following (even if you don’t realize this) is the Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying there’s something, when there is nothing, more than X% of the time’. If you don’t control error rates, your epistemic foundation of making statements reduces to ‘In the long run, I’m not saying there’s something, when there is nothing, more than … uhm … nice weather for the time of the year, isn’t it?’.

Now just because you need to control error rates, doesn’t mean you need to use a Type 1 error rate of 5%. If you plan to replicate any effect you find in an exploratory study, and you set the alpha to 0.2, the probability of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type 1 error across two studies to 0.2, and then using an alpha of 0.05 for each of the 4 tests might be a good idea. If some effects would be costlier to miss, but others less costly, you can use an alpha of 0.8 for two effects, and an alpha of 0.02 for the other two. This is just one example. It’s your party. You can easily pre-register the choices you make to the OSF or AsPredicted to transparently communicate them.

You can also throw error control out of the window. There are approximately 1.950.000 hits in Google Scholar when I search for ‘An Exploratory Analysis Of’. Put these words in the title, list all your DV’s in the main test (e.g., in a table), add Bayesian statistics and effect sizes with their confidence intervals, and don’t draw strong conclusions (Bender & Lange, 2001).

Obviously, the tricky thing is always what to do if your prediction was not confirmed. I think you enter a Lakatosian degenerative research line (as opposed to the progressive research line you’d be in if your predictions were confirmed). With some luck, there’s an easy fix. The same study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue the data collection after the first look at the data, Lakens, 2014) might get you back in a progressive research line after an update in the predicted effect size. Try again, with a better manipulation of dependent variable. Giving up on a research idea after a single failed confirmation is not how science works, in general. Statistical inferences tell you how to interpret the data without fooling yourself. Type 1 error control matters, and in most psychology experiments, is relatively easy to do. But it’s only one aspect of the things you take into account when you decide which research you want to do.

My main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot of room between 100% exploratory and 100% confirmatory research, and there are many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender & Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly be stated why and how the chosen analyses are performed, and which error rate is controlled for”. Clear words, no hand waving.


Thanks to @RogierK for correcting an error in an earlier version of this blog post.


Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4), 343–349. 
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416
Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8. http://doi.org/10.1002/ejsp.1932
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023


28 comments:

  1. Thank you. I knew it was false to ignore the problem of multiple testing in ANOVAs but I didn't know what to do when you are interested in some, but not all, of the interactions.
    However, I don't quite understand what you mean with don't do quantitative science?

    ReplyDelete
    Replies
    1. Although I mainly perform quantitative research, I think there is a time and place for qualitative research - not trying to quantify anything, but collecting anecdotal insights, and your thoughts about these. This does not require p-values, but it has a place in science.

      Delete
  2. I also want to say "thank you". I really admire, and appreciate researchers' efforts to improve practices, and i view writing blogs like this as a way to achieve this.

    Even though i'm often not smart enough to follow every detail and nuance in your posts, i still think they are useful for me in providing a warning signal: hey, this might be important stuff, try and learn more about it and/or get some help from a statistician when trying to deal with the issue.

    I look forward to reading your posts in this new year, and thanks again for all your efforts!

    ReplyDelete
  3. Hi.

    I thought the usual reason for not correcting for multiple tests across factors and interactions was that the study could have been done without the extra factors and therefore they are separate "families" of tests. Analysts want the "familywise" false-alarm rate to be 5%, for each family. If I understand correctly, in this post you are requiring the "experimentwise" false-alarm rate to be 5% (combining all families). I am NOT defending the distinction between familywise and experimenttwise, just bringing it up.

    Because the division from one study to the next can be arbitrary (i.e., when does a "study" end?) I think it might be more appropriate to use "career-wise" corrections for multiple tests: We should correct all the tests done by a researcher in his/her career, which presumably is several thousand tests.

    Also, the corrections are supposed to apply to the full set of tests that one INTENDS to do, whether or not one actually reports the tests, and whether or not the tests are possible in principle but uninteresting. That would include all the comparisons and contrasts one might be interested in, which could be dozens more. (Again, just raising the issue, not defending a position.) Exactly which correction to use depends on the set of intended tests and the specific structural relation of the tests.

    ReplyDelete
    Replies
    1. Thanks for the comment. The article by Bender & Lange (2001) provides a really good discussion of the difference between family-wise and experiment-wise error control (including a nice example of a distinction between primary outcomes and secondary outcomes, both being corrected independently.

      I expected the çareer-wise error correction joke - tried to find an official reference for it, but I couldn't find one (except a chapter by Dienes). It's too extreme. You can debate what a family is, and as long as you justify and clearly communicate the decision, you'll be fine.

      The correction is not needed for the tests you intend to do. At least, I've not seen any paper arguing this. Again, the common view is that the tests examine a single theory. As with any point where statistical inferences interact directly with theory, there will never be one-size fits all rules you can blindly follow, but you will need to think, argue, and try your best.

      Delete
    2. But the set of intended tests is exactly what is at issue for the corrections. Perhaps it sounds better for the set of tests to be called "the tests implied by a theory" but it's still just the set of intended tests.

      For a researcher entertaining theory A, it's 13 (say) comparisons and contrasts that might be interesting and therefore the researcher intends to do those. For a researcher entertaining theory B, it's 17 comparisons and contrasts (some the same as in theory A) that might be interesting and therefore the researcher intends to do those instead.

      Intentions are central for other corrections too. In particular, corrections for optional stopping ("data peeking") are all about the intention to continue sampling data if an interim test fails. In optional stopping, the analyst is doing a series of multiple tests -- it's the multiple test situation again, but with a different structure. The only thing that defines the structure is the intention to stop when reaching "significance" or when patience expires.

      For me, THE key problem that drove me away from frequentism to Bayesian approaches is the issue of correcting for multiple comparisons and stopping intentions.

      Delete
    3. John: It is entirely appropriate to see a concern with error probabilities as the key point distinguishing frequentist (error statistical) methods and Bayesian ones. That is too often forgotten or overlooked. However, it's a mistake to suppose a concern with error probabilities is a concern with "intentions" rather than a concern with probing severely and avoiding confirmation bias. To pick up on gambits that alter or even destroy error probability control is to pick up on something quite irrelevant to evidence for Bayesian updating, Bayes ratios, likelihood ratios. This follows from the Likelihood Principle. Unless the Bayesian adds a concern to ensure the output (be it a posterior or ratio) will frequently inform you of erroneous interpretations of data--in which case he or she has become an error statistician--the only way to save yourself from cheating is to invoke a prior which is supposed to save the day. But then the whole debate turns on competing beliefs rather than a matter of whether the data and method warrant the inference well or terribly. In other words, the work that the formal inference method was supposed to perform has to (hopefully) be performed by you, informally. Moreover, it entitles you to move from a statistical result to a substantive research hypothesis, which frequentist tests prohibit.

      Delete
    4. Here is everything anyone needs to know about your vaunted frequentist error control:

      http://www.bayesianphilosophy.com/test-for-anti-bayesian-fanaticism-part-ii/

      Delete
    5. Deborah: Concern for error probabilities is of course an earnest and important concern. We should all hope to control error rates in making decisions. My gripe is not with the concern, but with the impossibility of pinning it down. An error rate is defined in terms of the space of possible errors that could have been generated by counterfactual imaginary data, and those hypothetical data are defined by the sampling intentions, by which I mean the analyst's intended stopping criterion and intended tests. (Sorry to persist with the i-word, but for me the word accurately captures the true meaning of the situation because the word makes explicit the role of the analyst beyond what's in the data.) Because the error rate is inherently conditional on the intended stopping criterion and tests, "the" error rate for a single set of real data can vary dramatically depending the intentions. This fact is, of course, what motivates the blog post that we're commenting on. If a person wants to make a decision based on an error rate, then that person must work through the various error rates that arise from different stopping intentions and different testing intentions, and live with the vagaries of doing that. As Daniel said in his post, "My [Daniel's] main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. ... there are many reasonable ideas about what the ‘family’ of errors is you want to control." Indeed there are.

      Delete
    6. It's important to understand the limitations of any approach - and the dependency on intentions is an important limitation. I prefer to teach people how to make these corrections (also when it comes to optional stopping, see Lakens, 2014), and I do think error control is really important, especially in more exploratory research lines. There are other options, and I always encourage people to explore these. But I also think the dependency on intentions is not too problematic in practice. We never take single studies as proof of much, and in lines of studies, the differences in intentions will wash out, just as differences in priors will wash out in Bayesian statistics, where the dependency on beliefs yields differences between researchers.

      Delete
    7. "We never take single studies as proof of much, and in lines of studies, the differences in intentions will wash out (...)," that's an odd thing to say for someone whose statistical approach prohibits integration of evidence over studies.

      Delete
    8. An exploratory test in follow up studies will become part of everyone's intention to test. So really, there's no big issue. Lines of research can be meta-analyzed, which have their own error rates.

      Delete
    9. [Part 1 of 2]
      Hi again Daniel:

      I resonate with the desire to control errors especially in exploratory research. I agree that it's important to apprise researchers of these issues of false-alarm rates for different stopping and testing intentions, and I admire your efforts to do so! And I tried, very hard, to let this thread end with your previous comment, but alas herewith I have failed . :-)

      DL: "I also think the dependency on intentions is not too problematic in practice."
      That has not been my experience. I have reviewed so many manuscripts in which people are obviously prevaricating to rationalize why they've made only a few tests with the least severe correction for multiple tests, just so the "corrected" p values squeak under .05. In one manuscript I reviewed, a set of experimental conditions that obviously went together (they were run at the same time on the same subject pool) were split apart and labeled as Experiment 1 and Experiment 2 so that the multiple tests "within experiments" had lower corrected p values than if the conditions were all put together into a single set of tests. Later in that same manuscript, in the discussion, they tacked on some extra comparisons "across experiments" and used uncorrected p values for those. In general, for any study with more than a few conditions there are almost always more interesting comparisons and contrasts that could be done, yet people feign disinterest in those comparisons because testing them would inflate the p values of the tests they want to report with p<.05. Defenders of corrections might reply, "Well of course the corrections can be 'gamed' by unscrupulous researchers, but the corrections are not problematic when applied honestly." But even then the corrections remain problematic because honest researchers can be honestly interested in different comparisons and contrasts. Researcher A is honestly interested in testing ten comparisons because she has thought through a lot of interesting implications of the research design. Researcher B, looking at the very same data, is honestly interested in testing only three of those comparisons (perhaps because he hasn't thought about it deeply enough). For those three comparisons, the p values of Researcher B and honestly lower than the p values of Researcher A, even though the data and three comparisons are identical. Maybe that apparent clash of conclusions is not problematic, by definition, because it properly takes into account the set of intended tests. But for me it remains a problem in practice.

      [continues in next comment]

      Delete
    10. [Part 2 of 2]


      DL: "... differences in intentions will wash out, just as differences in priors will wash out in Bayesian statistics, where the dependency on beliefs yields differences between researchers."
      An analogy, between (a) intentions for sampling distributions and (b) prior beliefs for Bayesian inference, is tempting but extremely misleading. (I don't know to what extent you intended to make the analogy, but many people ask me about this so I couldn't let it rest.) Consider the following simple example. There are two professional baseball players, A who is a pitcher and B who is a catcher. In a particular season A had 4 hits in 25 at-bats while B had 6 hits in 30 at-bats. Question: Are the batting abilities of A and B "really" different? By the way, we are also interested in testing many dozens of comparisons of other players. The Bayesian analysis uses the prior knowledge that A is a pitcher and B is catcher and concludes that the two players almost certainly have different batting abilities, because professional pitchers and catchers generally have very different batting abilities and it would take a lot more data to budge us away from the prior knowledge. On the contrary, the frequentist analysis "corrects" for the dozens of other intended tests and concludes, with p>>.05, that the batting abilities are not significantly different. Infinite sample size eventually makes the conclusions of the two methods converge, but in the mean time there are real differences between sampling intentions and prior knowledge.

      (Again, sorry to perseverate on this so long... It just truly is the topic that pushed me over the brink into Bayesian approaches...)

      Delete
    11. No need to apologize! These discussions go to the core of the problem, and it is important to address them. With respect to the first post - researchers clearly did not pre-register. This leaves open selective reporting, flexibility in the data-analysis, etc. These researchers will try to game any threshold - and yes, you are right, approaches that don't require a threshold are more likely to get people to just report the data. Except pre-registration and a changing reward system, I see no solution.

      With respect to the second point, I meant that an exploratory analysis in study 1, will be a confirmatory test in a follow up study. If a finding looks promising, over studies, it will be everyone's intention to test it. With huge numbers of tests, Holms correction is not workable. More modern approaches that correct for false discovery rates might be preferable. I also would not know how other approaches (assuming you have very little prior info) would fare better. Obviously, if you have very usable priors, Bayesian statistics are always outperforming Frequentist approaches.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hi Daniel, have you ever looked at the Hochberg step-up procedure? looking at all tests starting with the largest p value

    see http://www.stat.osu.edu/~jch/PDF/HuangHsuPreprint.pdf

    I use this in multiple pair-wise testing if that's on any interest see https://github.com/CPernet/Robust_Statistical_Toolbox/blob/master/stats_functions/univariate/rst_multicompare.m (same as R Wilcox R function)

    ReplyDelete
    Replies
    1. I did! The mutoss package has a huge number of tests. The method is perfectly fine, just has a small extra assumption of stochastical independence if I remember correctly. There is some improvement in power, but it's really 3 digits after the seperator, and since my motto is to explain the 20% that give an 80%% improvement in inferences, and the post was quite long, I left it out.

      Delete
  6. Just a side note: the initial sentence that there are 7 hypotheses to test in a 2x2x2 design (3 mains, 3 two-way interactions and 1 three-way interaction) is in tune with what seems to be the standard approach.

    First, there are obviously many more tests you could run (x1 = 1, x1 = 2*x2, x1*x2*x3=1, x1 = log(x2), etc), so in this sense the problem is even worse than you make it seem. Second, and I think this is covered in some of the other comments, one could argue that the only tests to run are the ones about which you have formulated clear hypotheses. Then the problem is a bit less serious than suggested in the example.

    Anyhow, my point is that the standard approach to always test for all main and interaction effects (against 0) is equally weird as the reluctance to adapt your alpha.

    ReplyDelete
    Replies
    1. Indeed, the 7 tests is completely reasoned from the output SPSS would provide, but it's a lower bound.

      I agree testing for all these effects is perhaps weird, but I don't think it's uncommon to add all factors in experiments, and interpret the outcome based on what is significant without correcting for multiple tests. If this posts motivates researchers to more careufully choose their tests, that would be great.

      Delete
  7. Thanks to a contribution by Frederik Aust, the latest version of my afex package (0.15-2) allows to specify this type of correction directly in the call to the ANOVA function. For example:


    require(afex)
    data(obk.long)
    # with correction:
    aov_ez("id", "value", obk.long, between = "treatment", within = c("phase", "hour"), anova_table = list(p.adjust.method = "holm"))

    # without correction:
    aov_ez("id", "value", obk.long, between = "treatment", within = c("phase", "hour"))

    ReplyDelete
  8. 1. I touched upon this topic in my blog:

    http://simkovic.github.io/2014/04/20/No-Way-Anova---Interactions-need-more-power.html

    Also see Maxwell (2004) referenced in my blog who wrote about these problems and should have been cited by Cramer et al.

    As noted in my blog and ignored in your blog, the issue is not only that multiple testing inflates the error rate but there is bias towards interactions/main effects depending on the sample size.

    2. Bonferroni-Holm: As always with multiple comparisons the solution is to use hierarchical modeling:

    Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don't have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5(2), 189-211.

    3. Exploratory Anova is a contradiction. Exploratory = no hypotheses, so why is everyone doing hypothesis testing? Do estimation and you are ok. There is a large literature that criticizes the use of omnibus Anova and many researchers (incl. Geoff Cumming) recommend contrasts instead of Anova.

    4. To do estimation properly you need to abandon the block design and use continuous IVs. Then you can do regression and avoid the discussed problems by using hierarchical priors on the regression coefficients. Another disadvantage with 2^n design is that it does not allow you to infer functional relationship between the IVs and DV. This allows the researcher to twist the result in any way by assuming a functional relationship such that the result confirms his/her theory. With regression design any such assumptions can be directly tested.

    5. I wish you and your blog a productive 2016 :)

    ReplyDelete
    Replies
    1. Thanks for the references and link to your blog! I'm still getting used to the behavior of ANOVA's in simulations - will try to replicate your simulations, looking at power is a near future goal! Have a good 2016 as well!

      Delete
  9. Have you accounted for possible dependencies between the measures?

    I'm kinda nitpicking, but it's important sometimes.

    When you have, say two slightly different, but highly correlated measures, x2 correction of the p-value is too much.

    In your example above, the distortion is only small, as the between measures correlations aren't large. But it's good to be aware of it.

    You might find out by simulating. Which is actually how it was argued before by Weber

    ReplyDelete
    Replies
    1. Weber who? Do you have reference? Assumptions of correlations are always tricky. If you have good estimates, yes, error rates depend on the correlation. Difference is not huge, but it matters a little bit.

      Delete
    2. http://www-stat.wharton.upenn.edu/~steele/Courses/956/Resource/MultipleComparision/Simes86pdf.pdf

      Delete
    3. You do not need any correlation assumptions for this method.

      For 2 measures, you essentially do:
      min ( (2min (p1,p2)) , (max (p1,p2)) )

      The correlations will not matter in this case. Unless you want to assume independence and multiply the measures.

      Technically, bonferroni is proven mathematically in an ironclad way. This measure could very theoretically get wrong. But to get wrong it requires extremely weird conditions etc. so it is good for every practical purpose.

      I love this measure, because I happened to work it out on my own in the late 90s but did not knew it was already solved. So I am having this little warm corner in it in my heart.

      There are more papers on it, I think there is a paper somewhere by Sergiu Hart. Should be called SIMes method

      Delete