The 20% Statistician: Error Control in Exploratory ANOVA's: The How and the Why

Friday, January 1, 2016

Error Control in Exploratory ANOVA's: The How and the Why

In a 2X2X2 design, there are three main effects, three two-way interactions, and one three-way interaction to test. That’s 7 statistical tests.The probability of making at least one Type 1 error in a single ANOVA is 1-(0.95)^⁷=30%.

There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven't read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you’ll find in exploratory ANOVA's.

Simulating Type 1 errors in 3-way ANOVA's

Let’s simulate 250000 2x2x2 ANOVAs where all factors are manipulated between individuals, with 50 participants in each condition, and without any true effect (all group means are equal). The R code is at the bottom of this page. We store the p-values of the 7 tests. The total p-value distribution has the by now familiar uniform shape we see if the null hypothesis is true.

If we count the number of significant findings (even though there is no real effect), we see that from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05 (the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1 errors x 7 tests. If we split up the p-values for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5% error rate, which together add up to a 30% error rate due to multiple testing (i.e., the probability of not making a Type 1 error is 0.95*0.95*0.95*0.95*0.95*0.95*0.95, and the probability of making a Type 1 error is 1 minus this number). With a 2x2x2x2 ANOVA, the Type 1 errors you'll make reach a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.

Let’s fix this. We need to adjust the error rate. The Bonferroni correction (divide your alpha level by the number of tests, so for 7 tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g., economics, gene discovery) work on optimal Type 1 error control procedures continues. I’ve used the mutoss package in R in my simulations to check a wide range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I'm wrong in the comments!). It orders p-values from lowest to highest, and tests them sequentially against an increasingly more lenient alpha level. If you prefer a spreadsheet, go here.

In a 2x2x2 ANOVA, we can test three main effects, three 2-way interactions, and one 3-way interaction. The table below shows the error rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the Holm-Bonferroni correction, the Type 1 error rate nicely controlled.

However, another challenge is to not let Type 1 error control increase the Type 2 errors too much. To examine this, I’ve simulated 2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction).

Let’s first look at the p-value distribution. I’ve added a horizontal and vertical line. The horizontal line indicates the null-distribution caused by the four null-effects. The vertical line indicates the significance level of 0.05. The two lines create four quarters. Top left are the true positives, bottom left are the false positives, top right are the false negatives (not significant due to a lack of power) and the bottom right are the true negatives.

Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).

We see a substantial drop in the left-most column, and this drop is larger than the false height due to false positives. We also see a peculiarly high bar on the right, caused by the Holm correction adjusting a large number of p-values to 1. We can see this drop in power in the Table below as well. It’s substantial: From 87% power to 68% power.

If you perform a 2x2x2 ANOVA, we might expect you are not really interested in the main effects (if you were, a simply t-test would have sufficed). The power cost is already much lower if the exploratory analysis focusses on only four tests, the three 2-way interactions, and the 3-way interaction (see the third row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100% exploratory. If so, preregistering the subset of all tests you are interesting in, and controlling the error rate for this subset of tests, provides an important boost in power.

Oh come on you silly methodological fetishist!

If you think Type 1 error control should not endanger the discovery of true effects, here’s what you should not do. You should not wave your hands at controlling Type 1 error rates, saying it is ‘methodological fetishism’ (Ellemers, 2013). It ain’t gonna work. If you choose to report p-values (by all means, don’t), and want to do quantitative science (by all means, don’t) than the formal logic you are following (even if you don’t realize this) is the Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying there’s something, when there is nothing, more than X% of the time’. If you don’t control error rates, your epistemic foundation of making statements reduces to ‘In the long run, I’m not saying there’s something, when there is nothing, more than … uhm … nice weather for the time of the year, isn’t it?’.

Now just because you need to control error rates, doesn’t mean you need to use a Type 1 error rate of 5%. If you plan to replicate any effect you find in an exploratory study, and you set the alpha to 0.2, the probability of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type 1 error across two studies to 0.2, and then using an alpha of 0.05 for each of the 4 tests might be a good idea. If some effects would be costlier to miss, but others less costly, you can use an alpha of 0.8 for two effects, and an alpha of 0.02 for the other two. This is just one example. It’s your party. You can easily pre-register the choices you make to the OSF or AsPredicted to transparently communicate them.

You can also throw error control out of the window. There are approximately 1.950.000 hits in Google Scholar when I search for ‘An Exploratory Analysis Of’. Put these words in the title, list all your DV’s in the main test (e.g., in a table), add Bayesian statistics and effect sizes with their confidence intervals, and don’t draw strong conclusions (Bender & Lange, 2001).

Obviously, the tricky thing is always what to do if your prediction was not confirmed. I think you enter a Lakatosian degenerative research line (as opposed to the progressive research line you’d be in if your predictions were confirmed). With some luck, there’s an easy fix. The same study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue the data collection after the first look at the data, Lakens, 2014) might get you back in a progressive research line after an update in the predicted effect size. Try again, with a better manipulation of dependent variable. Giving up on a research idea after a single failed confirmation is not how science works, in general. Statistical inferences tell you how to interpret the data without fooling yourself. Type 1 error control matters, and in most psychology experiments, is relatively easy to do. But it’s only one aspect of the things you take into account when you decide which research you want to do.

My main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot of room between 100% exploratory and 100% confirmatory research, and there are many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender & Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly be stated why and how the chosen analyses are performed, and which error rate is controlled for”. Clear words, no hand waving.

Thanks to @RogierK for correcting an error in an earlier version of this blog post.

Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416

Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8. http://doi.org/10.1002/ejsp.1932

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023

29 comments:

UnknownJanuary 1, 2016 at 12:41 PM
Thank you. I knew it was false to ignore the problem of multiple testing in ANOVAs but I didn't know what to do when you are interested in some, but not all, of the interactions.
However, I don't quite understand what you mean with don't do quantitative science?
ReplyDelete
Replies
AnonymousJanuary 1, 2016 at 1:03 PM
I also want to say "thank you". I really admire, and appreciate researchers' efforts to improve practices, and i view writing blogs like this as a way to achieve this.

Even though i'm often not smart enough to follow every detail and nuance in your posts, i still think they are useful for me in providing a warning signal: hey, this might be important stuff, try and learn more about it and/or get some help from a statistician when trying to deal with the issue.

I look forward to reading your posts in this new year, and thanks again for all your efforts!
ReplyDelete
Replies
John K. KruschkeJanuary 1, 2016 at 8:10 PM
Hi.

I thought the usual reason for not correcting for multiple tests across factors and interactions was that the study could have been done without the extra factors and therefore they are separate "families" of tests. Analysts want the "familywise" false-alarm rate to be 5%, for each family. If I understand correctly, in this post you are requiring the "experimentwise" false-alarm rate to be 5% (combining all families). I am NOT defending the distinction between familywise and experimenttwise, just bringing it up.

Because the division from one study to the next can be arbitrary (i.e., when does a "study" end?) I think it might be more appropriate to use "career-wise" corrections for multiple tests: We should correct all the tests done by a researcher in his/her career, which presumably is several thousand tests.

Also, the corrections are supposed to apply to the full set of tests that one INTENDS to do, whether or not one actually reports the tests, and whether or not the tests are possible in principle but uninteresting. That would include all the comparisons and contrasts one might be interested in, which could be dozens more. (Again, just raising the issue, not defending a position.) Exactly which correction to use depends on the set of intended tests and the specific structural relation of the tests.
ReplyDelete
Replies
John K. KruschkeJanuary 2, 2016 at 3:02 AM
This comment has been removed by the author.
ReplyDelete
Replies
Dr Cyril PernetJanuary 2, 2016 at 10:04 AM
Hi Daniel, have you ever looked at the Hochberg step-up procedure? looking at all tests starting with the largest p value

see http://www.stat.osu.edu/~jch/PDF/HuangHsuPreprint.pdf

I use this in multiple pair-wise testing if that's on any interest see https://github.com/CPernet/Robust_Statistical_Toolbox/blob/master/stats_functions/univariate/rst_multicompare.m (same as R Wilcox R function)
ReplyDelete
Replies
Chris_en_JipJanuary 2, 2016 at 1:59 PM
Just a side note: the initial sentence that there are 7 hypotheses to test in a 2x2x2 design (3 mains, 3 two-way interactions and 1 three-way interaction) is in tune with what seems to be the standard approach.

First, there are obviously many more tests you could run (x1 = 1, x1 = 2*x2, x1*x2*x3=1, x1 = log(x2), etc), so in this sense the problem is even worse than you make it seem. Second, and I think this is covered in some of the other comments, one could argue that the only tests to run are the ones about which you have formulated clear hypotheses. Then the problem is a bit less serious than suggested in the example.

Anyhow, my point is that the standard approach to always test for all main and interaction effects (against 0) is equally weird as the reluctance to adapt your alpha.
ReplyDelete
Replies
Henrik SingmannJanuary 3, 2016 at 3:59 PM
Thanks to a contribution by Frederik Aust, the latest version of my afex package (0.15-2) allows to specify this type of correction directly in the call to the ANOVA function. For example:

require(afex)
data(obk.long)
# with correction:
aov_ez("id", "value", obk.long, between = "treatment", within = c("phase", "hour"), anova_table = list(p.adjust.method = "holm"))

# without correction:
aov_ez("id", "value", obk.long, between = "treatment", within = c("phase", "hour"))
ReplyDelete
Replies
matusJanuary 5, 2016 at 4:13 PM
1. I touched upon this topic in my blog:

http://simkovic.github.io/2014/04/20/No-Way-Anova---Interactions-need-more-power.html

Also see Maxwell (2004) referenced in my blog who wrote about these problems and should have been cited by Cramer et al.

As noted in my blog and ignored in your blog, the issue is not only that multiple testing inflates the error rate but there is bias towards interactions/main effects depending on the sample size.

2. Bonferroni-Holm: As always with multiple comparisons the solution is to use hierarchical modeling:

Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don't have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5(2), 189-211.

3. Exploratory Anova is a contradiction. Exploratory = no hypotheses, so why is everyone doing hypothesis testing? Do estimation and you are ok. There is a large literature that criticizes the use of omnibus Anova and many researchers (incl. Geoff Cumming) recommend contrasts instead of Anova.

4. To do estimation properly you need to abandon the block design and use continuous IVs. Then you can do regression and avoid the discussed problems by using hierarchical priors on the regression coefficients. Another disadvantage with 2^n design is that it does not allow you to infer functional relationship between the IVs and DV. This allows the researcher to twist the result in any way by assuming a functional relationship such that the result confirms his/her theory. With regression design any such assumptions can be directly tested.

5. I wish you and your blog a productive 2016 :)
ReplyDelete
Replies
Jazi ZilberOctober 11, 2016 at 1:47 PM
Have you accounted for possible dependencies between the measures?

I'm kinda nitpicking, but it's important sometimes.

When you have, say two slightly different, but highly correlated measures, x2 correction of the p-value is too much.

In your example above, the distortion is only small, as the between measures correlations aren't large. But it's good to be aware of it.

You might find out by simulating. Which is actually how it was argued before by Weber
ReplyDelete
Replies
Alina SmithFebruary 8, 2021 at 11:43 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment