There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven't read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you’ll find in exploratory ANOVA's.
Simulating Type 1 errors in 3-way ANOVA's
Let’s simulate 250000 2x2x2 ANOVAs where all
factors are manipulated between individuals, with 50 participants in each
condition, and without any true effect (all group means are equal).
The R code is at the bottom of this page. We store the p-values of the 7 tests.
The total p-value distribution has
the by now familiar uniform shape we see if the null hypothesis is true.
If we count the number of significant findings (even though there is no real effect), we see that
from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05
(the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1
errors x 7 tests. If we split up the p-values
for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5%
error rate, which together add up to a 30% error rate due to multiple testing (i.e., the probability of not making a Type 1 error is 0.95*0.95*0.95*0.95*0.95*0.95*0.95, and the probability of making a Type 1 error is 1 minus this number). With a 2x2x2x2 ANOVA, the Type 1 errors you'll make reach a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.
Let’s fix this. We need to adjust the error rate. The
Bonferroni correction (divide your alpha level by the number of tests, so for 7
tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic
idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g.,
economics, gene discovery) work on optimal Type 1 error control procedures
continues. I’ve used the mutoss package in R in my simulations to check a wide
range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more
fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I'm wrong in the comments!). It orders
p-values from lowest to highest, and
tests them sequentially against an increasingly more lenient alpha level. If
you prefer a spreadsheet, go
here.
In a 2x2x2 ANOVA, we can test three main effects, three
2-way interactions, and one 3-way interaction. The table below shows the error
rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the
Holm-Bonferroni correction, the Type 1 error rate nicely controlled.
However, another challenge is to not let Type 1 error
control increase the Type 2 errors too much. To examine this, I’ve simulated
2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction).
Let’s first look at the p-value
distribution. I’ve added a horizontal and vertical line. The horizontal line
indicates the null-distribution caused by the four null-effects. The vertical
line indicates the significance level of 0.05. The two lines create four
quarters. Top left are the true positives, bottom left are the false positives,
top right are the false negatives (not significant due to a lack of power) and
the bottom right are the true negatives.
Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).
We see a substantial drop in the left-most column, and this
drop is larger than the false height due to false positives. We also see a peculiarly
high bar on the right, caused by the Holm correction adjusting a large number of
p-values to 1. We can see this drop in
power in the Table below as well. It’s substantial: From 87% power to 68%
power.
If you perform a 2x2x2 ANOVA, we might expect you are not
really interested in the main effects (if you were, a simply t-test would have sufficed). The power
cost is already much lower if the exploratory analysis focusses on only four
tests, the three 2-way interactions, and the 3-way interaction (see the third
row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100%
exploratory. If so, preregistering the subset of all tests you are interesting
in, and controlling the error rate for this subset of tests, provides an
important boost in power.
Oh come on you silly methodological fetishist!
If you think Type 1 error control should not endanger the
discovery of true effects, here’s what you should not do. You should not wave
your hands at controlling Type 1 error rates, saying it is ‘methodological
fetishism’ (Ellemers,
2013).
It ain’t gonna work.
If you choose to report p-values (by
all means, don’t), and want to do quantitative science (by all means, don’t)
than the formal logic you are following (even if you don’t realize this) is the
Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying
there’s something, when there is nothing, more than X% of the time’. If you don’t
control error rates, your epistemic foundation of making statements reduces to ‘In
the long run, I’m not saying there’s something, when there is nothing, more
than … uhm … nice weather for the time of the year, isn’t it?’.
Now just because you need to control error rates, doesn’t
mean you need to use a Type 1 error rate of 5%. If you plan to replicate any
effect you find in an exploratory study, and you set the alpha to 0.2, the probability
of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to
explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type
1 error across two studies to 0.2, and then using an alpha of 0.05 for each of
the 4 tests might be a good idea. If some effects would be costlier to miss,
but others less costly, you can use an alpha of 0.8 for two effects, and an
alpha of 0.02 for the other two. This is just one example. It’s your party. You
can easily pre-register the choices you make to the OSF
or AsPredicted to transparently
communicate them.
You can also throw error control out of the window. There
are approximately 1.950.000 hits in Google Scholar when I search for ‘An
Exploratory Analysis Of’. Put these words in the title, list all your DV’s
in the main test (e.g., in a table), add Bayesian statistics and effect sizes
with their confidence intervals, and don’t draw strong conclusions (Bender
& Lange, 2001).
Obviously, the tricky thing is always what to do if your
prediction was not confirmed. I think you enter a Lakatosian degenerative
research line (as opposed to the progressive research line you’d be in if your
predictions were confirmed). With some luck, there’s an easy fix. The same
study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue
the data collection after the first look at the data, Lakens,
2014) might get you back in a progressive research
line after an update in the predicted effect size.
Try again, with a better manipulation of dependent variable. Giving up on a
research idea after a single failed confirmation is not how science works, in general.
Statistical inferences tell you how to interpret the data without fooling
yourself. Type 1 error control matters, and in most psychology experiments, is
relatively easy to do. But it’s only one aspect of the things you take into
account when you decide which research you want to do.
My main point here is that there are many possible solutions,
and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot
of room between 100% exploratory and 100% confirmatory research, and there are
many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender &
Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly
be stated why and how the chosen analyses are performed, and which error rate
is controlled for”. Clear words, no hand waving.
Bender, R.,
& Lange, S. (2001). Adjusting for multiple testing—when and how? Journal
of Clinical Epidemiology, 54(4), 343–349.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416
Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal
the big picture in social psychology (and why we should do this): The big
picture in social psychology. European Journal of Social Psychology, 43(1),
1–8. http://doi.org/10.1002/ejsp.1932
Lakens, D. (2014). Performing high-powered studies efficiently with
sequential analyses: Sequential analyses. European Journal of Social
Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023



