1740 words. Reading Time: 8 minutes 
The main goal in a Neyman-Pearson approach is to develop a procedure that that will guide behavior, without being wrong too often. The long-run error rate of the decisions you make (based on the p-values you calculate) is easily controlled at a specific alpha level when only a single statistical test is performed. When multiple tests are performed, one can’t simply use the overall alpha level for all performed tests. Although there is some misguided discussion in the literature about whether error rates should be controlled when making multiple comparisons, the need for adjustments is a logical consequence of using Frequentist statistics such as the Neyman-Pearson approach (Thompson, 1998).
Consider an experiment where people are randomly assigned to either a control or experimental condition. Two unrelated dependent variables are measured to test a hypothesis. A researcher will conclude a specific manipulation has an effect, if there is a difference between the control group and the experimental group on either of these two dependent variables. Because two independent tests are performed, the probability of not making a Type 1 error when α = 0.05 is 0.95*0.95, or 0.9025. This means that the probability of concluding there is an effect, when there is no effect, is 1 - 0.9025 = 0.0975 instead of 0.05.
There are different ways to control for error rates, the easiest being the Bonferroni correction (divide the α by the number of tests), and an ever-so-slightly less conservative correction being the Holm-Bonferroni sequential procedure. For some multiple testing situations, dedicated statistical approaches have been developed. For example, sequential analyses (Lakens, 2014) control the error rate when researchers want to look at their data as it comes in, and stop the data collection whenever a statistically significant result is observed (this is also needed when updating meta-analyses). When the number of statistical tests becomes substantial, it is sometimes preferable to control false discovery rates, instead of error rates (Benjamini, Krieger, & Yekutieli, 2006). Many procedures that control for false discovery rates take dependencies among hypotheses into account. All these approaches have the same goal of limiting the probability of saying there is an effect, when there is no effect.
The Bonferroni correction controls the familywise error rate, but what a family of tests is, requires some thought. The main reason this question is not straightforward is that error control does not just aim to control the number of erroneous statistical inferences, but the number of erroneous theoretical inferences. We therefore need to make a statement about which tests relate to a single theoretical inference, which depends on the theoretical question. I believe many of the problems researchers have in deciding how to correct for multiple comparisons is actually a problem in deciding what their theoretical question is.
Error rates can be controlled for all tests in an experiment (the experimentwise Type 1 error rate) or for a specific group of tests (the familywise Type 1 error rate). Broad questions have many possible answers. If we want to know if there is ‘an effect’ in a study, then rejecting the null-hypothesis in any test we perform would lead us to decide the answer to our question is ‘yes’. In this situation, the experimentwise Type 1 error rate correctly controls the probability of deciding there is any effect, when all null hypotheses are true. For example, in a 2x2x2 ANOVA, we test for three main effects, three two-way interactions, and one three-way interaction, which makes seven tests in total. If we use a 5% alpha level for every test, the probability that we will conclude there is an effect, when the null hypothesis is true, is 30%.
But researchers often have more specific questions. Let’s assume a researcher has designed an experiment that compares predictions from two competing theories. Theory A predicts an interaction in a 2x2 ANOVA, while Theory B predicts no interaction, but at least one significant main effect. The researcher will perform three tests, which we will assume is highly powered for any theoretically relevant effect size. One might intuitively assume that since we will perform three tests (two main effects and one interaction) we should control the error rate for all three tests, for example by using α/3. But when controlling the familywise error rate, what constitutes a ‘family’ depends on a set of theoretically related tests. In this case, where we test two theories, there are two families of tests, the first family consisting of a single interaction effect, and the second family of two main effects. With an overall alpha level of 5%, we will decide to accept Theory A when p < α for the interaction, and we will decide no to accept Theory A when p > α. If the null is true, at most 5% of these decisions we make in the long run will be incorrect, so the percentage of decision errors is controlled. Furthermore, we will decide to accept Theory B when p < α/2 (using a Bonferroni correction) for either of the two main effects, and not accept theory B when p > α/2. When the null hypothesis is true, we will decide to accept Theory B when it is not true at most 5% of the time. We could accept neither theory, or even both, if it turned out the experiment was not the crucial test the researcher had thought.
Some researchers criticize corrections for multiple comparisons because one might as well correct for all tests you do in your lifetime (Perneger, 1998). If you choose to use a Neyman-Pearson paradigm, as opposed to a Likelihood approach or Bayesian statistics, the only reason to correct for all tests you perform in your lifetime is when all the work you have done in your life tests a single theory, and you would use your last words to decide to accept or reject this theory, as long as only one of all individual tests you have performed yielded a p < α. Researchers rarely work like this. Instead, they often draw a conclusion after a single study. It’s these intermediate decisions to accept or reject the null hypothesis that should not be wrong too often, in the long run. We control errors when we make decisions about theories, and we make these decisions more than once in our lifetime.
It might seem if researchers can find a way out of using error control by formulating a hypothesis for every possible test they will perform. Indeed, they can. For a ten by ten correlation matrix, a researcher might have theoretical predictions for all 45 individual correlations. If all these 45 predicted correlations are tests using an alpha level of 5%, the statistical inference is valid. However, readers might reasonably question the theoretical validity of these 45 tests. All statistical inferences interact with theoretical inferences at some point, and choices to control error rates are a good example of this.
Another criticism on corrections for multiple comparisons is that it is strange that the conclusions a researcher draws depend on the number of additional tests a researcher performs. For example, if a researcher had measured only a single dependent variable, a p = 0.04 would have led to a decision to reject the null hypothesis, but with a second dependent variable, the alpha level is reduced to 0.025, and now the same data no longer leads to the conclusion to reject the null hypothesis. Lowering alpha levels is a mathematical necessity when you want to control error rates, but it is not needed if all you want to do is quantify relative likelihoods of the data under different hypotheses.
Likelihood approaches look at the relative likelihood of the data, given two hypotheses (complemented with prior knowledge in Bayesian statistics). Likelihoods only care about the data. Obviously the probability that the strong evidence in favor of the alternative hypothesis is a fluke increases with the number of tests that were performed. There are ways to control error rates in likelihood approaches and Bayesian statistics, but they are less straightforward than using a Neyman-Pearson approach. It might seem strange for someone who uses a likelihood approach (or Bayesian statistics) that conclusions depend on the number of additional tests that are performed. But from a Neyman-Pearson approach, it is similarly strange to interpret one out of 45 likelihood ratios or Bayes factors from a ten by ten correlation matrix as ‘strong evidence’ for a true effect, without taking into account 44 other tests were performed at the same time. Combining both approaches is probably a win-win, where long run error rates are controlled, after which the evidential value in individual studies in interpreted (and, because why not, parameters are estimated).
A better understanding of controlling error rates is useful. There are researchers who fear the current scientific climate is focusing too much on Type 1 error control, at the expense of Type 2 error control (Fiedler, Kutzner, & Krueger, 2012). But this is not necessarily so. It all depends on how you design your experiments. Just like you need to lower the alpha level if multiple tests would allow you to reject the null hypothesis, you can choose to increase the alpha level if you will only reject the null hypothesis when multiple independent tests yield a p < α. For example, it is perfectly fine to pre-register a set of two experiments, the second a close replication of the first, where you will choose to reject the null-hypothesis if the p-value is smaller than 0.2236 in both experiments. The probability that you will reject the null hypothesis twice in a row if the null hypothesis is true is α * α, or 0.2236 * 0.2236 = 0.05. In other words, if you set out to do a line of pre-registered studies, which you will report without publication bias, it makes sense to increase your alpha level. For example, an alpha level of 0.1 in both studies effectively limits the Type 1 error rate to 0.1 * 0.1 = 0.01. Conceptually, this is similar to deciding to base your decision on the outcome of a small-scale meta-analysis with an alpha of 0.01.
There is only one reason to calculate p-values, and that is to control Type 1 error rates using a Neyman-Pearson approach. Therefore, if you use p-values, you need to correct for multiple comparisons, but be smart about it. We need better error control, not necessarily stricter error control.
References
Benjamini, Y., Krieger, A. M., &
Yekutieli, D. (2006). Adaptive linear step-up procedures that control
the false discovery rate. Biometrika, 93(3), 491–507.
Fiedler, K., Kutzner, F., & Krueger, J.
I. (2012). The Long Way From -Error Control to Validity Proper: Problems With a
Short-Sighted False-Positive Debate. Perspectives on Psychological Science,
7(6), 661–669. http://doi.org/10.1177/1745691612462587
Lakens, D. (2014). Performing high-powered
studies efficiently with sequential analyses: Sequential analyses. European
Journal of Social Psychology, 44(7), 701–710.
http://doi.org/10.1002/ejsp.2023
Perneger, T. V. (1998). What’s wrong with
Bonferroni adjustments. Bmj, 316(7139), 1236–1238.
Thompson, J. R. (1998). Invited Commentary:
Re: ‘Multiple Comparisons and Related Issues in the Interpretation of
Epidemiologic Data”. American Journal of Epidemiology, 147(9),
801–806. http://doi.org/10.1093/oxfordjournals.aje.a009530
 
>>>>For example, it is perfectly fine to pre-register a set of two experiments, the second a close replication of the first, where you will choose to reject the null-hypothesis if the p-value is smaller than 0.2236 in both experiments. The probability that you will reject the null hypothesis twice in a row if the null hypothesis is true is α * α, or 0.2236 * 0.2236 = 0.05.
ReplyDeleteInteresting logic. In practice, however, what would happen if after your first experiment, the *one* and only target statistical test yields p < .30? Do you still run the second experiment?
I guess you have to given you publicly pre-registered the study. But if the first experiment was highly-powered (e.g., 95%) to detect a plausible effect size (e.g., d=.20), doesn't it seem odd to still run the second experiment?
Hi Etienne, I'm thinking of registered reports. There, you could pre-register a set of 2 studies, and they will be publish regardless. Let's say the second p-value is 0.8. If you indeed had high power for a minimum effect (e.g., 95%) you could decide that the effect is small, or null. That should be good to know, right?
Deleteif i got it correct (maybe i didn't) when you say 'Combining both approaches is probably a win-win, where long run error rates are controlled, after which the evidential value in individual studies in interpreted (and, because why not, parameters are estimated).', does it mean one could perform, say, a bayesian t.test and a welch t.test on a pairwise comparison and report both bayesian and frequentist p.values to come up with a decision?... even, would it be ok to combine those p.values?
ReplyDeleteYou cannot combine p-values (a Bayesian t-test does not give a p-value). You could do both tests, interpret the p-value in terms of a NP approach (in the long run, I would rarely be wrong if I act as if there is an effect) and then interpret the evidence at hand (and the current data provide strong/weak evidence for the alternative hypothesis).
DeleteNo, it would not. You perform separate tests for each individual study. If you want to evaluate all the studies, you need to do a meta-analysis. This has a new theoretical prediction (is there an effect, if I combine all these studies). If this was really one big investigation, it would not make sense to publish these papers separately, right? And if it makes sense to publish them separately, then you don't need to control the error rate across all studies.
ReplyDeleteHI Daniel,
ReplyDeleteThinking about multiple tests: What about calculating the number of significant findings that you'd expect to observe due to chance (given number of tests), and then running a chi-squared test to determine whether the number of significant results you obtained are themselves, significantly different from what you'd expect due to chance?
Intuitively i feel like this makes sense...what do you think?
If you have some data, you can better use a meta-analysis. A chi-square would be a dichotomous test (sig yes or on), meta-analysis is continuous. Alternatively, you might be interested in literature on controlling the false discovery rate (instead of the Type 1 error rate) - see Benjamini & Hochberg, 1995.
DeleteHere's another puzzler for folks interested in the issue of error control over families of tests: Should researchers be correcting for multiple tests, even when they themselves did not run the tests, but all of the tests were run on the same data? Link is HERE.
ReplyDeleteNice post! We should definitely pay more attention to the logical structure of the inferences we make, in particular whether multiple pieces of evidence are combined in a disjunctive (OR operator) or conjunctive (AND operator) manner. I also think it is sometimes sensible to do neither and simply "average" multiple pieces of evidence (without correction) when we interpret our results.
ReplyDeleteOn a different topic, Fisher would have probably hated to read this sentence at the end: "There is only one reason to calculate p-values, and that is to control Type 1 error rates using a Neyman-Pearson approach"
See (Gigerenzer, 2004) http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf
I can understand Fisher's dismay, but it remains true :)
DeleteIn your online class you just said the opposite of this post
ReplyDeleteI did not. Please provide the quote you are referring to. Then we can discuss what I meant.
Delete