I will be honest: I have never understood this confusion about what a family of tests is when controlling the family-wise error rate. At least not in a Neyman-Pearson approach to hypothesis testing, where the goal is to use data to make decisions about how to act. Neyman (Neyman, 1957) calls his approach inductive behavior. The outcome of an experiment leads one to take different possible actions, which can be either practical (e.g., implement a new procedure, abandon a research line) or scientific (e.g., claim there is or is no effect). From an error-statistical approach (Mayo, 2018) inflated Type 1 error rates mean that it has become very likely that you will be able to claim support for your hypothesis, even when the hypothesis is wrong. This reduces the severity of the test. To prevent this, we need to control our error rate at the level of our claim.
One reason the issue of family-wise error rates might remain vague, is that researchers are often vague about their claims. We do not specify our hypotheses unambiguously, and therefore this issue remains unclear. To be honest, I suspect another reason there is a continuing debate about whether and how to lower the alpha level to control for multiple comparisons in some disciplines is that 1) there are a surprisingly large number of papers written on this topic that argue you do not need to control for multiple comparisons, which are 2) cited a huge number of times giving rise to the feeling that surely they must have a point. Regrettably, the main reason these papers are written is because there are people who don't think a Neyman-Pearson approach to hypothesis testing is a good idea, and the main reason these papers are cited is because doing so is convenient for researchers who want to publish statistically significant results, as they can justify why they are not lowering their alpha level, making that p = 0.02 in one of three tests really 'significant'. All papers that argue against the need to control for multiple comparisons when testing hypotheses are wrong. Yes, their existence and massive citation counts frustrate me. It is fine not to test a hypothesis, but when you do, and you make a claim based on a test, you need to control your error rates.
But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.
But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.
An example of formalizing family-wise error control
Let’s assume we collect data from 100 participants in a control and treatment condition. We collect 3 dependent variables (dv1, dv2, and dv3). In the population there is no difference between groups on any of these three variables (the true effect size is 0). We will analyze the three dv’s in independent t-tests. This requires specifying our alpha level, and thus deciding whether we need to correct for multiple comparisons. How we control error rates depends on claim we want to make.
We might want to act as if (or claim that) our treatment works if there is a difference between the treatment and control conditions on any of the three variables. In scienceverse terms, this means we consider the prediction corroborated when the p-value of the first t-test is smaller than alpha level, the p-value of the second t-test is smaller than the alpha level, or the p-value of the first t-test is smaller than the alpha level. In the scienceverse code, we specify a criterion for each test (a p-value smaller than the alpha level, p.value < alpha_level) and conclude the hypothesis is corroborated if either of these criteria are met ("p_t_1 | p_t_2 | p_t_3").
We could also want to make three different predictions. Instead of one hypothesis (“something will happen”) we have three different hypotheses, and predict there will be an effect on dv1, dv2, and dv3. The criterion for each t-test is the same, but we now have three hypotheses to evaluate (H1, H2, and H3). Each of these claims can be corroborated, or not.
Scienceverse allows you to specify your hypotheses tests unambiguously (for code used in this blog, see the bottom of the post). It also allows you to simulate a dataset, which we can use to examine Type 1 errors by simulating data where no true effects exist. Finally, scienceverse allows you to run the pre-specified analyses on the (simulated) data, and will automatically create a report that summarizes which hypotheses were corroborated (which is useful when checking if the conclusions in a manuscript indeed follow from the preregistered analyses, or not). The output a single simulated dataset for the scenario where we will interpret any effect on the three dv’s as support for the hypothesis looks like this:
Evaluation of Statistical Hypotheses
12 March, 2020
Simulating Null Effects Postregistration
Results
Hypothesis 1: H1
Something will happenp_t_1
is confirmed if analysis ttest_1 yieldsp.value<0.05
The result was p.value = 0.452 (FALSE)
p_t_2
is confirmed if analysis ttest_2 yieldsp.value<0.05
The result was p.value = 0.21 (FALSE)
p_t_3
is confirmed if analysis ttest_3 yieldsp.value<0.05
The result was p.value = 0.02 (TRUE)
Corroboration ( TRUE )
The hypothesis is corroborated if anything is significant. p_t_1 | p_t_2 | p_t_3
Falsification ( FALSE )
The hypothesis is falsified if nothing is significant. !p_t_1 & !p_t_2 & !p_t_3
All criteria were met for corroboration.We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 - even though this was a Type 1 error, since we simulated data with a true effect size of 0 - and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:
Evaluation of Statistical Hypotheses
12 March, 2020
Simulating Null Effects Postregistration
Results
Hypothesis 1: H1
dv1 will show an effectp_t_1
is confirmed if analysis ttest_1 yieldsp.value<0.05
The result was p.value = 0.452 (FALSE)
Corroboration ( FALSE )
The hypothesis is corroborated if dv1 is significant. p_t_1
Falsification ( TRUE )
The hypothesis is falsified if dv1 is not significant. !p_t_1
All criteria were met for falsification.Hypothesis 2: H2
dv2 will show an effectp_t_2
is confirmed if analysis ttest_2 yieldsp.value<0.05
The result was p.value = 0.21 (FALSE)
Corroboration ( FALSE )
The hypothesis is corroborated if dv2 is significant. p_t_2
Falsification ( TRUE )
The hypothesis is falsified if dv2 is not significant. !p_t_2
All criteria were met for falsification.Hypothesis 3: H3
dv3 will show an effectp_t_3
is confirmed if analysis ttest_3 yieldsp.value<0.05
The result was p.value = 0.02 (TRUE)
Corroboration ( TRUE )
The hypothesis is corroborated if dv3 is significant. p_t_3
Falsification ( FALSE )
The hypothesis is falsified if dv3 is not significant. !p_t_3
All criteria were met for corroboration.We now see that two hypotheses were falsified (yes, yes, I know you should not use p > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don't also have to explain equivalence testing to readers not familiar with it - if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.
Hopefully, specifying our hypotheses (and what would corroborate them) transparently by using scienceverse makes it clear what happens in the long run in both scenarios. In the long run, both the first scenario, if we would use an alpha level of 0.05/3 instead of 0.05, and the second scenario, with an alpha level of 0.05 for each individual hypothesis, will lead to the same end result: Not more than 5% of our claims will be wrong, if the null hypothesis is true. In the first scenario, we are making one claim in an experiment, and in the second we make three. In the second scenario we will end up with more false claims in an absolute sense, but the relative number of false claims is the same in both scenarios. And that’s exactly the goal of family-wise error control.
References
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. CRC Press.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.
Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7. https://doi.org/10.2307/1401671
Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.
Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.
This comment has been removed by a blog administrator.
ReplyDelete