H0: Mean 1 – Mean 2 = 0
H1: Mean 1 – Mean 2 ≠ 0
H1: Mean 1 – Mean 2 ≠ 0
or a one-sided test:
H0: Mean 1 – Mean 2 ≤ 0
H1: Mean 1 – Mean 2 > 0
H1: Mean 1 – Mean 2 > 0
One-sided tests are more powerful than two-sided tests. If you design a test with 80% power, a one-sided test requires approximately 79% of the total sample of a two-sided test. This means that the use of one-sided tests would make researchers more efficient. Tax money would be spent more efficiently.
Many researchers have reacted negatively to the “widespread overuse of two-tailed testing for directional research hypotheses tests” (Cho & Abe, 2013 – this a good read). As Jones (1952, p. 46) remarks: “Since the test of the null hypothesis against a one-sided alternative is the most powerful test for all directional hypotheses, it is strongly recommended that the one-tailed model be adopted wherever its use is appropriate”.
Nevertheless, researchers predominantly use two-sided tests. The use of one-sided tests is associated with attempts to get a non-significant p-value of 0.08 below the 0.05 threshold. I predict that the increased use of pre-registration will finally allow researchers to take advantage of more efficient one-sided tests, whenever they have a clear one-sided hypothesis.
There has been some discussion in the literature about the validity of one-sided tests, even when researchers have a directional hypothesis. This discussion has probably confused researchers enough to prevent them from changing the status quo of default use of two-sided tests. However, ignorance is not a good excuse to waste tax money in science. Furthermore, we can expect that in competitive research environments, researchers would prefer to be more efficient, whenever this is justified. Let’s discuss the factors that determine whether someone would use a one-sided or two-sided test.
First of all, a researcher should have a hypothesis where the expected effect lies in a specific direction. Importantly, the question is not whether a result in the opposite direction is possible, but whether it supports your hypothesis. For example, quizzing students during a series of lectures seems to be a useful way to improve their grade for the final exam. I set out to test this hypothesis. Half of the students receive weekly quizzes, while the other half does not get weekly quizzes. It is possible that, opposed to my prediction, the students who are quizzed actually perform worse. However, this is not of interest to me. I want to decide if I should take time during my lectures to quiz my students to improve their grades, or whether I should not do this. Therefore, I want to know if quizzes improve grades, or not. A one-sided test answers my question. If I decide to introduce quizzes in my lectures whenever p < alpha, where my alpha level is an acceptable Type 1 error rate, a one-sided test is a more efficient way to answer my question than a two-sided test.
If the introduction of quizzes substantially reduces exam grades, as opposed to my hypothesis, this might be an interesting observation for other researchers. A second concern raised against one-sided tests is that surprising findings in the opposite direction might be meaningful, and should not be ignored. I agree, but this is not an argument against one-sided testing. The goal in null-hypothesis significance testing is, not surprisingly, to test a hypothesis. But we are not in the business of testing a hypothesis we fabricated after looking at the data. Remember that the only correct use of a p-value is to control error rates when testing a hypothesis (the Neyman-Pearson approach to hypothesis testing). If you have a directional hypothesis, a result in the opposite direction can never confirm your hypothesis. It can confirm a new hypothesis, but this new hypothesis cannot be tested with a p-value calculated from the same data that was used to generate the hypothesis. It makes sense to describe the unexpected pattern in your data when you publish your research. The descriptive statistics can be used to communicate the direction and size of the observed effect. Although you can’t report a meaningful p-value, you are free to add a Bayes Factor or likelihood ratio as a measure of evidence in the data. There is a difference between describing data, and testing a hypothesis. A one-sided hypothesis test does not prohibit researchers from describing unexpected data patterns.
A third concern is that a one-sided test leads to weaker evidence (e.g., Schulz & Grimes, 2005). This is trivially true: Any change to the design of a study that requires a smaller sample size reduces the strength of the evidence you collect, since the evidence is inherently tied to the total number of observations. Other techniques to design more efficient studies (e.g., sequential analyses, Lakens, 2014) also lead to lower samples sizes, and thus less evidence. The response to this concern is straightforward: If you desire a specific level of evidence, design a study that provides this desired level of evidence. Criticizing a one-sided test because it reduces the level of evidence is an implicit acknowledgement that a two-sided test provides the desired level of evidence, which is illogical, since p-values are only weakly related to evidence to begin with (Good, 1992). Furthermore, the use of a one-sided test does not force you to reduce the sample size. For example, a researcher will collect the maximum number of participants that are available given the current resources should still use a one-sided test whenever possible to increase statistical power, even when the choice for a one-sided vs. two-sided test does not change the level of evidence in the data. There is a difference between designing a study that yields a certain level of evidence, and a study that adequately controls the error rates when performing a hypothesis test.
I think this sufficiently addresses the concerns raised in the literature (but this blog is my invitation to you to tell me why I am wrong, or raise new concerns).
We can now answer the question when we should use one-sided tests. To prevent wasting tax money, one-sided tests should be performed whenever:
1) a hypothesis involves a directional prediction
2) a p-value is calculated.
I believe there are many studies that meet these two requirements. Researchers should take 10 minutes to pre-register their experiment (just to prevent reviewers from drawing an incorrect inference about why you are using a one-sided test), to benefit from the 20% reduction in sample size (perform 5 studies, get one free). Also, these benefits stack with the reduction in the required sample when you use sequential analyses, such that a one-sided sequential analysis easily provides a 20% reduction, on top of a 20% reduction. You are welcome.
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. http://doi.org/10.2307/2290192Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023
Schulz, K. F., & Grimes, D. A. (2005). Sample size calculations in randomised trials: mandatory and mystical. The Lancet, 365(9467), 1348–1353.