The 20% Statistician: One-sided tests: Efficient and Underused

Thursday, March 17, 2016

One-sided tests: Efficient and Underused

Researchers often have a directional hypothesis (e.g., the reaction times in the implicit association test are slower in the incongruent block compared to the congruent block). In these situations, researchers can choose to use either a two-sided test:

H0: Mean 1 – Mean 2 = 0
H1: Mean 1 – Mean 2 ≠ 0

or a one-sided test:

H0: Mean 1 – Mean 2 ≤ 0
H1: Mean 1 – Mean 2 > 0

One-sided tests are more powerful than two-sided tests. If you design a test with 80% power, a one-sided test requires approximately 79% of the total sample of a two-sided test. This means that the use of one-sided tests would make researchers more efficient. Tax money would be spent more efficiently.

Many researchers have reacted negatively to the “widespread overuse of two-tailed testing for directional research hypotheses tests” (Cho & Abe, 2013 – this a good read). As Jones (1952, p. 46) remarks: “Since the test of the null hypothesis against a one-sided alternative is the most powerful test for all directional hypotheses, it is strongly recommended that the one-tailed model be adopted wherever its use is appropriate”.

Nevertheless, researchers predominantly use two-sided tests. The use of one-sided tests is associated with attempts to get a non-significant p-value of 0.08 below the 0.05 threshold. I predict that the increased use of pre-registration will finally allow researchers to take advantage of more efficient one-sided tests, whenever they have a clear one-sided hypothesis.

There has been some discussion in the literature about the validity of one-sided tests, even when researchers have a directional hypothesis. This discussion has probably confused researchers enough to prevent them from changing the status quo of default use of two-sided tests. However, ignorance is not a good excuse to waste tax money in science. Furthermore, we can expect that in competitive research environments, researchers would prefer to be more efficient, whenever this is justified. Let’s discuss the factors that determine whether someone would use a one-sided or two-sided test.

First of all, a researcher should have a hypothesis where the expected effect lies in a specific direction. Importantly, the question is not whether a result in the opposite direction is possible, but whether it supports your hypothesis. For example, quizzing students during a series of lectures seems to be a useful way to improve their grade for the final exam. I set out to test this hypothesis. Half of the students receive weekly quizzes, while the other half does not get weekly quizzes. It is possible that, opposed to my prediction, the students who are quizzed actually perform worse. However, this is not of interest to me. I want to decide if I should take time during my lectures to quiz my students to improve their grades, or whether I should not do this. Therefore, I want to know if quizzes improve grades, or not. A one-sided test answers my question. If I decide to introduce quizzes in my lectures whenever p < alpha, where my alpha level is an acceptable Type 1 error rate, a one-sided test is a more efficient way to answer my question than a two-sided test.

If the introduction of quizzes substantially reduces exam grades, as opposed to my hypothesis, this might be an interesting observation for other researchers. A second concern raised against one-sided tests is that surprising findings in the opposite direction might be meaningful, and should not be ignored. I agree, but this is not an argument against one-sided testing. The goal in null-hypothesis significance testing is, not surprisingly, to test a hypothesis. But we are not in the business of testing a hypothesis we fabricated after looking at the data. Remember that the only correct use of a p-value is to control error rates when testing a hypothesis (the Neyman-Pearson approach to hypothesis testing). If you have a directional hypothesis, a result in the opposite direction can never confirm your hypothesis. It can confirm a new hypothesis, but this new hypothesis cannot be tested with a p-value calculated from the same data that was used to generate the hypothesis. It makes sense to describe the unexpected pattern in your data when you publish your research. The descriptive statistics can be used to communicate the direction and size of the observed effect. Although you can’t report a meaningful p-value, you are free to add a Bayes Factor or likelihood ratio as a measure of evidence in the data. There is a difference between describing data, and testing a hypothesis. A one-sided hypothesis test does not prohibit researchers from describing unexpected data patterns.

A third concern is that a one-sided test leads to weaker evidence (e.g., Schulz & Grimes, 2005). This is trivially true: Any change to the design of a study that requires a smaller sample size reduces the strength of the evidence you collect, since the evidence is inherently tied to the total number of observations. Other techniques to design more efficient studies (e.g., sequential analyses, Lakens, 2014) also lead to lower samples sizes, and thus less evidence. The response to this concern is straightforward: If you desire a specific level of evidence, design a study that provides this desired level of evidence. Criticizing a one-sided test because it reduces the level of evidence is an implicit acknowledgement that a two-sided test provides the desired level of evidence, which is illogical, since p-values are only weakly related to evidence to begin with (Good, 1992). Furthermore, the use of a one-sided test does not force you to reduce the sample size. For example, a researcher will collect the maximum number of participants that are available given the current resources should still use a one-sided test whenever possible to increase statistical power, even when the choice for a one-sided vs. two-sided test does not change the level of evidence in the data. There is a difference between designing a study that yields a certain level of evidence, and a study that adequately controls the error rates when performing a hypothesis test.

I think this sufficiently addresses the concerns raised in the literature (but this blog is my invitation to you to tell me why I am wrong, or raise new concerns).

We can now answer the question when we should use one-sided tests. To prevent wasting tax money, one-sided tests should be performed whenever:

1) a hypothesis involves a directional prediction

2) a p-value is calculated.

I believe there are many studies that meet these two requirements. Researchers should take 10 minutes to pre-register their experiment (just to prevent reviewers from drawing an incorrect inference about why you are using a one-sided test), to benefit from the 20% reduction in sample size (perform 5 studies, get one free). Also, these benefits stack with the reduction in the required sample when you use sequential analyses, such that a one-sided sequential analysis easily provides a 20% reduction, on top of a 20% reduction. You are welcome.

References

Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. http://doi.org/10.2307/2290192

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023

Schulz, K. F., & Grimes, D. A. (2005). Sample size calculations in randomised trials: mandatory and mystical. The Lancet, 365(9467), 1348–1353.

24 comments:

Nick BrownMarch 18, 2016 at 12:33 AM
Yes, you get more power. But if you keep an alpha level of .05, you also increase your false positive rate, because the area of the part of the tail that causes you to say you got a "significant result" is twice as large.

Also, many statistical methods (are there any apart from t-tests and z-tests) don't admit one-tailed tests. So you can be in the situation of using ANOVA (2-tailed) and t tests (1-tailed) very close to each other in the same analyses. This seems like a recipe for confusion (at least).
ReplyDelete
Replies
AnonymousMarch 18, 2016 at 9:34 AM
The use of one-sided t-test should be restricted to pre-registered studies. No pre-registration, no use of one-sided t-test.

I hate when people use one-sided t-test in their paper and you realize that p=0.0264. Therefore, a two-sided t-test would not yield a significant effect. That would be interesting to study. Are one-sided t-test more often associated with p-values between 0.025 and 0.05 than they should?

JJ
ReplyDelete
Replies
Daniel LakensMarch 18, 2016 at 9:36 AM
They undoubtly are, just like p-values between 0.025 and 0.05 are more often associated with non-replicable results than they should.

Hence, the pre-registration. We solve the problem of inflated error rates, AND YOU GET TO BE 20% MORE EFFICIENT.

Win-Win.
ReplyDelete
Replies
ArchitectonicMarch 18, 2016 at 12:52 PM
Excellent post. It seems many people seem to miss the forest for the trees! The goal is appropriate hypothesis testing, not getting excited (or frustrated) about a p value falling in a particular range.
ReplyDelete
Replies
Ben CairnsMarch 18, 2016 at 1:29 PM
Pre-registration seems a pre-requisite for avoiding the need to do a two-tailed test -- the main point of which is to keep researchers honest where some might be tempted to report a one-sided p-value in the 'other' direction. Pre-registration should probably pre-specify both the directional hypothesis and, to be safe, the intention to use a one-tailed test.

FWIW, the ANOVA F-test is one tailed because extreme differences in means are represented by larger F-statistics, whatever the direction of the difference. In effect, two tails have been combined into one. If your hypothesis is directional, and requires a one-tailed test, it would be an error to use ANOVA to test it.

That said, it might often seem reasonable to calculate a "two-tailed" confidence interval, even alongside a one-tailed p-value. That would confuse those readers (and possibly, co-authors) who cannot reconcile a 95% CI crossing 0 and a P<0.05. I suspect that faith in the CI/p-value correspondence is more dearly held (particularly by non-statisticians) than the belief that all tests should be two-tailed.
ReplyDelete
Replies
I'dratherbeplayingtennisMarch 29, 2016 at 11:13 PM
The major problem with one tailed tests (as far as I can tell) is that the researcher then CANNOT interpret a result in the "wrong" or opposite direction as statistically significant.

When one has a single, strong, clear, pre-registered, uni-directional hypothesis, this is a non-issue.

However, my view is that some of the best social psychology (my field) tests plausible alternative hypotheses. Here is a simple example.

1. Racial stereotypes bias judgments of African American job applicants, who, all things being equal, will be judged more negatively than White applicants.

2. Racial stereotypes set up expectations, which, when violated, lead to more extreme evaluations. A White job applicant with a weak background has not situational excuse; because people are aware of discrimination, an African American applicant with an identically weak background is probably more competent. Similarly, among equally strong applicants, the African American will be seen as even more impressive than the White applicant, by virtue of getting there by overcoming discrimination. In both cases, all things being equal, on average, people will more positively evaluate the African American applicant.

Testing plausible alternative hypotheses may not be "the" answer to social psychology's troubles, but it should be in the toolbox, bigtime.

Lee Jussim
ReplyDelete
Replies
AnonymousApril 16, 2016 at 2:32 PM
Hi Prof Lakens, interesting article. One-sided confidence intervals never seem to be discussed and I wonder if this would be a useful discussion point, along with their interpretation.

Dan
ReplyDelete
Replies
AnonymousAugust 7, 2016 at 9:38 PM
Dear Daniel, do you think, that use of one-sided test is still appropriate if from the means it is evident that in the first run of the experiment the changes follow the expected direction, and in the second run there is no change at all?
ReplyDelete
Replies
UnknownJanuary 28, 2017 at 12:48 PM
I agree with Lee and propose that one should not use one tailed tests in drug testing or basic biological research even when one guesses at a direction of change a priori our even when preregistered. This is because a change in the opposite direction may be a vital new counter example for challenging a hypothesis (a la Popper) that may shed light on a new mechanism or reveal an unexpected toxicity. Sample size calculations reveal that two tailed tests are often not much more demanding in resources for typical studies in Neuroscience. Try it out in G Power!
ReplyDelete
Replies
UnknownJanuary 28, 2017 at 6:04 PM
http://rsos.royalsocietypublishing.org/content/1/3/140216

Your false positive rate won't be 5 percent with alpha of 0.05 if your experiments are underpowered. Could be as much as 30 percent.
ReplyDelete
Replies
Daniel LakensJanuary 28, 2017 at 6:16 PM
Thanks for your comment. This is not related to Type 1 errors. See my blog where I discuss the topic you raise in more detail: http://daniellakens.blogspot.nl/2015/09/how-can-p-005-lead-to-wrong-conclusions.html
ReplyDelete
Replies
AniJanuary 23, 2018 at 4:56 PM
Thanks for blog post Daniel :) Just wondering about the unequal variance issue?
I am guessing that converting a two-way to one-way test, i would assume that a balanced set and equal variance will be assumed or not?
ReplyDelete
Replies
AnonymousApril 6, 2021 at 4:51 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment