A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, March 17, 2016

One-sided tests: Efficient and Underused

Researchers often have a directional hypothesis (e.g., the reaction times in the implicit association test are slower in the incongruent block compared to the congruent block). In these situations, researchers can choose to use either a two-sided test:

H0: Mean 1 – Mean 2 = 0
H1: Mean 1 – Mean 2 ≠ 0

or a one-sided test:

H0: Mean 1 – Mean 2 ≤ 0
H1: Mean 1 – Mean 2 > 0

One-sided tests are more powerful than two-sided tests. If you design a test with 80% power, a one-sided test requires approximately 79% of the total sample of a two-sided test. This means that the use of one-sided tests would make researchers more efficient. Tax money would be spent more efficiently.

Many researchers have reacted negatively to the “widespread overuse of two-tailed testing for directional research hypotheses tests” (Cho & Abe, 2013 – this a good read). As Jones (1952, p. 46) remarks: “Since the test of the null hypothesis against a one-sided alternative is the most powerful test for all directional hypotheses, it is strongly recommended that the one-tailed model be adopted wherever its use is appropriate”.

Nevertheless, researchers predominantly use two-sided tests. The use of one-sided tests is associated with attempts to get a non-significant p-value of 0.08 below the 0.05 threshold. I predict that the increased use of pre-registration will finally allow researchers to take advantage of more efficient one-sided tests, whenever they have a clear one-sided hypothesis.

There has been some discussion in the literature about the validity of one-sided tests, even when researchers have a directional hypothesis. This discussion has probably confused researchers enough to prevent them from changing the status quo of default use of two-sided tests. However, ignorance is not a good excuse to waste tax money in science. Furthermore, we can expect that in competitive research environments, researchers would prefer to be more efficient, whenever this is justified. Let’s discuss the factors that determine whether someone would use a one-sided or two-sided test.

First of all, a researcher should have a hypothesis where the expected effect lies in a specific direction. Importantly, the question is not whether a result in the opposite direction is possible, but whether it supports your hypothesis. For example, quizzing students during a series of lectures seems to be a useful way to improve their grade for the final exam. I set out to test this hypothesis. Half of the students receive weekly quizzes, while the other half does not get weekly quizzes. It is possible that, opposed to my prediction, the students who are quizzed actually perform worse. However, this is not of interest to me. I want to decide if I should take time during my lectures to quiz my students to improve their grades, or whether I should not do this. Therefore, I want to know if quizzes improve grades, or not. A one-sided test answers my question. If I decide to introduce quizzes in my lectures whenever p < alpha, where my alpha level is an acceptable Type 1 error rate, a one-sided test is a more efficient way to answer my question than a two-sided test.

If the introduction of quizzes substantially reduces exam grades, as opposed to my hypothesis, this might be an interesting observation for other researchers. A second concern raised against one-sided tests is that surprising findings in the opposite direction might be meaningful, and should not be ignored. I agree, but this is not an argument against one-sided testing. The goal in null-hypothesis significance testing is, not surprisingly, to test a hypothesis. But we are not in the business of testing a hypothesis we fabricated after looking at the data. Remember that the only correct use of a p-value is to control error rates when testing a hypothesis (the Neyman-Pearson approach to hypothesis testing). If you have a directional hypothesis, a result in the opposite direction can never confirm your hypothesis. It can confirm a new hypothesis, but this new hypothesis cannot be tested with a p-value calculated from the same data that was used to generate the hypothesis. It makes sense to describe the unexpected pattern in your data when you publish your research. The descriptive statistics can be used to communicate the direction and size of the observed effect. Although you can’t report a meaningful p-value, you are free to add a Bayes Factor or likelihood ratio as a measure of evidence in the data. There is a difference between describing data, and testing a hypothesis. A one-sided hypothesis test does not prohibit researchers from describing unexpected data patterns.

A third concern is that a one-sided test leads to weaker evidence (e.g., Schulz & Grimes, 2005). This is trivially true: Any change to the design of a study that requires a smaller sample size reduces the strength of the evidence you collect, since the evidence is inherently tied to the total number of observations. Other techniques to design more efficient studies (e.g., sequential analyses, Lakens, 2014) also lead to lower samples sizes, and thus less evidence. The response to this concern is straightforward: If you desire a specific level of evidence, design a study that provides this desired level of evidence. Criticizing a one-sided test because it reduces the level of evidence is an implicit acknowledgement that a two-sided test provides the desired level of evidence, which is illogical, since p-values are only weakly related to evidence to begin with (Good, 1992). Furthermore, the use of a one-sided test does not force you to reduce the sample size. For example, a researcher will collect the maximum number of participants that are available given the current resources should still use a one-sided test whenever possible to increase statistical power, even when the choice for a one-sided vs. two-sided test does not change the level of evidence in the data. There is a difference between designing a study that yields a certain level of evidence, and a study that adequately controls the error rates when performing a hypothesis test.

I think this sufficiently addresses the concerns raised in the literature (but this blog is my invitation to you to tell me why I am wrong, or raise new concerns).

We can now answer the question when we should use one-sided tests. To prevent wasting tax money, one-sided tests should be performed whenever:

1) a hypothesis involves a directional prediction

2) a p-value is calculated.

I believe there are many studies that meet these two requirements. Researchers should take 10 minutes to pre-register their experiment (just to prevent reviewers from drawing an incorrect inference about why you are using a one-sided test), to benefit from the 20% reduction in sample size (perform 5 studies, get one free). Also, these benefits stack with the reduction in the required sample when you use sequential analyses, such that a one-sided sequential analysis easily provides a 20% reduction, on top of a 20% reduction. You are welcome.



References

Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. http://doi.org/10.2307/2290192
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023
Schulz, K. F., & Grimes, D. A. (2005). Sample size calculations in randomised trials: mandatory and mystical. The Lancet, 365(9467), 1348–1353.

21 comments:

  1. Yes, you get more power. But if you keep an alpha level of .05, you also increase your false positive rate, because the area of the part of the tail that causes you to say you got a "significant result" is twice as large.

    Also, many statistical methods (are there any apart from t-tests and z-tests) don't admit one-tailed tests. So you can be in the situation of using ANOVA (2-tailed) and t tests (1-tailed) very close to each other in the same analyses. This seems like a recipe for confusion (at least).

    ReplyDelete
    Replies
    1. Hi Nick, you have both things wrong. An alpha of 0.05 means you have a 5% Type 1 error rate, max. Your false positives rate is 5% max. You will say 'There is something here' when there is nothing, 5% of the time, max. So one-sided testing does not increase the false positives.

      A two-tailed ANOVA does not exist. An ANOVA is always a one-tailed test. There is a difference in the means, or there is no difference in the means. See: http://stats.stackexchange.com/questions/67543/why-do-we-use-a-one-tailed-test-f-test-in-analysis-of-variance-anova

      Delete
    2. I will think more about the false-positive issue. I'm not convinced it is so simple.

      On the other question, maybe an ANOVA should be a one-tailed test, but that doesn't seem to be what the software is doing. Have a look at nick.brown.free.fr/stuff/TvsF/TvsF.R (or for SPSS users, nick.brown.free.fr/stuff/TvsF/TvsF.sav and nick.brown.free.fr/stuff/TvsF/TvsF.spv). I compare two groups with a t test and an ANOVA. t is -1.68, which would be significant with a one-tailed test; the two-tailed p is .098, so the one-tailed p would be .049 (give or take the Welch question ;-)). F is 2.82 (i.e., -1.68 squared), and the p value is the same, give or take rounding at some point. So if the ANOVA is doing a one-tailed test, why does it give the same p value as a two-tailed t test on the same data? What am I missing here?

      Delete
    3. The problem is that a one-tailed t test is a directional test, but a one-tailed F test is non-directional (being equivalent to the sum of both tail probabilities in the t test).

      > pf(2.82,1,1000) # about 10% in right tail
      [1] 0.9065912
      > pt(1.68,1000) # about 5% in right tail
      [1] 0.9533652

      I dislike the one-tailed or two-tailed terminology. Conceptually we should care about whether the test is directional or not. In principle one can carve up a non-directional tests into 2 or more directional hypotheses (depending on how many df the effect has).

      Delete
    4. I think Nick's sense that something is not right here might reflect the difference between the statistical error rate and what we might call the effective error rate, by which I mean the error rate that would lead to a change in practice, for example.

      Let's take your lecture quiz example, and imagine that the quizzes have no effect. Under a one-tailed test (in the direction of improving scores), 5% of the time we'd erroneous conclude that they were helpful and keep giving them, thus wasting everyone's time. If we'd done a two-tailed test, we'd also have been wrong 5% of the time, but half of those times would be in the opposite direction, making us think quizzes actually hurt performance. In only 2.5% of the cases would we think that quizzes helped and keep giving them. As you point out, the question of quizzes being counterproductive is not really relevant to the question of whether to give them - to make that decision, it's enough that they're not helpful.

      So even though the rate of wrong conclusions is 5% in both types of tests, the rate of wrong responses is not. With a two-tailed test, we'd have kept giving quizzes only 2.5% of the time; with a one-tailed, we'd keep giving them 5% of the time.

      The same sort of logic applies to something like a drug trial. With the more common two-sided test, we'd only make an error on the side of benefit 2.5% of the time, whereas we'd do that 5% of the time with a one-sided test. That might still be an acceptable rate, but I think we'd want to acknowledge that a major shift to one-sided tests would likely lead to an increase in the rate at which useless treatments were pursued.

      Delete
  2. The use of one-sided t-test should be restricted to pre-registered studies. No pre-registration, no use of one-sided t-test.

    I hate when people use one-sided t-test in their paper and you realize that p=0.0264. Therefore, a two-sided t-test would not yield a significant effect. That would be interesting to study. Are one-sided t-test more often associated with p-values between 0.025 and 0.05 than they should?

    JJ

    ReplyDelete
  3. They undoubtly are, just like p-values between 0.025 and 0.05 are more often associated with non-replicable results than they should.

    Hence, the pre-registration. We solve the problem of inflated error rates, AND YOU GET TO BE 20% MORE EFFICIENT.

    Win-Win.

    ReplyDelete
    Replies
    1. The current goal has to be to remove bogus claims from the literature. False social science is so bad that replication is as likely to find purported treatments have significant deleterious effects as they are to show null effects. The only people, IMHO, who need to pre-register are adherents of the original study. Giving them a 1-sided option doubles the confirmatory error rate, makes their "we had >80% power" inconsistent with what most people understand power to mean (i.e., power at a=.05), and excuses them from concluding that their potion is actually harmful. I'd rate it as a positively harmful change.

      Delete
  4. Excellent post. It seems many people seem to miss the forest for the trees! The goal is appropriate hypothesis testing, not getting excited (or frustrated) about a p value falling in a particular range.

    ReplyDelete
  5. Pre-registration seems a pre-requisite for avoiding the need to do a two-tailed test -- the main point of which is to keep researchers honest where some might be tempted to report a one-sided p-value in the 'other' direction. Pre-registration should probably pre-specify both the directional hypothesis and, to be safe, the intention to use a one-tailed test.

    FWIW, the ANOVA F-test is one tailed because extreme differences in means are represented by larger F-statistics, whatever the direction of the difference. In effect, two tails have been combined into one. If your hypothesis is directional, and requires a one-tailed test, it would be an error to use ANOVA to test it.

    That said, it might often seem reasonable to calculate a "two-tailed" confidence interval, even alongside a one-tailed p-value. That would confuse those readers (and possibly, co-authors) who cannot reconcile a 95% CI crossing 0 and a P<0.05. I suspect that faith in the CI/p-value correspondence is more dearly held (particularly by non-statisticians) than the belief that all tests should be two-tailed.

    ReplyDelete
  6. The major problem with one tailed tests (as far as I can tell) is that the researcher then CANNOT interpret a result in the "wrong" or opposite direction as statistically significant.

    When one has a single, strong, clear, pre-registered, uni-directional hypothesis, this is a non-issue.

    However, my view is that some of the best social psychology (my field) tests plausible alternative hypotheses. Here is a simple example.

    1. Racial stereotypes bias judgments of African American job applicants, who, all things being equal, will be judged more negatively than White applicants.

    2. Racial stereotypes set up expectations, which, when violated, lead to more extreme evaluations. A White job applicant with a weak background has not situational excuse; because people are aware of discrimination, an African American applicant with an identically weak background is probably more competent. Similarly, among equally strong applicants, the African American will be seen as even more impressive than the White applicant, by virtue of getting there by overcoming discrimination. In both cases, all things being equal, on average, people will more positively evaluate the African American applicant.

    Testing plausible alternative hypotheses may not be "the" answer to social psychology's troubles, but it should be in the toolbox, bigtime.

    Lee Jussim

    ReplyDelete
    Replies
    1. Hi Lee, thanks for dropping by! I wrote a follow-up post on asymmetric tests: http://daniellakens.blogspot.de/2016/03/who-do-you-love-most-your-left-tail-or.html which achieves what you want (test effects in two directions) while giving you power benefits for the effect you care about.

      Your example makes sense, if you are examining a theory that makes both these predictions. There are still many situations where theories make directional predictions, and where we can be more efficient.

      Delete
  7. Hi Prof Lakens, interesting article. One-sided confidence intervals never seem to be discussed and I wonder if this would be a useful discussion point, along with their interpretation.

    Dan

    ReplyDelete
    Replies
    1. Our region of certainty extends in all directions around the estimate. Standard-error-based intervals reflect 1.96*SE around the estimate. Profile-based estimates can be asymmetric (and are better - which is why programs like OpenMx support them). A CI may run up against a bound in the model. Some such bounds are statistical anomalies (negative variance, for instance), but others are artificial (like bounding a mean difference to be positive) should be interpreted as one-side only in the sense that the author doesn't wish to explore that side of space.

      PS: This stack item is worth reading on profile likelihoods:
      https://stats.stackexchange.com/questions/9833/constructing-confidence-intervals-based-on-profile-likelihood

      Delete
  8. Dear Daniel, do you think, that use of one-sided test is still appropriate if from the means it is evident that in the first run of the experiment the changes follow the expected direction, and in the second run there is no change at all?

    ReplyDelete
    Replies
    1. Yes, because it tests your hypothesis, even if the data of the second study is no in line with your prediction.

      Delete
  9. I agree with Lee and propose that one should not use one tailed tests in drug testing or basic biological research even when one guesses at a direction of change a priori our even when preregistered. This is because a change in the opposite direction may be a vital new counter example for challenging a hypothesis (a la Popper) that may shed light on a new mechanism or reveal an unexpected toxicity. Sample size calculations reveal that two tailed tests are often not much more demanding in resources for typical studies in Neuroscience. Try it out in G Power!

    ReplyDelete
    Replies
    1. Hi anonymous, I think you are completely wrong, because exploration (finding surprising data) is not hypothesis testing (testing a prediction).

      Delete
  10. http://rsos.royalsocietypublishing.org/content/1/3/140216

    Your false positive rate won't be 5 percent with alpha of 0.05 if your experiments are underpowered. Could be as much as 30 percent.

    ReplyDelete
  11. Thanks for your comment. This is not related to Type 1 errors. See my blog where I discuss the topic you raise in more detail: http://daniellakens.blogspot.nl/2015/09/how-can-p-005-lead-to-wrong-conclusions.html

    ReplyDelete
  12. Hi Daniel,

    That's a good piece and I was quite shocked when I understood most people in a niche I work use two-tailed tests when the questions they want answered are almost exclusively answerable only by one-tailed tests. I've written a post myself, where I even make the case that two-tailed tests are a misapplication of statistics in almost all cases of practical and scientific research. Even though it is tailored towards testing in online marketing, it should help your readers as well: http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/

    ReplyDelete