The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, December 1, 2018

Justify Your Alpha by Decreasing Alpha Levels as a Function of the Sample Size

Testing whether observed data should surprise us, under the assumption that some model of the data is true, is a widely used procedure in psychological science. Tests against a null model, or against the smallest effect size of interest for an equivalence test, can guide your decisions to continue or abandon research lines. Seeing whether a p-value is smaller than an alpha level is rarely the only thing you want to do, but especially early on in experimental research lines where you can randomly assign participants to conditions, it can be a useful thing.

Regrettably, this procedure is performed rather mindlessly. Doing Neyman-Pearson hypothesis testing well, you should carefully think about the error rates you find acceptable. How often do you want to miss the smallest effect size you care about, if it is really there? And how often do you want to say there is an effect, but actually be wrong? It is important to justify your error rates when designing an experiment. In this post I will provide one justification for setting the alpha level (something we recommended makes more sense than using a fixed alpha level).

Papers explaining how to justify your alpha level are very rare (for an example, see Mudge, Baker, Edge, & Houlahan, 2012). Here I want to discuss one of the least known, but easiest suggestions on how to justify alpha levels in the literature, proposed by Good. The idea is simple, and has been supported by many statisticians in the last 80 years: Lower the alpha level as a function of your sample size.

The idea behind this recommendation is most extensively discussed in a book by Leamer (1978, p. 92). He writes:

The rule of thumb quite popular now, that is, setting the significance level arbitrarily to .05, is shown to be deficient in the sense that from every reasonable viewpoint the significance level should be a decreasing function of sample size.

Leamer (you can download his book for free) correctly notes that this behavior, an alpha level that is a decreasing function of the sample size, makes sense from both a Bayesian as a Neyman-Pearson perspective. Let me explain.

Imagine a researcher who performs a study that has 99.9% power to detect the smallest effect size the researcher is interested in, based on a test with an alpha level of 0.05. Such a study also has 99.8% power when using an alpha level of 0.03. Feel free to follow along here, by setting the sample size to 204, the effect size to 0.5, alpha or p-value (upper limit) to 0.05, and the p-value (lower limit) to 0.03.

We see that if the alternative hypothesis is true only 0.1% of the observed studies will, in the long run, observe a p-value between 0.03 and 0.05. When the null-hypothesis is true 2% of the studies will, in the long run, observe a p-value between 0.03 and 0.05. Note how this makes p-values between 0.03 and 0.05 more likely when there is no true effect, than when there is an effect. This is known as Lindley’s paradox (and I explain this in more detail in Assignment 1 in my MOOC, which you can also do here).

Although you can argue that you are still making a Type 1 error at most 5% of the time in the above situation, I think it makes sense to acknowledge there is something weird about having a Type 1 error of 5% when you have a Type 2 error of 0.1% (again, see Mudge, Baker, Edge, & Houlahan, 2012, who suggest balancing error rates). To me, it makes sense to design a study where error rates are more balanced, and a significant effect is declared for p-values more likely to occur when the alternative model is true than when the null model is true.

Because power increases as the sample size increases, and because Lindley’s paradox (Lindley, 1957, see also Cousins, 2017) can be prevented by lowering the alpha level sufficiently, the idea to lower the significance level as a function of the sample is very reasonable. But how?

Zellner (1971) discusses how the critical value for a frequentist hypothesis test approaches a limit as the sample size increases (i.e., a critical value of 1.96 for p = 0.05 in a two-sided test) whereas the critical value for a Bayes factor increases as the sample size increases (see also Rouder, Speckman, Sun, Morey, & Iverson, 2009). This difference lies at the heart of Lindley’s paradox, and under certain assumptions comes down to a factor of √n. As Zellner (1971, footnote 19, page 304) writes (K01 is the formula for the Bayes factor):

If a sampling theorist were to adjust his significance level upward as n grows larger, which seems reasonable, za would grow with n and tend to counteract somewhat the influence of the √n factor in the expression for K01.

Jeffreys (1939) discusses Neyman and Pearson’s work and writes:

We should therefore get the best result, with any distribution of α, by some form that makes the ratio of the critical value to the standard error increase with n. It appears then that whatever the distribution may be, the use of a fixed P limit cannot be the one that will make the smallest number of mistakes.

He discusses the issue more in Appendix B, where he compared his own test (Bayes factors) against Neyman-Pearson decision procedures, and he notes that:

In spite of the difference in principle between my tests and those based on the P integrals, and the omission of the latter to give the increase of the critical values for large n, dictated essentially by the fact that in testing a small departure found from a large number of observations we are selecting a value out of a long range and should allow for selection, it appears that there is not much difference in the practical recommendations. Users of these tests speak of the 5 per cent. point in much the same way as I should speak of the K = 10 point, and of the 1 per cent. point as I should speak of the K = I0-1 point; and for moderate numbers of observations the points are not very different. At large numbers of observations there is a difference, since the tests based on the integral would sometimes assert significance at departures that would actually give K > I. Thus there may be opposite decisions in such cases. But they will be very rare.

So even though extremely different conclusions between Bayes factors and frequentist tests will be rare, according to Jeffreys, when the sample size grows, the difference becomes noticeable.

This brings us to Good’s (1982) easy solution. His paper is basically just a single page (I’d love something akin to a Comments, Conjectures, and Conclusions format in Meta-Psychology! – note that Good himself was the section editor, which started with ‘Please be succinct but lucid and interesting’, and it reads just like a blog post).


He also explains the rationale in Good (1992):


‘we have empirical evidence that sensible P values are related to weights of evidence and, therefore, that P values are not entirely without merit. The real objection to P values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.

Based on the observation by Jeffrey’s (1939) that, under specific circumstances, the Bayes factor against the null-hypothesis is approximately inversely proportional to √N, Good (1982) suggests a standardized p-value to bring p-values in closer relationship with weights of evidence:

This formula standardizes the p-value to the evidence against the null hypothesis that what would be found if the pstan-value was the tail area probability observed in a sample of 100 participants (I think the formula is only intended for between designs - I would appreciate anyone weighing in in the comments if it can be extended to within-designs). When the sample size is 100, the p-value and pstan are identical. But for larger sample sizes pstan is larger than p. For example, a p = .05 observed in a sample size of 500 would have a pstan of 0.11, which is not enough to reject the null-hypothesis for the alternative. Good (1988) demonstrates great insight when he writes: ‘I guess that standardized p-values will not become standard before the year 2000.’

Good doesn’t give a lot of examples of how standardized p-values should be used in practice, but I guess it makes things easier to think about a standardized alpha level (even though the logic is the same, just like you can double the p-value, or halve the alpha level, when you are correcting for 2 comparisons in a Bonferroni correction). So instead of an alpha level of 0.05, we can think of a standardized alpha level:
Again, with 100 participants α and αstan are the same, but as the sample size increases above 100, the alpha level becomes smaller. For example, a α = .05 observed in a sample size of 500 would have a αstan of 0.02236.

So one way to justify your alpha level is by using a decreasing alpha level as the sample size increases. I for one have always thought it was rather nonsensical to use an alpha level of 0.05 in all meta-analyses (especially when testing a meta-analytic effect size based on thousands of participants against zero), or large collaborative research project such as Many Labs, where analyses are performed on very large samples. If you have thousands of participants, you have extremely high power for most effect sizes original studies could have detected in a significance test. With such a low Type 2 error rate, why keep the Type 1 error rate fixed at 5%, which is so much larger than the Type 2 error rate in these analyses? It just doesn’t make any sense to me. Alpha levels in meta-analyses or large-scale data analyses should be lowered as a function of the sample size. In case you are wondering: an alpha level of .005 would be used when the sample size is 10.000.

When designing a study based on a specific smallest effect size of interest, where you desire to have decent power (e.g., 90%), we run in to a small challenge because in the power analysis we now have two unknowns: The sample size (which is a function of the power, effect size, and alpha), and the standardized alpha level (which is a function of the sample size). Luckily, this is nothing that some R-fu can’t solve by some iterative power calculations. [R code to calculate the standardized alpha level, and perform an iterative power analysis, is at the bottom of the post]

When we wrote Justify Your Alpha (I recommend downloading the original draft before peer review because it has more words and more interesting references) one of the criticism I heard the most is that we gave no solutions how to justify your alpha. I hope this post makes it clear that statisticians have discussed that the alpha level should not be any fixed value even since it was invented. There are already some solutions available in the literature. I like Good’s approach because it is simple. In my experience, people like simple solutions. It might not be a full-fledged decision theoretical cost-benefit analysis, but it beats using a fixed alpha level. I recently used it in a submission for a Registered Report. At the same time, I think it has never been used in practice, so I look forward to any comments, conjectures, and conclusions you might have. 


References
Good, I. J. (1982). C140. Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16(1), 65–66. https://doi.org/10.1080/00949658208810607
Good, I. J. (1988). The interface between statistics and philosophy of science. Statistical Science, 3(4), 386–397.
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. https://doi.org/10.2307/2290192
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Leamer, E. E. (1978). Specification Searches: Ad Hoc Inference with Nonexperimental Data (1 edition). New York usw.: Wiley.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.



 

Tuesday, August 28, 2018

Equivalence Testing and the Second Generation P-Value


Recently Blume, D’Agostino McGowan, Dupont, & Greevy (2018) published an article titled: “Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses”. As it happens, I would greatly appreciate more rigor, reproducibility, and transparency in statistical analyses, so my interest was piqued. On Twitter I saw the following slide, promising a updated version of the p-value that can support null-hypotheses, takes practical significance into account, has a straightforward interpretation, and ideally never needs adjustments for multiple comparisons. Now it sounded like someone found the goose that lays the golden eggs.




Upon reading the manuscript, I noticed the statistic is surprisingly similar to equivalence testing, which I’ve written about recently and created an R package for (Lakens, 2017). The second generation p-value (SGPV) relies on specifying an equivalence range of values around the null-hypothesis that are practically equivalent to zero (e.g., 0 ± 0.3). If the estimation interval falls completely within the equivalence range, the SGPV is 1. If the confidence interval lies completely outside of the equivalence range, the SGPV is 0. Otherwise the SGPV is a value between 0 and 1 that expresses the overlap of the confidence interval with the equivalence bound, divided by the total width of the confidence interval.

Testing whether the confidence interval falls completely within the equivalence bounds is equivalent to the two one-sided tests (TOST) procedure, where the data is tested against the lower equivalence bound in the first one-sided test, and against the upper equivalence bound in the second one-sided test. If both tests allow you to reject an effect as extreme or more extreme than the equivalence bound, you can reject the presence of an effect large enough to be meaningful, and conclude the observed effect is practically equivalent to zero. You can also simply check if a 90% confidence interval falls completely within the equivalence bounds. Note that testing whether the 95% confidence interval falls completely outside of the equivalence range is known as a minimum-effect test (Murphy, Myors, & Wolach, 2014).

So together with my collaborator Marie Delacre we compared the two approaches, to truly understand how second generation p-values accomplished what they were advertised to do, and what they could contribute to our statistical toolbox.

To examine the relation between the TOST p-value and the SGPV we can calculate both statistics across a range of observed effect sizes. In Figure 1 p-values are plotted for the TOST procedure and the SGPV. The statistics are calculated for hypothetical one-sample t-tests for all means that can be observed in studies ranging from 140 to 150 (on the x-axis). The equivalence range is set to 145 ± 2 (i.e., an equivalence range from 143 to 147), the observed standard deviation is assumed to be 2, and the sample size is 100. The SGPV treats the equivalence range as the null-hypothesis, while the TOST procedure treats the values outside of the equivalence range as the null-hypothesis. For ease of comparison we can reverse the SGPV (by calculating 1-SGPV), which is used in the plot below.
 
 
Figure 1: Comparison of p-values from TOST (black line) and 1-SGPV (dotted grey line) across a range of observed sample means (x-axis) tested against a mean of 145 in a one-sample t-test with a sample size of 30 and a standard deviation of 2.

It is clear the SGPV and the p-value from TOST are very closely related. The situation in Figure 1 is not an exception – in our pre-print we describe how the SGPV and the p-value from the TOST procedure are always directly related when confidence intervals are symmetrical. You can play around with this Shiny app as confirm this for yourself: http://shiny.ieis.tue.nl/TOST_vs_SGPV/.

There are 3 situations where the p-value from the TOST procedure and the SGPV are not directly related. The SGPV is 1 when the confidence interval falls completely within the equivalence bounds. P-values from the TOST procedure continue to differentiate and will for example distinguish between a p = 0.048 and p = 0.002. The same happens when the SGPV is 0 (and p-values fall between 0.975 and 1).

The third situation when the TOST and SGPV differ is when the ‘small sample correction’ is at play in the SGPV. This “correction” kicks in whenever the confidence interval is wider than the equivalence range. However, it is not a correction in the typical sense of the word, since the SGPV is not adjusted to any ‘correct’ value. When the normal calculation would be ‘misleading’ (i.e., the SGPV would be small, which normally would suggest support for the alternative hypothesis, when all values in the equivalence range are also supported), the SGPV is set to 0.5 which according to Blume and colleagues signal the SGPV is ‘uninformative’.In all three situations the p-value from equivalence tests distinguishes between scenarios where the SGPV yields the same result.

We can examine this situation by calculating the SGPV and performing the TOST for a situation where sample sizes are small and the equivalence range is narrow, such that the CI is more than twice as large as the equivalence range.

 
Figure 2: Comparison of p-values from TOST (black line) and SGPV (dotted grey line) across a range of observed sample means (x-axis). Because the sample size is small (n = 10) and the CI is more than twice as wide as the equivalence range (set to -0.4 to 0.4), the SGPV is set to 0.5 (horizontal light grey line) across a range of observed means.


The main novelty of the SGPV is that it is meant to be used as a descriptive statistic. However, we show that the SGPV is difficult to interpret when confidence intervals are asymmetric, and when the 'small sample correction' is operating. For an extreme example, see Figure 3 where the SGPV's are plotted for a correlation (where confidence intervals are asymmetric). 

Figure 3: Comparison of p-values from TOST (black line) and 1-SGPV (dotted grey curve) across a range of observed sample correlations (x-axis) tested against equivalence bounds of r = 0.4 and r = 0.8 with n = 10 and an alpha of 0.05.

Even under ideal circumstances, the SGPV is mainly meaningful when it is either 1, 0, or inconclusive (see all examples in Blume et al., 2018). But to categorize your results into one of these three outcomes you don’t need to calculate anything – you can just look at whether the confidence interval falls inside, outside, or overlaps with the equivalence bound, and thus the SGPV loses its value as a descriptive statistic. 

When discussing the lack of a need for error correction, Blume and colleagues compare the SGPV to null-hypothesis tests. However, the more meaningful comparison is with the TOST procedure, and given the direct relationship, not correcting for multiple comparisons will inflate the probability of concluding the absence of a meaningful effect in exactly the same way as when calculating p-values for an equivalence test. Equivalence tests provide an easier and more formal way to control both Type I error rates (by setting the alpha level) and the Type II error rate (by performing an a-priori power analysis, see Lakens, Scheele, & Isager, 2018).

Conclusion

There are strong similarities between p-values from the TOST procedure and the SGPV, and in all situations where the statistics yield different results, the behavior of the p-value from the TOST procedure is more consistent and easier to interpret. More details can be found in our pre-print (where you can also leave comments or suggestions for improvement using hypothes.is). Our comparisons show that when proposing alternatives to null-hypothesis tests, it is important to compare new proposals to already existing procedures. We believe equivalence tests achieve the goals of the second generation p-value while allowing users to more easily control error rates, and while yielding more consistent statistical outcomes.



References
Blume, J. D., D’Agostino McGowan, L., Dupont, W. D., & Greevy, R. A. (2018). Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLOS ONE, 13(3), e0188299. https://doi.org/10.1371/journal.pone.0188299
Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 2515245918770963. https://doi.org/10.1177/2515245918770963.
Murphy, K. R., Myors, B., & Wolach, A. H. (2014). Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (Fourth edition). New York: Routledge, Taylor & Francis Group.

Monday, July 2, 2018

Strong versus Weak Hypothesis Tests

The goal of a hypothesis test is to carefully examine whether predictions that are derived from a scientific theory hold up under scrutiny. Not all predictions we can test are equally exciting. For example, if a researcher asks two groups to report their mood on a scale from 1 to 7, and then predicts the difference between these groups will fall within a range of -6 to +6, we know in advance that it must be so. No result can falsify the prediction, and therefore finding a result that corroborates the prediction is completely trivial and a waste of time.

To demonstrate our theory has good predictive validity, we need to divide all possible states of the world into a set that is predicted by our theory, and a set that is not predicted by our theory. We can then collect data, and if the results are in line with our prediction (repeatedly, across replication studies), our theory gains verisimilitude – it seems to be related to the truth. We can never know the truth, but by corroborating theoretical predictions, we can hope to get closer to it.

The most common division of states of the world that are predicted and not prediction by a theory in null-hypothesis significance testing is the following: An effect of exactly zero is not predicted by a theory, and all other effects are taken to corroborate the theoretical prediction. Here, I want to explain why this is a very weak hypothesis test. In certain lines of research, it might even be a pretty trivial prediction. Luckily, it is quite easy to perform much stronger tests of hypotheses. I’ll also explain how to do so in practice.

Risky Predictions


Take a look at the three circles below. Each circle represents all possible outcomes of an empirical test of a theory. The blue line illustrates the state of the world that was observed. The line could have fallen anywhere on the circle. We performed a study and found one specific outcome. The black area in the circle represents the states of the world that will be interpreted as falsifying our prediction, whereas the white area is interpreted as the states in the world that will be interpreted as corroborating our prediction.




In the figure on the left, only a tiny fraction of states of the world will falsify our prediction. This represents a hypothesis test where only an infinitely small portion of all possible states of the world is not in line with the prediction. A common example is a two-sided null-hypothesis significance test, which forbids (and tries to reject) only the state of the world where the true effect size is exactly zero.

In the middle circle, 50% of all possible outcomes falsify the prediction, and 50% corroborates it. A common example is a one-sided null-hypothesis test. If you predict the mean is larger than zero, this prediction is falsified by all states of the world where the true effect is either equal to zero, or smaller than zero. This means that half of all possible states of the world can no longer be interpreted as corroborating your prediction. The blue line, or observed state of the world in the experiment, happens to fall in the white area for the middle circle, so we can still conclude the prediction is supported. However, our prediction was already slightly more risky than in the circle on the left representing a two-sided test.

In the scenario in the right circle, almost all possible outcomes are not in line with our prediction – only 5% of the circle is white. Again, the blue line, our observed outcome, falls in this white area, and our prediction is confirmed. However, now our prediction is confirmed in a very risky test. There were many ways in which we could be wrong – but we were right regardless.

Although our prediction is confirmed in all three scenarios above, philosophers of science such as Popper and Lakatos would be most impressed after your prediction has withstood the most severe test (i.e., in the scenario illustrated by the right circle). Our prediction was most specific: 95% of possible outcomes were judged as falsifying our prediction, and only 5% of possible outcomes would be interpreted as support for our theory. Despite this high hurdle, our prediction was corroborated. Compare this to the scenario on the left – almost any outcome would have supported our theory. That our prediction was confirmed in the scenario in the left circle is hardly surprising.

Systematic Noise


The scenario in the left, where only a very small part of all possible outcomes is seen as falsifying a prediction, is very similar to how people commonly use null-hypothesis significance tests. In a null-hypothesis significance test, any effect that is not zero is interpreted as support for a theory. Is this impressive? That depends on the possible states of the world. According to Meehl, there are many situations where null-hypothesis significance tests are performed, but the true difference is highly unlikely to be exactly zero. Meehl is especially worried about research where there is room for systematic noise, or the crud factor.

Systematic noise can only be excluded in an ideal experiment. In this ideal experiment, there is perfect random assignment to conditions, and only one single thing can cause a difference, such as in a randomized controlled trial. Perfection is notoriously hard to achieve in practice. In any close to perfect experiment, there can be tiny factors that, although not being the main goal of the experiment, lead to differences between the experimental and control condition. Participants in the experimental condition might read more words, answer more questions, need more time, have to think more deeply, or process more novel information. Any of these things could slightly move the true effect size away from zero – without being related to the independent variable the researchers aimed to manipulate. This is why Meehl calls it systematic noise, and not random noise: The difference is reliable, but not due to something you are theoretically interested in.

Many experiments are not even close to perfect and consequently have a lot of room for systematic noise. And then there are many studies where there isn’t even random assignment to conditions, but where data is correlational. As an example of correlational data, think about research examining differences between women and men. If we examine differences between men and women, the subjects in our study can not be randomly assigned to a condition. In such non-experimental studies, it is possible that ‘everything is correlated to everything’. For example, men are on average taller than women, and as a consequence it is more common for a man to be asked to pick an object of a high shelf in a supermarket, than vice versa. If we then ask men and women ‘how often do you help strangers’ this average difference in height has some tiny but systematic effect on their responses. In this specific case, systematic noise moves the mean difference from zero to a slightly higher value for men – but an unknown number of other sources of systematic noise are at play, and these interact, leading to an unknown final true population difference that is very unlikely to be exactly zero.

I think there are experiments that, for all practical purposes, are controlled enough to make a null-hypothesis a valid and realistic model to test against. However, I also think that these experiments are much more limited than the current widespread use of null-hypothesis testing. There are many experiments where a test against a null-hypothesis is performed, while the null-hypothesis is not reasonable to entertain, and we can not expect the difference to be exactly zero.

In those studies (e.g., as in the experiment examining gender differences above) it is much more impressive to have a theory that is able to predict how big an effect is (approximately). In other words, we should aim for theories that make point predictions, or a bit more reasonably, given that most sciences have a hard time predicting a single exact value, range predictions.

Range Predictions


Making more risky range predictions has some important benefits over the widespread use of null-hypothesis tests. These benefits mean that even if a null-hypothesis test is defensible, it would be preferable if you could test a range prediction.

Making a more risky prediction gives your theory higher verisimilitude. You will get more credit in darts when you correctly predict you will hit the bullseye, than when you correctly predict you will hit the board. Similarly, you get more credit for the predictive power of your theory when you correctly predict an effect will fall within 0.5 scale points of 8 on a 10 point scale, than when you predict the effect will be larger than the midpoint of the scale. A theory allows you to make predictions, and a good theory allows you to make precise predictions.

Range predictions allow you to design a study that can be falsified based on clear criteria. If you specify the bounds within which an effect should fall, any effect that is either smaller or larger will falsify the prediction. For a traditional null-hypothesis test, an effect of 0.0000001 will officially still fall in the possible states of the world that support the theory. However, it is practically impossible to falsify such tiny differences from zero, because doing so would require huge resources.

To increase the falsifiability of psychological research, the lower bound of the range prediction can be used as the smallest effect size of interest. Designing a study that has high power for this smallest effect size of interest (for example, a Cohen’s d of 0.1) will lead to an informative result. If the threshold for the smallest effect size of interest is really is so close to zero (e.g., 0.0000001) that a researcher does not have the resources to design a high powered study that could falsify this prediction. Specifying this range prediction is still, useful, because then it is clear to everyone that we do not have the resources to falsify that prediction.

Many of the criticisms on p-values in null-hypothesis tests disappear when p-values are calculated for range predictions. In a traditional hypothesis test with at least some systematic noise (meaning the true effect differs slightly from zero) all studies where the null is not exactly true will lead to a significant effect with a large enough sample size. This makes it a boring prediction, and we will end up stating there is a ‘significant’ difference for tiny irrelevant effects. I expect this problem will become more important now that it is easier to get access to Big Data.

However, we don’t want just any effect to become statistically significant – we want theoretically relevant effects to be significant, but not theoretically irrelevant effects. A range prediction achieves this. If we expect effects between 0.1 and 0.3, an effect of 0.05 might be statistically different from 0 in a huge sample, but it is not support for our prediction. To provide support for a range prediction your prediction needs to be accurate.

Testing Range Predictions in Practice 

 

In a null-hypothesis test (visualized below) we compare the data against the hypothesis that the difference is 0 (indicated by the dotted vertical line at 0). The test yields a p = 0.047 – if we use an alpha level of 0.05, this is just below the alpha threshold. The observed difference (indicated by the square) has a confidence interval that ranges from almost 0 to 1.69. We can reject the null, but beyond that, we haven’t learned much.



In the example above, we were testing against a mean difference of 0. But there is no reason why a hypothesis test should be limited to test against a mean difference of 0. Meehl (1967 – yes, that is more than 40 years ago!) compared the use of statistical tests in psychology and physics, and notes that in physics, researchers make point predictions. For example, say a theory predicts a mean difference of 0.35. Let’s assume effects smaller than 0.1 are considered too small to matter, and effects larger than 0.6 are considered too large. Note that the bounds happen to be symmetric around the expected effect size (0.35 ±0.25) but you can set the bounds where ever you like. It is also perfectly acceptable not to specify an upper bound (in which case you are performing a minimal effects test, where you aim to reject effects smaller than a lower bound.

If you have learned about equivalence testing (see Lakens, Scheel, & Isager, 2018), you might recognize the practice of specifying equivalence bounds, and testing whether effects outside of this equivalence range can be rejected. In most equivalence tests the bounds are set up to fall on either size of 0 (e.g., -0.3 to 0.3), and the goal is to reject effect that are large enough to matter, so that we can conclude the effect is practically equivalent to zero.

But you can use equivalence tests to test any range. If you specify the bounds as ranging from 0.1 to 0.6, you can use for example the TOSTER package to test whether the observed effect is equivalent to the range of values you predicted. Below you see the hypothetical output for an experiment with n = 254 in two conditions, where ratings on a 7-point scale were collected from an experimental group (M = 5.25, SD = 1.12) and a control group (M = 4.87, SD = 0.98). A mean difference of 0.38 is observed, which is close to our predicted value of 0.35. We can set up an equivalence test to examine whether we can statistically conclude that we can reject effect sizes outside the range that we predicted. We can use the TOSTER package to test whether we can reject the presence of effects smaller than 0.1, and larger than 0.6. The code below performs the test for our range prediction:

library(TOSTER)
TOSTtwo.raw(m1 = 5.25,
m2 = 4.87,
sd1 = 1.12,
sd2 = 0.98,
n1 = 254,
n2 = 254,
low_eqbound = 0.1,
high_eqbound = 0.6,
alpha = 0.05,
var.equal = FALSE)



The results show we cannot just reject a mean difference of 0, we can also statistically reject values smaller than 0.1 and larger than 0.6:

Using alpha = 0.05 the equivalence test based on Welch's t-test was significant, t(497.2383) = -2.355984, p = 0.009430463


We have made a riskier prediction than a traditional two-sided hypothesis test, and our prediction was confirmed – impressive!




Note that although Meehl prefers point predictions that lie within a certain bound, he doesn’t completely reject the use of null-hypothesis significance testing. When he asks ‘Is it ever correct to use null-hypothesis significance tests?’ his own answer is ‘Of course it is’ (Meehl, 1990). There are times, such as very early in research lines, where researchers do not have good enough models, or reliable existing data, to make point predictions. Other times, two competing theories are not more precise than that one predicts rats in a maze will learn something, while the other theory predicts the rats will learn nothing. As Meehl writes: “When I was a rat psychologist, I unabashedly employed significance testing in latent-learning experiments; looking back I see no reason to fault myself for having done so in the light of my present methodological views.”

There are no good or bad statistical approaches – all statistical approaches are just answers to questions. What matters is asking the best possible question. It makes sense to allow traditional null-hypothesis tests early in research lines, when theories do not make more specific predictions than that ‘something’ will happen. But we should also push ourselves to develop theories that make more precise range predictions, and then test these more specific predictions. More mature theories should be able to predict effects in some range – even when these ranges are relatively wide.

The narrower the range you predict, the smaller the confidence interval needs to be to have a high probability of falling within the equivalence bounds (or to have high power for the equivalence test). Collecting a much larger sample size, with the direct real-world costs associated, might not immediately feel worth it, just for the lofty reward of higher verisimilitude (a concept philosophers don’t even know how to quantify!).

But thinking about hypothesis tests as range predictions is a useful skill. A two-sided null-hypothesis test sets the range of predictions to anywhere but zero. A one-sided test halves all possible states of the world that are predicted. This is a very efficient way to gain verisimilitude – indeed, because you can now only make Type 1 error in one direction, you even have the benefit of a small increase in power when performing a one-sided test. You could even go a step further, and instead of testing against the value of 0, acknowledge that there might be some systematic noise you are not interested in, and test against an effect of 0.05 (known as a minimal effects test). And finally, if you have a good theory, and see value in confirming a point prediction, you might want to put in the effort to collect enough data to test a range prediction (e.g., a difference between 0.3 and 0.6). All these tests use the same philosophical and statistical framework but make increasingly narrow range predictions. Thinking more carefully about the range of effects you want to corroborate or falsify, and relying less often on two-sided null-hypothesis tests, will make your hypothesis tests much stronger.




References



Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 2515245918770963. https://doi.org/10/gdj7s9

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 103–115.

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.