The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, May 9, 2022

Tukey on Decisions and Conclusions

In 1955 Tukey gave a dinner talk about the difference between decisions and conclusions at a meeting of the Section of Physical and Engineering Science of the American Statistical Association. The talk was published in 1960. The distinction relates directly to different goals researchers might have when they collect data. This blog is largely a summary of his paper.

 


Tukey was concerned about the ‘tendency of decision theory to attempt to conquest all of statistics’. In hindsight, he needn’t have worried. In the social sciences, most statistics textbooks do not even discuss decision theory. His goal was to distinguish decisions from conclusions, to carve out a space for ‘conclusion theory’ to complement decision theory. He distinguishes decisions from conclusions.

 

In practice, making a decision means to ‘decide to act for the present as if’. Possible actions are defined, possible states of nature identified, and we make an inference about each state of nature. Decisions can be made even when we remain extremely uncertain about any ‘truth’. Indeed, in extreme cases we can even make decisions without access to any data. We might even decide to act as if two mutually exclusive states of nature are true! For example, we might buy a train ticket for a holiday three months from now, but also take out life insurance in case we die tomorrow.   

 

Conclusions differ from decisions. First, conclusions are established without taking consequences into consideration. Second, conclusions are used to build up a ‘fairly well-established body of knowledge’. As Tukey writes: “A conclusion is a statement which is to be accepted as applicable to the conditions of an experiment or observation unless and until unusually strong evidence to the contrary arises.” A conclusion is not a decision on how to act in the present. Conclusions are to be accepted, and thereby incorporated into what Frick (1996) calls a ‘corpus of findings’. According to Tukey, conclusions are used to narrow down the number of working hypotheses still considered consistent with observations. Conclusions should be reached, not based on their consequences, but because of their lasting (but not everlasting, as conclusions can now and then be overturned by new evidence) contribution to scientific knowledge.

 

Tests of hypotheses

 

According to Tukey, a test of hypotheses can have two functions. The first function is as a decision procedure, and the second function is to reach a conclusion. In a decision procedure the goal is to choose a course of action given an acceptable risk. This risk can be high. For example, a researcher might decide not to pursue a research idea after a first study, designed to have 80% power for a smallest effect size of interest, yields a non-significant result. The error rate is at most 20%, but the researcher might have enough good research ideas to not care.

 

The second function is to reach a conclusion. This is done, according to Tukey, by controlling the Type 1 and Type 2 error rate at ‘suitably low levels’ (Note: Tukey’s discussion of concluding an effect is absent is hindered somewhat by the fact that equivalence tests were not yet widely established in 1955 – Hodges & Lehman’s paper appeared in 1954). Low error rates, such as the conventions to use a 5% of 1% alpha level, are needed to draw conclusions that can enter the corpus of findings (even though some of these conclusions will turn out to be wrong, in the long run).

 

Why would we need conclusions?

 

One might reasonably wonder if we need conclusions in science. Tukey also ponders this question in Appendix 2. He writes “Science, in the broadest sense, is both one of the most successful of human affairs, and one of the most decentralized. In principle, each of us puts his evidence (his observations, experimental or not, and their discussion) before all the others, and in due course an adequate consensus of opinion develops.” He argues not for an epistemological reason, nor for a statistical reason, but for a sociological reason. Tukey writes: There are four types of difficulty, then, ranging from communication through assessment to mathematical treatment, each of which by itself will be sufficient, for a long time, to prevent the replacement, in science, of the system of conclusions by a system based more closely on today’s decision theory.” He notes how scientists can no longer get together in a single room (as was somewhat possible in the early decades of the Royal Society of London) to reach consensus about decisions. Therefore, they need to communicate conclusions, as “In order to replace conclusions as the basic means of communication, it would be necessary to rearrange and replan the entire fabric of science.” 

 

I hadn’t read Tukey’s paper when we wrote our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests”. In this preprint, we also discuss a sociological reason for the presence of dichotomous claims in science. We also ask: “Would it be possible to organize science in a way that relies less on tests of competing theories to arrive at intersubjectively established facts about phenomena?” and similarly conclude: “Such alternative approaches seem feasible if stakeholders agree on the research questions that need to be investigated, and methods to be utilized, and coordinate their research efforts”.  We should add a citation to Tukey's 1960 paper.

 

Is the goal of an study a conclusion, a decision, or both?

 

Tukey writes he “looks forward to the day when the history and status of tests of hypotheses will have been disentangled.” I think that in 2022 that day has not yet come. At the same time, Tukey admits in Appendix 1 that the two are sometimes intertwined.

 

A situation Tukey does not discuss, but that I think is especially difficult to disentangle, is a cumulative line of research. Although I would prefer to only build on an established corpus of findings, this is simply not possible. Not all conclusions in the current literature are reached with low error rates. This is true both for claims about the absence of an effect (which are rarely based on an equivalence test against a smallest effect size of interest with a low error rate), as for claims about the presence of an effect, not just because of p-hacking, but also because I might want to build on an exploratory finding from a previous study. In such cases, I would like to be able to conclude the effects I build on are established findings, but more often than not, I have to decide these effects are worth building on. The same holds for choices about the design of a set of studies in a research line. I might decide to include a factor in a subsequent study, or drop it. These decisions are based on conclusions with low error rates if I had the resources to collect large samples and perform replication studies, but other times they involve decisions about how to act in my next study with quite considerable risk.

 

We allow researchers to publish feasibility studies, pilot studies, and exploratory studies. We don’t require every study to be a Registered Report of Phase 3 trial. Not all information in the literature that we build on has been established with the rigor Tukey associates with conclusions. And the replication crisis has taught us that more conclusions from the past are later rejected than we might have thought based on the alpha levels reported in the original articles. And in some research areas, where data is scarce, we might need to accept that, if we want to learn anything, the conclusions will always more tentative (and the error rates accepted in individual studies will be higher) than in research areas where data is abundant.

 

Even if decisions and conclusions can not be completely disentangled, reflecting on their relative differences is very useful, as I think it can help us to clarify the goal we have when we collect data. 

 

For a 2013 blog post by Justin Esarey, who found the distinction a bit less useful than I found it, see https://polmeth.org/blog/scientific-conclusions-versus-scientific-decisions-or-we%E2%80%99re-having-tukey-thanksgiving

 

References

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Tukey, J. W. (1960). Conclusions vs decisions. Technometrics, 2(4), 423–433.

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by

 

 

 

 

 

Tuesday, May 3, 2022

Collaborative Author Involved Replication Studies

Recently a new category of studies have started to appear in the psychological literature that provide the strongest support to date for a replication crisis in psychology: Large scale collaborative replication studies where the authors of the original study are directly involved in the study. These replication studies have often provided conclusive demonstrations of the absence of any effect large enough to matter. Despite considerable attention for these extremely interesting projects, I don’t think the scientific community has fully appreciated what we have learned from these studies.

 

Three examples of Collaborative Author Involved Replication Studies

 

Vohs and colleagues (2021) performed a multi-lab replication study of the ego-depletion effect, which (deservedly) has become a poster child of non-replicable effects in psychology. The teams used different combinations of protocols, allowing an unsuccessful prediction to generalize across minor variations in how the experiment was operationalized. Across these conditions, a non-significant effect was observed of d = 0.06, 95%CI[-0.02;0.14]. Although the authors regrettably did not specify a smallest effect size of interest in their frequentist analyses, they mention “we pitted a point-null hypothesis, which states that the effect is absent, against an informed one-sided alternative hypothesis centered on a depletion effect (δ) of 0.30 with a standard deviation of 0.15” in their Bayesian analyses. Based on the confidence interval, we can reject effects of d = 0.3, and even d = 0.2, suggesting that we have extremely informative data concerning the absence of an effect most ego-depletion researchers would consider is large enough to matter.

 

Morey et al (2021) performed a multi-lab replication study of the Action-Sentence Compatibility effect (Glenberg & Kaschak, 2002). I cited the original paper in my PhD thesis, and it was an important finding that I built on, so I was happy to join this project. As written in the replication study, the original team, together with the original authors, “established and pre-registered ranges of effects on RT that we would deem (a) uninteresting and inconsistent with the ACE theory: less than 50 ms.” An effect between 50 ms and 100 ms was seen as inconsistent with the previous literature, but in line with predictions of the ACE effect. The replication study consisted (after exclusions) of 903 native English speakers, and 375 non-native English speakers. The original study had used 44, 70, and 72 participants across 3 studies. The conclusion in the replication study was that  the median ACE interactions were close to 0 and all within the range that we pre-specified as negligible and inconsistent with the existing ACE literature. There was little heterogeneity.

 

Last week, Many Labs 4 was published (Klein et al., 2022). This study was designed to examine the mortality salience effect (which I think deserve the same poster child status of a non-replicable effect in psychology, but which seems to have gotten less attention so far). Data from 1550 participants was collected across 17 labs, some which performed the study with involvement of the original author, and some which did not. Several variations of the analyses were preregistered, but none revealed the predicted effect, Hedges’ g = 0.07, 95% CI = [-0.03, 0.17] (for exclusion set 1). The authors did not provide a formal sample size justification based on a smallest effect size of interest, but in a sensitivity power analysis indicate they had 95% power for effect sizes of d = 0.18 to d = 0.21. If we assume all authors found effect sizes around d = 0.2 small enough to no longer support their predictions, we can see based on the confidence intervals that we can indeed exclude effect sizes large enough to matter. The mortality salience effect, even with involvement of the original authors, seems to be too small to matter. There was little heterogeneity in effect sizes (in part because the absence of an effect).

 

These are just three examples (there are more, of which the multi-lab test of the facial feedback hypothesis by Coles et al., 2022, is worth highlighting), but they highlight some interesting properties of collaborative author involved replication studies. I will highlight four strengths of these studies.

 

Four strengths of Collaborative Author Involved Replication Studies

 

1) The original authors are extensively involved in the design of the study. They sign off on the final design, and agree that the study is, with the knowledge they currently have, the best test of their prediction. This means the studies tell us something about the predictive validity of state of the art knowledge in a specific field. If the predictions these researchers make are not corroborated, the knowledge we have accumulated in these research areas are is not reliable enough to make successful predictions.

2) The studies are not always direct replications, but the best possible test of the hypothesis, in the eyes of the researchers involved. Criticism on past replication studies has been that directly replicating a study performed many years ago is not always insightful, as the context has changed (even though Many Labs 5 found no support for this criticism). In this new category of collaborative author involved replication studies, the original authors are free to design the best possible test of their prediction. If these tests fail, we can not attribute the failure to replicate to the ‘protective belt’ of auxiliary hypotheses that no longer hold. Of course, it is possible that the theory can be adjusted in a constructive manner after this unsuccessful prediction. But at this moment, these original authors do not have a solid understanding of their research topic to be able to predict if an effect will be observed.

3) The other researchers involved in these projects often have extensive expertise in the content area. They are not just researchers interested in mechanistically performing a replication study on a topic they have little expertise with. Instead, many of the researchers consists of peers who have worked in a specific research area, published on the topic of the replication study, but have collectively developed some doubts about the reliability of past claims, and have decided to spend some of their time replicating a previous finding.

4) The statistical analyses in these studies yield informative conclusions. The studies typically do not conclude the prediction was unsuccessful based on p > 0.05 in a small sample. In the most informative studies, original authors have explicitly specified a smallest effect size of interest, which makes it possible to perform an equivalence test, and statistically reject the presence of any effect deemed large enough to matter. In other cases, Bayesian hypothesis tests are performed which provide support for the null, compared to the alternative, model. This makes these replications studies severe tests of the predicted effect. In cases where original authors did not specify a smallest effect size of interest, the very large sample sizes allow readers to examine effects that can be rejected based on the observed confidence interval, and in all the studies discussed here, we can reject the presence of effects large enough to be considered meaningful. There is most likely not a PhD student in the world who would be willing to examine these effects, given the size that remains possible after these collaborative author involved replication studies. We can never conclude an effect is exactly zero, but that hardly matters – the effects are clearly too small to study.

 

The Steel Man for Replication Crisis Deniers

 

Given the reward structures in science, it is extremely rewarding for individual researchers to speak out against the status quo. Currently, the status quo is that the scientific community has accepted there is a replication crisis. Some people attempt to criticize this belief. This is important. All established beliefs in science should be open to criticism.

Most papers that aim to challenge the fact that many scientific domains have a surprising difficulty successfully replicating findings once believed reliable focus on the 100 studies in the Replicability Project: Psychology that was started a decade ago, and published in 2015. This project was incredibly successful in creating awareness of concerns around replicability, but it was not incredibly informative about how big the problem was.

In the conclusion of the RP:P, the authors wrote: “After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. Humans desire certainty, and science infrequently provides it. As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation.” The RP:P was an important project, but it is no longer the project to criticize if you want to provide evidence against the presence of a replication crisis.

Since the start of the RP:P, other projects have aimed to complement our insights about replicability. Registered Replication Reports focused on single studies, replicated in much larger sample sizes, to reduce the probability of a Type 2 error. These studies often quite conclusively showed original studies did not replicate, and a surprisingly large number yielded findings not statistically different from 0, despite sample sizes much larger than psychologists would be able to collect in normal research lines. Many Labs studies focused on a smaller set of studies, replicated many times, sometimes with minor variations to examine the role of possible moderators proposed to explain failures to replicate, which were typically absent.

The collaborative author involved replications are the latest addition to this expanding literature that consistently shows great difficulties in replicating findings. I believe they currently make up the steel man for researchers motivated to cast doubt on the presence of a replication crisis. I believe the fact that these large projects with direct involvement of the original authors can not find support for predicted effects are the strongest evidence too date that we have a problem replicating findings. Of course, these studies are complemented by Registered Replication Reports and Many Labs studies, and together they make up the Steel Man to argue against if you are a Replication Crisis Denier.

 

References

Coles, N. A., March, D. S., Marmolejo-Ramos, F., Larsen, J., Arinze, N. C., Ndukaihe, I., Willis, M., Francesco, F., Reggev, N., Mokady, A., Forscher, P. S., Hunter, J., Gwenaël, K., Yuvruk, E., Kapucu, A., Nagy, T., Hajdu, N., Tejada, J., Freitag, R., … Marozzi, M. (2022). A Multi-Lab Test of the Facial Feedback Hypothesis by The Many Smiles Collaboration. PsyArXiv. https://doi.org/10.31234/osf.io/cvpuw

 

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Hilgard, J., Ahn, P. H., Brady, A. J., Chartier, C. R., Christopherson, C. D., Clay, S., Collisson, B., Crawford, J. T., Cromar, R., Gardiner, G., Gosnell, C. L., Grahe, J., Hall, C., Howard, I., … Ratliff, K. A. (2022). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement. Collabra: Psychology, 8(1), 35271. https://doi.org/10.1525/collabra.35271

 

Morey, R. D., Kaschak, M. P., Díez-Álamo, A. M., Glenberg, A. M., Zwaan, R. A., Lakens, D., Ibáñez, A., García, A., Gianelli, C., Jones, J. L., Madden, J., Alifano, F., Bergen, B., Bloxsom, N. G., Bub, D. N., Cai, Z. G., Chartier, C. R., Chatterjee, A., Conwell, E., … Ziv-Crispel, N. (2021). A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE). Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-021-01927-8

 

Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J. L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi, J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay, S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733

 

Saturday, November 20, 2021

Why p-values should be interpreted as p-values and not as measures of evidence

Update: Florian Hartig has also published a blog post criticizing the paper by Muff et al (2021). 

In a recent paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation “to regard P-values as what they are, namely, continuous measures of statistical evidence". This is a surprising recommendation, given that p-values are not valid measures of evidence (Royall, 1997). The authors follow Bland (2015) who suggests that “It is preferable to think of the significance test probability as an index of the strength of evidence against the null hypothesis” and proposed verbal labels for p-values in specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman & Royall, 1988). If the null-hypothesis is true, p-values are uniformly distributed. This means it is just as likely to observe a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just because the probability to observe this p-value is very small. After all, if the null hypothesis is true, the probability of observing p = 0.999 is exactly as small.

The reason that small p-values can be used to guide us in the direction of true effects is not because they are rarely observed when the null-hypothesis is true, but because they are relatively less likely to be observed when the null hypothesis is true, than when the alternative hypothesis is true. For this reason, statisticians have argued that the concept of evidence is necessarily ‘relative’. We can quantify evidence in favor of one hypothesis over another hypothesis, based on the likelihood of observing data when the null hypothesis is true, compared to this probability when an alternative hypothesis is true. As Royall (1997, p. 8) explains: “The law of likelihood applies to pairs of hypotheses, telling when a given set of observations is evidence for one versus the other: hypothesis A is better supported than B if A implies a greater probability for the observations than B does. This law represents a concept of evidence that is essentially relative, one that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569) write, “The p-value is not adequate for inference because the measurement of evidence requires at least three components: the observations, and two competing explanations for how they were produced.

In practice, the problem of interpreting p-values as evidence in absence of a clearly defined alternative hypothesis is that they at best serve as proxies for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true, p-values are unrelated to evidence. In practice, when researchers examine a mix of hypotheses where the alternative hypothesis is sometimes true, p-values will be correlated with measures of evidence. However, this correlation can be quite weak (Krueger, 2001), and in general this correlation is too weak for p-values to function as a valid measure of evidence, where p-values in a specific range can directly be associated with ‘strong’ or ‘weak’ evidence.

 

Why single p-values cannot be interpreted as the strength of evidence

 

The evidential value of a single p-value depends on the statistical power of the test (i.e., on the sample size in combination with the effect size of the alternative hypothesis). The statistical power expresses the probability of observing a p-value smaller than the alpha level if the alternative hypothesis is true. When the null hypothesis is true, statistical power is formally undefined, but in practice in a two-sided test α% of the observed p-values will fall below the alpha level, as p-values are uniformly distributed under the null-hypothesis. The horizontal grey line in Figure 1 illustrates the expected p-value distribution for a two-sided independent t-test if the null-hypothesis is true (or when the observed effect size Cohen’s d is 0). As every p-value is equally likely, they can not quantify the strength of evidence against the null hypothesis.

 

Figure 1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve) and 99% (dotted black curve). 



If the alternative hypothesis is true the strength of evidence that corresponds to a p-value depends on the statistical power of the test. If power is 50%, we should expect that 50% of the observed p-values fall below the alpha level. The remaining p-values fall above the alpha level. The black curve in Figure 1 illustrates the p-value distribution for a test with a statistical power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when there is a true effect that is examined in a statistical test with 50% power than when the null hypothesis is true (as illustrated by the black curve being above the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence for an alternative hypothesis examined with 50% power, compared to the null hypothesis.

If an effect is examined in a test with 99% power (the dotted line in Figure 1) we would draw a different conclusion. With such high power p-values larger than the alpha level of 5% are rare (they occur only 1% of the time) and a p-value of 0.168 is much more likely to be observed when the null-hypothesis is true than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is evidence against an alternative hypothesis examined with 99% power, compared to the null hypothesis.

Figure 1 illustrates that with 99% power even a ‘statistically significant’ p-value of 0.04 is evidence for of the null-hypothesis. The reason for this is that the probability of observing a p-value of 0.04 is more likely when the null hypothesis is true than when a hypothesis is tested with 99% power (i.e., the grey horizontal line at p = 0.04 is above the dotted black curve). This fact, which is often counterintuitive when first encountered, is known as the Lindley paradox, or the Jeffreys-Lindley paradox (for a discussion, see Spanos, 2013).

Figure 1 illustrates that different p-values can correspond to the same relative evidence in favor of a specific alternative hypothesis, and that the same p-value can correspond to different levels of relative evidence. This is obviously undesirable if we want to use p-values as a measure of the strength of evidence. Therefore, it is incorrect to verbally label any p-value as providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis, as depending on the alternative hypothesis a researcher is interested in, the level of evidence will differ (and the p-value could even correspond to evidence in favor of the null hypothesis).

 

All p-values smaller than 1 correspond to evidence for some non-zero effect

 

If the alternative hypothesis is not specified, any p-value smaller than 1 should be treated as at least some evidence (however small) for some alternative hypotheses. It is therefore not correct to follow the recommendations of the authors in their Table 2 to interpret p-values above 0.1 (e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis based on p < 0.05 (see the forest plot in Figure 2). It thus seems ill-advised to follow their recommendation to describe a single study with p = 0.168 as ‘no evidence’ for a relationship.

 

Figure 2: Forest plot for a meta-analysis of three identical studies yielding p = 0.168.


However, replacing the label of ‘no evidence’ with the label ‘at least some evidence for some hypotheses’ leads to practical problems when communicating the results of statistical tests. It seems generally undesirable to allow researchers to interpret any p-value smaller than 1 as ‘at least some evidence’ against the null hypothesis. This is the price one pays for not specifying an alternative hypothesis, and try to interpret p-values from a null hypothesis significance test in an evidential manner. If we do not specify the alternative hypothesis, it becomes impossible to conclude there is evidence for the null hypothesis, and we cannot statistically falsify any hypothesis (Lakens, Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).

 

Interpreting p-values as p-values

 

Instead of interpreting p-values as measures of the strength of evidence, we could consider a radical alternative: interpret p-values as p-values. This would, perhaps surprisingly, solve the main problems that Muff and colleagues aim to address, namely ‘black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff’. The idea to interpret p-values as measures of evidence is most strongly tried to a Fisherian interpretation of p-values. An alternative statistical frequentist philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to guide decisions about the null and alternative hypothesis by, in the long run, controlling the Type I and Type II error rate. Researchers specify an alpha level and design a study with a sufficiently high statistical power, and reject (or fail to reject) the null hypothesis.

Neyman and Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First, Neyman and Pearson (1933b) leave open whether the states of the world are divided in two (‘accept’ and ‘reject’) or three regions, and write that a “region of doubt may be obtained by a further subdivision of the region of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics is to test range predictions instead of limiting oneself to a null hypothesis significance test (Lakens, 2021). This implements the idea of Neyman and Pearson to introduce a region of doubt, and distinguishes inconclusive results (where neither the null hypothesis nor the alternative hypothesis can be rejected, and more data needs to be collected to draw a conclusion) from conclusive results (where either the null hypothesis or the alternative hypothesis can be rejected.

In a Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of rejections must be only tentative.” So when we reject the null model, we do so tentatively, aware of the fact we might have done so in error, and without necessarily believing the null model is false. For Neyman (1957, p. 13) inferential behavior is an: “act of will to behave in the future (perhaps until new experiments are performed) in a particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.

Furthermore, it is important to remember that hypothesis tests reject a statistical hypothesis, but not a theoretical hypothesis. As Neyman (1960, p. 290) writes: “the frequency of correct conclusions regarding the statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis in a specific experiment does not necessarily inform us about the truth of the theory. Decisions about the truthfulness of a theory requires a careful evaluation of the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).

Neyman (1976) provides some reporting examples that reflect his philosophy on statistical inferences: “after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that "high" scores on "potential and on "education" are indicative of better chances of success in the drive to home ownership”. An example of a shorter statement that Neyman provides reads: “As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population.

A complete verbal description of the result of a Neyman-Pearson hypothesis test acknowledges two sources of uncertainty. First, the assumptions of the statistical test must be met (i.e., data is normally distributed), or any deviations should be small enough to not have any substantial effect on the frequentist error rates. Second, conclusions are made “Without hoping to know. whether each separate hypothesis is true or false(Neyman & Pearson, 1933a). Any single conclusion can be wrong, and assuming the test assumption are met, we make claims under a known maximum error rate (which is never zero). Future replication studies are needed to provide further insights about whether the current conclusion was erroneous or not.

After observing a p-value smaller than the alpha level, one can therefore conclude: “Until new data emerges that proves us wrong, we decide to act as if there is an effect, while acknowledging that the methodological procedure we base this decision on has, a maximum error rate of alpha% (assuming the statistical assumptions are met), which we find acceptably low.” One can follow such a statement about the observed data with a theoretical inference, such as “assuming our auxiliary hypotheses hold, the result of this statistical test corroborates our theoretical hypothesis”. If a conclusive test result in an equivalence test is observed that allows a researcher to reject the presence of any effect large enough to be meaningful, the conclusion would be that the test result does not corroborate the theoretical hypothesis.

The problem that the common application of null hypothesis significance testing in science is based on an arbitrary threshold of 0.05 is true (Lakens, Adolfi, et al., 2018). There are surprisingly few attempts to provide researchers with practical approaches to determine an alpha level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021; Miller & Ulrich, 2019; Mudge et al., 2012). It seems difficult to resolve in practice, both because at least some scientist adopt a philosophy of science where the goal of hypothesis tests is to establish a corpus of scientific claims (Frick, 1996), and any continuous measure will be broken up in a threshold below which a researcher are not expected to make a claim about a finding (e.g., a BF < 3, see Kass & Raftery, 1995, or a likelihood ratio lower than k = 8, see Royall, 2000). Although it is true that an alpha level of 0.05 is arbitrary, there are some pragmatic arguments in its favor (e.g., it is established, and it might be low enough to yield claims that are taken seriously, but not high enough to prevent other researchers from attempting to refute the claim, see Uygun Tunç et al., 2021).

 

If there really no agreement on best practices in sight?

 

One major impetus for the flawed proposal to interpret p-values as evidence by Muff and colleagues is that “no agreement on a way forward is in sight”. The statement that there is little agreement among statisticians is an oversimplification. I will go out on a limb and state some things I assume most statisticians agree on. First, there are multiple statistical tools one can use, and each tool has their own strengths and weaknesses. Second, there are different statistical philosophies, each with their own coherent logic, and researchers are free to analyze data from the perspective of one or multiple of these philosophies. Third, one should not misuse statistical tools, or apply them to attempt to answer questions the tool was not designed to answer.

It is true that there is variation in the preferences individuals have about which statistical tools should be used, and the philosophies of statistical researchers should adopt. This should not be surprising. Individual researchers differ in which research questions they find interesting within a specific content domain, and similarly, they differ in which statistical questions they find interesting when analyzing data. Individual researchers differ in which approaches to science they adopt (e.g., a qualitative or a quantitative approach), and similarly, they differ in which approach to statistical inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences. It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is. Researchers can believe it is important for reliable knowledge generation to control error rates when making scientific claims, while at the same time believing that it is important to quantify relative evidence using likelihoods or Bayes factors (for example by presented a Bayes factor alongside every p-value for a statistical test, Lakens et al., 2020).

Whatever approach to statistical inferences researchers choose to use, the approach should answer a meaningful statistical question (Hand, 1994), the approach to statistical inferences should be logically coherent, and the approach should be applied correctly. Despite the common statement in the literature that p-values can be interpreted as measures of evidence, the criticism against the coherence of this approach should make us pause. Given that coherent alternatives exist, such as likelihoods (Royall, 1997) or Bayes factors (Kass & Raftery, 1995), researchers should not follow the recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p = 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.

 

References

Bland, M. (2015). An introduction to medical statistics (Fourth edition). Oxford University Press.

Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecology Letters, 7(8), 669–675. https://doi.org/10.1111/j.1461-0248.2004.00625.x

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568–1574.

Hand, D. J. (1994). Deconstructing Statistical Questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3), 317–356. https://doi.org/10.2307/2983526

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

Kim, J. H., & Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic Approach. Abacus, 57(1), 27–71. https://doi.org/10.1111/abac.12172

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. https://doi.org/10.1037//0003-066X.56.1.16

Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2020). Improving Inferences About Null Effects With Bayes Factors and Equivalence Tests. The Journals of Gerontology: Series B, 75(1), 45–57. https://doi.org/10.1093/geronb/gby065

Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

Maier, M., & Lakens, D. (2021). Justify Your Alpha: A Primer on Two Practical Approaches. PsyArXiv. https://doi.org/10.31234/osf.io/ts4r6

Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7–22. https://doi.org/10.2307/1401671

Neyman, J. (1960). First course in probability and statistics. Holt, Rinehart and Winston.

Neyman, J. (1976). Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in Statistics - Theory and Methods, 5(8), 737–751. https://doi.org/10.1080/03610927608827392

Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009

Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29(04), 492–510. https://doi.org/10.1017/S030500410001152X

Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall/CRC.

Royall, R. (2000). On the probability of observing misleading statistical evidence. Journal of the American Statistical Association, 95(451), 760–768.

Spanos, A. (2013). Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science, 80(1), 73–93.

Uygun Tunç, D., & Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework. In Meta-Psychology. https://doi.org/10.31234/osf.io/pdm7y

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by