The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, November 20, 2021

Why p-values should be interpreted as p-values and not as measures of evidence

In a recent paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation “to regard P-values as what they are, namely, continuous measures of statistical evidence". This is a surprising recommendation, given that p-values are not valid measures of evidence (Royall, 1997). The authors follow Bland (2015) who suggests that “It is preferable to think of the significance test probability as an index of the strength of evidence against the null hypothesis” and proposed verbal labels for p-values in specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman & Royall, 1988). If the null-hypothesis is true, p-values are uniformly distributed. This means it is just as likely to observe a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just because the probability to observe this p-value is very small. After all, if the null hypothesis is true, the probability of observing p = 0.999 is exactly as small.

The reason that small p-values can be used to guide us in the direction of true effects is not because they are rarely observed when the null-hypothesis is true, but because they are relatively less likely to be observed when the null hypothesis is true, than when the alternative hypothesis is true. For this reason, statisticians have argued that the concept of evidence is necessarily ‘relative’. We can quantify evidence in favor of one hypothesis over another hypothesis, based on the likelihood of observing data when the null hypothesis is true, compared to this probability when an alternative hypothesis is true. As Royall (1997, p. 8) explains: “The law of likelihood applies to pairs of hypotheses, telling when a given set of observations is evidence for one versus the other: hypothesis A is better supported than B if A implies a greater probability for the observations than B does. This law represents a concept of evidence that is essentially relative, one that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569) write, “The p-value is not adequate for inference because the measurement of evidence requires at least three components: the observations, and two competing explanations for how they were produced.

In practice, the problem of interpreting p-values as evidence in absence of a clearly defined alternative hypothesis is that they at best serve as proxies for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true, p-values are unrelated to evidence. In practice, when researchers examine a mix of hypotheses where the alternative hypothesis is sometimes true, p-values will be correlated with measures of evidence. However, this correlation can be quite weak (Krueger, 2001), and in general this correlation is too weak for p-values to function as a valid measure of evidence, where p-values in a specific range can directly be associated with ‘strong’ or ‘weak’ evidence.

 

Why single p-values cannot be interpreted as the strength of evidence

 

The evidential value of a single p-value depends on the statistical power of the test (i.e., on the sample size in combination with the effect size of the alternative hypothesis). The statistical power expresses the probability of observing a p-value smaller than the alpha level if the alternative hypothesis is true. When the null hypothesis is true, statistical power is formally undefined, but in practice in a two-sided test α% of the observed p-values will fall below the alpha level, as p-values are uniformly distributed under the null-hypothesis. The horizontal grey line in Figure 1 illustrates the expected p-value distribution for a two-sided independent t-test if the null-hypothesis is true (or when the observed effect size Cohen’s d is 0). As every p-value is equally likely, they can not quantify the strength of evidence against the null hypothesis.

 

Figure 1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve) and 99% (dotted black curve). 



If the alternative hypothesis is true the strength of evidence that corresponds to a p-value depends on the statistical power of the test. If power is 50%, we should expect that 50% of the observed p-values fall below the alpha level. The remaining p-values fall above the alpha level. The black curve in Figure 1 illustrates the p-value distribution for a test with a statistical power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when there is a true effect that is examined in a statistical test with 50% power than when the null hypothesis is true (as illustrated by the black curve being above the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence for an alternative hypothesis examined with 50% power, compared to the null hypothesis.

If an effect is examined in a test with 99% power (the dotted line in Figure 1) we would draw a different conclusion. With such high power p-values larger than the alpha level of 5% are rare (they occur only 1% of the time) and a p-value of 0.168 is much more likely to be observed when the null-hypothesis is true than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is evidence against an alternative hypothesis examined with 99% power, compared to the null hypothesis.

Figure 1 illustrates that with 99% power even a ‘statistically significant’ p-value of 0.04 is evidence for of the null-hypothesis. The reason for this is that the probability of observing a p-value of 0.04 is more likely when the null hypothesis is true than when a hypothesis is tested with 99% power (i.e., the grey horizontal line at p = 0.04 is above the dotted black curve). This fact, which is often counterintuitive when first encountered, is known as the Lindley paradox, or the Jeffreys-Lindley paradox (for a discussion, see Spanos, 2013).

Figure 1 illustrates that different p-values can correspond to the same relative evidence in favor of a specific alternative hypothesis, and that the same p-value can correspond to different levels of relative evidence. This is obviously undesirable if we want to use p-values as a measure of the strength of evidence. Therefore, it is incorrect to verbally label any p-value as providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis, as depending on the alternative hypothesis a researcher is interested in, the level of evidence will differ (and the p-value could even correspond to evidence in favor of the null hypothesis).

 

All p-values smaller than 1 correspond to evidence for some non-zero effect

 

If the alternative hypothesis is not specified, any p-value smaller than 1 should be treated as at least some evidence (however small) for some alternative hypotheses. It is therefore not correct to follow the recommendations of the authors in their Table 2 to interpret p-values above 0.1 (e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis based on p < 0.05 (see the forest plot in Figure 2). It thus seems ill-advised to follow their recommendation to describe a single study with p = 0.168 as ‘no evidence’ for a relationship.

 

Figure 2: Forest plot for a meta-analysis of three identical studies yielding p = 0.168.


However, replacing the label of ‘no evidence’ with the label ‘at least some evidence for some hypotheses’ leads to practical problems when communicating the results of statistical tests. It seems generally undesirable to allow researchers to interpret any p-value smaller than 1 as ‘at least some evidence’ against the null hypothesis. This is the price one pays for not specifying an alternative hypothesis, and try to interpret p-values from a null hypothesis significance test in an evidential manner. If we do not specify the alternative hypothesis, it becomes impossible to conclude there is evidence for the null hypothesis, and we cannot statistically falsify any hypothesis (Lakens, Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).

 

Interpreting p-values as p-values

 

Instead of interpreting p-values as measures of the strength of evidence, we could consider a radical alternative: interpret p-values as p-values. This would, perhaps surprisingly, solve the main problems that Muff and colleagues aim to address, namely ‘black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff’. The idea to interpret p-values as measures of evidence is most strongly tried to a Fisherian interpretation of p-values. An alternative statistical frequentist philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to guide decisions about the null and alternative hypothesis by, in the long run, controlling the Type I and Type II error rate. Researchers specify an alpha level and design a study with a sufficiently high statistical power, and reject (or fail to reject) the null hypothesis.

Neyman and Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First, Neyman and Pearson (1933b) leave open whether the states of the world are divided in two (‘accept’ and ‘reject’) or three regions, and write that a “region of doubt may be obtained by a further subdivision of the region of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics is to test range predictions instead of limiting oneself to a null hypothesis significance test (Lakens, 2021). This implements the idea of Neyman and Pearson to introduce a region of doubt, and distinguishes inconclusive results (where neither the null hypothesis nor the alternative hypothesis can be rejected, and more data needs to be collected to draw a conclusion) from conclusive results (where either the null hypothesis or the alternative hypothesis can be rejected.

In a Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of rejections must be only tentative.” So when we reject the null model, we do so tentatively, aware of the fact we might have done so in error, and without necessarily believing the null model is false. For Neyman (1957, p. 13) inferential behavior is an: “act of will to behave in the future (perhaps until new experiments are performed) in a particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.

Furthermore, it is important to remember that hypothesis tests reject a statistical hypothesis, but not a theoretical hypothesis. As Neyman (1960, p. 290) writes: “the frequency of correct conclusions regarding the statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis in a specific experiment does not necessarily inform us about the truth of the theory. Decisions about the truthfulness of a theory requires a careful evaluation of the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).

Neyman (1976) provides some reporting examples that reflect his philosophy on statistical inferences: “after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that "high" scores on "potential and on "education" are indicative of better chances of success in the drive to home ownership”. An example of a shorter statement that Neyman provides reads: “As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population.

A complete verbal description of the result of a Neyman-Pearson hypothesis test acknowledges two sources of uncertainty. First, the assumptions of the statistical test must be met (i.e., data is normally distributed), or any deviations should be small enough to not have any substantial effect on the frequentist error rates. Second, conclusions are made “Without hoping to know. whether each separate hypothesis is true or false(Neyman & Pearson, 1933a). Any single conclusion can be wrong, and assuming the test assumption are met, we make claims under a known maximum error rate (which is never zero). Future replication studies are needed to provide further insights about whether the current conclusion was erroneous or not.

After observing a p-value smaller than the alpha level, one can therefore conclude: “Until new data emerges that proves us wrong, we decide to act as if there is an effect, while acknowledging that the methodological procedure we base this decision on has, a maximum error rate of alpha% (assuming the statistical assumptions are met), which we find acceptably low.” One can follow such a statement about the observed data with a theoretical inference, such as “assuming our auxiliary hypotheses hold, the result of this statistical test corroborates our theoretical hypothesis”. If a conclusive test result in an equivalence test is observed that allows a researcher to reject the presence of any effect large enough to be meaningful, the conclusion would be that the test result does not corroborate the theoretical hypothesis.

The problem that the common application of null hypothesis significance testing in science is based on an arbitrary threshold of 0.05 is true (Lakens, Adolfi, et al., 2018). There are surprisingly few attempts to provide researchers with practical approaches to determine an alpha level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021; Miller & Ulrich, 2019; Mudge et al., 2012). It seems difficult to resolve in practice, both because at least some scientist adopt a philosophy of science where the goal of hypothesis tests is to establish a corpus of scientific claims (Frick, 1996), and any continuous measure will be broken up in a threshold below which a researcher are not expected to make a claim about a finding (e.g., a BF < 3, see Kass & Raftery, 1995, or a likelihood ratio lower than k = 8, see Royall, 2000). Although it is true that an alpha level of 0.05 is arbitrary, there are some pragmatic arguments in its favor (e.g., it is established, and it might be low enough to yield claims that are taken seriously, but not high enough to prevent other researchers from attempting to refute the claim, see Uygun Tunç et al., 2021).

 

If there really no agreement on best practices in sight?

 

One major impetus for the flawed proposal to interpret p-values as evidence by Muff and colleagues is that “no agreement on a way forward is in sight”. The statement that there is little agreement among statisticians is an oversimplification. I will go out on a limb and state some things I assume most statisticians agree on. First, there are multiple statistical tools one can use, and each tool has their own strengths and weaknesses. Second, there are different statistical philosophies, each with their own coherent logic, and researchers are free to analyze data from the perspective of one or multiple of these philosophies. Third, one should not misuse statistical tools, or apply them to attempt to answer questions the tool was not designed to answer.

It is true that there is variation in the preferences individuals have about which statistical tools should be used, and the philosophies of statistical researchers should adopt. This should not be surprising. Individual researchers differ in which research questions they find interesting within a specific content domain, and similarly, they differ in which statistical questions they find interesting when analyzing data. Individual researchers differ in which approaches to science they adopt (e.g., a qualitative or a quantitative approach), and similarly, they differ in which approach to statistical inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences. It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is. Researchers can believe it is important for reliable knowledge generation to control error rates when making scientific claims, while at the same time believing that it is important to quantify relative evidence using likelihoods or Bayes factors (for example by presented a Bayes factor alongside every p-value for a statistical test, Lakens et al., 2020).

Whatever approach to statistical inferences researchers choose to use, the approach should answer a meaningful statistical question (Hand, 1994), the approach to statistical inferences should be logically coherent, and the approach should be applied correctly. Despite the common statement in the literature that p-values can be interpreted as measures of evidence, the criticism against the coherence of this approach should make us pause. Given that coherent alternatives exist, such as likelihoods (Royall, 1997) or Bayes factors (Kass & Raftery, 1995), researchers should not follow the recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p = 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.

 

References

Bland, M. (2015). An introduction to medical statistics (Fourth edition). Oxford University Press.

Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecology Letters, 7(8), 669–675. https://doi.org/10.1111/j.1461-0248.2004.00625.x

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568–1574.

Hand, D. J. (1994). Deconstructing Statistical Questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3), 317–356. https://doi.org/10.2307/2983526

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

Kim, J. H., & Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic Approach. Abacus, 57(1), 27–71. https://doi.org/10.1111/abac.12172

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. https://doi.org/10.1037//0003-066X.56.1.16

Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2020). Improving Inferences About Null Effects With Bayes Factors and Equivalence Tests. The Journals of Gerontology: Series B, 75(1), 45–57. https://doi.org/10.1093/geronb/gby065

Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

Maier, M., & Lakens, D. (2021). Justify Your Alpha: A Primer on Two Practical Approaches. PsyArXiv. https://doi.org/10.31234/osf.io/ts4r6

Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7–22. https://doi.org/10.2307/1401671

Neyman, J. (1960). First course in probability and statistics. Holt, Rinehart and Winston.

Neyman, J. (1976). Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in Statistics - Theory and Methods, 5(8), 737–751. https://doi.org/10.1080/03610927608827392

Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009

Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29(04), 492–510. https://doi.org/10.1017/S030500410001152X

Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall/CRC.

Royall, R. (2000). On the probability of observing misleading statistical evidence. Journal of the American Statistical Association, 95(451), 760–768.

Spanos, A. (2013). Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science, 80(1), 73–93.

Uygun Tunç, D., & Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework. In Meta-Psychology. https://doi.org/10.31234/osf.io/pdm7y

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by

Sunday, October 31, 2021

Not All Flexibility P-Hacking Is, Young Padawan

During a recent workshop on Sample Size Justification an early career researcher asked me: “You recommend sequential analysis in your paper for when effect sizes are uncertain, where researchers collect data, analyze the data, stop when a test is significant, or continue data collection when a test is not significant, and, I don’t want to be rude, but isn’t this p-hacking?”

In linguistics there is a term for when children apply a rule they have learned to instances where it does not apply: Overregularization. They learn ‘one cow, two cows’, and use the +s rule for plural where it is not appropriate, such as ‘one mouse, two mouses’ (instead of ‘two mice’). The early career researcher who asked me if sequential analysis was a form of p-hacking was also overregularizing. We teach young researchers that flexibly analyzing data inflates error rates, is called p-hacking, and is a very bad thing that was one of the causes of the replication crisis. So, they apply the rule ‘flexibility in the data analysis is a bad thing’ to cases where it does not apply, such as in the case of sequential analyses. Yes, sequential analyses give a lot of flexibility to stop data collection, but it does so while carefully controlling error rates, with the added bonus that it can increase the efficiency of data collection. This makes it a good thing, not p-hacking.

 

Children increasingly use correct language the longer they are immersed in it. Many researchers are not yet immersed in an academic environment where they see flexibility in the data analysis applied correctly. Many are scared to do things wrong, which risks becoming overly conservative, as the pendulum from ‘we are all p-hacking without realizing the consequences’ swings back to far to ‘all flexibility is p-hacking’. Therefore, I patiently explain during workshops that flexibility is not bad per se, but that making claims without controlling your error rate is problematic.

In a recent podcast episode of ‘Quantitude’ one of the hosts shared a similar experience 5 minutes into the episode. A young student remarked that flexibility during the data analysis was ‘unethical’. The remainder of the podcast episode on ‘researcher degrees of freedom’ discussed how flexibility is part of data analysis. They clearly state that p-hacking is problematic, and opportunistic motivations to perform analyses that give you what you want to find should be constrained. But they then criticized preregistration in ways many people on Twitter disagreed with. They talk about ‘high priests’ who want to ‘stop bad people from doing bad things’ which they find uncomfortable, and say ‘you can not preregister every contingency’. They remark they would be surprised if data could be analyzed without requiring any on the fly judgment.

Although the examples they gave were not very good1 it is of course true that researchers sometimes need to deviate from an analysis plan. Deviating from an analysis plan is not p-hacking. But when people talk about preregistration, we often see overregularization: “Preregistration requires specifying your analysis plan to prevent inflation of the Type 1 error rate, so deviating from a preregistration is not allowed.” The whole point of preregistration is to transparently allow other researchers to evaluate the severity of a test, both when you stick to the preregistered statistical analysis plan, as when you deviate from it. Some researchers have sufficient experience with the research they do that they can preregister an analysis that does not require any deviations2, and then readers can see that the Type 1 error rate for the study is at the level specified before data collection. Other researchers will need to deviate from their analysis plan because they encounter unexpected data. Some deviations reduce the severity of the test by inflating the Type 1 error rate. But other deviations actually get you closer to the truth. We can not know which is which. A reader needs to form their own judgment about this.

A final example of overregularization comes from a person who discussed a new study that they were preregistering with a junior colleague. They mentioned the possibility of including a covariate in an analysis but thought that was too exploratory to be included in the preregistration. The junior colleague remarked: “But now that we have thought about the analysis, we need to preregister it”. Again, we see an example of overregularization. If you want to control the Type 1 error rate in a test, preregister it, and follow the preregistered statistical analysis plan. But researchers can, and should, explore data to generate hypotheses about things that are going on in their data. You can preregister these, but you do not have to. Not exploring data could even be seen as research waste, as you are missing out on the opportunity to generate hypotheses that are informed by data. A case can be made that researchers should regularly include variables to explore (e.g., measures that are of general interest to peers in their field), as long as these do not interfere with the primary hypothesis test (and as long as these explorations are presented as such).

In the book “Reporting quantitative research in psychology: How to meet APA Style Journal Article Reporting Standards” by Cooper and colleagues from 2020 a very useful distinction is made between primary hypotheses, secondary hypotheses, and exploratory hypotheses. The first consist of the main tests you are designing the study for. The secondary hypotheses are also of interest when you design the study – but you might not have sufficient power to detect them. You did not design the study to test these hypotheses, and because the power for these tests might be low, you did not control the Type 2 error rate for secondary hypotheses. You can preregister secondary hypotheses to control the Type 1 error rate, as you know you will perform them, and if there are multiple secondary hypotheses, as Cooper et al (2020) remark, readers will expect “adjusted levels of statistical significance, or conservative post hoc means tests, when you conducted your secondary analysis”.

If you think of the possibility to analyze a covariate, but decide this is an exploratory analysis, you can decide to neither control the Type 1 error rate nor the Type 2 error rate. These are analyses, but not tests of a hypothesis, as any findings from these analyses have an unknown Type 1 error rate. Of course, that does not mean these analyses can not be correct in what they reveal – we just have no way to know the long run probability that exploratory conclusions are wrong. Future tests of the hypotheses generated in exploratory analyses are needed. But as long as you follow Journal Article Reporting Standards and distinguish exploratory analyses, readers know what the are getting. Exploring is not p-hacking.

People in psychology are re-learning the basic rules of hypothesis testing in the wake of the replication crisis. But because they are not yet immersed in good research practices, the lack of experience means they are overregularizing simplistic rules to situations where they do not apply. Not all flexibility is p-hacking, preregistered studies do not prevent you from deviating from your analysis plan, and you do not need to preregister every possible test that you think of. A good cure for overregularization is reasoning from basic principles. Do not follow simple rules (or what you see in published articles) but make decisions based on an understanding of how to achieve your inferential goal. If the goal is to make claims with controlled error rates, prevent Type 1 error inflation, for example by correcting the alpha level where needed. If your goal is to explore data, feel free to do so, but know these explorations should be reported as such. When you design a study, follow the Journal Article Reporting Standards and distinguish tests with different inferential goals.

 

1 E.g., they discuss having to choose between Student’s t-test and Welch’s t-test, depending on wheter Levene’s test indicates the assumption of homogeneity is violated, which is not best practice – just follow R, and use Welch’s t-test by default.

2 But this is rare – only 2 out of 27 preregistered studies in Psychological Science made no deviations. https://royalsocietypublishing.org/doi/full/10.1098/rsos.211037 We can probably do a bit better if we only preregistered predictions at a time where we really understand our manipulations and measures.

Monday, September 20, 2021

Jerzy Neyman: A Positive Role Model in the History of Frequentist Statistics

Many of the facts in this blog post come from the biography ‘Neyman’ by Constance Reid. I highly recommend reading this book if you find this blog interesting.

In recent years researchers have become increasingly interested in the relationship between eugenics and statistics, especially focusing on the lives of Francis Galton, Karl Pearson, and Ronald Fisher. Some have gone as far as to argue for a causal relationship between eugenics and frequentist statistics. For example, in a recent book ‘Bernouilli’s Fallacy’, Aubrey Clayton speculates that Fisher’s decision to reject prior probabilities and embrace a frequentist approach was “also at least partly political”. Rejecting prior probabilities, Clayton argues, makes science seem more ‘objective’, which would have helped Ronald Fisher and his predecessors to establish eugenics as a scientific discipline, despite the often-racist conclusions eugenicists reached in their work.

When I was asked to review an early version of Clayton’s book for Columbia University Press, I thought that the main narrative was rather unconvincing, and thought the presented history of frequentist statistics was too one-sided and biased. Authors who link statistics to problematic political views often do not mention equally important figures in the history of frequentist statistics who were in all ways the opposite of Ronald Fisher. In this blog post, I want to briefly discuss the work and life of Jerzy Neyman, for two reasons.


Jerzy Neyman (image from https://statistics.berkeley.edu/people/jerzy-neyman)

First, the focus on Fisher’s role in the history of frequentist statistics is surprising, given that the dominant approach to frequentist statistics used in many scientific disciplines is the Neyman-Pearson approach. If you have ever rejected a null hypothesis because a p-value was smaller than an alpha level, or if you have performed a power analysis, you have used the Neyman-Pearson approach to frequentist statistics, and not the Fisherian approach. Neyman and Fisher disagreed vehemently about their statistical philosophies (in 1961 Neyman published an article titled ‘Silver Jubilee of My Dispute with Fisher’), but it was Neyman’s philosophy that won out and became the default approach to hypothesis testing in most fields[i]. Anyone discussing the history of frequentist hypothesis testing should therefore seriously engage with the work of Jerzy Neyman and Egon Pearson. Their work was not in line with the views of Karl Pearson, Egon's father, nor the views of Fisher. Indeed, it was a great source of satisfaction to Neyman that their seminal 1933 paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of the work, and (as Neyman thought) reviewed by Fisher[ii], who strongly disagreed with their philosophy of statistics.

Second, Jerzy Neyman was also the opposite to Fisher in his political viewpoints. Instead of promoting eugenics, Neyman worked to improve the position of those less privileged throughout his life, teaching disadvantaged people in Poland, and creating educational opportunities for Americans at UC Berkeley. He hired David Blackwell, who was the first Black tenured faculty member at UC Berkeley. This is important, because it falsifies the idea put forward by Clayton[iii] that frequentist statistics became the dominant approach in science because the most important scientists who worked on it wanted to pretend their dubious viewpoints were based on ‘objective’ scientific methods.  

I think it is useful to broaden the discussion of the history of statistics, beyond the work by Fisher and Karl Pearson, and credit the work of others[iv] who contributed in at least as important ways to the statistics we use today. I am continually surprised about how few people working outside of statistics even know the name of Jerzy Neyman, even though they regularly use his insights when testing hypotheses. In this blog, I will try to describe his work and life to add some balance to the history of statistics that most people seem to learn about. And more importantly, I hope Jerzy Neyman can be a positive role-model for young frequentist statisticians, who might so far have only been educated about the life of Ronald Fisher.


Neyman’s personal life


Neyman was born in 1894 in Russia, but raised in Poland. After attending the gymnasium, he studied at the University of Kharkov. Initially trying to become an experimental physicist, he was too clumsy with his hands, and switched to conceptual mathematics, in which he concluded his undergraduate in 1917 in politically tumultuous times. In 1919 he met his wife, and they marry in 1920. Ten days later, because of the war between Russia and Poland, Neyman is imprisoned for a short time, and in 1921 flees to a small village to avoid being arrested again, where he obtains food by teaching the children of farmers. He worked for the Agricultural Institute, and then worked at the University in Warsaw. He obtained his doctor’s degree in 1924 at age 30. In September 1925 he was sent to London for a year to learn about the latest developments in statistics from Karl Pearson himself. It is here that he met Egon Pearson, Karl’s son, and a friendship and scientific collaboration starts.

Neyman always spends a lot of time teaching, often at the expense of doing scientific work. He was involved in equal opportunity education in 1918 in Poland, teaching in dimly lit classrooms where the rag he used to wipe the blackboard would sometimes freeze. He always had a weak spot for intellectuals from ‘disadvantaged’ backgrounds. He and his wife were themselves very poor until he moved to UC Berkeley in 1938. In 1929, back in Poland, his wife becomes ill due to their bad living conditions, and the doctor who comes to examine her is so struck by their miserable living conditions he offers the couple stay in his house for the same rent they were paying while he visits France for 6 months. In his letters to Egon Pearson from this time, he often complained that the struggle for existence takes all his time and energy, and that he can not do any scientific work.

Even much later in his life, in 1978, he kept in mind that many people have very little money, and he calls ahead to restaurants to make sure a dinner before a seminar would not cost too much for the students. It is perhaps no surprise that most of his students (and he had many) talk about Neyman with a lot of appreciation. He wasn’t perfect (for example, Erich Lehmann - one of Neyman's students - remarks how he was no longer allowed to teach a class after Lehmann's notes, building on but extending the work by Neyman, became extremely popular – suggesting Neyman was no stranger to envy). But his students were extremely positive about the atmosphere he created in his lab. For example, job applicants were told around 1947 that “there is no discrimination on the basis of age, sex, or race ... authors of joint papers are always listed alphabetically."

Neyman himself often suffered discrimination, sometimes because of his difficulty mastering the English language, sometimes for being Polish (when in Paris a piece of clothing, and ermine wrap, is stolen from their room, the police responds “What can you expect – only Poles live there!”), sometimes because he did not believe in God, and sometimes because his wife was Russian and very emancipated (living independently in Paris as an artist). He was fiercely against discrimination. In 1933, as anti-Semitism is on the rise among students at the university where he works in Poland, he complains to Egon Pearson in a letter that the students are behaving with Jews as Americans do with people of color. In 1941 at UC Berkeley he hired women at a time it was not easy for a woman to get a job in mathematics.  

In 1942, Neyman examined the possibility of hiring David Blackwell, a Black statistician, then still a student. Neyman met him in New York (so that Blackwell does not need to travel to Berkeley at his own expense) and considered Blackwell the best candidate for the job. The wife of a mathematics professor (who was born in the south of the US) learned about the possibility that a Black statistician might be hired, warns she will not invite a Black man to her house, and there was enough concern for the effect the hire would have on the department that Neyman can not make an offer to Blackwell. He is able to get Blackwell to Berkeley in 1953 as a visiting professor, and offers him a tenured job in 1954, making David Blackwell the first tenured faculty member at the University of Berkeley, California. And Neyman did this, even though Blackwell was a Bayesian[v] ;).

In 1963, Neyman travelled to the south of the US and for the first time directly experienced the segregation. Back in Berkeley, a letter is written with a request for contributions for the Southern Christian Leadership Conference (founded by Martin Luther King, Jr. and others), and 4000 copies are printed and shared with colleagues at the university and friends around the country, which brought in more than $3000. He wrote a letter to his friend Harald Cramér that he believed Martin Luther King, Jr. deserved a Nobel Peace Prize (which Cramér forwarded to the chairman of the Nobel Committee, and which he believed might have contributed at least a tiny bit to fact that Martin Luther King, Jr. was awarded the Nobel Prize a year later). Neyman also worked towards the establishment of a Special Scholarships Committee at UC Berkeley with the goal of providing education opportunities to disadvantaged Americans

Neyman was not a pacifist. In the second world war he actively looked for ways he could contribute to the war effort. He is involved in statistical models that compute the optimal spacing of bombs by planes to clear a path across a beach of land mines. (When at a certain moment he needs specifics about the beach, a representative from the military who is not allowed to directly provide this information asks if Neyman has ever been to the seashore in France, to which Neyman replies he has been to Normandy, and the representative answers “Then use that beach!”). But Neyman early and actively opposed the Vietnam war, despite the risk of losing lucrative contracts the Statistical Laboratory had with the Department of Defense. In 1964 he joined a group of people who bought advertisements in local newspapers with a picture of a napalmed Vietnamese child with the quote “The American people will bluntly and plainly call it murder”.


A positive role model


It is important to know the history of a scientific discipline. Histories are complex, and we should resist overly simplistic narratives. If your teacher explains frequentist statistics to you, it is good if they highlight that someone like Fisher had questionable ideas about eugenics. But the early developments in frequentist statistics involved many researchers beyond Fisher[vi], and, luckily, there are many more positive role-models that also deserve to be mentioned - such as Jerzy Neyman. Even though Neyman’s philosophy on statistical inferences forms the basis of how many scientists nowadays test hypotheses, his contributions and personal life are still often not discussed in histories of statistics - an oversight I hope the current blog post can somewhat mitigate. If you want to learn more about the history of statistics through Neyman’s personal life, I highly recommend the biography of Neyman by Constance Reid, which was the source for most of the content of this blog post.

 



[i] See Hacking, 1965: “The mature theory of Neyman and Pearson is very nearly the received theory on testing statistical hypotheses.”

[ii] It turns out, in the biography, that it was not Fisher, but A. C. Aitken, who reviewed the paper positively.

[iii] Clayton’s book seems to be mainly intended as an attempt to persuade readers to become a Bayesian, and not as an accurate analysis of the development of frequentist statistics.

[iv] William Gosset (or 'Student', from 'Student's t-test'), who was the main inspiration for the work by Neyman and Pearson, is another giant in frequentist statistics who does not in any way fit into the narrative that frequentist statistics is tied to eugenics, as his statistical work was motivated by applied research questions in the Guinness brewery. Gosset was a modest man – which is probably why he rarely receives the credit he is due.

[v] When asked about his attitude towards Bayesian statistics in 1979, he answered: “It does not interest me. I am interested in frequencies.” He did note multiple legitimate approaches to statistics exist, and the choice one makes is largely a matter of personal taste. Neyman opposed subjective Bayesian statistics because their use could lead to bad decision procedures, but was very positive about later work by Wald, which inspired Bayesian statistical decision theory.

[vi] For a more nuanced summary of Fisher's life, see https://www.nature.com/articles/s41437-020-00394-6