The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Wednesday, January 1, 2020

Observed Type 1 Error Rates (Why Statistical Models are Not Reality)


“In the long run we are all dead.” - John Maynard Keynes

When we perform hypothesis tests in a Neyman-Pearson framework we want to make decisions while controlling the rate at which we make errors. We do this in part by setting an alpha level that guarantees we will not say there is an effect when there is no effect more than α% of the time, in the long run.

I like my statistics applied. And in practice I don’t do an infinite number of studies. As Keynes astutely observed, I will be dead before then. So when I control the error rate for my studies, what is a realistic Type 1 error rate I will observe in the ‘somewhat longer run’?

Let’s assume you publish a paper that contains only a single p-value. Let’s also assume the true effect size is 0, so the null hypothesis is true. Your test will return a p-value smaller than your alpha level (and this would be a Type 1 error) or not. With a single study, you don’t have the granularity to talk about a 5% error rate.


In experimental psychology 30 seems to be a reasonable average for the number of p-values that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872). Let’s assume you perform 30 tests in a single paper and every time the null is true (even though this is often unlikely in a real paper). In the long run, with an alpha level of 0.05 we can expect that 30 * 0.05 = 1.5 p-values will be significant. But in real sets of 30 p-values there is no half of a p-value, so you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more. We can plot the frequency of Type 1 error rates for 1 million sets of 30 tests.


Each of these error rates occurs with a certain frequency. 21.5% of the time, you will not make any Type 1 errors. 12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but no single set of studies is average.


Now maybe a single paper with 30 tests is not ‘long runnerish’ enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let’s assume we are interested in a meta-analysis worth of 200 studies that examine a topic where the true effect size is 0 for each test. We can plot the frequency of Type 1 error rates for 1 million sets of 200 tests.
 

Now things start to look a bit more like what you would expect. The Type 1 error rate you will observe in your set of 200 tests is close to 5%. However, it is almost exactly as likely that the observed Type 1 error rate is 4.5%. 90% of the distribution of observed alpha levels will lie between 0.025 and 0.075. So, even in ‘somewhat longrunnish’ 200 tests, the observed Type 1 error rate will rarely be exactly 5%, and it might be more useful to think about it as being between 2.5 and 7.5%.

Statistical models are not reality.


A 5% error rate exists only in the abstract world of infinite repetitions, and you will not live long enough to perform an infinite number of studies. In practice, if you (or a group of researchers examining a specific question) do real research, the error rates are somewhere in the range of 5%. Everything has variation in samples drawn from a larger population - error rates are no exception.

When we quantify things, there is the tendency to get lost in digits. But in practice, the levels of random noise we can reasonable expect quickly overwhelms everything at least 3 digits after the decimal. I know we can compute the alpha level after a Pocock correction for two looks at the data in sequential analyses as 0.0294. But this is not the level of granularity that we should have in mind when we think of the error rate we will observe in real lines of research. When we control our error rates, we do so with the goal to end up somewhere reasonably low, after a decent number of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors or 7.5% errors: Potato, patato.

This does not mean we should stop quantifying numbers precisely when they can be quantified precisely, but we should realize what we get from the statistical procedures we use. We don't get a 5% Type 1 error rate in any real set of studies we will actually perform. Statistical inferences guide us roughly to where we would ideally like to end up. By all means calculate exact numbers where you can. Strictly adhere to hard thresholds to prevent you from fooling yourself too often. But maybe in 2020 we can learn to appreciate statistical inferences are always a bit messy. Do the best you reasonably can, but don’t expect perfection. In 2020, and in statistics.


Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers. 

Sunday, November 24, 2019

Do You Really Want to Test a Hypothesis?


I’ve uploaded one of my favorite lectures in the my new MOOC “Improving Your Statistical Questions” to YouTube. It asks the question whether you really want to test a hypothesis. A hypothesis is a very specific tool to answer a very specific question. I like hypothesis tests, because in experimental psychology it is common to perform lines of research where you can design a bunch of studies that test simple predictions about the presence or absence of differences on some measure. I think they have a role to play in science. I also think hypothesis testing is widely overused. As we are starting to do hypothesis tests better (e.g., by preregistering our predictions and controlling our error rates in more severe tests) I predict many people will start to feel a bit squeamish as they become aware that doing hypothesis tests as they were originally designed to be used isn’t really want they want in their research. One of the often overlooked gains in teaching people how to do something well, is that they finally realize that they actually don’t want to do it.

The lecture “Do You Really Want to Test a Hypothesis” aims to explain which question a hypothesis tests asks, and discusses when a hypothesis tests answers a question you are interested in. It is very easy to say what not to do, or to point out what is wrong with statistical tools. Statistical tools are very limited, even under ideal circumstances. It’s more difficult to say what you can do. If you follow my work, you know that this latter question is what I spend my time on. Instead of telling you optional stopping can’t be done because it is p-hacking, I explain how you can do it correctly through sequential analysis. Instead of telling you it is wrong to conclude the absence of an effect from p > 0.05, I explain how to use equivalence testing­­. Instead of telling you p-values are the devil, I explain how they answer a question you might be interested in when used well. Instead of saying preregistration is redundant, I explain from which philosophy of science preregistration has value. And instead of saying we should abandon hypothesis tests, I try to explain in this video how to use them wisely. This is all part of my ongoing #JustifyEverything educational tour. I think it is a reasonable expectation that researchers should be able to answer at least a simple ‘why’ question if you ask why they use a specific tool, or use a tool in a specific manner.

This might help to move beyond the simplistic discussion I often see about these topics. If you ask me if I prefer frequentist of Bayesian statistics, or confirmatory or exploratory research, I am most likely to respond (see Wikipedia). It is tempting to think about these topics in a polarized either-or mindset – but then you would miss asking the real questions. When would any approach give you meaningful insights? Just as not every hypothesis test is an answer to a meaningful question, so will not every exploratory study provide interesting insights. The most important question to ask yourself when you plan a study is ‘when will the tools you use lead to interesting insights’? In the second week of my MOOC I discuss when effects in hypothesis tests could be deemed meaningful, but the same question applies to exploratory or descriptive research. Not all exploration is interesting, and we don’t want to simply describe every property of the world. Again, it is easy to dismiss any approach to knowledge generation, but it is so much more interesting to think about which tools will lead to interesting insights. And above all, realize that in most research lines, researchers will have a diverse set of questions that they want to answer given practical limitations, and they will need to rely on a diverse set of tools, limitations and all.

In this lecture I try to explain what the three limitations are of hypothesis tests, and the very specific question they try to answer. If you like to think about how to improve your statistical questions, you might be interested in enrolling in my free MOOC Improving Your Statistical Questions”.




Sunday, November 3, 2019

The Value of Preregistration for Psychological Science: A Conceptual Analysis


This blog is an excerpt of an invited journal article for a special issue of Japanese Psychological Review, that I am currently one week overdue with (but that I hope to complete soon). I hope this paper will raise the bar in the ongoing discussion about the value of preregistration in psychological science. If you have any feedback on what I wrote here, I would be very grateful to hear it, as it would allow me to improve the paper I am working on. If we want to fruitfully discuss preregistration, researchers need to provide a clear conceptual definition of preregistration, anchored in their philosophy of science.

For as long as data has been used to support scientific claims, people have tried to selectively present data in line with what they wish to be true. In his treatise ‘On the Decline of Science in England: And on Some of its Cases’ Babbage (1830) discusses what he calls cooking: “One of its numerous processes is to make multitudes of observations, and out of these to select those only which agree or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he can not pick out fifteen or twenty that will do up for serving.” In the past researchers have proposed solutions to prevent bias in the literature. With the rise of the internet it has become feasible to create online registries that ask researchers to specify their research design and the planned analyses. Scientific communities have started to make use of this opportunity (for a historical overview, see Wiseman, Watt, & Kornbrot, 2019).

Preregistration in psychology has been a good example of ‘learning by doing’. Best practices are continuously updated as we learn from practical challenges and early meta-scientific investigations into how preregistrations are performed. At the same time, discussions have emerged about what the goal of preregistration is, whether preregistration is desirable, and what preregistration should look like across different research areas. Every practice comes with costs and benefits, and it is useful to evaluate whether and when preregistration is worth it. Finally, it is important to evaluate how preregistration relates to different philosophies of science, and when it facilitates or distracts from goals scientists might have. The discussion about benefits and costs of preregistration has not been productive up to now because there is a general lack of a conceptual analysis of what preregistration entails and aims to accomplish, which leads to disagreements that are easily resolved when a conceptual definition would be available. Any conceptual definition about a tool that scientists use, such as preregistration, must examine the goals it achieves, and thus requires a clearly specified view on philosophy of science, which provides an analysis of different goals scientists might have. Discussing preregistration without discussing philosophy of science is a waste of time.

What is Preregistration For?


Preregistration has the goal to transparently prevent bias due to selectively reporting analyses. Since bias in estimates only occurs in relation to a true population parameter, preregistration as discussed here is limited to scientific questions that involve estimates of population values from samples. Researchers can have many different goals when collecting data, perhaps most notably theory development, as opposed to tests of statistical predictions derived from theories. When testing predictions, researchers might want a specific analysis to yield a null effect, for example to show that including a possible confound in an analysis does not change their main results. More often perhaps, they want an analysis to yield a statistically significant result, for example so that they can argue the results support their prediction, based on a p-value below 0.05. Both examples are sources of bias in the estimate of a population effect size. In this paper I will assume researchers use frequentist statistics, but all arguments can be generalized to Bayesian statistics (Gelman & Shalizi, 2013). When effect size estimates are biased, for example due to the desire to obtain a statistically significant result, hypothesis tests performed on these estimates have inflated Type 1 error rates, and when bias emerges due to the desire to obtain a non-significant test result, hypothesis tests have reduced statistical power. In line with the general tendency to weigh Type 1 error rates (the probability of obtaining a statistically significant result when there is no true effect) as more serious than Type 2 error rates (the probability of obtaining a non-significant result when there is a true effect), publications that discuss preregistration have been more concerned with inflated Type 1 error rates than with low power. However, one can easily think of situations where the latter is a bigger concern.

If the only goal of a researcher is to prevent bias it suffices to make a mental note of the planned analyses, or to verbally agree upon the planned analysis with collaborators, assuming we will perfectly remember our plans when analyzing the data. The reason to write down an analysis plan is not to prevent bias, but to transparently prevent bias. By including transparency in the definition of preregistration it becomes clear that the main goal of preregistration is to convince others that the reported analysis tested a clearly specified prediction. Not all approaches to knowledge generation value prediction, and it is important to evaluate if your philosophy of science values prediction to be able to decide if preregistration is a useful tool in your research. Mayo (2018) presents an overview of different arguments for the role prediction plays in science and arrives at a severity requirement: We can build on claims that passed tests that were highly capable of demonstrating the claim was false, but supported the prediction nevertheless. This requires that researchers who read about claims are able to evaluate the severity of a test. Preregistration facilitates this.

Although falsifying theories is a complex issue, falsifying statistical predictions is straightforward. Researchers can specify when they will interpret data as support for their claim based on the result of a statistical test, and when not. An example is a directional (or one-sided) t-test testing whether an observed mean is larger than zero. Observing a value statistically smaller or equal to zero would falsify this statistical prediction (as long as statistical assumptions of the test hold, and with some error rate in frequentist approaches to statistics). In practice, only range predictions can be statistically falsified. Because resources and measurement accuracy are not infinitely large, there is always a value close enough to zero that is statistically impossible to distinguish from zero. Therefore, researchers will need to specify at least some possible outcomes that would not be considered support for their prediction that statistical tests can pick up on. How such bounds are determined is a massively understudied problem in psychology, but it is essential to have falsifiable predictions.

Where bounds of a range prediction enable statistical falsification, the specification of these bounds is not enough to evaluate how highly capable a test was to demonstrate a claim was wrong. Meehl (1990) argues that we are increasingly impressed by a prediction, the more ways a prediction could have been wrong.  He writes (1990, p. 128): “The working scientist is often more impressed when a theory predicts something within, or close to, a narrow interval than when it predicts something correctly within a wide one.” Imagine making a prediction about where a dart will land if I throw it at a dartboard. You will be more impressed with my darts skills if I predict I will hit the bullseye, and I hit the bullseye, than when I predict to hit the dartboard, and I hit the dartboard. Making very narrow range predictions is a way to make it statistically likely to falsify your prediction, if it is wrong. It is also possible to make theoretically risky predictions, for example by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory. Regardless of how researchers increase the capability of a test to be wrong, the approach to scientific progress described here places more faith in claims based on predictions that have a higher capability of being falsified, but where data nevertheless supports the prediction. Anyone is free to choose a different philosophy of science, and create a coherent analysis of the goals of preregistration in that framework, but as far as I am aware, Mayo’s severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration.

Researchers admit to research practices that make their predictions, or the empirical support for their prediction, look more impressive than it is. One example of such a practice is optional stopping, where researchers collect a number of datapoints, perform statistical analyses, and continue the data collection if the result is not statistically significant. In theory, a researcher who is willing to continue collecting data indefinitely will always find a statistically significant result. By repeatedly looking at the data, the Type 1 error rate can inflate to 100%. Even though in practice the inflation will be smaller, optional stopping strongly increases the probability that a researcher can interpret their result as support for their prediction. In the extreme case, where a researcher is 100% certain that they will observe a statistically significant result when they perform their statistical test, their prediction will never be falsified. Providing support for a claim by relying on optional stopping should not increase our faith in the claim by much, or even at all. As Mayo (2018, p. 222) writes: “The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’s objectivity.” If researchers were to transparently register their data collection strategy, readers could evaluate the capability of the test to falsify their prediction, conclude this capability is very small, and be relatively unimpressed by the study. If the stopping rule keeps the probability of finding a non-significant result when the prediction is incorrect high, and the data nevertheless support the prediction, we can choose to act as if the claim is correct because it has been severely tested. Preregistration thus functions as a tool to allow other researchers te transparently evaluate the severity with which a claim has been tested.

The severity of a test can also be compromised by selecting a hypothesis based on the observed results. In this practice, known as Hypothesizing After the Results are Known (HARKing, Kerr, 1998) researchers look at their data, and then select a prediction. This reversal of the typical hypothesis testing procedure makes the test incapable of demonstrating the claim was false. Mayo (2018) refers to this as ‘bad evidence, no test’. If we choose a prediction from among the options that yield a significant result, the claims we make base on these ‘predictions’ will never be wrong. In philosophies of science that value predictions, such claims do not increase our confidence that the claim is true, because it has not yet been tested. By preregistering our predictions, we transparently communicate to readers that our predictions predated looking at data, and therefore that the data we present as support of our prediction could have falsified our hypothesis. We have not made our test look more severe by narrowing the range of our predictions after looking at the data (like the Texas sharpshooter who draws the circles of the bullseye after shooting at the wall of the barn). A reader can transparently evaluate how severely our claim was tested.

As a final example of the value of preregistration to transparently allow readers to evaluate the capability of our prediction to be falsified, think about the scenario described by Babbage at the beginning of this article, where a researchers makes multitudes of observations, and selects out of all these tests only those that support their prediction. The larger the number of observations to choose from, the higher the probability that one of the possible tests could be presented as support for the hypothesis. Therefore, from a perspective on scientific knowledge generation where severe tests are valued, choosing to selectively report tests from among many tests that were performed strongly reduces the capability of a test to demonstrate the claim was false. This can be prevented by correcting for multiple testing by lowering the alpha level depending on the number of tests.
The fact that preregistration is about specifying ways in which your claim could be false is not generally appreciated. Preregistrations should carefully specify not just the analysis researchers plan to perform, but also when they would infer from the analyses that their prediction was wrong. As the preceding section explains, successful predictions impress us more when the data that was collected was capable of falsifying the prediction. Therefore, a preregistration document should give us all the required information that allows us to evaluate the severity of the test. Specifying exactly which test will be performed on the data is important, but not enough. Researchers should also specify when they will conclude the prediction was not supported. Beyond specifying the analysis plan in detail, the severity of a test can be increased by narrowing the range of values that are predicted (without increasing the Type 1 and Type 2 error rate), or making the theoretical prediction more specific by specifying detailed circumstances under which the effect will be observed, and when it will not be observed.

When is preregistration valuable?


If one agrees with the conceptual analysis above, it follows that preregistration adds value for people who choose to increase their faith in claims that are supported by severe tests and predictive successes. Whether this seems reasonable depends on your philosophy of science. Preregistration itself does not make a study better or worse compared to a non-preregistered study. Sometimes, being able to transparently evaluate a study (and its capability to demonstrate claims were false) will reveal a study was completely uninformative. Other times we might be able to evaluate the capability of a study to demonstrate a claim was false even if the study is not transparently preregistered. Examples are studies where there is no room for bias, because the analyses are perfectly constrained by theory, or because it is not possible to analyze the data in any other way than was reported. Although the severity of a test is in principle unrelated to whether it is pre-registered or not, in practice there will be a positive correlation that is caused by the studies where the ability to evaluate how capable these studies were to demonstrate a claim was false is improved by transparently preregistering, such as studies with multiple dependent variables to choose from, which do not use standardized measurement scale so that the dependent variable can be calculated in different ways, or where additional data is easily collected, to name a few.

We can apply our conceptual analysis of preregistration to hypothetical real-life situations to gain a better insight into when preregistration is a valuable tool, and when not. For example, imagine a researcher who preregisters an experiment where the main analysis tests a linear relationship between two variables. This test yields a non-significant result, thereby failing to support the prediction. In an exploratory analysis the authors find that fitting a polynomial model yields a significant test result with a low p-value. A reviewer of their manuscript has studied the same relationship, albeit in a slightly different context and with another measure, and has unpublished data from multiple studies that also yielded polynomial relationships. The reviewer also has a tentative idea about the underlying mechanism that causes not a linear, but a polynomial, relationship. The original authors will be of the opinion that the claim of a polynomial relationship has passed a less severe test than their original prediction of a linear prediction would have passed (had it been supported). However, the reviewer would never have preregistered a linear relationship to begin with, and therefore does not evaluate the switch to a polynomial test in the exploratory result section as something that reduces the severity of the test. Given that the experiment was well-designed, the test for a polynomial relationship will be judged as having greater severity by the reviewer than by the authors. In this hypothetical example the reviewer has additional data that would have changed the hypothesis they would have preregistered in the original study. It is also possible that the difference in evaluation of the exploratory test for a polynomial relationship is based purely on a subjective prior belief, or on the basis of knowledge about an existing well-supported theory that would predict a polynomial, but not a linear, relationship.

Now imagine that our reviewer asks for the raw data to test whether their assumed underlying mechanism is supported. They receive the dataset, and looking through the data and the preregistration, the reviewer realizes that the original authors didn’t adhere to their preregistered analysis plan. They violated their stopping rule, analyzing the data in batches of four and stopping earlier than planned. They did not carefully specify how to compute their dependent variable in the preregistration, and although the reviewer has no experience with the measure that has been used, the dataset contains eight ways in which the dependent variable was calculated. Only one of the eight ways in which the dependent variable yields a significant effect for the polynomial relationship. Faced with this additional information, the reviewer believes it is much more likely that the analysis testing the claim was the result of selective reporting, and now is of the opinion the polynomial relationship was not severely tested.

Both of these evaluations of how severely a hypothesis was tested were perfectly reasonable, given the information reviewer had available. It reveals how sometimes switching from a preregistered analysis to an exploratory analysis does not impact the evaluation of the severity of the test by a reviewer, while in other cases a selectively reported result does reduce the perceived severity with which a claim has been tested. Preregistration makes more information available to readers that can be used to evaluate the severity of a test, but readers might not always evaluate the information in a preregistration in the same way. Whether a design or analytic choice increases or decreases the capability of a claim to be falsified depends on statistical theory, as well as on prior beliefs about the theory that is tested. Some practices are known to reduce the severity of tests, such as optional stopping and selective reporting analyses that yield desired results, and therefore it is easier to evaluate how statistical practices impact the severity with which a claim is tested. If a preregistration is followed through exactly as planned then the tests that are performed have desired error rates in the long run, as long as the test assumptions are met. Note that because long run error rates are based on assumptions about the data generating process, which are never known, true error rates are unknown, and thus preregistration makes it relatively more likely that tests have desired long run error rates. The severity of a tests also depends on assumptions about the underlying theory, and how the theoretical hypothesis is translated into a statistical hypothesis. There will rarely be unanimous agreement on whether a specific operationalization is a better or worse test of a hypothesis, and thus researchers will differ in their evaluation of how severely specific design choices tests a claim. This once more highlights how preregistration does not automatically increase the severity of a test. When it prevents practices that are known to reduce the severity of tests, such as optional stopping, preregistration leads to a relative increase in the severity of a test compared a non-preregistered study. But when there is no objective evaluation of the severity of a test, as is often the case when we try to judge how severe a test was based on theoretical grounds, preregistration merely enables a transparent evaluation of the capability of a claim to be falsified.