The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, November 3, 2019

The Value of Preregistration for Psychological Science: A Conceptual Analysis


This blog is an excerpt of an invited journal article for a special issue of Japanese Psychological Review, that I am currently one week overdue with (but that I hope to complete soon). I hope this paper will raise the bar in the ongoing discussion about the value of preregistration in psychological science. If you have any feedback on what I wrote here, I would be very grateful to hear it, as it would allow me to improve the paper I am working on. If we want to fruitfully discuss preregistration, researchers need to provide a clear conceptual definition of preregistration, anchored in their philosophy of science.

For as long as data has been used to support scientific claims, people have tried to selectively present data in line with what they wish to be true. In his treatise ‘On the Decline of Science in England: And on Some of its Cases’ Babbage (1830) discusses what he calls cooking: “One of its numerous processes is to make multitudes of observations, and out of these to select those only which agree or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he can not pick out fifteen or twenty that will do up for serving.” In the past researchers have proposed solutions to prevent bias in the literature. With the rise of the internet it has become feasible to create online registries that ask researchers to specify their research design and the planned analyses. Scientific communities have started to make use of this opportunity (for a historical overview, see Wiseman, Watt, & Kornbrot, 2019).

Preregistration in psychology has been a good example of ‘learning by doing’. Best practices are continuously updated as we learn from practical challenges and early meta-scientific investigations into how preregistrations are performed. At the same time, discussions have emerged about what the goal of preregistration is, whether preregistration is desirable, and what preregistration should look like across different research areas. Every practice comes with costs and benefits, and it is useful to evaluate whether and when preregistration is worth it. Finally, it is important to evaluate how preregistration relates to different philosophies of science, and when it facilitates or distracts from goals scientists might have. The discussion about benefits and costs of preregistration has not been productive up to now because there is a general lack of a conceptual analysis of what preregistration entails and aims to accomplish, which leads to disagreements that are easily resolved when a conceptual definition would be available. Any conceptual definition about a tool that scientists use, such as preregistration, must examine the goals it achieves, and thus requires a clearly specified view on philosophy of science, which provides an analysis of different goals scientists might have. Discussing preregistration without discussing philosophy of science is a waste of time.

What is Preregistration For?


Preregistration has the goal to transparently prevent bias due to selectively reporting analyses. Since bias in estimates only occurs in relation to a true population parameter, preregistration as discussed here is limited to scientific questions that involve estimates of population values from samples. Researchers can have many different goals when collecting data, perhaps most notably theory development, as opposed to tests of statistical predictions derived from theories. When testing predictions, researchers might want a specific analysis to yield a null effect, for example to show that including a possible confound in an analysis does not change their main results. More often perhaps, they want an analysis to yield a statistically significant result, for example so that they can argue the results support their prediction, based on a p-value below 0.05. Both examples are sources of bias in the estimate of a population effect size. In this paper I will assume researchers use frequentist statistics, but all arguments can be generalized to Bayesian statistics (Gelman & Shalizi, 2013). When effect size estimates are biased, for example due to the desire to obtain a statistically significant result, hypothesis tests performed on these estimates have inflated Type 1 error rates, and when bias emerges due to the desire to obtain a non-significant test result, hypothesis tests have reduced statistical power. In line with the general tendency to weigh Type 1 error rates (the probability of obtaining a statistically significant result when there is no true effect) as more serious than Type 2 error rates (the probability of obtaining a non-significant result when there is a true effect), publications that discuss preregistration have been more concerned with inflated Type 1 error rates than with low power. However, one can easily think of situations where the latter is a bigger concern.

If the only goal of a researcher is to prevent bias it suffices to make a mental note of the planned analyses, or to verbally agree upon the planned analysis with collaborators, assuming we will perfectly remember our plans when analyzing the data. The reason to write down an analysis plan is not to prevent bias, but to transparently prevent bias. By including transparency in the definition of preregistration it becomes clear that the main goal of preregistration is to convince others that the reported analysis tested a clearly specified prediction. Not all approaches to knowledge generation value prediction, and it is important to evaluate if your philosophy of science values prediction to be able to decide if preregistration is a useful tool in your research. Mayo (2018) presents an overview of different arguments for the role prediction plays in science and arrives at a severity requirement: We can build on claims that passed tests that were highly capable of demonstrating the claim was false, but supported the prediction nevertheless. This requires that researchers who read about claims are able to evaluate the severity of a test. Preregistration facilitates this.

Although falsifying theories is a complex issue, falsifying statistical predictions is straightforward. Researchers can specify when they will interpret data as support for their claim based on the result of a statistical test, and when not. An example is a directional (or one-sided) t-test testing whether an observed mean is larger than zero. Observing a value statistically smaller or equal to zero would falsify this statistical prediction (as long as statistical assumptions of the test hold, and with some error rate in frequentist approaches to statistics). In practice, only range predictions can be statistically falsified. Because resources and measurement accuracy are not infinitely large, there is always a value close enough to zero that is statistically impossible to distinguish from zero. Therefore, researchers will need to specify at least some possible outcomes that would not be considered support for their prediction that statistical tests can pick up on. How such bounds are determined is a massively understudied problem in psychology, but it is essential to have falsifiable predictions.

Where bounds of a range prediction enable statistical falsification, the specification of these bounds is not enough to evaluate how highly capable a test was to demonstrate a claim was wrong. Meehl (1990) argues that we are increasingly impressed by a prediction, the more ways a prediction could have been wrong.  He writes (1990, p. 128): “The working scientist is often more impressed when a theory predicts something within, or close to, a narrow interval than when it predicts something correctly within a wide one.” Imagine making a prediction about where a dart will land if I throw it at a dartboard. You will be more impressed with my darts skills if I predict I will hit the bullseye, and I hit the bullseye, than when I predict to hit the dartboard, and I hit the dartboard. Making very narrow range predictions is a way to make it statistically likely to falsify your prediction, if it is wrong. It is also possible to make theoretically risky predictions, for example by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory. Regardless of how researchers increase the capability of a test to be wrong, the approach to scientific progress described here places more faith in claims based on predictions that have a higher capability of being falsified, but where data nevertheless supports the prediction. Anyone is free to choose a different philosophy of science, and create a coherent analysis of the goals of preregistration in that framework, but as far as I am aware, Mayo’s severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration.

Researchers admit to research practices that make their predictions, or the empirical support for their prediction, look more impressive than it is. One example of such a practice is optional stopping, where researchers collect a number of datapoints, perform statistical analyses, and continue the data collection if the result is not statistically significant. In theory, a researcher who is willing to continue collecting data indefinitely will always find a statistically significant result. By repeatedly looking at the data, the Type 1 error rate can inflate to 100%. Even though in practice the inflation will be smaller, optional stopping strongly increases the probability that a researcher can interpret their result as support for their prediction. In the extreme case, where a researcher is 100% certain that they will observe a statistically significant result when they perform their statistical test, their prediction will never be falsified. Providing support for a claim by relying on optional stopping should not increase our faith in the claim by much, or even at all. As Mayo (2018, p. 222) writes: “The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’s objectivity.” If researchers were to transparently register their data collection strategy, readers could evaluate the capability of the test to falsify their prediction, conclude this capability is very small, and be relatively unimpressed by the study. If the stopping rule keeps the probability of finding a non-significant result when the prediction is incorrect high, and the data nevertheless support the prediction, we can choose to act as if the claim is correct because it has been severely tested. Preregistration thus functions as a tool to allow other researchers te transparently evaluate the severity with which a claim has been tested.

The severity of a test can also be compromised by selecting a hypothesis based on the observed results. In this practice, known as Hypothesizing After the Results are Known (HARKing, Kerr, 1998) researchers look at their data, and then select a prediction. This reversal of the typical hypothesis testing procedure makes the test incapable of demonstrating the claim was false. Mayo (2018) refers to this as ‘bad evidence, no test’. If we choose a prediction from among the options that yield a significant result, the claims we make base on these ‘predictions’ will never be wrong. In philosophies of science that value predictions, such claims do not increase our confidence that the claim is true, because it has not yet been tested. By preregistering our predictions, we transparently communicate to readers that our predictions predated looking at data, and therefore that the data we present as support of our prediction could have falsified our hypothesis. We have not made our test look more severe by narrowing the range of our predictions after looking at the data (like the Texas sharpshooter who draws the circles of the bullseye after shooting at the wall of the barn). A reader can transparently evaluate how severely our claim was tested.

As a final example of the value of preregistration to transparently allow readers to evaluate the capability of our prediction to be falsified, think about the scenario described by Babbage at the beginning of this article, where a researchers makes multitudes of observations, and selects out of all these tests only those that support their prediction. The larger the number of observations to choose from, the higher the probability that one of the possible tests could be presented as support for the hypothesis. Therefore, from a perspective on scientific knowledge generation where severe tests are valued, choosing to selectively report tests from among many tests that were performed strongly reduces the capability of a test to demonstrate the claim was false. This can be prevented by correcting for multiple testing by lowering the alpha level depending on the number of tests.
The fact that preregistration is about specifying ways in which your claim could be false is not generally appreciated. Preregistrations should carefully specify not just the analysis researchers plan to perform, but also when they would infer from the analyses that their prediction was wrong. As the preceding section explains, successful predictions impress us more when the data that was collected was capable of falsifying the prediction. Therefore, a preregistration document should give us all the required information that allows us to evaluate the severity of the test. Specifying exactly which test will be performed on the data is important, but not enough. Researchers should also specify when they will conclude the prediction was not supported. Beyond specifying the analysis plan in detail, the severity of a test can be increased by narrowing the range of values that are predicted (without increasing the Type 1 and Type 2 error rate), or making the theoretical prediction more specific by specifying detailed circumstances under which the effect will be observed, and when it will not be observed.

When is preregistration valuable?


If one agrees with the conceptual analysis above, it follows that preregistration adds value for people who choose to increase their faith in claims that are supported by severe tests and predictive successes. Whether this seems reasonable depends on your philosophy of science. Preregistration itself does not make a study better or worse compared to a non-preregistered study. Sometimes, being able to transparently evaluate a study (and its capability to demonstrate claims were false) will reveal a study was completely uninformative. Other times we might be able to evaluate the capability of a study to demonstrate a claim was false even if the study is not transparently preregistered. Examples are studies where there is no room for bias, because the analyses are perfectly constrained by theory, or because it is not possible to analyze the data in any other way than was reported. Although the severity of a test is in principle unrelated to whether it is pre-registered or not, in practice there will be a positive correlation that is caused by the studies where the ability to evaluate how capable these studies were to demonstrate a claim was false is improved by transparently preregistering, such as studies with multiple dependent variables to choose from, which do not use standardized measurement scale so that the dependent variable can be calculated in different ways, or where additional data is easily collected, to name a few.

We can apply our conceptual analysis of preregistration to hypothetical real-life situations to gain a better insight into when preregistration is a valuable tool, and when not. For example, imagine a researcher who preregisters an experiment where the main analysis tests a linear relationship between two variables. This test yields a non-significant result, thereby failing to support the prediction. In an exploratory analysis the authors find that fitting a polynomial model yields a significant test result with a low p-value. A reviewer of their manuscript has studied the same relationship, albeit in a slightly different context and with another measure, and has unpublished data from multiple studies that also yielded polynomial relationships. The reviewer also has a tentative idea about the underlying mechanism that causes not a linear, but a polynomial, relationship. The original authors will be of the opinion that the claim of a polynomial relationship has passed a less severe test than their original prediction of a linear prediction would have passed (had it been supported). However, the reviewer would never have preregistered a linear relationship to begin with, and therefore does not evaluate the switch to a polynomial test in the exploratory result section as something that reduces the severity of the test. Given that the experiment was well-designed, the test for a polynomial relationship will be judged as having greater severity by the reviewer than by the authors. In this hypothetical example the reviewer has additional data that would have changed the hypothesis they would have preregistered in the original study. It is also possible that the difference in evaluation of the exploratory test for a polynomial relationship is based purely on a subjective prior belief, or on the basis of knowledge about an existing well-supported theory that would predict a polynomial, but not a linear, relationship.

Now imagine that our reviewer asks for the raw data to test whether their assumed underlying mechanism is supported. They receive the dataset, and looking through the data and the preregistration, the reviewer realizes that the original authors didn’t adhere to their preregistered analysis plan. They violated their stopping rule, analyzing the data in batches of four and stopping earlier than planned. They did not carefully specify how to compute their dependent variable in the preregistration, and although the reviewer has no experience with the measure that has been used, the dataset contains eight ways in which the dependent variable was calculated. Only one of the eight ways in which the dependent variable yields a significant effect for the polynomial relationship. Faced with this additional information, the reviewer believes it is much more likely that the analysis testing the claim was the result of selective reporting, and now is of the opinion the polynomial relationship was not severely tested.

Both of these evaluations of how severely a hypothesis was tested were perfectly reasonable, given the information reviewer had available. It reveals how sometimes switching from a preregistered analysis to an exploratory analysis does not impact the evaluation of the severity of the test by a reviewer, while in other cases a selectively reported result does reduce the perceived severity with which a claim has been tested. Preregistration makes more information available to readers that can be used to evaluate the severity of a test, but readers might not always evaluate the information in a preregistration in the same way. Whether a design or analytic choice increases or decreases the capability of a claim to be falsified depends on statistical theory, as well as on prior beliefs about the theory that is tested. Some practices are known to reduce the severity of tests, such as optional stopping and selective reporting analyses that yield desired results, and therefore it is easier to evaluate how statistical practices impact the severity with which a claim is tested. If a preregistration is followed through exactly as planned then the tests that are performed have desired error rates in the long run, as long as the test assumptions are met. Note that because long run error rates are based on assumptions about the data generating process, which are never known, true error rates are unknown, and thus preregistration makes it relatively more likely that tests have desired long run error rates. The severity of a tests also depends on assumptions about the underlying theory, and how the theoretical hypothesis is translated into a statistical hypothesis. There will rarely be unanimous agreement on whether a specific operationalization is a better or worse test of a hypothesis, and thus researchers will differ in their evaluation of how severely specific design choices tests a claim. This once more highlights how preregistration does not automatically increase the severity of a test. When it prevents practices that are known to reduce the severity of tests, such as optional stopping, preregistration leads to a relative increase in the severity of a test compared a non-preregistered study. But when there is no objective evaluation of the severity of a test, as is often the case when we try to judge how severe a test was based on theoretical grounds, preregistration merely enables a transparent evaluation of the capability of a claim to be falsified.

Friday, October 11, 2019

Improving Your Statistical Questions

Three years after launching my first massive open online course (MOOC) ‘Improving Your Statistical Inferences’ on Coursera, today I am happy to announce a second completely free online course called ‘Improving Your Statistical Questions’. My first course is a collection of lessons about statistics and methods that we commonly use, but that I wish I had known how to use better when I was taking my first steps into empirical research. My new course is a collection of lessons about statistics and methods that we do not yet commonly use, but that I wish we start using to improve the questions we ask. Where the first course tries to get people up to speed about commonly accepted best practices, my new course tries to educate researchers about better practices. Most of the modules consist of topics in which there has been more recent developments, or at least increasing awareness, over the last 5 years.

About a year ago, I wrote on this blog: If I ever make a follow up to my current MOOC, I will call it ‘Improving Your Statistical Questions’. The more I learn about how people use statistics, the more I believe the main problem is not how people interpret the numbers they get from statistical tests. The real issue is which statistical questions researchers ask from their data. If you approach a statistician to get help with the data analysis, most of their time will be spend asking you ‘but what is your question?’. I hope this course helps to take a step back, reflect on this question, and get some practical advice on how to answer it.

There are 5 modules, with 15 videos, and 13 assignments that provide hands on explanations of how to use the insights from the lectures in your own research. The first week discusses different questions you might want to ask. Only one of these is a hypothesis test, and I examine in detail if you really want to test a hypothesis, or are simply going through the motions of the statistical ritual. I also discuss why NHST is often not a very risky prediction, and why range predictions are a more exciting question to ask (if you can). Module 2 focuses on falsification in practice and theory, including a lecture and some assignments on how to determine the smallest effect size of interest in the studies you perform. I also share my favorite colloquium question for whenever you dozed of and wake up at the end only to find no one else is asking a question, when you can always raise you hand to ask ‘so, what would falsify your hypothesis?’ Module 3 discusses the importance of justifying error rates, a more detailed discussion on power analysis (following up on the ‘sample size justification’ lecture in MOOC1), and a lecture on the many uses of learning how to simulate data. Module 4 moves beyond single studies, and asks what you can expect from lines of research, how to perform a meta-analysis, and why the scientific literature does not look like reality (and how you can detect, and prevent contributing to, a biased literature). I was tempted to add this to MOOC1, but I am happy I didn’t, as there has been a lot of exciting work on bias detection that is now part of the lecture. The last module has three different topics I think are important: computational reproducibility, philosophy of science (this video would also have been a good first video lecture, but I don’t want to scare people away!) and maybe my favorite lecture in the MOOC on scientific integrity in practice. All are accompanied by assignments, and the assignments is where the real learning happens.

If after this course some people feel more comfortable to abandon hypothesis testing and just describe their data, make their predictions a bit more falsifiable, design more informative studies, publish sets of studies that look a bit more like reality, and make their work more computationally reproducible, I’ll be very happy.

The content of this MOOC is based on over 40 workshops and talks I gave in the last 3 years since my previous MOOC came out, testing this material on live crowds. It comes with some of the pressure a recording artist might feel for a second record when their first was somewhat successful. As my first MOOC hits 30k enrolled learners (many of who attend very few of the content, but still with thousands of people taking in a lot of the material) I hope it comes close and lives up to expectations.

I’m very grateful to Chelsea Parlett Pelleriti who checked all assignments for statistical errors or incorrect statements, and provided feedback that made every exercise in this MOOC better. If you need a statistics editor, you can find her at: https://cmparlettpelleriti.github.io/TheChatistician.html. Special thanks to Tim de Jonge who populated the Coursera environment as a student assistant, and Sascha Prudon for recording and editing the videos. Thanks to Uri Simonsohn for feedback on Assignment 2.1, Lars Penke for suggesting the SESOI example in lecture 2.2, Lisa DeBruine for co-developing Assignment 2.4, Joe Hilgard for the PET-PEESE code in assignment 4.3, Matti Heino for the GRIM test example in lecture 4.3, and Michelle Nuijten for feedback on assignment 4.4. Thanks to Seth Green, Russ Zack and Xu Fei at Code Ocean for help in using their platform to make it possible to run the R code online. I am extremely grateful for all alpha testers who provided feedback on early versions of the assignments: Daniel Dunleavy, Robert Gorsch, Emma Henderson, Martine Jansen, Niklas Johannes, Kristin Jankowsky, Cian McGinley, Robert Görsch, Chris Noone, Alex Riina, Burak Tunca, Laura Vowels, and Lara Warmelink, as well as the beta-testers who gave feedback on the material on Coursera: Johannes Breuer, Marie Delacre, Fabienne Ennigkeit, Marton L. Gy, and Sebastian Skejø. Finally, thanks to my wife for buying me six new shirts because ‘your audience has expectations’ (and for accepting how I worked through the summer holiday to complete this MOOC).

All material in the MOOC is shared with a CC-BY-NC-SA license, and you can access all material in the MOOC for free (and use it in your own education). Improving Your Statistical Questions is available from today. I hope you enjoy it!

Saturday, September 14, 2019

Improving Education about P-values

A recent paper in AMPPS points out that many textbooks for introduction to psychology courses incorrectly explain p-values. There are dozens, if not hundreds, of papers that point out problems in how people understand p-values. If we don’t do anything about it, there will be dozens of articles like this in the next decades as well. So let’s do something about it.

When I made my first MOOC three years ago I spent some time thinking about how to explain what a p-value is clearly (you can see my video here). Some years later I realized that if you want to prevent misunderstandings of p-values, you should also explicitly train people about what p-values are not. Now, I think that training away misconceptions is just as important as explaining the correct interpretation of a p-value. Based on a blog post I made a new assignment for my MOOC. In the last year Arianne Herrera-Bennett (@ariannechb) performed an A/B test in my MOOC ‘Improving Your Statistical Inferences’. Half of the learners received this new assignment, explicitly aimed at training away misconceptions. The results are in her PhD thesis that she will defend on the 27th of September, 2019, but one of the main conclusions in the study is that it is possible to substantially reduce common misconceptions about p-values by educating people about them. This is a hopeful message.

I tried to keep the assignment as short as possible, and therefore it is 20 pages. Let that sink in for a moment. How much space does education about p-values take up in your study material? How much space would you need to prevent misunderstandings? And how often would you need to repeat the same material across the years? If we honestly believe misunderstanding of p-values are a problem, then why don’t we educate people well enough to prevent misunderstandings? The fact that people do not understand p-values is not their mistake – it is ours.

In my own MOOC I needed 7 pages to explain what p-value distributions look like, how they are a function of power, why p-values are uniformly distributed when the null is true, and what Lindley’s paradox is. But when I tried to clearly explain common misconceptions, I needed a lot more words. Before you want to blame that poor p-value, let me tell you that I strongly believe the problem of misconceptions is not limited to p-values: Probability is just not intuitive. It might always take more time to explain ways you can misunderstand something, than to teach the correct way to understand something.

In a recent pre-print I wrote on p-values, I reflect on the bad job we have been doing at teaching others about p-values. I write:

If anyone seriously believes the misunderstanding of p-values lies at the heart of reproducibility issues in science, why are we not investing more effort to make sure misunderstandings of p-values are resolved before young scholars perform their first research project? Although I am sympathetic to statisticians who think all the information researchers need to educate themselves on this topic is already available, as an experimental psychologist who works at a Human-Technology Interaction department this reminds me too much of the engineer who argues all the information to understand the copy machine is available in the user manual. In essence, the problems we have with how p-values are used is a human factors problem (Tryon, 2001). The challenge is to get researchers to improve the way they work.
Looking at the deluge of papers published in the last half century that point out how researchers have consistently misunderstood p-values, I am left to wonder: Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? It is nowadays relatively straightforward to create online apps where people can simulate studies and see the behavior of p-values across studies, which can easily be combined with exercises that fit the knowledge level of bachelor and master students. The second point I want to make in this article is that a dedicated attempt to develop evidence based educational material in a cross-disciplinary team of statisticians, educational scientists, cognitive psychologists, and designers seems worth the effort if we really believe young scholars should understand p-values. I do not think that the effort statisticians have made to complain about p-values is matched with a similar effort to improve the way researchers use p-values and hypothesis tests. We really have not tried hard enough.

So how about we get serious about solving this problem? Let’s get together and make a dent in this decade old problem. Let’s try hard enough.

A good place to start might be to take stock of good ways to educate people about p-values that already exist, and then all together see how we can improve them.

I have uploaded my lecture about p-values to YouTube, and my assignment to train away misconceptions is available online as a Google Doc (the answers and feedback is here).

This is just my current approach to teaching p-values. I am sure there are many other approaches (and it might turn out that watching several videos, each explaining p-values in slightly different ways, is an even better way to educate people than having only one video). If anyone wants to improve this material (or replace it by better material) I am willing to open up my online MOOC for anyone who wants to do an A/B test of any good idea, so you can collect data from hundreds of students each year. I’m more than happy to collect best practices in p-value education – if you have anything you think (or have empirically shown) works well, send it my way - and make it openly available. Educators, pedagogists, statisticians, cognitive psychologists, software engineers, and designers interested in improving educational materials should find a place to come together. I know there are organizations that exist to improve statistics education (but have no good information about what they do, or which one would be best to join given my goals), and if you work for such an organization and are interested in taking p-value education to the next level, I’m more than happy to spread this message in my network and work with you.

If we really consider the misinterpretation of p-values to be one of the more serious problems underlying the lack of replicability of scientific findings, we need to seriously reflect on whether we have done enough to prevent misunderstandings. Treating it as a human factors problem might illuminate ways in which statistics education and statistical software can be improved. Let’s beat swords into ploughshares, and turn papers complaining about how people misunderstand p-values into papers that examine how we can improve education about p-values.

Saturday, August 10, 2019

Requiring high-powered studies from scientists with resource constraints


Underpowered studies make it very difficult to learn something useful from the studies you perform. Low power means you have a high probability of finding non-significant results, even when there is a true effect. Hypothesis tests which high rates of false negatives (concluding there is nothing, when there is something) become a malfunctioning tool. Low power is even more problematic combined with publication bias (shiny app). After repeated warnings over at least half a century, high quality journals are starting to ask authors who rely on hypothesis tests to provide a sample size justification based on statistical power.

The first time researchers use power analysis software, they typically think they are making a mistake, because the sample sizes required to achieve high power for hypothesized effects are much larger than the sample sizes they collected in the past. After double checking their calculations, and realizing the numbers are correct, a common response is that there is no way they are able to collect this number of observations.

Published articles on power analysis rarely tell researchers what they should do if they are hired on a 4 year PhD project where the norm is to perform between 4 to 10 studies that can cost at most 1000 euro each, learn about power analysis, and realize there is absolutely no way they will have the time and resources to perform high-powered studies, given that an effect size estimate from an unbiased registered report suggests the effect they are examining is half as large as they were led to believe based on a published meta-analysis from 2010. Facing a job market that under the best circumstances is a nontransparent marathon for uncertainty-fetishists, the prospect of high quality journals rejecting your work due to a lack of a solid sample size justification is not pleasant.

The reason that published articles do not guide you towards practical solutions for a lack of resources, is that there are no solutions for a lack of resources. Regrettably, the mathematics do not care about how small the participant payment budget is that you have available. This is not to say that you can not improve your current practices by reading up on best practices to increase the efficiency of data collection. Let me give you an overview of some things that you should immediately implement if you use hypothesis tests, and data collection is costly.

1) Use directional tests where relevant. Just following statements such as ‘we predict X is larger than Y’ up with a logically consistent test of that claim (e.g., a one-sided t-test) will easily give you an increase of 10% power in any well-designed study. If you feel you need to give effects in both directions a non-zero probability, then at least use lopsided tests.

2) Use sequential analysis whenever possible. It’s like optional stopping, but then without the questionable inflation of the false positive rate. The efficiency gains are so great that, if you complain about the recent push towards larger sample sizes without already having incorporated sequential analyses, I will have a hard time taking you seriously.

3) Increase your alpha level. Oh yes, I am serious. Contrary to what you might believe, the recommendation to use an alpha level of 0.05 was not the sixth of the ten commandments – it is nothing more than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify Your Alpha paper as an argument to not require an alpha level of 0.005: “without (1) increased funding, (2) a reward system that values large-scale collaboration and (3) clear recommendations for how to evaluate research with sample size constraints, lowering the significance threshold could adversely affect the breadth of research questions examined.” If you *have* to make a decision, and the data you can feasibly collect is limited, take a moment to think about how problematic Type 1 and Type 2 error rates are, and maybe minimize combined error rates instead of rigidly using a 5% alpha level.

4) Use within designs where possible. Especially when measurements are strongly correlated, this can lead to a substantial increase in power.

5) If you read this blog or follow me on Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and money: Designing statistically powerful studies while minimizing financial costs (link). They discuss I) better ways to screen participants for studies where participants need to be screened before participation, II) assigning participants unequally to conditions (if the control condition is much cheaper than the experimental condition, for example), III) using multiple measurements to increase measurement reliability (or use well-validated measures, if I may add), and IV) smart use of (preregistered, I’d recommend) covariates.

6) If you are really brave, you might want to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably, almost all approaches to statistical inferences become very limited when the number of observations is small. If you are very confident in your predictions (and your peers agree), incorporating prior information will give you a benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and colleagues.

Now if you care about efficiency, you might already have incorporated all these things. There is no way to further improve the statistical power of your tests, and by all plausible estimates of effects sizes you can expect or the smallest effect size you would be interested in, statistical power is low. Now what should you do?

What to do if best practices in study design won’t save you?

The first thing to realize is that you should not look at statistics to save you. There are no secret tricks or magical solutions. Highly informative experiments require a large number of observations. So what should we do then? The solutions below are, regrettably, a lot more work than making a small change to the design of your study. But it is about time we start to take them seriously. This is a list of solutions I see – but there is no doubt more we can/should do, so by all means, let me know your suggestions on twitter or in the comments.

1) Ask for a lot more money in your grant proposals.
Some grant organizations distribute funds to be awarded as a function of how much money is requested. If you need more money to collect informative data, ask for it. Obviously grants are incredibly difficult to get, but if you ask for money, include a budget that acknowledges that data collection is not as cheap as you hoped some years ago. In my experience, psychologists are often asking for much less money to collect data than other scientists. Increasing the requested funds for participant payment by a factor of 10 is often reasonable, given the requirements of journals to provide a solid sample size justification, and the more realistic effect size estimates that are emerging from preregistered studies.

2) Improve management.
If the implicit or explicit goals that you should meet are still the same now as they were 5 years ago, and you did not receive a miraculous increase in money and time to do research, then an update of the evaluation criteria is long overdue. I sincerely hope your manager is capable of this, but some ‘upward management’ might be needed. In the coda of Lakens & Evers (2014) we wrote “All else being equal, a researcher running properly powered studies will clearly contribute more to cumulative science than a researcher running underpowered studies, and if researchers take their science seriously, it should be the former who is rewarded in tenure systems and reward procedures, not the latter.” and “We believe reliable research should be facilitated above all else, and doing so clearly requires an immediate and irrevocable change from current evaluation practices in academia that mainly focus on quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure track, I thought it would be at least principled if I sent this coda to the head of my own department. He replied that the things we wrote made perfect sense, instituted a recommendation to aim for 90% power in studies our department intends to publish, and has since then tried to make sure quality, and not quantity, is used in evaluations within the faculty (as you might have guessed, I am not on the job market, nor do I ever hope to be).

3) Change what is expected from PhD students.
When I did my PhD, there was the assumption that you performed enough research in the 4 years you are employed as a full-time researcher to write a thesis with 3 to 5 empirical chapters (with some chapters having multiple studies). These studies were ideally published, but at least publishable. If we consider it important for PhD students to produce multiple publishable scientific articles during their PhD’s, this will greatly limit the types of research they can do. Instead of evaluating PhD students based on their publications, we can see the PhD as a time where researchers learn skills to become an independent researcher, and evaluate them not based on publishable units, but in terms of clearly identifiable skills. I personally doubt data collection is particularly educational after the 20th participant, and I would probably prefer to  hire a post-doc who had well-developed skills in programming, statistics, and who broadly read the literature, then someone who used that time to collect participant 21 to 200. If we make it easier for PhD students to demonstrate their skills level (which would include at least 1 well written article, I personally think) we can evaluate what they have learned in a more sensible manner than now. Currently, difference in the resources PhD students have at their disposal are a huge confound as we try to judge their skill based on their resume. Researchers at rich universities obviously have more resources – it should not be difficult to develop tools that allow us to judge the skills of people where resources are much less of a confound.

4) Think about the questions we collectively want answered, instead of the questions we can individually answer.
Our society has some serious issues that psychologists can help address. These questions are incredibly complex. I have long lost faith in the idea that a bottom-up organized scientific discipline that rewards individual scientists will manage to generate reliable and useful knowledge that can help to solve these societal issues. For some of these questions we need well-coordinated research lines where hundreds of scholars work together, pool their resources and skills, and collectively pursuit answers to these important questions. And if we are going to limit ourselves in our research to the questions we can answer in our own small labs, these big societal challenges are not going to be solved. Call me a pessimist. There is a reason we resort to forming unions and organizations that have to goal to collectively coordinate what we do. If you greatly dislike team science, don’t worry – there will always be options to make scientific contributions by yourself. But now, there are almost no ways for scientists who want to pursue huge challenges in large well-organized collectives of hundreds or thousands of scholars (for a recent exception that proves my rule by remaining unfunded: see the Psychological Science Accelerator). If you honestly believe your research question is important enough to be answered, then get together with everyone who also thinks so, and pursue answeres collectively. Doing so should, eventually (I know science funders are slow) also be more convincing as you ask for more resources to do the resource (as in point 1).

If you are upset that as a science we lost the blissful ignorance surrounding statistical power, and are requiring researchers to design informative studies, which hits substantially harder in some research fields than in others: I feel your pain. I have argued against universally lower alpha levels for you, and have tried to write accessible statistics papers that make you more efficient without increasing sample sizes. But if you are in a research field where even best practices in designing studies will not allow you to perform informative studies, then you need to accept the statistical reality you are in. I have already written too long a blog post, even though I could keep going on about this. My main suggestions are to ask for more money, get better management, change what we expect from PhD students, and self-organize – but there is much more we can do, so do let me know your top suggestions. This will be one of the many challenges our generation faces, but if we manage to address it, it will lead to a much better science.