This blog is an excerpt of an invited
journal article for a special issue of Japanese Psychological Review, that I am
currently one week overdue with (but that I hope to complete soon). I hope this
paper will raise the bar in the ongoing discussion about the value of
preregistration in psychological science. If you have any feedback on what I
wrote here, I would be very grateful to hear it, as it would allow me to improve
the paper I am working on. If we want to fruitfully discuss preregistration, researchers need to provide a clear conceptual definition of
preregistration, anchored in their philosophy of science.
For as long as data has been used to support
scientific claims, people have tried to selectively present data in line with
what they wish to be true. In his treatise ‘On the Decline of Science in
England: And on Some of its Cases’ Babbage (1830) discusses
what he calls cooking: “One of its numerous processes is to make
multitudes of observations, and out of these to select those only which agree
or very nearly agree. If a hundred observations are made, the cook must be very
unlucky if he can not pick out fifteen or twenty that will do up for serving.” In
the past researchers have proposed solutions to prevent bias in the literature.
With the rise of the internet it has become feasible to create online
registries that ask researchers to specify their research design and the
planned analyses. Scientific communities have started to make use of this
opportunity (for a historical overview, see Wiseman, Watt, & Kornbrot, 2019).
Preregistration in psychology has been a
good example of ‘learning by doing’. Best practices are continuously updated as
we learn from practical challenges and early meta-scientific investigations
into how preregistrations are performed. At the same time, discussions have
emerged about what the goal of preregistration is, whether preregistration is
desirable, and what preregistration should look like across different research
areas. Every practice comes with costs and benefits, and it is useful to
evaluate whether and when preregistration is worth it. Finally, it is important
to evaluate how preregistration relates to different philosophies of science,
and when it facilitates or distracts from goals scientists might have. The
discussion about benefits and costs of preregistration has not been productive
up to now because there is a general lack of a conceptual analysis of what
preregistration entails and aims to accomplish, which leads to disagreements
that are easily resolved when a conceptual definition would be available. Any
conceptual definition about a tool that scientists use, such as
preregistration, must examine the goals it achieves, and thus requires a
clearly specified view on philosophy of science, which provides an analysis of
different goals scientists might have. Discussing preregistration without
discussing philosophy of science is a waste of time.
What is Preregistration For?
Preregistration has the goal to
transparently prevent bias due to selectively reporting analyses. Since bias in
estimates only occurs in relation to a true population parameter,
preregistration as discussed here is limited to scientific questions that
involve estimates of population values from samples. Researchers can have many
different goals when collecting data, perhaps most notably theory development,
as opposed to tests of statistical predictions derived from theories. When
testing predictions, researchers might want a specific analysis to yield a null
effect, for example to show that including a possible confound in an analysis
does not change their main results. More often perhaps, they want an analysis
to yield a statistically significant result, for example so that they can argue
the results support their prediction, based on a p-value below 0.05.
Both examples are sources of bias in the estimate of a population effect size.
In this paper I will assume researchers use frequentist statistics, but all
arguments can be generalized to Bayesian statistics (Gelman & Shalizi,
2013). When effect size estimates are biased, for example due to the desire to
obtain a statistically significant result, hypothesis tests performed on these
estimates have inflated Type 1 error rates, and when bias emerges due to the
desire to obtain a non-significant test result, hypothesis tests have reduced
statistical power. In line with the general tendency to weigh Type 1 error
rates (the probability of obtaining a statistically significant result when
there is no true effect) as more serious than Type 2 error rates (the
probability of obtaining a non-significant result when there is a true effect),
publications that discuss preregistration have been more concerned with
inflated Type 1 error rates than with low power. However, one can easily think of
situations where the latter is a bigger concern.
If the only goal of a researcher is to
prevent bias it suffices to make a mental note of the planned analyses, or to
verbally agree upon the planned analysis with collaborators, assuming we will
perfectly remember our plans when analyzing the data. The reason to write down
an analysis plan is not to prevent bias, but to transparently prevent bias. By
including transparency in the definition of preregistration it becomes clear
that the main goal of preregistration is to convince others that the reported
analysis tested a clearly specified prediction. Not all approaches to knowledge
generation value prediction, and it is important to evaluate if your philosophy
of science values prediction to be able to decide if preregistration is a
useful tool in your research. Mayo (2018) presents an overview of different
arguments for the role prediction plays in science and arrives at a severity
requirement: We can build on claims that passed tests that were highly capable
of demonstrating the claim was false, but supported the prediction
nevertheless. This requires that researchers who read about claims are able to
evaluate the severity of a test. Preregistration facilitates this.
Although falsifying theories is a complex
issue, falsifying statistical predictions is straightforward. Researchers can
specify when they will interpret data as support for their claim based on the
result of a statistical test, and when not. An example is a directional (or
one-sided) t-test testing whether an observed mean is larger than zero.
Observing a value statistically smaller or equal to zero would falsify this
statistical prediction (as long as statistical assumptions of the test hold,
and with some error rate in frequentist approaches to statistics). In practice,
only range predictions can be statistically falsified. Because resources and
measurement accuracy are not infinitely large, there is always a value close
enough to zero that is statistically impossible to distinguish from zero.
Therefore, researchers will need to specify at least some possible outcomes that
would not be considered support for their prediction that statistical tests can
pick up on. How such bounds are determined is a massively understudied problem
in psychology, but it is essential to have falsifiable predictions.
Where bounds of a range prediction enable
statistical falsification, the specification of these bounds is not enough to
evaluate how highly capable a test was to demonstrate a claim was wrong. Meehl
(1990) argues that we are increasingly impressed by a prediction, the more ways
a prediction could have been wrong. He
writes (1990, p. 128): “The working scientist is often more impressed when a
theory predicts something within, or close to, a narrow interval than when it
predicts something correctly within a wide one.” Imagine making a prediction
about where a dart will land if I throw it at a dartboard. You will be more
impressed with my darts skills if I predict I will hit the bullseye, and I hit
the bullseye, than when I predict to hit the dartboard, and I hit the dartboard. Making very narrow range predictions is a way to make it statistically
likely to falsify your prediction, if it is wrong. It is also possible to make
theoretically risky predictions, for example by predicting you will only
observe a statistically significant difference from zero in a hypothesis test
if a very specific set of experimental conditions is met that all follow from a
single theory. Regardless of how researchers increase the capability of a test
to be wrong, the approach to scientific progress described here places more
faith in claims based on predictions that have a higher capability of being
falsified, but where data nevertheless supports the prediction. Anyone is free
to choose a different philosophy of science, and create a coherent analysis of
the goals of preregistration in that framework, but as far as I am aware, Mayo’s
severity argument currently provides one of the few philosophies of science
that allows for a coherent conceptual analysis of the value of preregistration.
Researchers admit to research practices
that make their predictions, or the empirical support for their prediction,
look more impressive than it is. One example of such a practice is optional
stopping, where researchers collect a number of datapoints, perform statistical
analyses, and continue the data collection if the result is not statistically
significant. In theory, a researcher who is willing to continue collecting data
indefinitely will always find a statistically significant result. By repeatedly
looking at the data, the Type 1 error rate can inflate to 100%. Even though in
practice the inflation will be smaller, optional stopping strongly increases
the probability that a researcher can interpret their result as support for
their prediction. In the extreme case, where a researcher is 100% certain that
they will observe a statistically significant result when they perform their
statistical test, their prediction will never be falsified. Providing support
for a claim by relying on optional stopping should not increase our faith in
the claim by much, or even at all. As Mayo (2018, p. 222) writes: “The good
scientist deliberately arranges inquiries so as to capitalize on pushback, on
effects that will not go away, on strategies to get errors to ramify quickly
and force us to pay attention to them. The ability to register how hunting,
optional stopping, and cherry picking alter their error-probing capacities is a
crucial part of a method’s objectivity.” If researchers were to transparently
register their data collection strategy, readers could evaluate the capability
of the test to falsify their prediction, conclude this capability is very
small, and be relatively unimpressed by the study. If the stopping rule keeps
the probability of finding a non-significant result when the prediction is
incorrect high, and the data nevertheless support the prediction, we can choose
to act as if the claim is correct because it has been severely tested.
Preregistration thus functions as a tool to allow other researchers te
transparently evaluate the severity with which a claim has been tested.
The severity of a test can also be
compromised by selecting a hypothesis based on the observed results. In this
practice, known as Hypothesizing After the Results are Known (HARKing, Kerr,
1998) researchers look at their data, and then select a prediction. This
reversal of the typical hypothesis testing procedure makes the test incapable
of demonstrating the claim was false. Mayo (2018) refers to this as ‘bad
evidence, no test’. If we choose a prediction from among the options that yield
a significant result, the claims we make base on these ‘predictions’ will never
be wrong. In philosophies of science that value predictions, such claims do not
increase our confidence that the claim is true, because it has not yet been
tested. By preregistering our predictions, we transparently communicate to
readers that our predictions predated looking at data, and therefore that the
data we present as support of our prediction could have falsified our
hypothesis. We have not made our test look more severe by narrowing the range of
our predictions after looking at the data (like the Texas sharpshooter who
draws the circles of the bullseye after shooting at the wall of the barn). A
reader can transparently evaluate how severely our claim was tested.
As a final example of the value of
preregistration to transparently allow readers to evaluate the capability of
our prediction to be falsified, think about the scenario described by Babbage
at the beginning of this article, where a researchers makes multitudes of
observations, and selects out of all these tests only those that support their
prediction. The larger the number of observations to choose from, the higher
the probability that one of the possible tests could be presented as support
for the hypothesis. Therefore, from a perspective on scientific knowledge
generation where severe tests are valued, choosing to selectively report tests
from among many tests that were performed strongly reduces the capability of a
test to demonstrate the claim was false. This can be prevented by correcting
for multiple testing by lowering the alpha level depending on the number of
tests.
The fact that preregistration is about
specifying ways in which your claim could be false is not generally
appreciated. Preregistrations should carefully specify not just the analysis
researchers plan to perform, but also when they would infer from the analyses
that their prediction was wrong. As the preceding section explains, successful
predictions impress us more when the data that was collected was capable of
falsifying the prediction. Therefore, a preregistration document should give us
all the required information that allows us to evaluate the severity of the
test. Specifying exactly which test will be performed on the data is important,
but not enough. Researchers should also specify when they will conclude the
prediction was not supported. Beyond specifying the analysis plan in detail, the
severity of a test can be increased by narrowing the range of values that are
predicted (without increasing the Type 1 and Type 2 error rate), or making the
theoretical prediction more specific by specifying detailed circumstances under
which the effect will be observed, and when it will not be observed.
When is preregistration valuable?
If one agrees with the conceptual analysis
above, it follows that preregistration adds value for people who choose to
increase their faith in claims that are supported by severe tests and
predictive successes. Whether this seems reasonable depends on your philosophy
of science. Preregistration itself does not make a study better or worse
compared to a non-preregistered study. Sometimes, being able to transparently
evaluate a study (and its capability to demonstrate claims were false) will
reveal a study was completely uninformative. Other times we might be able to
evaluate the capability of a study to demonstrate a claim was false even if the
study is not transparently preregistered. Examples are studies where there is
no room for bias, because the analyses are perfectly constrained by theory, or
because it is not possible to analyze the data in any other way than was
reported. Although the severity of a test is in principle unrelated to whether
it is pre-registered or not, in practice there will be a positive correlation
that is caused by the studies where the ability to evaluate how capable these
studies were to demonstrate a claim was false is improved by transparently
preregistering, such as studies with multiple dependent variables to choose
from, which do not use standardized measurement scale so that the dependent
variable can be calculated in different ways, or where additional data is
easily collected, to name a few.
We can apply our conceptual analysis of
preregistration to hypothetical real-life situations to gain a better insight
into when preregistration is a valuable tool, and when not. For example,
imagine a researcher who preregisters an experiment where the main analysis
tests a linear relationship between two variables. This test yields a
non-significant result, thereby failing to support the prediction. In an
exploratory analysis the authors find that fitting a polynomial model yields a
significant test result with a low p-value. A reviewer of their manuscript has
studied the same relationship, albeit in a slightly different context and with
another measure, and has unpublished data from multiple studies that also
yielded polynomial relationships. The reviewer also has a tentative idea about
the underlying mechanism that causes not a linear, but a polynomial,
relationship. The original authors will be of the opinion that the claim of a
polynomial relationship has passed a less severe test than their original
prediction of a linear prediction would have passed (had it been supported).
However, the reviewer would never have preregistered a linear relationship to
begin with, and therefore does not evaluate the switch to a polynomial test in
the exploratory result section as something that reduces the severity of the
test. Given that the experiment was well-designed, the test for a polynomial
relationship will be judged as having greater severity by the reviewer than by
the authors. In this hypothetical example the reviewer has additional data that
would have changed the hypothesis they would have preregistered in the original
study. It is also possible that the difference in evaluation of the exploratory
test for a polynomial relationship is based purely on a subjective prior
belief, or on the basis of knowledge about an existing well-supported theory
that would predict a polynomial, but not a linear, relationship.
Now imagine that our reviewer asks for the
raw data to test whether their assumed underlying mechanism is supported. They
receive the dataset, and looking through the data and the preregistration, the
reviewer realizes that the original authors didn’t adhere to their
preregistered analysis plan. They violated their stopping rule, analyzing the
data in batches of four and stopping earlier than planned. They did not
carefully specify how to compute their dependent variable in the
preregistration, and although the reviewer has no experience with the measure
that has been used, the dataset contains eight ways in which the dependent variable
was calculated. Only one of the eight ways in which the dependent variable
yields a significant effect for the polynomial relationship. Faced with this
additional information, the reviewer believes it is much more likely that the
analysis testing the claim was the result of selective reporting, and now is of
the opinion the polynomial relationship was not severely tested.
Both of these evaluations of how severely a
hypothesis was tested were perfectly reasonable, given the information reviewer
had available. It reveals how sometimes switching from a preregistered analysis
to an exploratory analysis does not impact the evaluation of the severity of
the test by a reviewer, while in other cases a selectively reported result does
reduce the perceived severity with which a claim has been tested.
Preregistration makes more information available to readers that can be used to
evaluate the severity of a test, but readers might not always evaluate the
information in a preregistration in the same way. Whether a design or analytic
choice increases or decreases the capability of a claim to be falsified depends
on statistical theory, as well as on prior beliefs about the theory that is
tested. Some practices are known to reduce the severity of tests, such as
optional stopping and selective reporting analyses that yield desired results,
and therefore it is easier to evaluate how statistical practices impact the
severity with which a claim is tested. If a preregistration is followed through
exactly as planned then the tests that are performed have desired error rates
in the long run, as long as the test assumptions are met. Note that because
long run error rates are based on assumptions about the data generating
process, which are never known, true error rates are unknown, and thus
preregistration makes it relatively more likely that tests have desired long
run error rates. The severity of a tests also depends on assumptions about the
underlying theory, and how the theoretical hypothesis is translated into a
statistical hypothesis. There will rarely be unanimous agreement on whether a
specific operationalization is a better or worse test of a hypothesis, and thus
researchers will differ in their evaluation of how severely specific design
choices tests a claim. This once more highlights how preregistration does not
automatically increase the severity of a test. When it prevents practices that
are known to reduce the severity of tests, such as optional stopping,
preregistration leads to a relative increase in the severity of a test compared
a non-preregistered study. But when there is no objective evaluation of the
severity of a test, as is often the case when we try to judge how severe a test
was based on theoretical grounds, preregistration merely enables a transparent
evaluation of the capability of a claim to be falsified.
No comments:
Post a Comment