When you want to test theories, the theory
needs to make a prediction, and you need to have a procedure that can evaluate
verification criteria. As De Groot writes: “A theory must afford at least a
number of opportunities for testing. That is to say, the relations stated in
the model must permit the deduction of hypotheses which can be empirically
tested. This means that these hypotheses must in turn allow the deduction of
verifiable predictions, the fulfillment or non-fulfillment of which will
provide relevant information for judging the validity or acceptability of the
hypotheses” (§ 3.1.4).
This last sentence is interesting – we
collect data, to test the ‘validity’ of a theory. We are trying to see how well
our theory works when we want to predict what unobserved data looks like
(whether these are collected in the future, or in the past, as De Groot remarks).
As De Groot writes: “Stated otherwise, the function of the prediction in the
scientific enterprise is to provide relevant information with respect to the
validity of the hypothesis from which it has been derived.” (§ 3.4.1).
To make a prediction that can be true or
false, we need to forbid certain states of the world and allow others. As De
Groot writes: “Thus, in the case of statistical predictions, where it is sought
to prove the existence of a causal factor from its effect, the interval of
positive outcomes is defined by the limits outside which the null hypothesis
is to be rejected. It is common practice that such limits are fixed by
selecting in advance a conventional level of significance: e.g., 5 %, 1 %, or .1
% risk of error in rejecting the assumption that the null hypothesis holds in
the universe under consideration. Though naturally a judicious choice will be
made, it remains nonetheless arbitrary. At all events, once it has been made,
there has been created an interval of positive outcome, and thus a verification
criterion. Any outcome falling within it stamps the prediction as 'proven true.”
(§ 3.4.2). Note that if you prefer, you can predict an effect size with some
accuracy, calculate a Bayesian highest density interval that excludes some value, or a
Bayes factor that is larger than some cut-off – as long as your prediction can be
either confirmed or not confirmed.
Note that the prediction gets a ‘proven
true’ stamp – the theory does not. In this testing procedure, there is no
direct approach from the ‘proven true’ stamp to a ‘true theory’ conclusion.
Indeed, the latter conclusion is not possible in science. We are mainly
indexing the ‘track record’ of a theory, as Meehl (1990) argues: “The main way
a theory gets money in the bank is by predicting facts that, absent the theory,
would be antecedently improbable.” Often (e.g., in non-experimental settings)
rejecting a null hypothesis with large sample sizes is not considered a very
improbable event, but that is another issue (see also the definition of a severe
test by Mayo (1996, 178): a passing result is a severe test of hypothesis H
just to the extent that it is very improbable for such a passing result to
occur, were H false).
Regardless of how risky the prediction we
made was, when we then collect data, and test the hypothesis, we either confirm our
prediction, or we do not confirm our prediction. In frequentist statistics, we
add the outcome of this prediction to the ‘track record’ of our theory, but we
can not draw conclusions based on any single study. As Fisher (1926, 504)
writes: “if one in twenty does not seem high enough odds, we may, if we prefer
it, draw the line at one in fifty (the 2 per cent point), or one in a hundred
(the 1 per cent point). Personally, the writer prefers to set a low standard of
significance at the 5 per cent point, and ignore entirely all results which
fail to reach this level. A scientific
fact should be regarded as experimentally established only if a properly
designed experiment rarely fails to give this level of significance” (italics added).
The study needs to be ‘properly designed’
to ‘rarely' fail to give a level of evidence – which despite Fisher’s dislike
for Neyman-Pearson statistics, I can read in no other way than to make sure you
run well-powered studies for whatever happens to be your smallest effect size
of interest. In other words: When testing the validity of theories through
predictions, where you keep track of a ‘track record’ of predictions, you need
to control your error rates to efficiently distinguish hits from misses. Design
well-powered studies, and do not fool yourself by inflating the probability of
observed in false positive.
I think that when it comes to testing
theories, assessing the validity through prediction is extremely (and for me,
perhaps the most) important. We don’t want to fool ourselves when we test the
validity of our theories. An example of ‘fooling yourself’ are the studies on
pre-cognition by Daryl Bem (2011). An example of a result I like to use in workshops
is the following result of Study 1 where people pressed a left or right button
to predict whether a picture was hidden behind a left or right curtain.
If we take this study as it is (without
pre-registration) it is clear there are 5 tests (for erotic, neutral, negative,
positive, and ‘romantic but non-erotic’ pictures). A Bonferroni correction would
lead us to use an alpha level of 0.01 (an alpha of 0.05/5 tests) and the result (0.01, but more precisely
0.013) would not be enough to support our prediction, given the pre-specified
alpha level. Note that Bem (Bem, Utts, and Johnson, 2011) explicitly says this
test was predicted. However, I
see absolutely no reason to believe Bem without a pre-registration document for the study.
Bayesian statistics do not provide a
solution when analyzing this pre-cognition experiment. As Gelman and Loken
(2013) write about this study (I just realized this ‘Garden
of Forking paths’ paper is unpublished, but has 150 citations!): “we can
still take this as an observation that 53.1% of these guesses were correct, and
if we combine this with a flat prior distribution (that is, the assumption that
the true average probability of a correct guess under these conditions is
equally likely to be anywhere between 0 and 1) or, more generally, a
locally-flat prior distribution, we get a posterior probability of over 99%
that the true probability is higher than 0.5; this is one interpretation of the
one-sided p-value of 0.01.” The use
of Bayes factors that quantify model evidence provides no solution. Where
Wagenmakers and colleagues argue based on ‘default’ Bayesian t-tests that the null-hypothesis is
supported, Bem, Utts, and Johnson (2011) correctly point out this criticism is
flawed, because the default Bayesian t-tests
use completely unrealistic priors for pre-cognition research (and most other
studies published in psychology, for that matter).
It is interesting that the best solution Gelman
and Loken come up with is that “perhaps researchers can perform half as many
original experiments in each paper and just pair each new experiment with a
preregistered replication”. What matters is not just the data, but the
procedure used to collect the data. The procedure needs to be able to demonstrate
a strong predictive validity, which is why pre-registration is such a great
solution to many problems science faces. Pre-registered studies are the best
way we have to show you can actually predict something – which gets your theory
money in the bank.
If people ask me if I care about evidence,
I typically say: ‘mwah’. For me, evidence is not a primary goal of doing research.
Evidence is a consequence of demonstrating that my theories have high validity
as I test predictions. It is important to end up with, and it can be useful to
try to quantify model evidence through likelihoods or Bayes factors, if you
have good models. But if I am able to show that I can confirm predictions in a
line of pre-registered studies, either by showing my p-value is smaller than an alpha level, a Bayesian highest density
interval excludes some value, a Bayes factor is larger than some cut-off, or by
showing the effect size is close enough to some predicted value, I will always end up with strong evidence for the presence of some effect. As
De Groot (1969) writes: “If one knows something to be true, one is in a
position to predict; where prediction is impossible, there is no knowledge.”
References
Bem, D. J. (2011). Feeling the future: experimental evidence for
anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425.
https://doi.org/10.1037/a0021524
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must
psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719.
https://doi.org/10.1037/a0024777
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why
multiple comparisons can be a problem, even when there is no “fishing
expedition” or “p-hacking” and the research hypothesis was posited ahead of
time. Department of Statistics,
Columbia University.
De Groot, A. D. (1969). Methodology. The Hague: Mouton & Co.
Mayo, D. G. (1996). Error and
the growth of experimental knowledge. University of Chicago Press.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D.,
& van der Maas, H. L. J. (2011). Why psychologists
must change the way they analyze their data: the case of psi: comment on Bem
(2011). Journal of Personality and Social
Psychology, 100(3), 426–432. https://doi.org/10.1037/a0022790
No comments:
Post a Comment