A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, March 8, 2018

Prediction and Validity of Theories

What is the goal of data collection? This is a simple question, and as researchers we collect data all the time. But the answer to this question is not straightforward. It depends on the question that you are asking of your data. There are different questions you can ask from your data, and therefore, you can have different goals when collecting data. Here, I want to focus on collecting data to test scientific theories. I will be quoting a lot from De Groot’s book Methodology (1969), especially Chapter 3. If you haven’t read it, you should – I think it is the best book about doing good science that has ever been written.

When you want to test theories, the theory needs to make a prediction, and you need to have a procedure that can evaluate verification criteria. As De Groot writes: “A theory must afford at least a number of opportunities for testing. That is to say, the relations stated in the model must permit the deduction of hypotheses which can be empirically tested. This means that these hypotheses must in turn allow the deduction of verifiable predictions, the fulfillment or non-fulfillment of which will provide relevant information for judging the validity or acceptability of the hypotheses” (§ 3.1.4).

This last sentence is interesting – we collect data, to test the ‘validity’ of a theory. We are trying to see how well our theory works when we want to predict what unobserved data looks like (whether these are collected in the future, or in the past, as De Groot remarks). As De Groot writes: “Stated otherwise, the function of the prediction in the scientific enterprise is to provide relevant information with respect to the validity of the hypothesis from which it has been derived.” (§ 3.4.1).

To make a prediction that can be true or false, we need to forbid certain states of the world and allow others. As De Groot writes: “Thus, in the case of statistical predictions, where it is sought to prove the existence of a causal factor from its effect, the interval of positive outcomes is defined by the limits outside which the null hypothesis is to be rejected. It is common practice that such limits are fixed by selecting in advance a conventional level of significance: e.g., 5 %, 1 %, or .1 % risk of error in rejecting the assumption that the null hypothesis holds in the universe under consideration. Though naturally a judicious choice will be made, it remains nonetheless arbitrary. At all events, once it has been made, there has been created an interval of positive outcome, and thus a verification criterion. Any outcome falling within it stamps the prediction as 'proven true.” (§ 3.4.2). Note that if you prefer, you can predict an effect size with some accuracy, calculate a Bayesian highest density interval that excludes some value, or a Bayes factor that is larger than some cut-off – as long as your prediction can be either confirmed or not confirmed.

Note that the prediction gets a ‘proven true’ stamp – the theory does not. In this testing procedure, there is no direct approach from the ‘proven true’ stamp to a ‘true theory’ conclusion. Indeed, the latter conclusion is not possible in science. We are mainly indexing the ‘track record’ of a theory, as Meehl (1990) argues: “The main way a theory gets money in the bank is by predicting facts that, absent the theory, would be antecedently improbable.” Often (e.g., in non-experimental settings) rejecting a null hypothesis with large sample sizes is not considered a very improbable event, but that is another issue (see also the definition of a severe test by Mayo (1996, 178): a passing result is a severe test of hypothesis H just to the extent that it is very improbable for such a passing result to occur, were H false).

Regardless of how risky the prediction we made was, when we then collect data, and test the hypothesis, we either confirm our prediction, or we do not confirm our prediction. In frequentist statistics, we add the outcome of this prediction to the ‘track record’ of our theory, but we can not draw conclusions based on any single study. As Fisher (1926, 504) writes: “if one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance (italics added).

The study needs to be ‘properly designed’ to ‘rarely' fail to give a level of evidence – which despite Fisher’s dislike for Neyman-Pearson statistics, I can read in no other way than to make sure you run well-powered studies for whatever happens to be your smallest effect size of interest. In other words: When testing the validity of theories through predictions, where you keep track of a ‘track record’ of predictions, you need to control your error rates to efficiently distinguish hits from misses. Design well-powered studies, and do not fool yourself by inflating the probability of observed in false positive.

I think that when it comes to testing theories, assessing the validity through prediction is extremely (and for me, perhaps the most) important. We don’t want to fool ourselves when we test the validity of our theories. An example of ‘fooling yourself’ are the studies on pre-cognition by Daryl Bem (2011). An example of a result I like to use in workshops is the following result of Study 1 where people pressed a left or right button to predict whether a picture was hidden behind a left or right curtain.

If we take this study as it is (without pre-registration) it is clear there are 5 tests (for erotic, neutral, negative, positive, and ‘romantic but non-erotic’ pictures). A Bonferroni correction would lead us to use an alpha level of 0.01 (an alpha of 0.05/5 tests) and the result (0.01, but more precisely 0.013) would not be enough to support our prediction, given the pre-specified alpha level. Note that Bem (Bem, Utts, and Johnson, 2011) explicitly says this test was predicted. However, I see absolutely no reason to believe Bem without a pre-registration document for the study.

Bayesian statistics do not provide a solution when analyzing this pre-cognition experiment. As Gelman and Loken (2013) write about this study (I just realized this ‘Garden of Forking paths’ paper is unpublished, but has 150 citations!): “we can still take this as an observation that 53.1% of these guesses were correct, and if we combine this with a flat prior distribution (that is, the assumption that the true average probability of a correct guess under these conditions is equally likely to be anywhere between 0 and 1) or, more generally, a locally-flat prior distribution, we get a posterior probability of over 99% that the true probability is higher than 0.5; this is one interpretation of the one-sided p-value of 0.01.” The use of Bayes factors that quantify model evidence provides no solution. Where Wagenmakers and colleagues argue based on ‘default’ Bayesian t-tests that the null-hypothesis is supported, Bem, Utts, and Johnson (2011) correctly point out this criticism is flawed, because the default Bayesian t-tests use completely unrealistic priors for pre-cognition research (and most other studies published in psychology, for that matter).

It is interesting that the best solution Gelman and Loken come up with is that “perhaps researchers can perform half as many original experiments in each paper and just pair each new experiment with a preregistered replication”. What matters is not just the data, but the procedure used to collect the data. The procedure needs to be able to demonstrate a strong predictive validity, which is why pre-registration is such a great solution to many problems science faces. Pre-registered studies are the best way we have to show you can actually predict something – which gets your theory money in the bank.

If people ask me if I care about evidence, I typically say: ‘mwah’. For me, evidence is not a primary goal of doing research. Evidence is a consequence of demonstrating that my theories have high validity as I test predictions. It is important to end up with, and it can be useful to try to quantify model evidence through likelihoods or Bayes factors, if you have good models. But if I am able to show that I can confirm predictions in a line of pre-registered studies, either by showing my p-value is smaller than an alpha level, a Bayesian highest density interval excludes some value, a Bayes factor is larger than some cut-off, or by showing the effect size is close enough to some predicted value, I will always end up with strong evidence for the presence of some effect. As De Groot (1969) writes: “If one knows something to be true, one is in a position to predict; where prediction is impossible, there is no knowledge.”


Bem, D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719. https://doi.org/10.1037/a0024777
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.
De Groot, A. D. (1969). Methodology. The Hague: Mouton & Co.
Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426–432. https://doi.org/10.1037/a0022790

No comments:

Post a Comment