A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, January 20, 2020

Review of "The Generalizability Crisis" by Tal Yarkoni

In a recent preprint titled "The Generalizability Crisis", Tal Yarkoni examines whether the current practice of how psychologists generalize from studies to theories is problematic. He writes: “The question taken up in this paper is whether or not the tendency to generalize psychology findings far beyond the circumstances in which they were originally established is defensible. The case I lay out in the next few sections is that it is not, and that unsupported generalization lies at the root of many of the methodological and sociological challenges currently affecting psychological science.” We had a long twitter discussion about the paper, and then read it in our reading group. In this review, I try to make my thoughts about the paper clear in one place, which might be useful if we want to continue to discuss whether there is a generalizability crisis, or not.

First, I agree with Yarkoni that almost all the proposals he makes in the section “Where to go from here?” are good suggestions. I don’t think they follow logically from his points about generalizability, as I detail below, but they are nevertheless solid suggestions a researcher should consider. Second, I agree that there are research lines in psychology where modelling more things as random factors will be productive, and a forceful manifesto (even if it is slightly less practical than similar earlier papers) might be a wake up call for people who had ignored this issue until now.

Beyond these two points of agreement, I found the main thesis in his article largely unconvincing. I don’t think there is a generalizability crisis, but the article is a nice illustration of why philosophers like Popper abandoned the idea of an inductive science. When Yarkoni concludes that “A direct implication of the arguments laid out above is that a huge proportion of the quantitative inferences drawn in the published psychology literature are so inductively weak as to be at best questionable and at worst utterly insensible.” I am primarily surprised he believes induction is a defensible philosophy of science. There is a very brief discussion of views by Popper, Meehl, and Mayo on page 19, but their work on testing theories is proposed as a probable not feasible solution – which is peculiar, because these authors would probably disagree with most of the points made by Yarkoni, and I would expect at least somewhere in the paper a discussion comparing induction against the deductive approach (especially since the deductive approach is arguably the dominant approach in psychology, and therefore none of the generalizability issues raised by Yarkoni are a big concern). Because I believe the article starts from a faulty position (scientists are not concerned with induction, but use deductive approaches) and because Yarkoni provides no empirical support for any of his claims that generalizability has led to huge problems (such as incredibly high Type 1 error rates), I remain unconvinced there is anything remotely close to the generalizability crisis he so evocatively argues for. The topic addressed by Yarkoni is very broad. It probably needs a book length treatment to do it justice. My review is already way too long, and I did not get into the finer details of the argument. But I hope this review helps to point out the parts of the manuscript where I feel important arguments lack a solid foundation, and where issues that deserve to be discussed are ignored.

Point 1: “Fast” and “slow” approaches need some grounding in philosophy of science.


Early in the introduction, Yarkoni says there is a “fast” and “slow” approach of drawing general conclusions from specific observations. Whenever people use words that don’t exactly describe what they mean, putting them in quotation marks is generally not a good idea. The “fast” and “slow” approaches he describes are not, I believe upon closer examination, two approaches “of drawing general conclusions from specific observations”.

The difference is actually between induction (the “slow” approach of generalizing from single observations to general observations) and deduction, as proposed by for example Popper. As Popper writes “According to the view that will be put forward here, the method of critically testing theories, and selecting them according to the results of tests, always proceeds on the following lines. From a new idea, put up tentatively, and not yet justified in any way—an anticipation, a hypothesis, a theoretical system, or what you will—conclusions are drawn by means of logical deduction.”

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments’”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the effect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.” Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

Point 2: Titles are not evidence for psychologist’s tendency to generalize too quickly.


This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

For me, this observation raised serious concerns about the statement Yarkoni makes that, simply from the titles of scientific articles, we can make a statement about whether authors make ‘fast’ or ‘slow’ generalizations. One reason is that Yarkoni examined titles from a scientific article that adheres to the publication manual of the APA. In the section on titles, the APA states: “A title should summarize the main idea of the manuscript simply and, if possible, with style. It should be a concise statement of the main topic and should identify the variables or theoretical issues under investigation and the relationship between them. An example of a good title is "Effect of Transformed Letters on Reading Speed."”. To me, it seems the authors are simply following the APA publication manual. I do not think their choice for a title provides us with any insight whatsoever about the tendency of authors to have a preference for ‘fast’ generalization. Again, this might be a minor point, but I found this an illustrative example of the strength of arguments in other places (see the next point for the most important example). Yarkoni needs to make a case that scientists are overgeneralizing, for there to be a generalizability crisis – but he does so unconvincingly. I sincerely doubt researchers expect their findings to generalize to all possible situations mentioned in the title, I doubt scientists believe titles are the place to accurately summarize limits of generalizability, and I doubt Yarkoni has made a strong point that psychologists overgeneralize based on this section. More empirical work would be needed to build a convincing case (e.g., code how researchers actually generalize their findings in a random selection of 250 articles, taking into account Gricean communication norms (especially the cooperative principle) in scientific articles).

Point 3: Theories and tests are not perfectly aligned in deductive approaches.


After explaining that psychologists use statistics to test predictions based on experiments that are operationalizations of verbal theories, Yarkoni notes: “From a generalizability standpoint, then, the key question is how closely the verbal and quantitative expressions of one’s hypothesis align with each other.”

Yarkoni writes: “When a researcher verbally expresses a particular hypothesis, she is implicitly defining a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis. If the researcher subsequently asserts that a particular statistical procedure provides a suitable test of the verbal hypothesis, she is making the tacit but critical assumption that the universe of admissible observations implicitly defined by the chosen statistical procedure (in concert with the experimental design, measurement model, etc.) is well aligned with the one implicitly defined by the qualitative hypothesis. Should a discrepancy between the two be discovered, the researcher will then face a choice between (a) working to resolve the discrepancy in some way (i.e., by modifying either the verbal statement of the hypothesis or the quantitative procedure(s) meant to provide an operational parallel); or (b) giving up on the link between the two and accepting that the statistical procedure does not inform the verbal hypothesis in a meaningful way.

I highlighted what I think is the critical point is in a bold font. To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed effects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned. As Yarkoni accurately summarizes based on an large multi-lab study on verbal overshadowing by Alogna: “given very conservative background assumptions, the massive Alogna et al. study—an initiative that drew on the efforts of dozens of researchers around the world—does not tell us much about the general phenomenon of verbal overshadowing. Under more realistic assumptions, it tells us essentially nothing.” This is also why Yarkoni’s first practical recommendation on how to move forward is to not solve the problem, but to do something else: “One perfectly reasonable course of action when faced with the difficulty of extracting meaningful, widely generalizable conclusions from effects that are inherently complex and highly variable is to opt out of the enterprise entirely.”

This is exactly the reason Popper (among others) rejected induction, and proposed a deductive approach. Why isn’t the alignment between theories and tests raised by Yarkoni a problem for the deductive approach proposed by Popper, Meehl, and Mayo? The reason is that the theory is tentatively posited as true, but in no way believed to be a complete representation of reality. This is an important difference. Yarkoni relies on an inductive approach, and thus the test needs to be aligned with the theory, and the theory defines “a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis.” For deductive approaches, this is not true.

For philosophers of science like Popper and Lakatos, a theory is not a complete description of reality. Lakatos writes about theories: “Each of them, at any stage of its development, has unsolved problems and undigested anomalies. All theories, in this sense, are born refuted and die refuted.” Lakatos gives the example that Newton’s Principia could not even explain the motion of the moon when it was published. The main point here: All theories are wrong. The fact that all theories (or models) are wrong should not be surprising. Box’s quote “All models are wrong, some are useful” is perhaps best known, but I prefer Box (1976) on parsimony: “Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William Ockham (1285-1349) he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity (Ockham's knife).” He follows this up by stating “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

In a deductive approach, the goal of a theoretical model is to make useful predictions. I doubt anyone believes that any of the models they are currently working on is complete. Some researchers might follow an instrumentalist philosophy of science, and don’t expect their theories to be anything more than useful tools. Lakatos’s (1978) main contribution to philosophy of science was to develop a way we deal with our incorrect theories, admitting that all needed adjustment, but some adjustments lead to progressive research lines, and others to degenerative research lines.

In a deductive model, it is perfectly fine to posit a theory that eating ice-cream makes people happy, without assuming this holds for all flavors, across all cultures, at all temperatures, and is irrespective of the amount of ice-cream eaten previously, and many other factors. After all, it is just a tentatively model that we hope is simple enough to be useful, and that we expect to become more complex as we move forward. As we increase our understanding of food preferences, we might be able to modify our theory, so that it is still simple, but also allows us to predict the fact that eggnog and bacon flavoured ice-cream do not increase happiness (on average). The most important thing is that our theory is tentative, and posited to allow us to make good predictions. As long as the theory is useful, and we have no alternatives to replace it with, the theory will continue to be used – without any expectation that is will generalize to all possible situations. As Box (1976) writes: “Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory.” A discussion of this large gap between Yarkoni and deductive approaches proposed by Popper and Meehl, where Yarkoni thinks theories and tests need to align, and deductive approaches see theories as tentative and wrong, should be included, I think. 


Point 4: The dismissal of risky predictions is far from convincing (and generalizability is typically a means to risky predictions, not a goal in itself).


If we read Popper (but also on the statistical side the work of Neyman) we see induction as a possible goal in science is clearly rejected. Yarkoni mentions deductive approaches briefly in his section on adopting better standards, in the sub-section on making riskier predictions. I intuitively expected this section to be crucial – after all, it finally turns to those scholars who would vehemently disagree with most of Yarkoni’s arguments in the preceding sections – but I found this part rather disappointing. Strangely enough, Yarkoni simply proposes predictions as a possible solution – but since the deductive approach goes directly against the inductive approach proposed by Yarkoni, it seems very weird to just mention risky predictions as one possible solution, when it is actually a completely opposite approach that rejects most of what Yarkoni argues for. Yarkoni does not seem to believe that the deductive mode proposed by Popper, Meehl, and Mayo, a hypothesis testing approach that is arguably the dominant approach in most of psychology (Cortina & Dunlap, 1997; Dienes, 2008; Hacking, 1965), has a lot of potential. The reason he doubts severe tests of predictions will be useful is that “in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding” (Yarkoni, p. 19). This could be resolved if risky predictions were possible, which Yarkoni doubts.

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests. It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

Point 5: Why care about statistical inferences, if these do not relate to sweeping verbal conclusions?


If we ignore all points previous points, we can still read Yarkoni’s paper as a call to introduce more random factors in our experiments. This nicely complements recent calls to vary all factors you do not thing should change the conclusions you draw (Baribault et al., 2018), and classic papers on random effects (Barr et al., 2013; Clark, 1969; Cornfield & Tukey, 1956).

Yarkoni generalizes from the fact that most scientists model subjects as a random factor, and then asks why scientists generalize to all sorts of other factors that were not in their models. He asks “Why not simply model all experimental factors, including subjects, as fixed effects”. It might be worth noting in the paper that sometimes researchers model subjects as fixed effects. For example, Fujisaki and Nishida (2009) write: “Participants were the two authors and five paid volunteers” and nowhere in their analyses do they assume there is any meaningful or important variation across individuals. In many perception studies, an eye is an eye, and an ear is an ear – whether from the author, or a random participant dragged into the lab from the corridor.

In other research areas, we do model individuals as a random factor. Yarkoni says we model stimuli as a random factor because: “The reason we model subjects as random effects is not that such a practice is objectively better, but rather, that this specification more closely aligns the meaning of the quantitative inference with the meaning of the qualitative hypothesis we’re interested in evaluating”. I disagree. I think we model certain factor as random effects because we have a high prior these factors influence the effect, and leaving them out of the model would reduce the strength of our prediction. Leaving them out reduces the probability a test will show we are wrong, if we are wrong. It impacts the severity of the test. Whether or not we need to model factors (e.g., temperature, the experimenter, or day of the week) as random factors because not doing so reduces the severity of a test is a subjective judgments. Research fields need to decide for themselves. It is very well possible more random factors are generally needed, but I don’t know how many, and doubt it will ever be as severe are the ‘generalizability crisis’ suggests. If it is as severe as Yarkoni suggests, some empirical demonstrations of this would be nice. Clark (1973) showed his language-as-fixed-effect fallacy using real data. Barr et al (2013) similarly made their point based on real data. I currently do not find the theoretical point very strong, but real data might convince me otherwise.

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic. Similarly, Cornfield & Tukey (1956) more pragmatically list options ranging from ignoring factors altogether, to randomizing them, or including them as a factor, and note “Each of these attitudes is appropriate in its place. In every experiment there are many variables which could enter, and one of the great skills of the experimenter lies in leaving out only inessential ones.” Just as pragmatically, Clark (1973) writes: “The wide-spread capitulation to the language-as-fixed-effect fallacy, though alarming, has probably not been disastrous. In the older established areas, most experienced investigators have acquired a good feel for what will replicate on a new language sample and what will not. They then design their experiments accordingly.” As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why. In some ways, Yarkoni’s point generalizes the argument that most findings in psychology do not generalize to non-WEIRD populations (Henrich et al., 2010), and it has the same weakness. WEIRD is a nice acronym, but it is just a completely random collection of 5 factors that might limit generalizability. The WEIRD acronym functions more as a nice reminder that boundary conditions exist, but it does not allow us to predict when they exist, or when they matter enough to be included in our theories. Currently, there is a gap between the factors that in theory could matter, and the factors that we should in practice incorporate. Maybe it is my pragmatic nature, but without such a discussion, I think the paper offers relatively little progress compared to previous discussions about generalizability (of which there are plenty).

Conclusion


A large part of Yarkoni’s argument is based on the fact that theories and tests should be closely aligned, while in a deductive approach based on severe tests of predictions, models are seen as simple, tentative, and wrong, and this is not considered a problem. Yarkoni does not convincingly argue researchers want to generalize extremely broadly (although I agree papers would benefit from including Constraints on Generalizability statements a proposed by Simons and colleagues (2017), but mainly because this improves falsifiability, not because it improves induction), and even if there is the tendency to overclaim in articles, I do not think this leads to an inferential crisis. Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice. Until Yarkoni does the latter convincingly, I don’t think the generalizability crisis as he sketches it is something that will keep me up at night.



References


Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Ravenzwaaij, D. van, White, C. N., Boeck, P. D., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3). https://doi.org/10.1016/j.jml.2012.11.001

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10/gdm28w

Clark, H. H. (1969). Linguistic processes in deductive reasoning. Psychological Review, 76(4), 387–404. https://doi.org/10.1037/h0027578

Cornfield, J., & Tukey, J. W. (1956). Average Values of Mean Squares in Factorials. The Annals of Mathematical Statistics, 27(4), 907–949. https://doi.org/10.1214/aoms/1177728067

Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161.

Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Palgrave Macmillan.

Fujisaki, W., & Nishida, S. (2009). Audio–tactile superiority over visuo–tactile and audio–visual combinations in the temporal resolution of synchrony perception. Experimental Brain Research, 198(2), 245–259. https://doi.org/10.1007/s00221-009-1870-x

Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29–29.

Lakens, D. (2020). The Value of Preregistration for Psychological Science: A Conceptual Analysis. Japanese Psychological Review. https://doi.org/10.31234/osf.io/jbh4w

Munafò, M. R., & Smith, G. D. (2018). Robust research needs many lines of evidence. Nature, 553(7689), 399–401. https://doi.org/10.1038/d41586-018-01023-3

Orben, A., & Lakens, D. (2019). Crud (Re)defined. https://doi.org/10.31234/osf.io/96dpy

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on Generality (COG): A Proposed Addition to All Empirical Papers. Perspectives on Psychological Science, 12(6), 1123–1128. https://doi.org/10.1177/1745691617708630

No comments:

Post a Comment