A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, February 2, 2020

Review of "Do Effect Sizes in Psychology Laboratory Experiments Mean Anything in Reality?"

Researchers spend a lot of time reviewing papers. These reviews are rarely made public. Sometimes reviews might be useful for readers of an article. Here, I'm sharing my review of "Do Effect Sizes in Psychology Laboratory Experiments Mean Anything in Reality?" by Roy Baumeister. I reviewed this (blinded) manuscript in December 2019 for a journal where it was rejected January 8 based on 2 reviews. Below you can read the review as I submitted it. I am sharing this review because the paper was accepted at another journal. 

In this opinion piece the authors try to argue for the lack of theoretical meaning of effect sizes in psychology. The opinion piece makes a point I think most reasonable people already agreed upon (social psychology makes largely ordinal predictions). The question is how educational and well argued their position is that this means effect sizes are theoretically not that important. On that aspect, I found the paper relatively weak. Too many statements are overly simplistic and the text is far behind on the state of the art (it reads as if it was written 20 years ago). I think a slightly brushed up version might make a relatively trivial but generally educational point for anyone not up to speed on this topic. If the author put in a bit more effort to have a discussion that incorporates the state of the art, this could be a more nuanced piece that has a bit stronger analysis of the issues at play, and a bit more vision about where to go. I think the latter would be worth reading for a general audience at this journal.

Main points.

1) What is the real reason our theories do not predict effect sizes?

The authors argue how most theories in social psychology are verbal theories. I would say most verbal theories in social psych are actually not theories (Fiedler, 2004) but closer to tautologies. That being said, I think the anecdotal examples the authors use throughout their paper (obedience effects, bystander effect, cognitive dissonance) are all especially weak, and the paper would be improved if all these examples are removed. Although the authors are correct in stating we hardly care about the specific effect size in those studies, I am not interested in anecdotes of studies where we know we do not care about effect sizes. This is a weak (as in, not severe) test of the argument the authors are making, with confirmation bias dripping from every sentence. If you want to make a point about effect sizes not mattering, you can easily find situations where they do not theoretically matter. But this is trivial and boring. What would be more interesting is an analysis of why we do not care about effect sizes that generalizes beyond the anecdotal examples. The authors are close, but do not provide such a general argument yet. One reason I think the authors would like to mention is that there is a measurement crisis in psych – people use a hodgepodge of often completely unvalidated measures. It becomes a lot more interesting to quantify effect sizes if we all use the same measurement. This would also remove the concern about standardized vs unstandardized effect sizes. But more generally, I think the authors should make an argument more from basic principles, than based on anecdotes, if they want to be convincing.

2) When is something an ordinal prediction?

Now, if we use the same measures, are there cases where we predict effect sizes? The authors argue we never predict an exact effect size. True, but again, uninteresting. We can predict a range of effect sizes. The authors cite Meehl, but they should really incorporate Meehl’s point from his 1990 paper. Predicting a range of effect sizes is already quite something, and the authors do not give this a fair discussion. It matters a lot if I predict an effect in the range of 0.2 to 0.8, even though this is very wide, then if I say I predict any effect larger than zero. Again, a description of the state of the art is missing in the paper. This question has been discussed in many literatures the authors do not mention. The issue is the same as the discussion about whether we should use a uniform prior in Bayesian stats, or a slightly more informative prior, because we predict effects in some range. My own work specifying the smallest effect size of interest also provides quite a challenge to the arguments of the current authors. See especially the example of Burriss and colleages in our 2018 paper (Lakens, Scheel, & Isager, 2018). They predicted an effect should be noticeable with the naked eye, and it is an example where a theory very clearly makes a range prediction, falsifying the authors arguments in the current paper. That these cases exist, means the authors are completely wrong in their central thesis. It also means they need to rephrase their main argument – when do we have range predictions, and when are we predicting *any* effect that is not zero. And why? I think many theories in psychology would argue that effects should be larger than some other effect size. This can be sufficient to make a valid range prediction. Similarly, if psychologist would just think about effect sizes that are too large, we would not have papers in PNAS edited by Nobel prize winners that think the effect of judges on parole decisions over time is a psychological mechanism, when the effect size is too large to be plausible (Glöckner, 2016). So effects should often be larger or smaller than some value, and this does not align with the current argument by the authors.  
I would argue the fact that psych theories predict range effects means psych theories make effect size predictions that are relevant enough to quantify. We do a similar thing when we compare 2 conditions, for example when we predict an effect is larger in condition X than Y. In essence, this means we say: We predict the effect size of X to be in a range that is larger than the effect size of Y. Now, this is again a range prediction. We do not just say both X and Y have effects that differ from zero. It is still an ordinal prediction, so fits with the basic point of the authors about how we predict, but it no longer fits with their argument that we simply test for significance. Ordinal predictions can be more complex than the authors currently describe. To make a solid contribution they will need to address what ordinal predictions are in practice. With the space that is available after the anecdotes are removed, they can add a real analysis of how we test hypotheses in general, where range predictions fit with ordinal predictions, and if we would use the same measures and have some widely used paradigms, we could, if we wanted to, create theories that make quantifiable range predictions. I agree with the authors it can be perfectly valid to choose a unique operationalization of a test, and that this allows you to increase or decrease the effect size depending on the operationalization. This is true. But we can make theories that predict things beyond just any effect if we fix our paradigms and measures – and authors should discuss if this might be desirable to give their own argument a bit more oomph and credibility. If the authors want to argue that standard measures in psychology are undesirable or impossible, that might be interesting, but I doubt it will work. And thus, I expect author will need to give more credit to the power of ordinal predictions. In essence, my point here is that if you think we can not make a range prediction on a standardized measure, you also think there can be no prediction that condition X yields a larger effect than condition y, and yet we make these predictions all the time. Again, in a Stroop effect with 10 trials effect sizes differ than is we have 10000 trials – but given a standardized measure, we can predict relative differences.

3) Standardized vs unstandardized effect sizes

I might be a bit too rigid here, but when scientists make claims, I like them to be accompanied by evidence. The authors write “Finally, it is commonly believed that in the case of arbitrary units, a standardized effect size is more meaningful and informative than its equivalent in raw score units.” There is no citation and this to me sounds 100% incorrect. I am sure they might be able to dig out one misguided paper making this claim. But this is not commonly believed, and the literature abundantly shows researchers argue the opposite – the most salient example is (Baguley, 2009) but any stats book would suffice. The lack of a citation to Baguley is just one of the examples where the authors seem not to be up to speed of the state of the art, and where their message is not nuanced enough, while the discussion in the literature surpassed many of their simple claims more than a decade ago. I think the authors should improve their discussion of standardized and unstandardized effect sizes. Standardized effect sizes are useful of measurement tools differ, and if you have little understanding of what you are measuring. Although I think this is true in general in social psychology (and the measurement crisis is real), I think the authors are not making the point that *given* that social psychology is such a mess when it comes to how researchers in the field measure things, we can not make theoretically quantifiable predictions. I would agree with this. I think they try to argue that even if social psychology was not such a mess, we could still not make quantifiable predictions. I disagree. Issues related to the standardized and unstandardized effect sizes are a red herring. They do not matter anything. If we understood our measures and standardized them, we would have accurate estimates of the sd’s for what we are interested in, and this whole section can just be deleted. The authors should be clear if they think we will never standardize our measures and there is no value in them or if it is just difficult in practice right now. Regardless, they issue with standardized effects is mute, since their first sentence that standardized effect sizes are more meaningful is just wrong (for a discussion, Lakens, 2013).

Minor points

When discussing Festinger and Carlsmith, it makes sense to point out how low quality and riddled with mistakes the study was: https://mattiheino.com/2016/11/13/legacy-of-psychology/.

The authors use the first studies of several classic research lines as an example that psychology predicts directional effects at best, and that these studies cared about demonstrating an effect. Are the authors sure their idea of scientific progress is that we for ever limit ourselves to demonstrating effects? This is criticized in many research fields, and the idea of developing computational models in in some domains deserves to be mentioned. Even a simple stupid model can make range predictions that can be tested theoretically. A broader discussion of psychologists who have a bit more ambition for social psychology than the current authors, and who believe that some progress towards even a rough computational model would allow us to predict not just ranges, but also the shapes of effects (e.g., linear vs exponential effects) would be warranted, I think. I think it is fine if the authors have the opinion that social psychology will not move along by more quantification. But I find the final paragraph a bit vague and uninspiring in what the vision is. No one argues against practical applications or conceptual replications. The authors rightly note it is easier (although I think not as much easier as the authors think) to use effects in cost-benefit analyses in applied research. But what is the vision? Demonstration proofs and an existentialistic leap of faith that we can apply things? That has not worked well. Applied psychological researchers have rightly criticized theoretically focused social psychologists for providing basically completely useless existence proofs that often do not translate to any application, and are too limited to be of any value. I do not know what the solution is here, but I would be curious to hear if the authors have a slightly more ambitious vision. If not, that is fine, but if they have one, I think it would boost the impact of the paper.

Daniel Lakens

Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100(3), 603–617. https://doi.org/10.1348/000712608X377117
Fiedler, K. (2004). Tools, toys, truisms, and theories: Some thoughts on the creative cycle of theory formation. Personality and Social Psychology Review, 8(2), 123–131. https://doi.org/10.1207/s15327957pspr0802_5
Glöckner, A. (2016). The irrational hungry judge effect revisited: Simulations reveal that the magnitude of the effect is overestimated. Judgment and Decision Making, 11(6), 601–610.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00863
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

No comments:

Post a Comment