During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided upon. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned – it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!
I’m not the first person to be
surprised by this data (thanks to Nick Brown for pointing me to these papers on
Twitter). There was a published
criticism on the study in PNAS (which no one reads or cites), and more
recently, an article by Andreas
Glöckner explaining how the data could be explained through a more
plausible mechanism (for a nice write up, see this blog by Tom
Stafford). I appreciate that people have tried to think about which mechanism
could cause this effect, and if you are interested, highly recommend reading the
commentaries (and perhaps even the response by the authors).
As Glöckner notes, one
surprising aspect of this study is the magnitude of the effect: ‘A drop of
favorable decisions from 65% in the first trial to 5% in the last trial as
observed in DLA is equivalent to an odds ratio of 35 or a standardized mean
difference of d = 1.96 (Chinn, 2000)’.
Some people dislike statistics. They are
only interested in effects that are so large, you can see them by just plotting
the data. This study might seem to be a convincing illustration of such an
effect. My goal in this blog is to argue against this idea. You need
statistics, maybe especially when
effects are so large they jump out at you.
When reporting findings, authors should
report and interpret effect sizes. An
important reason for this is that effects can be impossibly large. An example I
give in my MOOC
is the Ig Nobel prize winning finding
that suicide rates among white people increased with the amount of airtime
dedicated to country music. The reported (but not interpreted) correlation was
a whopping r = 0.54. I once went to a
Dolly Parton concert with my wife. It was a great 2 hour show. If the true
correlation between listening to country music and white suicide rates was
0.54, this would not have been a great concert, but a mass-suicide.
Based on this data,
the difference between the height of 21-year old men and women in The
Netherlands is approximately 13 centimeters. That is a Cohen’s d of 2. That’s the effect size in the hungry
judges study.
If hunger had an effect on our mental
resources of this magnitude, our society would fall into minor chaos every day
at 11:45. Or at the very least, our society would have organized itself around
this incredibly strong effect of mental depletion. Just like manufacturers take
size differences between men and women into account when producing items such
as golf clubs or watches, we would stop teaching in the time before lunch, doctors
would not schedule surgery, and driving before lunch would be illegal. If a
psychological effect is this big, we don’t need to discover it and publish it
in a scientific journal - you would already know it exists. Sort of how the ‘after
lunch dip’ is a strong and replicable finding that you can feel yourself (and that, as it happens, is directly in conflict
with the finding that judges perform better immediately after lunch – surprisingly,
the authors don’t discuss the after lunch
dip).
We can look at the review paper by Richard,
Bond, & Stokes-Zoota (2003) to see which effect sizes in law psychology are close to a Cohen’s d of 2, and find two that are slightly
smaller. The first is the effect that a jury’s final verdict is likely to be
the verdict a majority initially favored, which 13 studies show has an effect
size of r = 0.63, or d = 1.62. The second is that when a jury
is initially split on a verdict, its final verdict is likely to be lenient,
which 13 studies show to have an effect size of r = .63 as well. In their entire database, some effect sizes that
come close to d = 2 are the finding
that personality traits are stable over time (r = 0.66, d = 1.76),
people who deviate from a group are rejected from that group (r = .6, d = 1.5), or that leaders have charisma (r = .62, d = 1.58). You
might notice the almost tautological nature of these effects. The biggest
effect in their database is for ‘psychological ratings are reliable’ (r = .75, d = 2.26) – if we try to develop a reliable rating, it is pretty
reliable. That is the type of effects that has a Cohen’s d of around 2: Tautologies. And that is, supposedly, the effect
size that the passing of time (and subsequently eating lunch) has on parole
hearing sentencings.
I think it is telling that most
psychologists don’t seem to be able to recognize data patterns that are too
large to be caused by psychological mechanisms. There are simply no plausible psychological
effects that are strong enough to cause the data pattern in the hungry judges
study. Implausibility is not a reason to completely dismiss empirical findings,
but impossibility is. It is up to authors to interpret the effect size in their
study, and to show the mechanism through which an effect that is impossibly
large, becomes plausible. Without such an explanation, the finding should simply
be dismissed.
Your comment "our society would have organized itself around this incredibly strong effect" is important here. A lot of (social) psychology seems to be about chasing effects that are invisible to the naked eye, and were unknown to Plato or Shakespeare, yet apparently emerge the size of an elephant with a sufficiently clever protocol. But hey, when a Nobel Prize winner writes that "The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true. More important, you must accept that they are true about you", who are ordinary mortals to deny the existence of the elephant?
ReplyDeleteAs for how this result came about: Stephen Senn had a guest post on Deborah Mayo's blog this last weekend (https://errorstatistics.com/2017/07/01/s-senn-fishing-for-fakes-with-fisher-guest-post/) from which I learned this phrase: "every statistician should always ask ‘how did I get to see what I see?’". It's very tempting to view all data as equal --- R or SPSS doesn't care where the numbers come from --- but it's very dangerous.
Why spoze it's hunger? and why spoze hunger is a psychological effect? I don't know a thing about how common this alleged pattern is, nor have I read the source, but I doubt the order of cases is randomly selected. If I grade 2 or 3 sections of a class, I start with those most likely to be best/easiest.
ReplyDeleteI agree; that seems to be the explanation that the Glöckner article (cited in the OP) considers as well, and finds evidence for via simulation.
DeleteThe Nature link is down for maintenance; sci-hub is up. There's some kind of lesson. With a huge effect size.
ReplyDeleteGood post.
ReplyDeleteThe result also seems implausible from an evolutionary perspective. If mental acuity dropped so precipitously after a just few hours without eating, you wouldn't be able to hunt or find food in your debilitated state. How would we have survived this long?
This comment has been removed by a blog administrator.
ReplyDelete