Estimates
based on samples from the population will show variability. The larger the
sample, the closer our estimates will be to the true population values.
Sometimes we will observe larger estimates than the population value, and
sometimes we will observe smaller values. As long as we have an unbiased
collection of effect size estimates, combining effect sizes estimates through a
meta-analysis can increase the accuracy of the estimate. Regrettably, the
scientific literature is often biased. It is specifically common that
statistically significant studies are published (e.g., studies with *p* values
smaller than 0.05) while studies with *p* values larger than 0.05 remain
unpublished (Ensinck & Lakens, 2023; Franco et al., 2014;
Sterling, 1959). Instead of having access to all
effect sizes, anyone reading the literature only has access to effects that
passed a *significance filter*. This will introduce systematic bias in our
effect size estimates.

The explain
how selection for significance introduces bias, it is useful to understand the
concept of a *truncated *or *censored* distribution. If we want to
measure the average length of people in The Netherlands we would collect a
representative sample of individuals, measure how tall they are, and compute
the average score. If we collect sufficient data the estimate will be close to
the true value in the population. However, if we collect data from participants
who are on a theme park ride where people need to be at least 150 centimeters
tall to enter, the mean we compute is
based on a truncated distribution where only individuals taller than 150 cm are
included. Smaller individuals are missing. Imagine we have measured the height
of two individuals in the theme park ride, and they are 164 and 184 cm tall. Their
average height is (164+184)/2 = 174 cm. Outside the entrance of the theme park
ride is one individual who is 144 cm tall. Had we measured this individual as
well, our estimate of the average length would be (144+164+184)/3 = 164 cm. Removing
low values from a distribution will lead to overestimation of the true value.
Removing high values would lead to underestimation of the true value.

The
scientific literature suffers from publication bias. Non-significant test
results – based on whether a *p* value is smaller than 0.05 or not – are often
less likely to be published. When an effect size estimate is 0 the *p*
value is 1. The further removed effect sizes are from 0, the smaller the *p*
value. All else equal (e.g., studies have the same sample size, and measures
have the same distribution and variability) if results are selected for
statistical significance (e.g., *p* < .05) they are also selected for
larger effect sizes. As small effect sizes will be observed with their
corresponding probabilities, their absence will inflate effect size estimates. Every
study in the scientific literature provides it’s own estimate of the true
effect size, just as every individual provides it’s own estimate of the average
height of people in a country. When these estimates are combined – as happens
in meta-analyses in the scientific literature – the meta-analytic effect size
estimate will be biased (or systematically different from the true
population value) whenever the distribution is truncated. To achieve unbiased
estimates of population values when combining individual studies in the
scientific literature in meta-analyses researchers need access to the complete
distribution of values – or all studies that are performed, regardless of
whether they yielded a *p* value above or below 0.05.

In the
figure below we see a distribution centered at an effect size of Cohen’s d =
0.5 for a two-sided *t*-test with 50 observations in each independent
condition. Given an alpha level of 0.05 in this test only effect sizes larger
than d = 0.4 will be statistically significant (i.e., all observed effect sizes
in the grey area). The threshold for which observed effect sizes will be
statistically significant is determined by the sample size and the alpha level
(and not influenced by the true effect size). The white area under the curve illustrates Type 2 errors
– non-significant results that will be observed if the alternative hypothesis
is true. If researchers only have access to the effect sizes estimates in the
grey area – a truncated distribution where non-significant results are removed –
a weighted average effect size from only these studies will be upwardly biased.

The inflation will be greater the larger the part of the distribution is truncated, and the closer the true population effect size is to 0. In our example about the height of individuals the inflation would be greater had we truncated the distribution by removing everyone smaller than 170 cm instead of 150 cm. If the true average height of individuals was 194 cm, removing the few people that are expected to be smaller than 150 (based on the assumption of normally distributed data) would have less of an effect on how much our estimate is inflated than when the true average height was 150 cm, in which case we would remove 50% of individuals. In statistical tests where results are selected for significance at a 5% alpha level more data will be removed if the true effect size is smaller, but also when the sample size is smaller. If the sample size is smaller, statistical power is lower, and more of the values in the distribution (those closest to 0) will be non-significant.

Any single
estimate of a population value will vary around the true population value. The
effect size estimate from a single study can be smaller than the true effect
size, even if studies have been selected for significance. For example, it is
possible that the true effect size is 0.5, you have observed an effect size of
0.45, but only effect sizes smaller than 0.4 are truncated when selecting
studies based on statistical significance (as in the figure above). At the same
time, this single effect size estimate of 0.45 is inflated. What inflates the
effect size is the long-run procedure used to generate the value. In the long
run effect sizes estimates based on a procedure where estimates are selected
for significance will be upwardly biased. This means that a single observed
effect size of d = 0.45 will be inflated if it is generated based on a
procedure where all non-significant effects are truncated, but it will be
unbiased if it is generated based on a distribution where all observed effect
sizes are reported, regardless of whether they are significant or not. This
also means that a single researcher can not guarantee that the effect sizes
they contribute to a literature will contribute to an unbiased effect sizes
estimate: There needs to be a system in place where all researchers report all
observed effect sizes to prevent bias. An alternative is to not have to rely on
other researchers, and collect sufficient data in a single study to have a
highly accurate effect size estimate. Multi-lab replication studies are an example
of such an approach, where dozens of researchers collect a large number (up to
thousands) of observations.

The most
extreme consequence of the inflation of effect size estimates occurs when the
true effect size in the population is 0, but due to selection of statistically
significant results, only significant effects in the expected direction are
published. Note that if all significant results are published (and not only
effect sizes in the expected direction) 2.5% of Type 1 error rates will be in
the positive direction, and 2.5% will be in the negative direction, and the
average effect size would be actually be 0. Thus, as long as the true effect
size is exactly 0, and all Type 1 errors are published, the effect size
estimate would be unbiased. In practice, we see scientists often do not simply
publish all results, but only statistically significant results in the desired
direction. An example of this is the literature on ego-depletion, where
hundreds of studies were published, most showing statistically significant
effects, but unbiased large scale replication studies revealed effect sizes of
0 (Hagger
et al., 2015; Vohs et al., 2021).

What can be
done about the problem of biased effect sizes estimates if we mainly have access
to the studies that passed a significance filter? Statisticians have developed approaches
to adjust biased effect size estimates by taking a truncated distribution into
account (Taylor & Muller, 1996). This approach has recently been
implemented in R (Anderson et al., 2017). Implementing this approach in
practice is difficult, because we never know for sure if an effect size estimate
is biased, and if it is biased, how much bias there is. Furthermore, selection
based on significance is only one form of bias, whereas researchers who
selectively report significant results may engage in additional problematic
research practices, such as selectively reporting results, which are not
accounted for in the adjustment. Other researchers have referred to this
problem as a Type M error (Gelman & Carlin, 2014; Gelman & Tuerlinckx,
2000) and have suggested that researchers
always report the average inflation factor of effect sizes. I do not believe
this approach is useful. The Type M error is not an error, but a bias in
estimation, and it is more informative to compute the adjusted estimate based
on a truncated distribution as proposed by Taylor and Muller in 1996, than to compute
the average inflation for a specific study design. If effects are on average
inflated by a factor of 1.3 (the Type M error) it does not mean that the
observed effect size is inflated by this factor, and the truncated effect sizes
estimator by Taylor and Muller will provide researchers with an actual estimate
based on their observed effect size. Type M errors might have a function in
education, but they are not useful for scientists (I will publish a paper on Type S and M errors later this year, explaining in more detail why I think neither are useful concepts).

Of course
the real solution to bias in effect size estimates due to significance filters
that lead to truncated or censored distributions is to stop selectively
reporting results. Designing highly informative studies that have high power to
both reject the null, as a smallest effect size of interest in an equivalence
test, is a good starting point. Publishing research as Registered Reports is
even better. Eventually, if we do not solve this problem ourselves, it is
likely that we will face external regulatory actions that force us to include
all studies that have received ethical review board approval to a public
registry, and update the registration with the effect size estimate, as is done
for clinical trials.

*References*:

Anderson, S. F., Kelley, K., & Maxwell, S. E.
(2017). Sample-size planning for more accurate statistical power: A method
adjusting sample effect sizes for publication bias and uncertainty. *Psychological
Science*, *28*(11), 1547–1562. https://doi.org/10.1177/0956797617723724

Ensinck, E., & Lakens,
D. (2023). *An Inception Cohort Study Quantifying How Many Registered Studies
are Published*. PsyArXiv. https://doi.org/10.31234/osf.io/5hkjz

Franco, A., Malhotra, N.,
& Simonovits, G. (2014). Publication bias in the social sciences: Unlocking
the file drawer. *Science*, *345*(6203), 1502–1505.
https://doi.org/10.1126/SCIENCE.1255484

Gelman, A., & Carlin,
J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M
(Magnitude) Errors. *Perspectives on Psychological Science*, *9*(6),
641–651.

Gelman, A., &
Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and
multiple comparison procedures. *Computational Statistics*, *15*(3),
373–390. https://doi.org/10.1007/s001800000040

Hagger, M. S.,
Chatzisarantis, N. L., Alberts, H., Anggono, C. O., Batailler, C., Birt, A.,
& Zwienenberg, M. (2015). A multi-lab pre-registered replication of the
ego-depletion effect. *Perspectives on Psychological Science*, 2.

Sterling, T. D. (1959).
Publication decisions and their possible effects on inferences drawn from tests
of significance—Or vice versa. *Journal of the American Statistical
Association*, *54*(285), 30–34. JSTOR. https://doi.org/10.2307/2282137

Taylor, D. J., &
Muller, K. E. (1996). Bias in linear model power and sample size calculation
due to estimating noncentrality. *Communications in Statistics-Theory and
Methods*, *25*(7), 1595–1610. https://doi.org/10.1080/03610929608831787

Vohs, K. D., Schmeichel,
B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J.
L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi,
J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay,
S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered
Paradigmatic Test of the Ego-Depletion Effect. *Psychological
Science*, *32*(10), 1566–1581. https://doi.org/10.1177/0956797621989733

The text was happy to point out a perverse effect of this problem: researchers will look for significant results. They will do this through processes that may include selective exclusion of outliers, treatment of variables, search for models that respond better in terms of results, and even gross manipulation of the data.

ReplyDeleteI just wanted to comment and thank you for your site. It was recommended to me by my supervisor, Steve Lindsay, and your blog posts (and publications) have elevated my understanding of statistics well beyond what I thought I would ever know. Shout-out to your twitter for directing me to the best realization I've ever had, that p-values are uniformly distributed under the null-hypothesis.

ReplyDeleteIt is true of course that effect sizes are inflated if only a positive selection of significant results is considered. However, keep in mind that many scientific studies do not intend to estimate the effect size in the first place but to establish the existence of effects. To make this difference clear, just consider experimental studies. The very purpose of an experiment is to MAXIMIZE the size of the effect by controlling noise and confounding variables. Also, many important effects only exist in the laboratory, and even there in only a few experiments (this is why there is no such thing as the "true magnitude of the emotional Stroop effect with Taylor Swift faces"). Generally, a context of hypothesis testing (i.e., theory-testing) is different from a context of effect size estimation -- the latter would require large representative samples, while the former employs experimental control to bring out effects that would otherwise be tiny. There are actually very few studies in experimental psychology that are conducted with effect size estimation in mind.

ReplyDelete