“In the long run we are all dead.” -
John Maynard Keynes
When we perform hypothesis tests in a
Neyman-Pearson framework we want to make decisions while controlling the rate
at which we make errors. We do this in part by setting an alpha level that guarantees we will not say there is an effect when there is no effect more than α% of the time, in the long run.
I like my statistics applied. And in
practice I don’t do an infinite number of studies. As Keynes astutely observed,
I will be dead before then. So when I control the error rate for my studies,
what is a realistic Type 1 error rate I will observe in the ‘somewhat longer
run’?
Let’s assume you publish a paper that
contains only a single p-value. Let’s
also assume the true effect size is 0, so the null hypothesis is true. Your test will return a p-value smaller than your alpha level (and this would be a Type 1 error) or not. With a
single study, you don’t have the granularity to talk about a 5% error rate.
In experimental psychology 30 seems to be
a reasonable average for the number of p-values
that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872).
Let’s assume you perform 30 tests in a single paper and every time the null is
true (even though this is often unlikely in a real paper). In the long run,
with an alpha level of 0.05 we can expect that 30 * 0.05 = 1.5 p-values
will be significant. But in real sets of 30 p-values there is no half of a p-value,
so you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which
equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more. We can plot the frequency of Type 1 error rates for 1 million sets of 30 tests.
Each of these error rates occurs with a
certain frequency. 21.5% of the time, you will not make any Type 1 errors.
12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over
thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but
no single set of studies is average.
Now maybe a single paper with 30 tests is not ‘long runnerish’ enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let’s assume we are interested in a meta-analysis worth of 200 studies that examine a topic where the true effect size is 0 for each test. We can plot the frequency of Type 1 error rates for 1 million sets of 200 tests.
Now things start to look a bit more like
what you would expect. The Type 1 error rate you will observe in your set of 200 tests is close
to 5%. However, it is almost exactly as likely that the observed Type 1 error
rate is 4.5%. 90% of the distribution of observed alpha levels will lie between
0.025 and 0.075. So, even in ‘somewhat longrunnish’ 200 tests, the observed
Type 1 error rate will rarely be exactly 5%, and it might be more useful to
think about it as being between 2.5 and 7.5%.
Statistical models are not reality.
A 5% error rate exists only in the
abstract world of infinite repetitions, and you will not live long enough to
perform an infinite number of studies. In practice, if you (or a group of
researchers examining a specific question) do real research, the error rates
are somewhere in the range of 5%. Everything has variation in samples drawn
from a larger population - error rates are no exception.
When we quantify things, there is the
tendency to get lost in digits. But in practice, the levels of random noise we
can reasonable expect quickly overwhelms everything at least 3 digits after the
decimal. I know we can compute the alpha level after a Pocock correction for
two looks at the data in sequential analyses as 0.0294. But this is not the
level of granularity that we should have in mind when we think of the error rate
we will observe in real lines of research. When we control our error rates, we
do so with the goal to end up somewhere reasonably low, after a decent number
of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors
or 7.5% errors: Potato, patato.
This does not mean we should stop
quantifying numbers precisely when they can be quantified precisely, but we
should realize what we get from the statistical procedures
we use. We don't get a 5% Type 1 error rate in any real set of studies we will actually perform. Statistical inferences guide us roughly to where we would ideally
like to end up. By all means calculate exact numbers where you can. Strictly
adhere to hard thresholds to prevent you from fooling yourself too often. But maybe in 2020 we can
learn to appreciate statistical inferences are always a bit messy. Do the best
you reasonably can, but don’t expect perfection. In 2020, and in statistics.
Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers.
Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers.
No comments:
Post a Comment