An extended version of this blog post is now in press at PeerJ.
TL;DR version: De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values between 0.041-0.049 in recent decades’ which 'suggests (but does not prove) questionable research practices have increased over the past 25 years'. I show the changes in the ratios of p-values over the years between 0.041-0.049 are better explained by a model of p-value distributions that assumes the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias over the years (cf. Fanelli, 2012), which has led to a relative decrease of 'marginally significant' p-values in the literature (instead of an increase in p-values just below 0.05). I (again, see Lakens, 2014) explain why researchers analyzing large numbers of p-values in the scientific literature need to develop better models of p-value distributions before drawing conclusion about questionable research practices. I want to thank De Winter and Dodou for sharing their data, assisting in the re-analysis, and reading an earlier version of this draft (to which they replied they were happy to see other researchers used the data to test alternative explanations, and that they did not see any technical mistakes in this blog post).
TL;DR version: De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values between 0.041-0.049 in recent decades’ which 'suggests (but does not prove) questionable research practices have increased over the past 25 years'. I show the changes in the ratios of p-values over the years between 0.041-0.049 are better explained by a model of p-value distributions that assumes the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias over the years (cf. Fanelli, 2012), which has led to a relative decrease of 'marginally significant' p-values in the literature (instead of an increase in p-values just below 0.05). I (again, see Lakens, 2014) explain why researchers analyzing large numbers of p-values in the scientific literature need to develop better models of p-value distributions before drawing conclusion about questionable research practices. I want to thank De Winter and Dodou for sharing their data, assisting in the re-analysis, and reading an earlier version of this draft (to which they replied they were happy to see other researchers used the data to test alternative explanations, and that they did not see any technical mistakes in this blog post).
In recent
years researchers have become more aware of how flexibility during the
data-analysis can increase false positive results (e.g., Simmons, Nelson, &
Simonsohn, 2011). If the true Type 1 error rate is substantially inflated
because researchers analyze their data until a p-value smaller than 0.05 is observed, this might substantially
decrease the robustness of scientific knowledge. However, as Stroebe and Strack
(2014, p. 60) have pointed out: “Thus
far, however, no solid data exist on the prevalence of such research practices”.
Some researchers have attempted to provide some indication of the prevalence of
questionable research practices by analyzing the distribution of p-values in the published literature.
The idea is that questionable research practices lead to ‘a peculiar prevalence
of p-values just below 0.05’
(Masicampo & Lalande, 2012) or the observation that ‘”just significant”
results are on the rise’ (Leggett, Loetscher, & Nichols, 2013).
Despite the
attention grabbing titles of these publications, the reported data does not afford the strong conclusions these researchers have drawn. The
observed pattern of a peak of p-values
just below 0.05 in Leggett et al (2013) does not replicate in other collected p-value distributions for the same
journal in later years (Masicampo & Lalande, 2012), in psychology in
general (Kühberger, Fritz, & Scherndl, 2014), or in scientific journals in
general (De Winter & Dodou, 2015). The peak in p-values observed in Masicampo & Lalande (2012) is only
surprising compared to an incorrectly modelled p-value distribution that ignores publication bias and its effect
on the frequency of p-values above
0.05 (Lakens, 2014).
Recently, De
Winter and Dodou (2015) have contributed to this emerging literature on p-value distributions and concluded that
there is a ‘surge of p-values between
0.041-0.049 in recent decades’. They improved upon earlier approaches to
analyze p-value distributions by
comparing the percentage of p-values
over time (from 1990-2013). Two observations in the data they collected could
seduce researchers to draw conclusions about a rise of p-values just below a significance level of 0.05. The first
observation is a much stronger rise in p-values
between 0.041 and 0.049 than in p-values
between 0.051-0.059. The second observation is that the percentage of p-values that falls between 0.041-0.049
has increased more from 1990 to 2013 than the increase in the percentage of p-values between 0.01-0.09, 0.011-0.019,
0.021-0.029, and 0.031-0.039 over the same years (the authors also analyze p-values with 2 digits (e.g., p = 0.04), which reveal similar patterns, but here I focus on the three digit data, which included p-values between for example 0.041-0.049 because trailing zeroes (e.g., p = 0.040) are rarely reported). The authors (2015, p. 37) conclude
that: “The fact that p-values just
below 0.05 exhibited the fastest increase among all p-value ranges we searched for suggests (but does not prove) that
questionable research practices have increased over the past 25 years.”
I will
explain why the data does not provide any indication of an increase in
questionable research practices. First, I will discuss how the difference in
the increase in p-values just below
0.05 and just above 0.05 is due to publication bias, where (perhaps
surprisingly) p-values just above
0.05 are becoming relatively less likely to appear in the abstracts of
published articles over the years. Second, I will explain why the relatively
high increase in p-values between
0.041-0.049 over the years can easily be accounted for by a decrease in the
average power of studies, but is unlikely to emerge due to an inflated Type 1
error rate due to questionable research practices. I want to explicitly note
that it was possible to provide these alternative interpretations of the data
mainly because the authors shared all data and analysis scripts online (http://dx.doi.org/10.7717/peerj.733/supp-7)
and were furthermore extremely responsive and helpful in answering a number of
questions I had. While I criticize their interpretation of data, I applaud
their adherence to open science principles (their Matlab code is an excellent
example of reproducible statistics), which greatly facilitates cumulative
science.
As I have
discussed before (Lakens, 2014), it is essential to accurately model p-value distributions before drawing
conclusions about p-values extracted
from the scientific literature. Statements about p-value distributions require a definition of four parameters. First,
researchers should specify the number of studies where H0 is true, and the number
of studies where H1 is true. Second, researchers need to estimate the average power
of the studies (or the average power of multiple subsets of studies, if
heterogeneity in power is substantial). Third, the true Type 1 error rate and
any possible mechanisms through which the error rate is inflated should be
specified. And finally, publication bias, and a model of how the p-value distribution is affected by
publication bias, should be proposed. It is important to look beyond simplistic
comparisons between p-values just
below 0.05 and p-values in other
locations in the p-value distribution
outside the scope of an explicit model of the four parameters that determine p-value distributions.
Are p-values below 0.05 increasing, or p-values above 0.05 decreasing?
De Winter
and Dodou (2015) show there is a relatively stronger increase in p-values between 0.041-0.049 than
between 0.051-0.059 (see for example Figure 9, reproduced below). The data is
clear, but the reason for this difference is not. Are p-values below 0.05 increasing more, or are p-values above 0.05 increasing less? A direct comparison is
difficult, because the percentage of papers reporting p-values below 0.05 can increase due to an increase in p-hacking, but also due to an increase
in publication bias. If publication bias increases, and people report less
non-significant results, the percentage of papers reporting p-values smaller than 0.05 will also
increase, even if there is no increase in p-hacking.
Indeed, Fanelli (2012) has shown negative results have been disappearing from
the literature between 1990-2007, which would explain the relative differences
in p-values between 0.041-0.049 and
0.051-0.059 observed by De Winter and Dodou (2015).
We can
examine the alternative explanation that the relative differences observed are due
to publication bias increasing, instead of due to an increase in p-hacking, by comparing the relative
differences between p-values between
0.031-0.039 and 0.041-0.049 over the years on the one hand, and 0.051-0.059 and
0.061-0.069 on the other hand. If there is an increase in p-hacking, the biggest differences should be observed below 0.05
(in line with the idea of a surge of p-values
between 0.041-0.049.
However, there are reasons to assume the biggest difference might occur in p-values just above 0.05. As Lakens
(2014) noted, there seems to be some tolerance for p-values just above 0.05 to be published, as indicated by a higher
prevalence of p-values between
0.051-0.059 than would be expected based on the power of statistical tests and
an equal reduction of all p-values
above 0.05. If publication bias becomes more severe, we might expect a
reduction in the tolerance for p-values
just above 0.05, which would lead to the largest changes in ratios above 0.05. The spreadsheets and datafiles used to re-analyze and reconstruct the data is available on the OSF.
Across the
three time periods (1990-1997, 1998-2005, and 2006-2013) the ratio of p-values in the 0.03 range to p-values in the 0.04 range is pretty
stable: 1.13, 1.09, and 1.11, respectively. The ratio of p-values in the 0.05 range to p-values
in the 0.06 range is surprisingly large to begin with (given that purely based
on power, p-values between 0.051-0.059 and 0.061-0.069 should occur
approximately equally often in the literature), and shows a surprisingly large
reduction over the years: 2.27, 1.94, and 1.79, respectively. The only larger
reduction in ratios is observed for p-values
between 0.001-0.009 (which is most likely due to differences in power over the
years, as will be explained below). This surprisingly large change in ratios over
time for p-values between 0.051-0.059
indicates that instead of a surge of p-hacking,
publication bias has become more pronounced over the years for p-values just above the 0.05 level,
which causes p-values just above 0.05
to increase relatively less over the years than p-values in all other bins (except for p-values below 0.009).
This might be
explained by the idea that where p-values
between .051-0.59 (or 'marginally significnt' p-values) were more readily interpreted as support for the hypothesis
in 1990-1997 than in 2005-2013. This idea is
speculative, but seems likely given the increase in publication bias over the
years (Fanelli, 2012). It should be noted that p-values just above the 0.05 level are still more frequent than can be explained just by the average power of the
tests and publication bias that is equal for all p-values above 0.05 (cf. Lakens, 2014). In other words, this data
is in line with the idea that publication bias is still slightly less severe
for p-values just above 0.05, even
though this benefit of p-values just
above 0.05 has become smaller over the years.
This seems
to be the driving force for the differences between p-values in the 0.041-0.049 range and p-values
in the 0.051-0.059 range, reported by De Winter and Dodou (2015, e.g., Figures
9 and 10). To conclude, these observed differences provide no indication for a
surge of p-values between 0.041-0.049
over the years due to an increase in questionable research practices.
How changes in average
power over the years affect ratios of p-values
below 0.05
The title
of the article, “A surge of p-values
between 0.041-0.049” is based on the observation that the ratio of p-values between 0.041-0.049 increases
more than the ratio of p-values
between 0.031-0.039, 0.021-0.029, and 0.011-0.019. There are no statistics
reported to indicate whether these differences in ratios are statistically
significant, nor are effect sizes reported to indicate whether the differences
are practically significant (or justify the term ‘surge’), but the ratios do
increase as you move from bins of low p-values
between 0.001-0.009 to bins of high p-values
between 0.041-0.049. Figure 23 reports the ratios of percentages of p-values in 1990 and 2013 for a range of
search terms. Most interesting for the current purpose are the p-values between 0.001 and 0.049.
The first
thing to understand is why these ratios are not close to 1. The reason is that
there is a massive increase in the percentage of papers in which p-values are reported over the years. As
De Winter & Dodou (2015, p. 15) note: “In
1990, 0.019% of papers (106 out of 563,023 papers) reported a p-value between
0.051 and 0.059. This increased 3.6-fold to 0.067% (1,549 out of 2,317,062
papers) in 2013. Positive results increased 10.3-fold in the same period: from
0.030% (171 out of 563,023 papers) in 1990 to 0.314% (7,266 out of 2,317,062
papers) in 2013.” This is not just an increase in the absolute number of
reported p-values in abstracts (in
which case the ratios could still be 1) but a relative 10.3-fold increase in
how often p-values end up in
abstracts. De Winter & Dodou (2015) demonstrate p-values are finding their way into more and more abstracts, which
points to a possible increase in the overreliance on null-hypothesis testing in
empirical articles. This is an important contribution to the literature, even
when other claims about an increase in questionable research practices would
not hold (also, the huge increase in the term 'paradigm shift' in abstracts over time is quite telling).
How can
these differences between the ratios across the 5 bins below 0.05 be explained
by a model of p-value distributions
that consists of the ratio of true to false effects examined, power, the Type 1
error rate, and publication bias? We can only explain the relative differences
between the ratios over the different bins of p-values if we allow at least one of the parameters of the model to
the change over time. We can ignore publication bias, assuming all disciplines
that report p-values in abstracts use
α = 0.05 (this is not true, but we can assume it applies to the majority of
articles that are analyzed). The two remaining possibilities are a change in
the average power of studies over time, and an inflated Type 1 error rate over
time, such as an increase in questionable research practices in the
literature.
If we
ignore Type 1 errors, we can relatively easily reconstruct the observed data
purely based on differences in the average power across the years. I’m not
arguing the numbers in this re-construction reflect the truth. However, they show
it is possible to model the ratios observed by De Winter & Dodou (2015) under
the assumption that power differs from 1990 to 2013. For example, if we assume
average power was 55% in 1990, and 42% in 2013, we can expect to observe the p-value distribution across the 5 bins
as detailed in the table below, with 29.855% of the p-values falling between 0.001 and 0.009 in 1990, but only 19.926%
of p-values falling between
0.001-0.009 n 2013 (which most likely explains the large differences in ratios
between 0.001-0.009 discussed earlier). This is just the p-value distribution as a function of the power of the tests.
Table 1:
Expected percentage of p-values
between 0.001-0.0049 page on 42% and 55% power.
1990 55% power
|
2013 42% power
|
|
p0.001-p0.009
|
0.299
|
0.199
|
p0.011-p0.019
|
0.085
|
0.072
|
p0.021-p0.029
|
0.056
|
0.051
|
p0.031-p0.039
|
0.042
|
0.040
|
p0.041-p0.049
|
0.034
|
0.033
|
If we
incorporate the fact that the percentage of p-values
reported in the abstract has increased by 10% over the years (column 2 and 3 in
Table 2 below), and use as total studies in 1990 563023, and as total studies
in 2013 2317062 (taken from De Winter & Dodou, 2015) then we should expect
the total number of observed p-values
in 1990 and 2013 as displayed in columns 4 and 5 below. These numbers mirror
the observed frequencies (columns 4 and 6) by De Winter and Dodou (2015).
Table 2. Absolute
number of reconstructed and observed p-values
between 0.001-0.049 from 1990 to 2013.
% p-values in
abstract
|
% p-values in
abstract
|
reconstructed p-values
1990
|
reconstructed p-values
2013
|
observed p-values
1990
|
observed p-values
2013
|
|
p0.001-p0.009
|
0.01
|
0.1
|
1681
|
46170
|
1770
|
44970
|
p0.011-p0.019
|
0.01
|
0.1
|
481
|
16728
|
462
|
14885
|
p0.021-p0.029
|
0.01
|
0.1
|
316
|
11725
|
268
|
10630
|
p0.031-p0.039
|
0.01
|
0.1
|
238
|
9210
|
240
|
9108
|
p0.041-p0.049
|
0.01
|
0.1
|
191
|
7646
|
178
|
8250
|
When we
calculate the ratios of the observed p-values,
we see in Table 3 they approach the general pattern of the ratios observed by
De Winter and Dodou (2015). The reconstruction is not perfect, for a number of
reasons. First of all, there is very little data from 1990, which will lead to
substantial variation between expected and observed frequencies for any model
(the fit of the model increases for comparisons between years where there is
more data available). For example, the fact that the difference in the
percentage of p-values in the
0.021-0.029 bin from 1990 to 2013 is larger than for p-values in the 0.031-0.039 bin is only true in 1990 and 2008, but
is reversed (as predicted by a model of p-value distributions where power
changes over time) in the remaining 21 comparisons of 2013 with each preceding
year.
Table 3.
Ratios of reconstructed and observed p-values
between 0.001-0.049 from 1990 to 2013.
reconstructed ratio N/T 1990
|
reconstructed ratio N/T 2013
|
reconstructed 1990/2013 Ratio
|
observed ratio N/T 1990
|
observed ratio N/T 2013
|
observed 1990/2013 Ratio
|
|
p0.001-p0.009
|
0.306
|
1.993
|
6.674
|
0.315
|
1.945
|
6.17
|
p0.011-p0.019
|
0.085
|
0.722
|
8.454
|
0.082
|
0.644
|
7.83
|
p0.021-p0.029
|
0.056
|
0.506
|
9.017
|
0.048
|
0.460
|
9.63
|
p0.031-p0.039
|
0.042
|
0.398
|
9.417
|
0.043
|
0.394
|
9.21
|
p0.041-p0.049
|
0.034
|
0.330
|
9.740
|
0.032
|
0.367
|
11.28
|
Similarly,
when comparing 2013 to each of the 23 preceding years, the ratio is higher for p-values between 0.041-0.049 than for
0.031-0.039 in 12 out of 23 comparisons – only just more than 50% of the time,
which can hardly be called a ‘surge’. The model based on power differences
predicts that ratios for p-values
between 0.031-0.039 should be very similar to those between 0.041-0.049. Given
the small percentages of articles that report p-values and the variation inherent in observed p-value distributions, it is not
surprising the ratios for 0.041-0.049 are only just more than 50% likely to be
higher than those for p-values
between 0.031-0.039. This observation is more difficult to explain based on the
idea that questionable research practices have increased, which typically
assumes p-values between 0.041-0.049
increase more strongly than p-values
between 0.031-0.039 (e.g., Leggett et al., 2013; Masicampo & Lalande,
2012).
Obviously
this model is too simplistic. It does not include any Type 1 errors, and it
assumes homogeneity in the power of the performed tests. We can be certain
power varies substantially across studies and research disciplines, and we can
be certain there are a number of Type 1 errors in the literature. For the
current purpose, which is to demonstrate the observed pattern can be
reconstructed by assuming the average power has changed over time, a more advanced
model is not required, but future attempts to provide support for an increase
in Type 1 errors, or attempts to calculate average effect sizes based on p-value distributions (e.g., Simonsohn,
Nelson, & Simmons, 2014) need to develop more detailed models of p-value distributions.
Let’s assume
the average power has not changed over time, and try to reconstruct the
observed ratios by changing the Type 1 error rates. As long as the Type 1 error
rates are the same for each bin of p-values,
the ratios equal the overall increase in p-values
reported in abstracts over time. To reconstruct the ratios as observed by De
Winter and Dodou (2015), we need to assume p-hacking
leads to a stronger increase in higher p-values
than in lower p-values. Although this
is a reasonable assumption under many types of p-hacking, it turns out to that the specific pattern of inflated
Type 1 error rates required to reconstruct the observed ratios in not very
likely to emerge in real life.
To simulate
the impact of questionable research practices, we need to decide upon the ratio
of studies where H0 is true and studies where H1 is true, and the exact
increase in Type 1 error rates for each bin of p-values below 0.05. Type 1 errors come exclusively from analyzing
results of studies where H0 is true (p-hacking
when H1 is true inflates the effect size estimate, and thus can be seen as an
incorrect way to increase the power of a test). In the calculations below,
power is kept constant, but p-hacking
is introduced. This is the equivalent of the true power of studies reducing
over the years, which is exactly compensated by an inflated Type 1 error rate.
The
observed ratios by De Winter & Dodou (2015) show the ratio is the smallest
for p-values between 0.001-0.009, and
substantially higher for p-values
between 0.011 and 0.049, with a relatively small increase in these 4 bins. This
pattern can be reproduced just based on inflated Type 1 errors, but the required
increase in Type 1 error rates over the 5 bins is very unlikely to occur when p-hacking.
The higher
the average power of statistical tests, the more frequently small p-values will be observed if there is a
true effect. This means there are more p-values
between 0.021-0.029 than between 0.041-0.049 whenever the power is larger than
0. Without p-hacking, the number of
Type 1 errors in each bin (e.g., between 0.001 and 0.009) should be 0.8% (it is
1% between 0 and 0.01). If we assume this was the situation in 1990 (which is a
conservative, albeit unlikely, estimate), the Type 1 error rates need to be
increased to higher levels to reproduce the observed ratios, after selecting
the average power of the studies, and the ratio of studies where H0 is true and
H1 is true. It becomes extremely difficult to reconstruct the observed absolute
numbers and ratios.
One attempt
to model to reconstruct the ratios (but not the absolute values) is presented
in Table 4. The ratio of studies where H0 is true to studies where H1 is true
is set to 1, and the average power is assumed to be 57.5%. The Type 1 error
rate inflation over time is substantial, and the difference in the increase
over the bins is not very typical, with a practically equal increase between
0.021-0.049. To achieve the ratios observed by De Winter & Dodou (2015) for
comparisons between 2013 and later years than 1990, the Type 1 error rate even needs
to be inflated more strongly for p-values
between 0.021-0.029 than for p-values
between 0.041-0.049. Such a pattern of Type 1 error rate inflation is
practically difficult to achieve, because questionable research practices (such
as performing multiple analyses on the same data with different outlier
criteria) produce a p-value
distribution where higher p-values
are observed more frequently than smaller p-values.
Thus, although it is not impossible to achieve the observed ratios purely by p-hacking (although it is very
challenging to reconstruct both ratio’s and absolute numbers), the required Type
1 error rate inflation over the 5 bins of p-values
is unlikely to occur in real life.
Table 4. Absolute
number of reconstructed Type 1 errors between 0.001-0.049 from 1990 to 2013.
1990 true effects
|
2013 true effects
|
Type 1 error rate 1990
|
1990 Type 1 errors
|
Type 1 error rate 2013s
|
2013 Type 1 errors
|
Reconstructed 1990/2013 Ratio
|
|
p0.001-p0.009
|
1814
|
47784
|
0.008
|
90
|
0.015
|
4449
|
6.66
|
p0.011-p0.019
|
492
|
12959
|
0.008
|
90
|
0.020
|
5932
|
7.89
|
p0.021-p0.029
|
319
|
8399
|
0.008
|
90
|
0.025
|
7415
|
9.40
|
p0.031-p0.039
|
238
|
6260
|
0.008
|
90
|
0.025
|
7415
|
10.14
|
p0.041-p0.049
|
189
|
4988
|
0.008
|
90
|
0.027
|
8008
|
11.30
|
To
summarize, we can easily reconstruct the observed ratios by assuming a
relatively small decrease in power over the years (e.g., from 55% to 42%). Such
an assumption could be reasonable, as long as new research areas, or strongly
growing research areas, have lower power than average. One example of such a
research area is neuroscience, with a median power estimated to be as low as
21% (Button et al., 2013). On the other hand, while increases in Type 1 error
rates can be used to reconstruct the observed ratios, the pattern of inflated
Type 1 errors across the 5 bins of p-values
is unlikely to emerge in real life.
Therefore,
I conclude it is not true that there is a ‘surge of p-values between 0.041-0.049’, nor that these data suggest there is
an increase in questionable research practices over the last 25 years. The search for evidence of an increase in questionable
research practices is starting to mirror the search for the ether. After
repeatedly claiming to observe a rise in p-values
just below 0.05 without providing substantial evidence for such a rise (De
Winter & Dodou, 2015; Leggett et al., 2013; Masicampo & LaLande, 2012),
it is time researchers investigating inflated Type 1 errors use better models,
make better predictions, and collect better data. Analyzing huge numbers of p-values, which come from studies with
huge heterogeneity, will not be able to provide any indication of the
prevalence of questionable research practices, not even when changes of p-value
distributions are analyzed over time. All these papers are evidence of is a
peculiar prevalence of incorrect conclusions about p-value distributions.
References
- Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
- Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS ONE 9(9): e105825. doi:10.1371/journal.pone.0105825
- Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p: “Just significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66, 2303-2309. doi: 10.1080/17470218.2013.863371
- Lakens, D. (2014). What p-hacking really looks like: A comment on Masicampo and LaLande (2012). The Quarterly Journal of Experimental Psychology, (ahead-of-print), 1-4.
- Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below. 05. The Quarterly Journal of Experimental Psychology, 65(11), 2271-2279.
- Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9(6), 666-681.
- de Winter, J. C., & Dodou, D. (2015). A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ,3, e733.
Nice analysis. I think part of the "problem" arises from the prevalence of Neyman-Pearson (vs. Fisherian) thinking about P-values, where we are fixated on the idea that P=0.049 and P=0.051 mean very different things. Fisher would not have approved! I mention this (admittedly not as clearly as I should have) in a recent post: https://scientistseessquirrel.wordpress.com/2015/02/09/in-defence-of-the-p-value/
ReplyDeleteWhy is Fisher's significance testing framework necessarily better than Neyman and Pearson's? They have different goals. Fisher wants to quantifying and measuring evidence against a null (setting aside for the moment that p is not a valid measure of evidence), whereas Neyman and Pearson want to have rules that minimize error rates.
DeleteFisher didn't even believe that type-2 errors were possible! If you side with Fisher you don't have a way to calculate power.
A brief reply is available here
ReplyDeletehttps://sites.google.com/site/jcfdewinter/Lakens_reply.pdf