An extended version of this blog post is now in press at PeerJ.
TL;DR version: De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of pvalues automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of pvalues between 0.0410.049 in recent decades’ which 'suggests (but does not prove) questionable research practices have increased over the past 25 years'. I show the changes in the ratios of pvalues over the years between 0.0410.049 are better explained by a model of pvalue distributions that assumes the average power has decreased over time. Furthermore, I propose that their observation that pvalues just below 0.05 increase more strongly than pvalues above 0.05 can be explained by an increase in publication bias over the years (cf. Fanelli, 2012), which has led to a relative decrease of 'marginally significant' pvalues in the literature (instead of an increase in pvalues just below 0.05). I (again, see Lakens, 2014) explain why researchers analyzing large numbers of pvalues in the scientific literature need to develop better models of pvalue distributions before drawing conclusion about questionable research practices. I want to thank De Winter and Dodou for sharing their data, assisting in the reanalysis, and reading an earlier version of this draft (to which they replied they were happy to see other researchers used the data to test alternative explanations, and that they did not see any technical mistakes in this blog post).
TL;DR version: De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of pvalues automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of pvalues between 0.0410.049 in recent decades’ which 'suggests (but does not prove) questionable research practices have increased over the past 25 years'. I show the changes in the ratios of pvalues over the years between 0.0410.049 are better explained by a model of pvalue distributions that assumes the average power has decreased over time. Furthermore, I propose that their observation that pvalues just below 0.05 increase more strongly than pvalues above 0.05 can be explained by an increase in publication bias over the years (cf. Fanelli, 2012), which has led to a relative decrease of 'marginally significant' pvalues in the literature (instead of an increase in pvalues just below 0.05). I (again, see Lakens, 2014) explain why researchers analyzing large numbers of pvalues in the scientific literature need to develop better models of pvalue distributions before drawing conclusion about questionable research practices. I want to thank De Winter and Dodou for sharing their data, assisting in the reanalysis, and reading an earlier version of this draft (to which they replied they were happy to see other researchers used the data to test alternative explanations, and that they did not see any technical mistakes in this blog post).
In recent
years researchers have become more aware of how flexibility during the
dataanalysis can increase false positive results (e.g., Simmons, Nelson, &
Simonsohn, 2011). If the true Type 1 error rate is substantially inflated
because researchers analyze their data until a pvalue smaller than 0.05 is observed, this might substantially
decrease the robustness of scientific knowledge. However, as Stroebe and Strack
(2014, p. 60) have pointed out: “Thus
far, however, no solid data exist on the prevalence of such research practices”.
Some researchers have attempted to provide some indication of the prevalence of
questionable research practices by analyzing the distribution of pvalues in the published literature.
The idea is that questionable research practices lead to ‘a peculiar prevalence
of pvalues just below 0.05’
(Masicampo & Lalande, 2012) or the observation that ‘”just significant”
results are on the rise’ (Leggett, Loetscher, & Nichols, 2013).
Despite the
attention grabbing titles of these publications, the reported data does not afford the strong conclusions these researchers have drawn. The
observed pattern of a peak of pvalues
just below 0.05 in Leggett et al (2013) does not replicate in other collected pvalue distributions for the same
journal in later years (Masicampo & Lalande, 2012), in psychology in
general (Kühberger, Fritz, & Scherndl, 2014), or in scientific journals in
general (De Winter & Dodou, 2015). The peak in pvalues observed in Masicampo & Lalande (2012) is only
surprising compared to an incorrectly modelled pvalue distribution that ignores publication bias and its effect
on the frequency of pvalues above
0.05 (Lakens, 2014).
Recently, De
Winter and Dodou (2015) have contributed to this emerging literature on pvalue distributions and concluded that
there is a ‘surge of pvalues between
0.0410.049 in recent decades’. They improved upon earlier approaches to
analyze pvalue distributions by
comparing the percentage of pvalues
over time (from 19902013). Two observations in the data they collected could
seduce researchers to draw conclusions about a rise of pvalues just below a significance level of 0.05. The first
observation is a much stronger rise in pvalues
between 0.041 and 0.049 than in pvalues
between 0.0510.059. The second observation is that the percentage of pvalues that falls between 0.0410.049
has increased more from 1990 to 2013 than the increase in the percentage of pvalues between 0.010.09, 0.0110.019,
0.0210.029, and 0.0310.039 over the same years (the authors also analyze pvalues with 2 digits (e.g., p = 0.04), which reveal similar patterns, but here I focus on the three digit data, which included pvalues between for example 0.0410.049 because trailing zeroes (e.g., p = 0.040) are rarely reported). The authors (2015, p. 37) conclude
that: “The fact that pvalues just
below 0.05 exhibited the fastest increase among all pvalue ranges we searched for suggests (but does not prove) that
questionable research practices have increased over the past 25 years.”
I will
explain why the data does not provide any indication of an increase in
questionable research practices. First, I will discuss how the difference in
the increase in pvalues just below
0.05 and just above 0.05 is due to publication bias, where (perhaps
surprisingly) pvalues just above
0.05 are becoming relatively less likely to appear in the abstracts of
published articles over the years. Second, I will explain why the relatively
high increase in pvalues between
0.0410.049 over the years can easily be accounted for by a decrease in the
average power of studies, but is unlikely to emerge due to an inflated Type 1
error rate due to questionable research practices. I want to explicitly note
that it was possible to provide these alternative interpretations of the data
mainly because the authors shared all data and analysis scripts online (http://dx.doi.org/10.7717/peerj.733/supp7)
and were furthermore extremely responsive and helpful in answering a number of
questions I had. While I criticize their interpretation of data, I applaud
their adherence to open science principles (their Matlab code is an excellent
example of reproducible statistics), which greatly facilitates cumulative
science.
As I have
discussed before (Lakens, 2014), it is essential to accurately model pvalue distributions before drawing
conclusions about pvalues extracted
from the scientific literature. Statements about pvalue distributions require a definition of four parameters. First,
researchers should specify the number of studies where H0 is true, and the number
of studies where H1 is true. Second, researchers need to estimate the average power
of the studies (or the average power of multiple subsets of studies, if
heterogeneity in power is substantial). Third, the true Type 1 error rate and
any possible mechanisms through which the error rate is inflated should be
specified. And finally, publication bias, and a model of how the pvalue distribution is affected by
publication bias, should be proposed. It is important to look beyond simplistic
comparisons between pvalues just
below 0.05 and pvalues in other
locations in the pvalue distribution
outside the scope of an explicit model of the four parameters that determine pvalue distributions.
Are pvalues below 0.05 increasing, or pvalues above 0.05 decreasing?
De Winter
and Dodou (2015) show there is a relatively stronger increase in pvalues between 0.0410.049 than
between 0.0510.059 (see for example Figure 9, reproduced below). The data is
clear, but the reason for this difference is not. Are pvalues below 0.05 increasing more, or are pvalues above 0.05 increasing less? A direct comparison is
difficult, because the percentage of papers reporting pvalues below 0.05 can increase due to an increase in phacking, but also due to an increase
in publication bias. If publication bias increases, and people report less
nonsignificant results, the percentage of papers reporting pvalues smaller than 0.05 will also
increase, even if there is no increase in phacking.
Indeed, Fanelli (2012) has shown negative results have been disappearing from
the literature between 19902007, which would explain the relative differences
in pvalues between 0.0410.049 and
0.0510.059 observed by De Winter and Dodou (2015).
We can
examine the alternative explanation that the relative differences observed are due
to publication bias increasing, instead of due to an increase in phacking, by comparing the relative
differences between pvalues between
0.0310.039 and 0.0410.049 over the years on the one hand, and 0.0510.059 and
0.0610.069 on the other hand. If there is an increase in phacking, the biggest differences should be observed below 0.05
(in line with the idea of a surge of pvalues
between 0.0410.049.
However, there are reasons to assume the biggest difference might occur in pvalues just above 0.05. As Lakens
(2014) noted, there seems to be some tolerance for pvalues just above 0.05 to be published, as indicated by a higher
prevalence of pvalues between
0.0510.059 than would be expected based on the power of statistical tests and
an equal reduction of all pvalues
above 0.05. If publication bias becomes more severe, we might expect a
reduction in the tolerance for pvalues
just above 0.05, which would lead to the largest changes in ratios above 0.05. The spreadsheets and datafiles used to reanalyze and reconstruct the data is available on the OSF.
Across the
three time periods (19901997, 19982005, and 20062013) the ratio of pvalues in the 0.03 range to pvalues in the 0.04 range is pretty
stable: 1.13, 1.09, and 1.11, respectively. The ratio of pvalues in the 0.05 range to pvalues
in the 0.06 range is surprisingly large to begin with (given that purely based
on power, pvalues between 0.0510.059 and 0.0610.069 should occur
approximately equally often in the literature), and shows a surprisingly large
reduction over the years: 2.27, 1.94, and 1.79, respectively. The only larger
reduction in ratios is observed for pvalues
between 0.0010.009 (which is most likely due to differences in power over the
years, as will be explained below). This surprisingly large change in ratios over
time for pvalues between 0.0510.059
indicates that instead of a surge of phacking,
publication bias has become more pronounced over the years for pvalues just above the 0.05 level,
which causes pvalues just above 0.05
to increase relatively less over the years than pvalues in all other bins (except for pvalues below 0.009).
This might be
explained by the idea that where pvalues
between .0510.59 (or 'marginally significnt' pvalues) were more readily interpreted as support for the hypothesis
in 19901997 than in 20052013. This idea is
speculative, but seems likely given the increase in publication bias over the
years (Fanelli, 2012). It should be noted that pvalues just above the 0.05 level are still more frequent than can be explained just by the average power of the
tests and publication bias that is equal for all pvalues above 0.05 (cf. Lakens, 2014). In other words, this data
is in line with the idea that publication bias is still slightly less severe
for pvalues just above 0.05, even
though this benefit of pvalues just
above 0.05 has become smaller over the years.
This seems
to be the driving force for the differences between pvalues in the 0.0410.049 range and pvalues
in the 0.0510.059 range, reported by De Winter and Dodou (2015, e.g., Figures
9 and 10). To conclude, these observed differences provide no indication for a
surge of pvalues between 0.0410.049
over the years due to an increase in questionable research practices.
How changes in average
power over the years affect ratios of pvalues
below 0.05
The title
of the article, “A surge of pvalues
between 0.0410.049” is based on the observation that the ratio of pvalues between 0.0410.049 increases
more than the ratio of pvalues
between 0.0310.039, 0.0210.029, and 0.0110.019. There are no statistics
reported to indicate whether these differences in ratios are statistically
significant, nor are effect sizes reported to indicate whether the differences
are practically significant (or justify the term ‘surge’), but the ratios do
increase as you move from bins of low pvalues
between 0.0010.009 to bins of high pvalues
between 0.0410.049. Figure 23 reports the ratios of percentages of pvalues in 1990 and 2013 for a range of
search terms. Most interesting for the current purpose are the pvalues between 0.001 and 0.049.
The first
thing to understand is why these ratios are not close to 1. The reason is that
there is a massive increase in the percentage of papers in which pvalues are reported over the years. As
De Winter & Dodou (2015, p. 15) note: “In
1990, 0.019% of papers (106 out of 563,023 papers) reported a pvalue between
0.051 and 0.059. This increased 3.6fold to 0.067% (1,549 out of 2,317,062
papers) in 2013. Positive results increased 10.3fold in the same period: from
0.030% (171 out of 563,023 papers) in 1990 to 0.314% (7,266 out of 2,317,062
papers) in 2013.” This is not just an increase in the absolute number of
reported pvalues in abstracts (in
which case the ratios could still be 1) but a relative 10.3fold increase in
how often pvalues end up in
abstracts. De Winter & Dodou (2015) demonstrate pvalues are finding their way into more and more abstracts, which
points to a possible increase in the overreliance on nullhypothesis testing in
empirical articles. This is an important contribution to the literature, even
when other claims about an increase in questionable research practices would
not hold (also, the huge increase in the term 'paradigm shift' in abstracts over time is quite telling).
How can
these differences between the ratios across the 5 bins below 0.05 be explained
by a model of pvalue distributions
that consists of the ratio of true to false effects examined, power, the Type 1
error rate, and publication bias? We can only explain the relative differences
between the ratios over the different bins of pvalues if we allow at least one of the parameters of the model to
the change over time. We can ignore publication bias, assuming all disciplines
that report pvalues in abstracts use
α = 0.05 (this is not true, but we can assume it applies to the majority of
articles that are analyzed). The two remaining possibilities are a change in
the average power of studies over time, and an inflated Type 1 error rate over
time, such as an increase in questionable research practices in the
literature.
If we
ignore Type 1 errors, we can relatively easily reconstruct the observed data
purely based on differences in the average power across the years. I’m not
arguing the numbers in this reconstruction reflect the truth. However, they show
it is possible to model the ratios observed by De Winter & Dodou (2015) under
the assumption that power differs from 1990 to 2013. For example, if we assume
average power was 55% in 1990, and 42% in 2013, we can expect to observe the pvalue distribution across the 5 bins
as detailed in the table below, with 29.855% of the pvalues falling between 0.001 and 0.009 in 1990, but only 19.926%
of pvalues falling between
0.0010.009 n 2013 (which most likely explains the large differences in ratios
between 0.0010.009 discussed earlier). This is just the pvalue distribution as a function of the power of the tests.
Table 1:
Expected percentage of pvalues
between 0.0010.0049 page on 42% and 55% power.
1990 55% power

2013 42% power


p0.001p0.009

0.299

0.199

p0.011p0.019

0.085

0.072

p0.021p0.029

0.056

0.051

p0.031p0.039

0.042

0.040

p0.041p0.049

0.034

0.033

If we
incorporate the fact that the percentage of pvalues
reported in the abstract has increased by 10% over the years (column 2 and 3 in
Table 2 below), and use as total studies in 1990 563023, and as total studies
in 2013 2317062 (taken from De Winter & Dodou, 2015) then we should expect
the total number of observed pvalues
in 1990 and 2013 as displayed in columns 4 and 5 below. These numbers mirror
the observed frequencies (columns 4 and 6) by De Winter and Dodou (2015).
Table 2. Absolute
number of reconstructed and observed pvalues
between 0.0010.049 from 1990 to 2013.
% pvalues in
abstract

% pvalues in
abstract

reconstructed pvalues
1990

reconstructed pvalues
2013

observed pvalues
1990

observed pvalues
2013


p0.001p0.009

0.01

0.1

1681

46170

1770

44970

p0.011p0.019

0.01

0.1

481

16728

462

14885

p0.021p0.029

0.01

0.1

316

11725

268

10630

p0.031p0.039

0.01

0.1

238

9210

240

9108

p0.041p0.049

0.01

0.1

191

7646

178

8250

When we
calculate the ratios of the observed pvalues,
we see in Table 3 they approach the general pattern of the ratios observed by
De Winter and Dodou (2015). The reconstruction is not perfect, for a number of
reasons. First of all, there is very little data from 1990, which will lead to
substantial variation between expected and observed frequencies for any model
(the fit of the model increases for comparisons between years where there is
more data available). For example, the fact that the difference in the
percentage of pvalues in the
0.0210.029 bin from 1990 to 2013 is larger than for pvalues in the 0.0310.039 bin is only true in 1990 and 2008, but
is reversed (as predicted by a model of pvalue distributions where power
changes over time) in the remaining 21 comparisons of 2013 with each preceding
year.
Table 3.
Ratios of reconstructed and observed pvalues
between 0.0010.049 from 1990 to 2013.
reconstructed ratio N/T 1990

reconstructed ratio N/T 2013

reconstructed 1990/2013 Ratio

observed ratio N/T 1990

observed ratio N/T 2013

observed 1990/2013 Ratio


p0.001p0.009

0.306

1.993

6.674

0.315

1.945

6.17

p0.011p0.019

0.085

0.722

8.454

0.082

0.644

7.83

p0.021p0.029

0.056

0.506

9.017

0.048

0.460

9.63

p0.031p0.039

0.042

0.398

9.417

0.043

0.394

9.21

p0.041p0.049

0.034

0.330

9.740

0.032

0.367

11.28

Similarly,
when comparing 2013 to each of the 23 preceding years, the ratio is higher for pvalues between 0.0410.049 than for
0.0310.039 in 12 out of 23 comparisons – only just more than 50% of the time,
which can hardly be called a ‘surge’. The model based on power differences
predicts that ratios for pvalues
between 0.0310.039 should be very similar to those between 0.0410.049. Given
the small percentages of articles that report pvalues and the variation inherent in observed pvalue distributions, it is not
surprising the ratios for 0.0410.049 are only just more than 50% likely to be
higher than those for pvalues
between 0.0310.039. This observation is more difficult to explain based on the
idea that questionable research practices have increased, which typically
assumes pvalues between 0.0410.049
increase more strongly than pvalues
between 0.0310.039 (e.g., Leggett et al., 2013; Masicampo & Lalande,
2012).
Obviously
this model is too simplistic. It does not include any Type 1 errors, and it
assumes homogeneity in the power of the performed tests. We can be certain
power varies substantially across studies and research disciplines, and we can
be certain there are a number of Type 1 errors in the literature. For the
current purpose, which is to demonstrate the observed pattern can be
reconstructed by assuming the average power has changed over time, a more advanced
model is not required, but future attempts to provide support for an increase
in Type 1 errors, or attempts to calculate average effect sizes based on pvalue distributions (e.g., Simonsohn,
Nelson, & Simmons, 2014) need to develop more detailed models of pvalue distributions.
Let’s assume
the average power has not changed over time, and try to reconstruct the
observed ratios by changing the Type 1 error rates. As long as the Type 1 error
rates are the same for each bin of pvalues,
the ratios equal the overall increase in pvalues
reported in abstracts over time. To reconstruct the ratios as observed by De
Winter and Dodou (2015), we need to assume phacking
leads to a stronger increase in higher pvalues
than in lower pvalues. Although this
is a reasonable assumption under many types of phacking, it turns out to that the specific pattern of inflated
Type 1 error rates required to reconstruct the observed ratios in not very
likely to emerge in real life.
To simulate
the impact of questionable research practices, we need to decide upon the ratio
of studies where H0 is true and studies where H1 is true, and the exact
increase in Type 1 error rates for each bin of pvalues below 0.05. Type 1 errors come exclusively from analyzing
results of studies where H0 is true (phacking
when H1 is true inflates the effect size estimate, and thus can be seen as an
incorrect way to increase the power of a test). In the calculations below,
power is kept constant, but phacking
is introduced. This is the equivalent of the true power of studies reducing
over the years, which is exactly compensated by an inflated Type 1 error rate.
The
observed ratios by De Winter & Dodou (2015) show the ratio is the smallest
for pvalues between 0.0010.009, and
substantially higher for pvalues
between 0.011 and 0.049, with a relatively small increase in these 4 bins. This
pattern can be reproduced just based on inflated Type 1 errors, but the required
increase in Type 1 error rates over the 5 bins is very unlikely to occur when phacking.
The higher
the average power of statistical tests, the more frequently small pvalues will be observed if there is a
true effect. This means there are more pvalues
between 0.0210.029 than between 0.0410.049 whenever the power is larger than
0. Without phacking, the number of
Type 1 errors in each bin (e.g., between 0.001 and 0.009) should be 0.8% (it is
1% between 0 and 0.01). If we assume this was the situation in 1990 (which is a
conservative, albeit unlikely, estimate), the Type 1 error rates need to be
increased to higher levels to reproduce the observed ratios, after selecting
the average power of the studies, and the ratio of studies where H0 is true and
H1 is true. It becomes extremely difficult to reconstruct the observed absolute
numbers and ratios.
One attempt
to model to reconstruct the ratios (but not the absolute values) is presented
in Table 4. The ratio of studies where H0 is true to studies where H1 is true
is set to 1, and the average power is assumed to be 57.5%. The Type 1 error
rate inflation over time is substantial, and the difference in the increase
over the bins is not very typical, with a practically equal increase between
0.0210.049. To achieve the ratios observed by De Winter & Dodou (2015) for
comparisons between 2013 and later years than 1990, the Type 1 error rate even needs
to be inflated more strongly for pvalues
between 0.0210.029 than for pvalues
between 0.0410.049. Such a pattern of Type 1 error rate inflation is
practically difficult to achieve, because questionable research practices (such
as performing multiple analyses on the same data with different outlier
criteria) produce a pvalue
distribution where higher pvalues
are observed more frequently than smaller pvalues.
Thus, although it is not impossible to achieve the observed ratios purely by phacking (although it is very
challenging to reconstruct both ratio’s and absolute numbers), the required Type
1 error rate inflation over the 5 bins of pvalues
is unlikely to occur in real life.
Table 4. Absolute
number of reconstructed Type 1 errors between 0.0010.049 from 1990 to 2013.
1990 true effects

2013 true effects

Type 1 error rate 1990

1990 Type 1 errors

Type 1 error rate 2013s

2013 Type 1 errors

Reconstructed 1990/2013 Ratio


p0.001p0.009

1814

47784

0.008

90

0.015

4449

6.66

p0.011p0.019

492

12959

0.008

90

0.020

5932

7.89

p0.021p0.029

319

8399

0.008

90

0.025

7415

9.40

p0.031p0.039

238

6260

0.008

90

0.025

7415

10.14

p0.041p0.049

189

4988

0.008

90

0.027

8008

11.30

To
summarize, we can easily reconstruct the observed ratios by assuming a
relatively small decrease in power over the years (e.g., from 55% to 42%). Such
an assumption could be reasonable, as long as new research areas, or strongly
growing research areas, have lower power than average. One example of such a
research area is neuroscience, with a median power estimated to be as low as
21% (Button et al., 2013). On the other hand, while increases in Type 1 error
rates can be used to reconstruct the observed ratios, the pattern of inflated
Type 1 errors across the 5 bins of pvalues
is unlikely to emerge in real life.
Therefore,
I conclude it is not true that there is a ‘surge of pvalues between 0.0410.049’, nor that these data suggest there is
an increase in questionable research practices over the last 25 years. The search for evidence of an increase in questionable
research practices is starting to mirror the search for the ether. After
repeatedly claiming to observe a rise in pvalues
just below 0.05 without providing substantial evidence for such a rise (De
Winter & Dodou, 2015; Leggett et al., 2013; Masicampo & LaLande, 2012),
it is time researchers investigating inflated Type 1 errors use better models,
make better predictions, and collect better data. Analyzing huge numbers of pvalues, which come from studies with
huge heterogeneity, will not be able to provide any indication of the
prevalence of questionable research practices, not even when changes of pvalue
distributions are analyzed over time. All these papers are evidence of is a
peculiar prevalence of incorrect conclusions about pvalue distributions.
References
 Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365376.
 Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS ONE 9(9): e105825. doi:10.1371/journal.pone.0105825
 Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p: “Just significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66, 23032309. doi: 10.1080/17470218.2013.863371
 Lakens, D. (2014). What phacking really looks like: A comment on Masicampo and LaLande (2012). The Quarterly Journal of Experimental Psychology, (aheadofprint), 14.
 Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below. 05. The Quarterly Journal of Experimental Psychology, 65(11), 22712279.
 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). Pcurve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9(6), 666681.
 de Winter, J. C., & Dodou, D. (2015). A surge of pvalues between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ,3, e733.
Nice analysis. I think part of the "problem" arises from the prevalence of NeymanPearson (vs. Fisherian) thinking about Pvalues, where we are fixated on the idea that P=0.049 and P=0.051 mean very different things. Fisher would not have approved! I mention this (admittedly not as clearly as I should have) in a recent post: https://scientistseessquirrel.wordpress.com/2015/02/09/indefenceofthepvalue/
ReplyDeleteWhy is Fisher's significance testing framework necessarily better than Neyman and Pearson's? They have different goals. Fisher wants to quantifying and measuring evidence against a null (setting aside for the moment that p is not a valid measure of evidence), whereas Neyman and Pearson want to have rules that minimize error rates.
DeleteFisher didn't even believe that type2 errors were possible! If you side with Fisher you don't have a way to calculate power.
A brief reply is available here
ReplyDeletehttps://sites.google.com/site/jcfdewinter/Lakens_reply.pdf