Tuesday, February 17, 2015

A peculiar surge of incorrect conclusions about the prevalence of p-values just below .05

An extended version of this blog post is now in press at PeerJ.

TL;DR version: De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values between 0.041-0.049 in recent decadeswhich 'suggests (but does not prove) questionable research practices have increased over the past 25 years'. I show the changes in the ratios of p-values over the years between 0.041-0.049 are better explained by a model of p-value distributions that assumes the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias over the years (cf. Fanelli, 2012), which has led to a relative decrease of 'marginally significant' p-values in the literature (instead of an increase in p-values just below 0.05). I (again, see Lakens, 2014) explain why researchers analyzing large numbers of p-values in the scientific literature need to develop better models of p-value distributions before drawing conclusion about questionable research practices. I want to thank De Winter and Dodou for sharing their data, assisting in the re-analysis, and reading an earlier version of this draft (to which they replied they were happy to see other researchers used the data to test alternative explanations, and that they did not see any technical mistakes in this blog post).

In recent years researchers have become more aware of how flexibility during the data-analysis can increase false positive results (e.g., Simmons, Nelson, & Simonsohn, 2011). If the true Type 1 error rate is substantially inflated because researchers analyze their data until a p-value smaller than 0.05 is observed, this might substantially decrease the robustness of scientific knowledge. However, as Stroebe and Strack (2014, p. 60) have pointed out: “Thus far, however, no solid data exist on the prevalence of such research practices”. Some researchers have attempted to provide some indication of the prevalence of questionable research practices by analyzing the distribution of p-values in the published literature. The idea is that questionable research practices lead to ‘a peculiar prevalence of p-values just below 0.05’ (Masicampo & Lalande, 2012) or the observation that ‘”just significant” results are on the rise’ (Leggett, Loetscher, & Nichols, 2013).

Despite the attention grabbing titles of these publications, the reported data does not afford the strong conclusions these researchers have drawn. The observed pattern of a peak of p-values just below 0.05 in Leggett et al (2013) does not replicate in other collected p-value distributions for the same journal in later years (Masicampo & Lalande, 2012), in psychology in general (Kühberger, Fritz, & Scherndl, 2014), or in scientific journals in general (De Winter & Dodou, 2015). The peak in p-values observed in Masicampo & Lalande (2012) is only surprising compared to an incorrectly modelled p-value distribution that ignores publication bias and its effect on the frequency of p-values above 0.05 (Lakens, 2014).

Recently, De Winter and Dodou (2015) have contributed to this emerging literature on p-value distributions and concluded that there is a ‘surge of p-values between 0.041-0.049 in recent decades’. They improved upon earlier approaches to analyze p-value distributions by comparing the percentage of p-values over time (from 1990-2013). Two observations in the data they collected could seduce researchers to draw conclusions about a rise of p-values just below a significance level of 0.05. The first observation is a much stronger rise in p-values between 0.041 and 0.049 than in p-values between 0.051-0.059. The second observation is that the percentage of p-values that falls between 0.041-0.049 has increased more from 1990 to 2013 than the increase in the percentage of p-values between 0.01-0.09, 0.011-0.019, 0.021-0.029, and 0.031-0.039 over the same years (the authors also analyze p-values with 2 digits (e.g., p = 0.04), which reveal similar patterns, but here I focus on the three digit data, which included p-values between for example 0.041-0.049 because trailing zeroes (e.g., p = 0.040) are rarely reported). The authors (2015, p. 37) conclude that: “The fact that p-values just below 0.05 exhibited the fastest increase among all p-value ranges we searched for suggests (but does not prove) that questionable research practices have increased over the past 25 years.

I will explain why the data does not provide any indication of an increase in questionable research practices. First, I will discuss how the difference in the increase in p-values just below 0.05 and just above 0.05 is due to publication bias, where (perhaps surprisingly) p-values just above 0.05 are becoming relatively less likely to appear in the abstracts of published articles over the years. Second, I will explain why the relatively high increase in p-values between 0.041-0.049 over the years can easily be accounted for by a decrease in the average power of studies, but is unlikely to emerge due to an inflated Type 1 error rate due to questionable research practices. I want to explicitly note that it was possible to provide these alternative interpretations of the data mainly because the authors shared all data and analysis scripts online (http://dx.doi.org/10.7717/peerj.733/supp-7) and were furthermore extremely responsive and helpful in answering a number of questions I had. While I criticize their interpretation of data, I applaud their adherence to open science principles (their Matlab code is an excellent example of reproducible statistics), which greatly facilitates cumulative science.

As I have discussed before (Lakens, 2014), it is essential to accurately model p-value distributions before drawing conclusions about p-values extracted from the scientific literature. Statements about p-value distributions require a definition of four parameters. First, researchers should specify the number of studies where H0 is true, and the number of studies where H1 is true. Second, researchers need to estimate the average power of the studies (or the average power of multiple subsets of studies, if heterogeneity in power is substantial). Third, the true Type 1 error rate and any possible mechanisms through which the error rate is inflated should be specified. And finally, publication bias, and a model of how the p-value distribution is affected by publication bias, should be proposed. It is important to look beyond simplistic comparisons between p-values just below 0.05 and p-values in other locations in the p-value distribution outside the scope of an explicit model of the four parameters that determine p-value distributions.

Are p-values below 0.05 increasing, or p-values above 0.05 decreasing?

De Winter and Dodou (2015) show there is a relatively stronger increase in p-values between 0.041-0.049 than between 0.051-0.059 (see for example Figure 9, reproduced below). The data is clear, but the reason for this difference is not. Are p-values below 0.05 increasing more, or are p-values above 0.05 increasing less? A direct comparison is difficult, because the percentage of papers reporting p-values below 0.05 can increase due to an increase in p-hacking, but also due to an increase in publication bias. If publication bias increases, and people report less non-significant results, the percentage of papers reporting p-values smaller than 0.05 will also increase, even if there is no increase in p-hacking. Indeed, Fanelli (2012) has shown negative results have been disappearing from the literature between 1990-2007, which would explain the relative differences in p-values between 0.041-0.049 and 0.051-0.059 observed by De Winter and Dodou (2015).




We can examine the alternative explanation that the relative differences observed are due to publication bias increasing, instead of due to an increase in p-hacking, by comparing the relative differences between p-values between 0.031-0.039 and 0.041-0.049 over the years on the one hand, and 0.051-0.059 and 0.061-0.069 on the other hand. If there is an increase in p-hacking, the biggest differences should be observed below 0.05 (in line with the idea of a surge of p-values between 0.041-0.049. However, there are reasons to assume the biggest difference might occur in p-values just above 0.05. As Lakens (2014) noted, there seems to be some tolerance for p-values just above 0.05 to be published, as indicated by a higher prevalence of p-values between 0.051-0.059 than would be expected based on the power of statistical tests and an equal reduction of all p-values above 0.05. If publication bias becomes more severe, we might expect a reduction in the tolerance for p-values just above 0.05, which would lead to the largest changes in ratios above 0.05. The spreadsheets and datafiles used to re-analyze and reconstruct the data is available on the OSF.

Across the three time periods (1990-1997, 1998-2005, and 2006-2013) the ratio of p-values in the 0.03 range to p-values in the 0.04 range is pretty stable: 1.13, 1.09, and 1.11, respectively. The ratio of p-values in the 0.05 range to p-values in the 0.06 range is surprisingly large to begin with (given that purely based on power, p-values between 0.051-0.059 and 0.061-0.069 should occur approximately equally often in the literature), and shows a surprisingly large reduction over the years: 2.27, 1.94, and 1.79, respectively. The only larger reduction in ratios is observed for p-values between 0.001-0.009 (which is most likely due to differences in power over the years, as will be explained below). This surprisingly large change in ratios over time for p-values between 0.051-0.059 indicates that instead of a surge of p-hacking, publication bias has become more pronounced over the years for p-values just above the 0.05 level, which causes p-values just above 0.05 to increase relatively less over the years than p-values in all other bins (except for p-values below 0.009).

This might be explained by the idea that where p-values between .051-0.59 (or 'marginally significnt' p-values) were more readily interpreted as support for the hypothesis in 1990-1997 than in 2005-2013. This idea is speculative, but seems likely given the increase in publication bias over the years (Fanelli, 2012). It should be noted that p-values just above the 0.05 level are still more frequent than can be explained just by the average power of the tests and publication bias that is equal for all p-values above 0.05 (cf. Lakens, 2014). In other words, this data is in line with the idea that publication bias is still slightly less severe for p-values just above 0.05, even though this benefit of p-values just above 0.05 has become smaller over the years.

This seems to be the driving force for the differences between p-values in the 0.041-0.049 range and p-values in the 0.051-0.059 range, reported by De Winter and Dodou (2015, e.g., Figures 9 and 10). To conclude, these observed differences provide no indication for a surge of p-values between 0.041-0.049 over the years due to an increase in questionable research practices.

How changes in average power over the years affect ratios of p-values below 0.05

The title of the article, “A surge of p-values between 0.041-0.049” is based on the observation that the ratio of p-values between 0.041-0.049 increases more than the ratio of p-values between 0.031-0.039, 0.021-0.029, and 0.011-0.019. There are no statistics reported to indicate whether these differences in ratios are statistically significant, nor are effect sizes reported to indicate whether the differences are practically significant (or justify the term ‘surge’), but the ratios do increase as you move from bins of low p-values between 0.001-0.009 to bins of high p-values between 0.041-0.049. Figure 23 reports the ratios of percentages of p-values in 1990 and 2013 for a range of search terms. Most interesting for the current purpose are the p-values between 0.001 and 0.049.



The first thing to understand is why these ratios are not close to 1. The reason is that there is a massive increase in the percentage of papers in which p-values are reported over the years. As De Winter & Dodou (2015, p. 15) note: “In 1990, 0.019% of papers (106 out of 563,023 papers) reported a p-value between 0.051 and 0.059. This increased 3.6-fold to 0.067% (1,549 out of 2,317,062 papers) in 2013. Positive results increased 10.3-fold in the same period: from 0.030% (171 out of 563,023 papers) in 1990 to 0.314% (7,266 out of 2,317,062 papers) in 2013.” This is not just an increase in the absolute number of reported p-values in abstracts (in which case the ratios could still be 1) but a relative 10.3-fold increase in how often p-values end up in abstracts. De Winter & Dodou (2015) demonstrate p-values are finding their way into more and more abstracts, which points to a possible increase in the overreliance on null-hypothesis testing in empirical articles. This is an important contribution to the literature, even when other claims about an increase in questionable research practices would not hold (also, the huge increase in the term 'paradigm shift' in abstracts over time is quite telling).

How can these differences between the ratios across the 5 bins below 0.05 be explained by a model of p-value distributions that consists of the ratio of true to false effects examined, power, the Type 1 error rate, and publication bias? We can only explain the relative differences between the ratios over the different bins of p-values if we allow at least one of the parameters of the model to the change over time. We can ignore publication bias, assuming all disciplines that report p-values in abstracts use α = 0.05 (this is not true, but we can assume it applies to the majority of articles that are analyzed). The two remaining possibilities are a change in the average power of studies over time, and an inflated Type 1 error rate over time, such as an increase in questionable research practices in the literature. 

If we ignore Type 1 errors, we can relatively easily reconstruct the observed data purely based on differences in the average power across the years. I’m not arguing the numbers in this re-construction reflect the truth. However, they show it is possible to model the ratios observed by De Winter & Dodou (2015) under the assumption that power differs from 1990 to 2013. For example, if we assume average power was 55% in 1990, and 42% in 2013, we can expect to observe the p-value distribution across the 5 bins as detailed in the table below, with 29.855% of the p-values falling between 0.001 and 0.009 in 1990, but only 19.926% of p-values falling between 0.001-0.009 n 2013 (which most likely explains the large differences in ratios between 0.001-0.009 discussed earlier). This is just the p-value distribution as a function of the power of the tests.

Table 1: Expected percentage of p-values between 0.001-0.0049 page on 42% and 55% power.
1990 55% power
2013 42% power
p0.001-p0.009
0.299
0.199
p0.011-p0.019
0.085
0.072
p0.021-p0.029
0.056
0.051
p0.031-p0.039
0.042
0.040
p0.041-p0.049
0.034
0.033

If we incorporate the fact that the percentage of p-values reported in the abstract has increased by 10% over the years (column 2 and 3 in Table 2 below), and use as total studies in 1990 563023, and as total studies in 2013 2317062 (taken from De Winter & Dodou, 2015) then we should expect the total number of observed p-values in 1990 and 2013 as displayed in columns 4 and 5 below. These numbers mirror the observed frequencies (columns 4 and 6) by De Winter and Dodou (2015).

Table 2. Absolute number of reconstructed and observed p-values between 0.001-0.049 from 1990 to 2013.
% p-values in abstract
% p-values in abstract
reconstructed p-values 1990
reconstructed p-values 2013
observed p-values 1990
observed p-values 2013
p0.001-p0.009
0.01
0.1
1681
46170
1770
44970
p0.011-p0.019
0.01
0.1
481
16728
462
14885
p0.021-p0.029
0.01
0.1
316
11725
268
10630
p0.031-p0.039
0.01
0.1
238
9210
240
9108
p0.041-p0.049
0.01
0.1
191
7646
178
8250

When we calculate the ratios of the observed p-values, we see in Table 3 they approach the general pattern of the ratios observed by De Winter and Dodou (2015). The reconstruction is not perfect, for a number of reasons. First of all, there is very little data from 1990, which will lead to substantial variation between expected and observed frequencies for any model (the fit of the model increases for comparisons between years where there is more data available). For example, the fact that the difference in the percentage of p-values in the 0.021-0.029 bin from 1990 to 2013 is larger than for p-values in the 0.031-0.039 bin is only true in 1990 and 2008, but is reversed (as predicted by a model of p-value distributions where power changes over time) in the remaining 21 comparisons of 2013 with each preceding year.

Table 3. Ratios of reconstructed and observed p-values between 0.001-0.049 from 1990 to 2013.
reconstructed ratio N/T 1990
reconstructed ratio N/T 2013
reconstructed 1990/2013 Ratio
observed ratio N/T 1990
observed ratio N/T 2013
observed 1990/2013 Ratio
p0.001-p0.009
0.306
1.993
6.674
0.315
1.945
6.17
p0.011-p0.019
0.085
0.722
8.454
0.082
0.644
7.83
p0.021-p0.029
0.056
0.506
9.017
0.048
0.460
9.63
p0.031-p0.039
0.042
0.398
9.417
0.043
0.394
9.21
p0.041-p0.049
0.034
0.330
9.740
0.032
0.367
11.28

Similarly, when comparing 2013 to each of the 23 preceding years, the ratio is higher for p-values between 0.041-0.049 than for 0.031-0.039 in 12 out of 23 comparisons – only just more than 50% of the time, which can hardly be called a ‘surge’. The model based on power differences predicts that ratios for p-values between 0.031-0.039 should be very similar to those between 0.041-0.049. Given the small percentages of articles that report p-values and the variation inherent in observed p-value distributions, it is not surprising the ratios for 0.041-0.049 are only just more than 50% likely to be higher than those for p-values between 0.031-0.039. This observation is more difficult to explain based on the idea that questionable research practices have increased, which typically assumes p-values between 0.041-0.049 increase more strongly than p-values between 0.031-0.039 (e.g., Leggett et al., 2013; Masicampo & Lalande, 2012).

Obviously this model is too simplistic. It does not include any Type 1 errors, and it assumes homogeneity in the power of the performed tests. We can be certain power varies substantially across studies and research disciplines, and we can be certain there are a number of Type 1 errors in the literature. For the current purpose, which is to demonstrate the observed pattern can be reconstructed by assuming the average power has changed over time, a more advanced model is not required, but future attempts to provide support for an increase in Type 1 errors, or attempts to calculate average effect sizes based on p-value distributions (e.g., Simonsohn, Nelson, & Simmons, 2014) need to develop more detailed models of p-value distributions.

Let’s assume the average power has not changed over time, and try to reconstruct the observed ratios by changing the Type 1 error rates. As long as the Type 1 error rates are the same for each bin of p-values, the ratios equal the overall increase in p-values reported in abstracts over time. To reconstruct the ratios as observed by De Winter and Dodou (2015), we need to assume p-hacking leads to a stronger increase in higher p-values than in lower p-values. Although this is a reasonable assumption under many types of p-hacking, it turns out to that the specific pattern of inflated Type 1 error rates required to reconstruct the observed ratios in not very likely to emerge in real life.

To simulate the impact of questionable research practices, we need to decide upon the ratio of studies where H0 is true and studies where H1 is true, and the exact increase in Type 1 error rates for each bin of p-values below 0.05. Type 1 errors come exclusively from analyzing results of studies where H0 is true (p-hacking when H1 is true inflates the effect size estimate, and thus can be seen as an incorrect way to increase the power of a test). In the calculations below, power is kept constant, but p-hacking is introduced. This is the equivalent of the true power of studies reducing over the years, which is exactly compensated by an inflated Type 1 error rate.

The observed ratios by De Winter & Dodou (2015) show the ratio is the smallest for p-values between 0.001-0.009, and substantially higher for p-values between 0.011 and 0.049, with a relatively small increase in these 4 bins. This pattern can be reproduced just based on inflated Type 1 errors, but the required increase in Type 1 error rates over the 5 bins is very unlikely to occur when p-hacking.

The higher the average power of statistical tests, the more frequently small p-values will be observed if there is a true effect. This means there are more p-values between 0.021-0.029 than between 0.041-0.049 whenever the power is larger than 0. Without p-hacking, the number of Type 1 errors in each bin (e.g., between 0.001 and 0.009) should be 0.8% (it is 1% between 0 and 0.01). If we assume this was the situation in 1990 (which is a conservative, albeit unlikely, estimate), the Type 1 error rates need to be increased to higher levels to reproduce the observed ratios, after selecting the average power of the studies, and the ratio of studies where H0 is true and H1 is true. It becomes extremely difficult to reconstruct the observed absolute numbers and ratios. 

One attempt to model to reconstruct the ratios (but not the absolute values) is presented in Table 4. The ratio of studies where H0 is true to studies where H1 is true is set to 1, and the average power is assumed to be 57.5%. The Type 1 error rate inflation over time is substantial, and the difference in the increase over the bins is not very typical, with a practically equal increase between 0.021-0.049. To achieve the ratios observed by De Winter & Dodou (2015) for comparisons between 2013 and later years than 1990, the Type 1 error rate even needs to be inflated more strongly for p-values between 0.021-0.029 than for p-values between 0.041-0.049. Such a pattern of Type 1 error rate inflation is practically difficult to achieve, because questionable research practices (such as performing multiple analyses on the same data with different outlier criteria) produce a p-value distribution where higher p-values are observed more frequently than smaller p-values. Thus, although it is not impossible to achieve the observed ratios purely by p-hacking (although it is very challenging to reconstruct both ratio’s and absolute numbers), the required Type 1 error rate inflation over the 5 bins of p-values is unlikely to occur in real life.

Table 4. Absolute number of reconstructed Type 1 errors between 0.001-0.049 from 1990 to 2013.
1990 true effects
2013 true effects
Type 1 error rate 1990
1990 Type 1 errors
Type 1 error rate 2013s
2013 Type 1 errors
Reconstructed 1990/2013 Ratio
p0.001-p0.009
1814
47784
0.008
90
0.015
4449
6.66
p0.011-p0.019
492
12959
0.008
90
0.020
5932
7.89
p0.021-p0.029
319
8399
0.008
90
0.025
7415
9.40
p0.031-p0.039
238
6260
0.008
90
0.025
7415
10.14
p0.041-p0.049
189
4988
0.008
90
0.027
8008
11.30

To summarize, we can easily reconstruct the observed ratios by assuming a relatively small decrease in power over the years (e.g., from 55% to 42%). Such an assumption could be reasonable, as long as new research areas, or strongly growing research areas, have lower power than average. One example of such a research area is neuroscience, with a median power estimated to be as low as 21% (Button et al., 2013). On the other hand, while increases in Type 1 error rates can be used to reconstruct the observed ratios, the pattern of inflated Type 1 errors across the 5 bins of p-values is unlikely to emerge in real life.

Therefore, I conclude it is not true that there is a ‘surge of p-values between 0.041-0.049’, nor that these data suggest there is an increase in questionable research practices over the last 25 years. The search for evidence of an increase in questionable research practices is starting to mirror the search for the ether. After repeatedly claiming to observe a rise in p-values just below 0.05 without providing substantial evidence for such a rise (De Winter & Dodou, 2015; Leggett et al., 2013; Masicampo & LaLande, 2012), it is time researchers investigating inflated Type 1 errors use better models, make better predictions, and collect better data. Analyzing huge numbers of p-values, which come from studies with huge heterogeneity, will not be able to provide any indication of the prevalence of questionable research practices, not even when changes of p-value distributions are analyzed over time. All these papers are evidence of is a peculiar prevalence of incorrect conclusions about p-value distributions.



References

  • Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
  • Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS ONE 9(9): e105825. doi:10.1371/journal.pone.0105825
  • Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p: “Just significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66, 2303-2309. doi: 10.1080/17470218.2013.863371
  • Lakens, D. (2014). What p-hacking really looks like: A comment on Masicampo and LaLande (2012). The Quarterly Journal of Experimental Psychology, (ahead-of-print), 1-4.
  • Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below. 05. The Quarterly Journal of Experimental Psychology, 65(11), 2271-2279.
  • Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9(6), 666-681.
  • de Winter, J. C., & Dodou, D. (2015). A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ,3, e733.



3 comments:

  1. Nice analysis. I think part of the "problem" arises from the prevalence of Neyman-Pearson (vs. Fisherian) thinking about P-values, where we are fixated on the idea that P=0.049 and P=0.051 mean very different things. Fisher would not have approved! I mention this (admittedly not as clearly as I should have) in a recent post: https://scientistseessquirrel.wordpress.com/2015/02/09/in-defence-of-the-p-value/

    ReplyDelete
    Replies
    1. Why is Fisher's significance testing framework necessarily better than Neyman and Pearson's? They have different goals. Fisher wants to quantifying and measuring evidence against a null (setting aside for the moment that p is not a valid measure of evidence), whereas Neyman and Pearson want to have rules that minimize error rates.

      Fisher didn't even believe that type-2 errors were possible! If you side with Fisher you don't have a way to calculate power.

      Delete
  2. A brief reply is available here
    https://sites.google.com/site/jcfdewinter/Lakens_reply.pdf

    ReplyDelete