A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Wednesday, September 24, 2014

Publication Bias in Psychology: Putting Things in Perspective

I'd like to gratefully acknowledge the extremely helpful comments and suggestions by Anton Kühberger and his co-authors on an earlier draft of this blog post, who patiently answered my questions and shared additional analyses. I'd also like to thank Marcel van Assen, Christina Bergmann, JP de Ruiter, and Uri Simonsohn for comments and suggestions. Any remaining errors are completely my own doing.

In this post I'll be taking a closer look at recent support for publication bias in psychology provided by Kühberger et al (2014). I'll show how it is easy to misinterpret some of their data, and how p-curve analyses of their data show strong evidential support for the underlying studies (with no signs of p-hacking). Finally, I'll illustrate recently developed techniques (Simonsohn et al., in press) to estimate the power of studies based on the p-curve distribution, and check these estimates using Z-score simulations in R.

Publication bias is a problem in science (Everyone, 1950-2014). It’s difficult to quantify the extent of the problem, and therefore articles that attempt to do this can be worthwhile. In a recent paper ‘Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size’ Kühberger, Fritz, and Scherndl (2014) coded 1000 articles in psychology and examined the presence of publication bias. 531 articles ended up in the dataset, and they were able to extract p-values and effect sizes from around 400 studies. I tweeted a picture from their article, showing a distribution of Z-scores that, as the authors conclude, ‘shows a peak in the number of articles with a Z-score just above the critical level, while there are few observations just below it.’

It’s easy to misinterpret this figure (I know I did) if you believe the bump around Z=1.96 should mean there are a surprising number of just significant p-values. I’ve spent time on my blog to explain how a healthy distribution of p-values should look like: We should not find a high number of p-values around 0.05 (or a Z=1.96), but a large number of low p-values, for example as in the graph below:

You might be surprised to hear that the p-values and Z-scores in both figures above are exactly the same data (although I’ve only plotted the p-values below 0.1), presented in a different way. In the graph below, you see the relationship between p-values (from 1 to 0) and Z-scores. As you see, the lower the p-value, the higher the Z-score (in excel, Z=NORMSINV(1-p/2). However, Z-scores do not increase linearly, but in a concave upward function.

If you look at the Figure 6 by Kühberger above, you will see they have plotted Z-scores as a linear function (from 0 to 9.80 and higher, in bars of equal width. Let’s plot Z-scores (and their associated p-values) linearly, starting at Z=1.96 (p=0.05) and going upward. Both Z-scores and p-values are presented in the top pane, but because the p-values are almost not visible, these are plotted in isolation in the bottom pane.

The main point the authors want to make is the strong drop to the left of the 1.96 Z-score (indicating much less non-significant than significant results in the literature). And I was looking at the drop to the right of the figure, which can be perfectly normal (once you understand how to interpret it). The first bar (Z=1.96-2.205) contains p-values between p = 0.05 and p = 0.0275 (a width of 0.0225), the bar to the right  (Z=2.025-2.45) p-values between p = 0.0275 and p = 0.0143 (a width of 0.0132), the bar to the right  (Z=2.45-2.695) p-values between p = 0.0143 and p = 0.007 (a width of 0.0073), the bar to the right  (Z=2.695-2.94) p-values between p = 0.007 and p = 0.0033 (a width of 0.0037), etc. Depending on the power of the studies, the shape of the curve on the right side could be perfectly expected without a peculiar pre-dominance of p-values just below the traditional significance level. I’ll return to this in the end.

So, based on the distribution of p-values I can correct my earlier tweet: There are not too many just-significant p-values compared to very significant p-values, but there are too many significant p-values compared to non-significant p-values. The authors suggest nothing more.

One thing that has greatly contributed to my expectation of a bump around p = .05 are pictures of the prevalence of p-values based on google searches, such as the ones below by Matthew Hankins on Twitter:

If we compare these searches based on the hand-coded p-values by Kühberger et al (2014) we can be relatively certain the google scholar search p-curves are completely due to an artefact of the search method (where people very often report p < .05 or p < .1 instead of precise p-values). Nevertheless, and despite Matthew Hankins warning against this, this spreads the idea through the community that there is something very wrong with p-curves across science, which is not just incorrect, but damaging the reputation of science. A picture says more than a 1000 words - and posting pictures of search method artefacts on social media might not be the thing we need now that many scienctist are becoming overly skeptical and see p-hacking everywhere.

Kühberger et al (2014) provide Z-scores for 344 significant and 66 non-significant Z-scores. We can easily perform a p-curve analysis on these Z-scores, which yields the pattern below, illustrating strong evidential value for studies with N’s larger than 100, χ²(258)=826.46, p=<.0001, as well as for studies with N’s smaller than 100, χ²(428)=1185.78, p=<.0001. Researchers are not just reporting false positives in a scientific environment where publication bias reigns. Yes, there is probably some publication bias, and who knows whether there are specific research areas in this large set of studies were a lot of the results are p-hacked, but in general what ends up in the literature seems to have something going for it. I’d say it’s a pretty good looking p-curve for a random selection of psychology articles.


We can count the percentage of significant studies in small studies (N ≤ 100) and in large studies (N > 100). In 152 small studies, 84.21% of the p-values were smaller than p = 0.05. In 259 large studies, 84.17% of the p-values were smaller than p = 0.05. This is too similar. Because smaller studies typically have lower power, we should expect at least slightly more non-significant results – another indication of publication bias, in addition to the missing Z-values to the left of the 1.96 threshold.

The central observation in Kühberger et al (2014) was actually not the distribution of p-values, but the correlation between effect size and sample size. The authors state that ‘ES and sample size (SS) ought to be unrelated.’ This is true when people completely ignore power. Note that if researchers would take power into account when designing an experiment, power and sample size should be related: The smaller the effect size, the larger the sample you need to collect to have a decent probability of observing it). An interesting aspect about the data Kühlberger et al have collected, is that the negative correlation between the sample size and effect size is mainly drive by small groups (N<100), but remains stable across larger sample sizes.

This indeed points to actual publication bias. For an r = 0.3, a power analyses tells us you have 95% power with a sample size of 134 – around the sample size where the negative correlation disappears. Another possibility is the use of within-subject designs. These studies rarely collect more than 100 individuals, but they typically don’t have to, because they often have higher power than between subject designs. In an e-mail, the authors presented additional analyses, which suggest that the effects are indeed smaller for within designs, but don’t completely disappear.

Kühberger et al (2014) asked the authors of the studies they sampled to estimate the direction and size of the correlation between sample size and effect size in the data they had collected. Because the contacted authors had no idea, Kühberger et al (2014) concluded power considerations by the original authors were ‘unlikely as the source of a possible ES-SS relationship’. I don’t think this question is conclusive about a lack of power considerations. After all, researchers will probably not have been able to define the meaning of a p-value, which doesn’t mean significance considerations played no role in the studies. It’s difficult to measure whether researchers took power into account, and direct questions would probably have led to socially desirable answers, so at least the authors tried, but it's an interesting question to what extent people design studies in which power is (at least implicitly) taken into account.

On a slightly more positive note, the estimated percentage of negative results in the psychological literature (15.78%) is much greater than in other estimates in the literature (e.g., the 8.5% reported by Fanelli, 2010).

Estimating the power of the studies

Remember that the distribution of p-values is a function of the power of the studies. We should therefore be able to estimate the power of the set of studies for N < 100 and N > 100 based on the p-curve we observe. When we plot the p-curves for 53% power and 63% power, the distributions are quite comparable to the p-curves observed based on the data by Kühberger.

This visual comparison is nice to communicate the basic idea, but a more formal mathematical approach is better. Simonsohn, Nelson, & Simmons (in press) have recently extended the use of p-curve analyses in exactly this direction, and provide the R script to estimate the power of the studies.

If we enter all significant Z-scores, the estimated power estimate based on the p-curve distribution for all the studies is 90%. That's extremely high, mainly due to a decent number of studies that observed extremely low p-values (or high Z-scores). For example, 115 out of 410 p-values are p < .0001. The huge heterogeneity in the effects in the dataset by Kühberger et al (2014) is problematic for these kinds of analyses, but for illustrative purposes, I'm going to give it a go anyway.

There is a difference between my visual matching attempt and the mathematical matching by Simonsohn et al (2014). I’ve focused on matching the percentage of p-values between .00 and .01, while Simonsohn et al (2014) plot a function that matches the entire p-value distribution (including the higher p-values, and differences between p = 0.0001 and p = 0.00006). These decisions about which loss function to use leave room for debate, and I expect future research on these techniques will address different possibilities.

As mentioned above, the shape of the Z-score (or p-value) distribution depends critically on the power of the performed studies.We can simulate Z-score distributions in R by using the code below:

If we simulate Z-scores with 90% power, our picture does not look like the figure in Kuhberger et al (2014), because the peak is too far to the right:

Note that even with 90% power, our simulation expects much less Z-scores > 5 compared to the figure by Kühberger et al (2014). The heterogeneity in their study set makes it difficult to simulate using a single distribution. Normally you would examine heterogeneity in a meta-analysis by looking more closely at how the studies differ, but the studies in the dataset are not identified. So let’s resort to a more extreme solution of excluding the very high Z-scores (perhaps assuming these are manipulation checks or other types of tests) and only look at Z< 4, or p > 0.00006). We get a (probably more reasonable) power estimate of 51%.

It's clear that power estimations based on p-curve distributions with huge heterogeneity are difficult, and I'm expecting more future work on this technique that examine different ways to attempt this. However, simulating Z-scores with a power of 51% does lead to a distribution with a peak located closer to that observed in Kühberger et al (2014).

We can be reasonably certain all studies that we see to the right of Z=1.96 in the simulation, but are missing from Kühberger et al (2014) were performed, but are not reported. What a waste in resources! As such, the article by Kühberger is an excellent reminder of the negative consequences of performing underpowered studies (in addition to the difficulty of drawing statistical inferences from these studies, see Lakens & Evers, 2014).The missing studies would lower all effect size estimates that are calculated based on the published literature.

Below is the R-code to perform the power-estimation using the R-script by Simonsohn et al (you can find the original code here).


  1. Very interesting stuff. So what are we to make of all of this, in the big scheme of things?

    On the one hand, the results of the Kühberger et al paper suggest that the literature is missing studies with non-significant results (e.g. all the studies with p > .05), and that studies with smaller samples are likely to over-estimate the effects they report (because there’s a negative relationship between sample size and effect size). These pieces of evidence would suggest that the findings we have in the literature are less reliable than they would appear (at least to someone who is taking only the reported results at face value), because of publication bias.

    On the other hand, the p-curve analysis of these data (at least, as I understand it) suggests that the effects reported do not seem to reflect widespread and flagrant p-hacking. This would seem to be a favorable piece of evidence for the underlying effects (though as you report through simulations, there appear to be problems with the power of the published studies and strong effects appear to be over-represented).

    As I read your post, I wasn’t sure what exactly to make of these conflicting pieces of evidence. I know I should be (and indeed, I am) concerned with the missing studies/publication bias, but can I take some comfort from the p-curves, which seem to suggest some evidence for the measured effects, or at least a lack of widespread p-hacking? Or is the message that, even with a nice-looking p-curve, the evidence for robust effects is not necessarily convincing, because of the “missing” (e.g. non-significant) studies?

    1. Hi Mike, good question. I will touch upon a lot of these things in a follow up post. For now, it means A) analyzing huge datasets is not the best way to draw conclusions about the overall occurrence of p-hacking (but is can show publication bias), B) we know p-hacking occurs, but quantifying how much is difficult (this will be the main point in my next blog), C) knowing there is 'at least some evidential value' is a bit comforting (we are not producing only chaos) but also not very comforting (because we don't know the content of these studies, there might be a lot of well established findings in there, but for more novel research the picture might be a lot worse. So, we cannot draw strong conclusions - I guess that's science for you!