# Science-Wise False Discovery Rate Does Not Explain the Prevalence of Bad Science

*Guest post by Nathan (Nat) Goodman (@gnatgoodman)*

*October 16, 2017*

*This article explores the statistical concept of science-wise false discovery rate (SWFDR). Some authors use SWFDR and its complement, positive predictive value, to argue that most (or, at least, many) published scientific results must be wrong unless most hypotheses are a priori true. I disagree. While SWFDR is valid statistically, the real cause of bad science is “Publish or Perish”.*

## Introduction

Is science broken? A lot of people seem to think so, including some esteemed statisticians. One line of reasoning uses the concepts of false discovery rate and its complement, positive predictive value, to argue that most (or, at least, many) published scientific results must be wrong unless most hypotheses are *a priori* true.

The *false discovery rate* (*FDR*) is the probability that a significant p-value indicates a false positive, or equivalently, the proportion of significant p-values that correspond to results without a real effect. The complement, *positive predictive value* (\(PPV=1-FDR\)) is the probability that a significant p-value indicates a true positive, or equivalently, the proportion of significant p-values that correspond to results with real effects.

I became interested in this topic after reading Felix SchÃ¶nbrodt’s blog post, “What’s the probability that a significant p-value indicates a true effect?” and playing with his ShinyApp. SchÃ¶nbrodt’s post led me to David Colquhoun’s paper, “An investigation of the false discovery rate and the misinterpretation of p-values” and blog posts by Daniel Lakens, “How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?” and Will Gervais, “Power Consequences”.

The term *science-wise false discovery rate* (SWFDR) is from Leah Jager and Jeffrey Leek’s paper, “An estimate of the science-wise false discovery rate and application to the top medical literature”. Earlier work includes Sholom Wacholder et al’s 2004 paper “Assessing the Probability That a Positive Report is False: An Approach for Molecular Epidemiology Studies” and John Ioannidis’s 2005 paper, “Why most published research findings are false”.

## Scenario

Being a programmer and not a statistician, I decided to write some R code to explore this topic on simulated data.

The program simulates a large number of problem instances representing published results, some of which are true and some false. The instances are very simple: I generate two groups of random numbers and use the t-test to assess the difference between their means. One group (the control group or simply *group0*) comes from a standard normal distribution with \(mean=0\). The other group (the treatment group or simply *group1*) is a little more involved:

- for
*true*instances, I take numbers from a standard normal distribution with mean*d*(\(d>0\)); - for
*false*instances, I use the same distribution as*group0*.

The parameter *d* is the effect size, aka *Cohen’s d*.

I use the t-test to compare the means of the groups and produce a p-value assessing whether both groups come from the same distribution.

The program does this thousands of times (drawing different random numbers each time, of course), collects the resulting p-values, and computes the FDR. The program repeats the procedure for a range of assumptions to determine the conditions under which most positive results are wrong.

For *true* instances, we expect the difference in means to be approximately *d* and for *false* ones to be approximately 0, but due to the vagaries of random sampling, this may not be so. If the actual difference in means is far from the expected value, the t-test may get it wrong, declaring a *false* instance to be positive and a *true* one to be negative. The goal is to see how often we get the wrong answer across a range of assumptions.

## Nomenclature

To reduce confusion, I will be obsessively consistent in my terminology.

- An
*instance*is a single run of the simulation procedure. - The terms
*positive*and*negative*refer to the results of the t-test. A*positive instance*is one for which the t-test reports a significant p-value; a*negative instance*is the opposite. Obviously the distinction between positive and negative depends on the chosen significance level. *true*and*false*refer to the correct answers. A*true instance*is one where the treatment group (*group1*) is drawn from a distribution with \(mean=d\) (\(d>0\)). A*false instance*is the opposite: an instance where*group1*is drawn from a distribution with \(mean=0\).*empirical*refers to results calculated from the simulated data, as opposed to*theoretical*which means results calculated using standard formulas.

The simulation parameters are

parameter | meaning | default |
---|---|---|

prop.true | fraction of cases where there is a real effect | `seq(.1,.9,by=.2)` |

m | number of iterations | `1e4` |

n | sample size | `16` |

d | standardized effect size (aka Cohen’s d) |
`c(.25,.50,.75,1,2)` |

pwr | power. if set, the program adjusts d to achieve power |
`NA` |

sig.level | significance level for power calculations when pwr is set |
`0.05` |

pval.plot | p-values for which we plot results | `c(.001,.01,.03,.05,.1)` |

## Results

The simulation procedure with default parameters produces four graphs similar to the ones below.