A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, March 10, 2017

No, the p-values are not to blame: Part 53

In the latest exuberant celebration of how Bayes Factors will save science, Ravenzwaaij and Ioannidis write: “our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences.” Uh oh – what have these evil p-values been up to this time?

Because the Food and Drug Administration thinks two significant studies are a good threshold before they'll allow you to put stuff in your mouth, in a simple simulation, Ravenzwaaij and Ioannidis look at what Bayes factors have to say when researchers find exactly two p < 0.05.

If you find two effects in 2 studies, and there is a true effect of d = 0.5, the data is super-duper convincing. The blue bars below indicate Bayes Factors > 20, the tiny green parts are BF > 3 but < 20 (still fine).

 Even when you study a small effect with d = 0.2, after observing two significant results in two studies, everything is hunky-dory.

So p-values work like a charm, and there is no problem. THE END.

What's that you say? This simple message does not fit your agenda? And it's unlikely to get published? Oh dear! Let's see what we can do!

Let's define 'support for the null-hypothesis' as a BF < 1. After all, just as a 49.999% rate of heads in a coin flip is support for a coin biased towards tails, any BF < 1 is stronger support for the null, than for the alternative. Yes, normally researchers consider 1/3 > BF < 3 as 'inconclusive' but let's ignore that for now.

The problem is we don't even have BF < 1 in our simulations so far. So let's think of something else. Let's introduce our good old friend lack of power!

Now we simulate a bunch of studies, until we find exactly 2 significant results. Let's say we do 20 studies where the true effect is d = 0.2, and only find an effect in 2 studies. We have 15% power (because we do a tiny study examining a tiny effect). This also means that the effect size estimates in the 18 other studies have to be small enough not to be significant. Then, we calculate Bayes Factors "for the combined data from the total number of trials conducted." Now what do we find?

Look! Black stuff! That's bad. The 'statistical evidence actually favors the null hypothesis', at least based on a BF < 1 cut-off. If we include the possibility of 'inconclusive evidence' (applying the widely used 1/3 > BF < 3 thresholds), we see that actually, when you find only 2 out of 20 significant studies when you have 15% power, the overall data is sometimes inconclusive (but not support for H0).

That's not surprising. When we have 20 people per cell, and d = 0.2, when we combine all the data to calculate the Bayes factor (so we have N = 400 per cell) the data is inconclusive sometimes. After all, we only have 88% power! That's not bad, but the data you collect will sometimes still be inconclusive!

Let's see if we can make it even worse, by introducing our other friend, publication bias. They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors).

Wowzerds, what a darkness! Aren't you surprised? No, I didn't think so.

To conclude: Inconclusive results happen. In small samples and small effects, there is huge variability in the data. This is not only true for p-values, but it is just as true of Bayes Factors (see my post on Dance of the Bayes Factors here).

I can understand the authors might be disappointed by the lack of enthusiasm of the FDA (which cares greatly about controlling error rates, given that they deal with life and death) to embrace Bayes Factors. But the problems the authors simulate are not going to be fixed by replacing p-values by Bayes Factors. It's not that "Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications." Publication bias and lack of power lead to spurious decision making - regardless of the statistic you throw at the data.

I'm gonna bet that a little less Bayesian propaganda, a little less p-value bashing for no good reason, and a little more acknowledgement of the universal problems of publication bias and too small sample sizes for any statistical inference we try to make, is what will really improve science in the long run.

P.S. The authors shared their simulation script with the publication, which was extremely helpful in understanding what they actually did, and which allowed me to make the figure above which includes an 'inconclusive' category (in which I also used a slightly more realistic prior when you expect small effects -I don't think it matters but I'm too impatient to redo the simulation with the same prior and only different cut-offs).


  1. "They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors)."

    Are they just saying that if significance tests are run ignoring multiple testing and selection effects, then it's easy to get spurious statistical significance? That's what Ioannidis argues elsewhere. Unbelievable.

    "Let's define 'support for the null-hypothesis' as a BF < 1.". But BFs never give support for a hypothesis, only comparative support,and since data dependent selections of hypotheses and priors are permitted,there is no error control.

  2. Maybe besides the point: for decades scholars (ranging from epistemologists to statisticians) have outed legitimate concerns about the scientific value of signifiance testing and the mess research workers make in interpreting the results of significance tests. It is a (cheap) rhetorical move to keep calling these concerns "p-value bashing", the effect of which is to make any concern about significance testing not worth the trouble thinking about ("it's just p-value bashing, you see"). I find this intelectually dishonest.

    1. Gerben, thank you for leaving this insult on my blog. I really appreciate you taking time out of your busy work to call me intellectually dishonest. If you think p-values are the problem, you simple don't understand the problem. I do not consider it dishonest to promote p-values. They are one of the best inventions in science. People misused them, and they have limitations. Many limitations people keep pointing out have been solved (e.g. by equivalence testing). Feel free to invite me for a talk at your department and we can discuss this in a more civil manner.

    2. Hi Daniel, please don't misunderstand me. I am not at all calling you intellectually dishonest. I had no intention to insult you. So, if you feel offended, I apologize. What I am saying is that the terminology you choose i.e. "p-value bashing" does not do justice to many of the legitimate concerns about p-values. I call the rhetorical effect of using these terms intellectually dishonest. I do not see how pointing this out is "uncivil" or insulting.
      I do not consider promotion of p-values dishonest, nor do I think (I repeat) that you are intellectually dishonest (on the contrary, I would say; if you care).

    3. Gerben, if you read a typical p-value bashing paper, the rethoric is much stronger. There are only 2 problems with p-values: 1) don't interpret them as evidence for a theory, and 2) p>0.05 does not mean there is no effect (this is solved by equivalence testing). Nockerson, 2000 explained all of this. All other articles have tried to draw attention by making more extreme, often incorrect, statements. That deserves the p-value bashing label.

  3. Daniel, I hope you no longer feel insulted by my complaint about terminology.
    I think I've read most of the articles and book(chapters) about the use and misuse of p-values (including Nickerson) and (like the two problems you mention) there are many ways in which p-values are misunderstood. Like: 3) p < .05 means that there is an effect, or 4) p <.05 means that the probability of a type I error is low (Nickerson himself interprets the p-value as a type I error probability), etc. etc. (see Kline, 2013 chapter 4 for an overview of common misunderstandings of p-values). And even though the rhetoric is often quite strong (take the work of McCloskey and Zilliak) I do not really think that it's the p-value that people are complaining about. How can you complain about a simple number? It's always about the way p-values are used. E.g. as inductive evidence against some nil-hypothesis without specifying an alternative (as in Fisherian significance testing), or as part of a decision procedure without being able to specify beta (NHST).
    Kline (2013, p. 114) sums it up as follows (without any bashing): "statistical significance testing provides even in the best case nothing more than low-level support for the existence of an effect, relation or difference. That best case occurs when researchers estimate a priori power, specify the correct construct definitions and operationalizations, work with random or at least representative samples, analyze highly reliable scores in distributions that respect test assumptions, control other major sources of imprecision besides sampling error and test plausible null-hypotheses. In this idyllic scenario, p values from statistical tests may be reasonably accurate and potentially meaningful, if they are not misinterpreted. But science should deal with more than just the existence question, a point that researchers overly fixated on p values have trouble understanding."

    1. Kline's comment is perfectly fine. He nicely illustrates what p-values do, and don't. Nobody recommends only using p-values, but they will allow you to do things (error control) nothing else allows you to do. You will have to admit the Ravenzwaaij article is not even close to nuanced. See also https://errorstatistics.com/2017/04/23/s-senn-automatic-for-the-people-not-quite-guest-post/

  4. When the prior distribution and statistical model (likelihood) are fully specified, a change in the state of belief associated with seeing the results of a particular experiment (i.e. the posterior distribution) is defined for every possible result of that experiment. To use an analogy, it is as if the research team were asked to write several different versions of the final report with each version corresponding to a different potential finding regarding the impact of the intervention or treatment (e.g. harmful, neutral, beneficial). Such a process would ensure that any evaluation of a study’s evidentiary value is strictly independent of the findings. Researchers who do not want to risk placing a high degree of confidence in a study design which might produce an implausible result (e.g. the program is very harmful), might also temper their enthusiasm for results that are more like to be received favorably (e.g. the program has the intended beneficial impact). The role of the data analyst is to merely determine which result is obtained and, therefore, which version of the report is send out for publication.

    Although a Bayesian framework has the potential to make evaluations of a study’s prospective evidentiary value transparent, it is not clear from my brief foray into the literature that Bayesians have fully articulated an approach which would allow one to incorporate a priori evaluations regarding the level of rigor associated with a particular experimental/observational study into a prior distribution. Indeed, most discussions of the prior distribution seem to refer to it as a general state of knowledge/uncertainty about a research hypothesis, one that does not make reference to a particular observational or experimental study and its various design elements. However, researchers are clearly not equally credulous with regard to every observation or experiment their colleagues perform, and the consensus view regarding the credulity of a particular result should clearly enter into the calculations of the posterior state of knowledge of the field (i.e. the posterior distribution). The problem is that in a culture that is dominated by post-hoc debates about the interpretation of already completed studies (aka peer review), there are no institutional mechanisms or incentives in place that would allow this. If Bayesian inference is going to save us, isn't this how it would do it?

    1. Maybe in theory, but so far, not in practice. See the Bayesian analyses of pre-cognition. You get different people, with different priors, and different conclusions. That's exactly how Bayesian statistics is supposed to work, but I'm not sure it is saving us in practice? But maybe I'm missing something? Do you have good examples of this in practice?

    2. Nope. No examples. We just need people to write their conclusions before they see the results. Bayesian inference provides a mathematically formalized way of doing this. Most people find this a strange, nearly impossible task. How do we track the change in a consensus position regarding a proposition in response to seeing a small amount of data from an imperfect experiment/observation? Wouldn't it be easier to just collect the data (run the experiment) and see what happens? Yes it would be easier, but we wouldn't call it science, we would call it journalism.

    3. In many fields, we are forced to make probabilistic inferences from a handful of small, poorly designed studies. For example, in medicine, once a treatment is demonstrated to be superior to placebo or some other treatment in a clinical trial, it is unethical to replicate that trial in an attempt to reproduce the results. We can't just sit around confirming that the placebo group gets screwed over and over again like they can in animal studies. If we could, there would hardly be any need for statistics beyond some sloppy use of a p-value. Constraints on replication in many fields mean that data are never enough and that we need expert judgement. The key, is that this judgement can only effect the posterior distribution through the prior distribution. The independent role of the data in the likelihood function must be preserved. Otherwise, you might as well throw the small amount of very valuable data that we did manage to collect in the trash. I have to think that Bayesian inference is the only way to go if you want to continue to make probabilistic inferences. My beef with the p-value is that makes the data seem more valuable then it really is. Researchers try to communicate this fact in the "limitations" section of the study report - but a few lines of prose, no matter how well written, can never compete with a number when it comes summarizing the risk of mistaken inference. Bayesian inference is unavoidable, I think.