The 20% Statistician: No, the p-values are not to blame: Part 53

Friday, March 10, 2017

No, the p-values are not to blame: Part 53

In the latest exuberant celebration of how Bayes Factors will save science, Ravenzwaaij and Ioannidis write: “our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences.” Uh oh – what have these evil p-values been up to this time?

Because the Food and Drug Administration thinks two significant studies are a good threshold before they'll allow you to put stuff in your mouth, in a simple simulation, Ravenzwaaij and Ioannidis look at what Bayes factors have to say when researchers find exactly two p < 0.05.

If you find two effects in 2 studies, and there is a true effect of d = 0.5, the data is super-duper convincing. The blue bars below indicate Bayes Factors > 20, the tiny green parts are BF > 3 but < 20 (still fine).

Even when you study a small effect with d = 0.2, after observing two significant results in two studies, everything is hunky-dory.

So p-values work like a charm, and there is no problem. THE END.

What's that you say? This simple message does not fit your agenda? And it's unlikely to get published? Oh dear! Let's see what we can do!

Let's define 'support for the null-hypothesis' as a BF < 1. After all, just as a 49.999% rate of heads in a coin flip is support for a coin biased towards tails, any BF < 1 is stronger support for the null, than for the alternative. Yes, normally researchers consider 1/3 > BF < 3 as 'inconclusive' but let's ignore that for now.

The problem is we don't even have BF < 1 in our simulations so far. So let's think of something else. Let's introduce our good old friend lack of power!

Now we simulate a bunch of studies, until we find exactly 2 significant results. Let's say we do 20 studies where the true effect is d = 0.2, and only find an effect in 2 studies. We have 15% power (because we do a tiny study examining a tiny effect). This also means that the effect size estimates in the 18 other studies have to be small enough not to be significant. Then, we calculate Bayes Factors "for the combined data from the total number of trials conducted." Now what do we find?

Look! Black stuff! That's bad. The 'statistical evidence actually favors the null hypothesis', at least based on a BF < 1 cut-off. If we include the possibility of 'inconclusive evidence' (applying the widely used 1/3 > BF < 3 thresholds), we see that actually, when you find only 2 out of 20 significant studies when you have 15% power, the overall data is sometimes inconclusive (but not support for H0).

That's not surprising. When we have 20 people per cell, and d = 0.2, when we combine all the data to calculate the Bayes factor (so we have N = 400 per cell) the data is inconclusive sometimes. After all, we only have 88% power! That's not bad, but the data you collect will sometimes still be inconclusive!

Let's see if we can make it even worse, by introducing our other friend, publication bias. They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors).

Wowzerds, what a darkness! Aren't you surprised? No, I didn't think so.

To conclude: Inconclusive results happen. In small samples and small effects, there is huge variability in the data. This is not only true for p-values, but it is just as true of Bayes Factors (see my post on Dance of the Bayes Factors here).

I can understand the authors might be disappointed by the lack of enthusiasm of the FDA (which cares greatly about controlling error rates, given that they deal with life and death) to embrace Bayes Factors. But the problems the authors simulate are not going to be fixed by replacing p-values by Bayes Factors. It's not that "Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications." Publication bias and lack of power lead to spurious decision making - regardless of the statistic you throw at the data.

I'm gonna bet that a little less Bayesian propaganda, a little less p-value bashing for no good reason, and a little more acknowledgement of the universal problems of publication bias and too small sample sizes for any statistical inference we try to make, is what will really improve science in the long run.

P.S. The authors shared their simulation script with the publication, which was extremely helpful in understanding what they actually did, and which allowed me to make the figure above which includes an 'inconclusive' category (in which I also used a slightly more realistic prior when you expect small effects -I don't think it matters but I'm too impatient to redo the simulation with the same prior and only different cut-offs).

11 comments:

MayoMarch 11, 2017 at 1:06 AM
"They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors)."

Are they just saying that if significance tests are run ignoring multiple testing and selection effects, then it's easy to get spurious statistical significance? That's what Ioannidis argues elsewhere. Unbelievable.

"Let's define 'support for the null-hypothesis' as a BF < 1.". But BFs never give support for a hypothesis, only comparative support,and since data dependent selections of hypotheses and priors are permitted,there is no error control.

ReplyDelete
Replies
Gerben MulderApril 26, 2017 at 7:06 PM
Maybe besides the point: for decades scholars (ranging from epistemologists to statisticians) have outed legitimate concerns about the scientific value of signifiance testing and the mess research workers make in interpreting the results of significance tests. It is a (cheap) rhetorical move to keep calling these concerns "p-value bashing", the effect of which is to make any concern about significance testing not worth the trouble thinking about ("it's just p-value bashing, you see"). I find this intelectually dishonest.
ReplyDelete
Replies
Gerben MulderApril 27, 2017 at 4:06 PM
Daniel, I hope you no longer feel insulted by my complaint about terminology.
I think I've read most of the articles and book(chapters) about the use and misuse of p-values (including Nickerson) and (like the two problems you mention) there are many ways in which p-values are misunderstood. Like: 3) p < .05 means that there is an effect, or 4) p <.05 means that the probability of a type I error is low (Nickerson himself interprets the p-value as a type I error probability), etc. etc. (see Kline, 2013 chapter 4 for an overview of common misunderstandings of p-values). And even though the rhetoric is often quite strong (take the work of McCloskey and Zilliak) I do not really think that it's the p-value that people are complaining about. How can you complain about a simple number? It's always about the way p-values are used. E.g. as inductive evidence against some nil-hypothesis without specifying an alternative (as in Fisherian significance testing), or as part of a decision procedure without being able to specify beta (NHST).
Kline (2013, p. 114) sums it up as follows (without any bashing): "statistical significance testing provides even in the best case nothing more than low-level support for the existence of an effect, relation or difference. That best case occurs when researchers estimate a priori power, specify the correct construct definitions and operationalizations, work with random or at least representative samples, analyze highly reliable scores in distributions that respect test assumptions, control other major sources of imprecision besides sampling error and test plausible null-hypotheses. In this idyllic scenario, p values from statistical tests may be reasonably accurate and potentially meaningful, if they are not misinterpreted. But science should deal with more than just the existence question, a point that researchers overly fixated on p values have trouble understanding."
ReplyDelete
Replies
Sam FieldApril 28, 2017 at 2:31 AM
When the prior distribution and statistical model (likelihood) are fully specified, a change in the state of belief associated with seeing the results of a particular experiment (i.e. the posterior distribution) is defined for every possible result of that experiment. To use an analogy, it is as if the research team were asked to write several different versions of the final report with each version corresponding to a different potential finding regarding the impact of the intervention or treatment (e.g. harmful, neutral, beneficial). Such a process would ensure that any evaluation of a study’s evidentiary value is strictly independent of the findings. Researchers who do not want to risk placing a high degree of confidence in a study design which might produce an implausible result (e.g. the program is very harmful), might also temper their enthusiasm for results that are more like to be received favorably (e.g. the program has the intended beneficial impact). The role of the data analyst is to merely determine which result is obtained and, therefore, which version of the report is send out for publication.

Although a Bayesian framework has the potential to make evaluations of a study’s prospective evidentiary value transparent, it is not clear from my brief foray into the literature that Bayesians have fully articulated an approach which would allow one to incorporate a priori evaluations regarding the level of rigor associated with a particular experimental/observational study into a prior distribution. Indeed, most discussions of the prior distribution seem to refer to it as a general state of knowledge/uncertainty about a research hypothesis, one that does not make reference to a particular observational or experimental study and its various design elements. However, researchers are clearly not equally credulous with regard to every observation or experiment their colleagues perform, and the consensus view regarding the credulity of a particular result should clearly enter into the calculations of the posterior state of knowledge of the field (i.e. the posterior distribution). The problem is that in a culture that is dominated by post-hoc debates about the interpretation of already completed studies (aka peer review), there are no institutional mechanisms or incentives in place that would allow this. If Bayesian inference is going to save us, isn't this how it would do it?
ReplyDelete
Replies

Add comment