Comments on The 20% Statistician: No, the p-values are not to blame: Part 53

In many fields, we are forced to make probabilisti...

2017-04-29T13:01:30.539+02:00

In many fields, we are forced to make probabilistic inferences from a handful of small, poorly designed studies. For example, in medicine, once a treatment is demonstrated to be superior to placebo or some other treatment in a clinical trial, it is unethical to replicate that trial in an attempt to reproduce the results. We can't just sit around confirming that the placebo group gets screwed over and over again like they can in animal studies. If we could, there would hardly be any need for statistics beyond some sloppy use of a p-value. Constraints on replication in many fields mean that data are never enough and that we need expert judgement. The key, is that this judgement can only effect the posterior distribution through the prior distribution. The independent role of the data in the likelihood function must be preserved. Otherwise, you might as well throw the small amount of very valuable data that we did manage to collect in the trash. I have to think that Bayesian inference is the only way to go if you want to continue to make probabilistic inferences. My beef with the p-value is that makes the data seem more valuable then it really is. Researchers try to communicate this fact in the "limitations" section of the study report - but a few lines of prose, no matter how well written, can never compete with a number when it comes summarizing the risk of mistaken inference. Bayesian inference is unavoidable, I think.

Nope. No examples. We just need people to write t...

2017-04-28T13:39:11.623+02:00

Nope. No examples. We just need people to write their conclusions before they see the results. Bayesian inference provides a mathematically formalized way of doing this. Most people find this a strange, nearly impossible task. How do we track the change in a consensus position regarding a proposition in response to seeing a small amount of data from an imperfect experiment/observation? Wouldn't it be easier to just collect the data (run the experiment) and see what happens? Yes it would be easier, but we wouldn't call it science, we would call it journalism.

Maybe in theory, but so far, not in practice. See ...

2017-04-28T04:41:45.877+02:00

Maybe in theory, but so far, not in practice. See the Bayesian analyses of pre-cognition. You get different people, with different priors, and different conclusions. That's exactly how Bayesian statistics is supposed to work, but I'm not sure it is saving us in practice? But maybe I'm missing something? Do you have good examples of this in practice?

When the prior distribution and statistical model ...

2017-04-28T02:31:27.544+02:00

When the prior distribution and statistical model (likelihood) are fully specified, a change in the state of belief associated with seeing the results of a particular experiment (i.e. the posterior distribution) is defined for every possible result of that experiment. To use an analogy, it is as if the research team were asked to write several different versions of the final report with each version corresponding to a different potential finding regarding the impact of the intervention or treatment (e.g. harmful, neutral, beneficial). Such a process would ensure that any evaluation of a study’s evidentiary value is strictly independent of the findings. Researchers who do not want to risk placing a high degree of confidence in a study design which might produce an implausible result (e.g. the program is very harmful), might also temper their enthusiasm for results that are more like to be received favorably (e.g. the program has the intended beneficial impact). The role of the data analyst is to merely determine which result is obtained and, therefore, which version of the report is send out for publication.

Although a Bayesian framework has the potential to make evaluations of a study’s prospective evidentiary value transparent, it is not clear from my brief foray into the literature that Bayesians have fully articulated an approach which would allow one to incorporate a priori evaluations regarding the level of rigor associated with a particular experimental/observational study into a prior distribution. Indeed, most discussions of the prior distribution seem to refer to it as a general state of knowledge/uncertainty about a research hypothesis, one that does not make reference to a particular observational or experimental study and its various design elements. However, researchers are clearly not equally credulous with regard to every observation or experiment their colleagues perform, and the consensus view regarding the credulity of a particular result should clearly enter into the calculations of the posterior state of knowledge of the field (i.e. the posterior distribution). The problem is that in a culture that is dominated by post-hoc debates about the interpretation of already completed studies (aka peer review), there are no institutional mechanisms or incentives in place that would allow this. If Bayesian inference is going to save us, isn't this how it would do it?

Kline's comment is perfectly fine. He nicely i...

2017-04-27T16:58:04.668+02:00

Kline's comment is perfectly fine. He nicely illustrates what p-values do, and don't. Nobody recommends only using p-values, but they will allow you to do things (error control) nothing else allows you to do. You will have to admit the Ravenzwaaij article is not even close to nuanced. See also https://errorstatistics.com/2017/04/23/s-senn-automatic-for-the-people-not-quite-guest-post/

Daniel, I hope you no longer feel insulted by my c...

2017-04-27T16:06:25.257+02:00

Daniel, I hope you no longer feel insulted by my complaint about terminology.
I think I've read most of the articles and book(chapters) about the use and misuse of p-values (including Nickerson) and (like the two problems you mention) there are many ways in which p-values are misunderstood. Like: 3) p < .05 means that there is an effect, or 4) p <.05 means that the probability of a type I error is low (Nickerson himself interprets the p-value as a type I error probability), etc. etc. (see Kline, 2013 chapter 4 for an overview of common misunderstandings of p-values). And even though the rhetoric is often quite strong (take the work of McCloskey and Zilliak) I do not really think that it's the p-value that people are complaining about. How can you complain about a simple number? It's always about the way p-values are used. E.g. as inductive evidence against some nil-hypothesis without specifying an alternative (as in Fisherian significance testing), or as part of a decision procedure without being able to specify beta (NHST).
Kline (2013, p. 114) sums it up as follows (without any bashing): "statistical significance testing provides even in the best case nothing more than low-level support for the existence of an effect, relation or difference. That best case occurs when researchers estimate a priori power, specify the correct construct definitions and operationalizations, work with random or at least representative samples, analyze highly reliable scores in distributions that respect test assumptions, control other major sources of imprecision besides sampling error and test plausible null-hypotheses. In this idyllic scenario, p values from statistical tests may be reasonably accurate and potentially meaningful, if they are not misinterpreted. But science should deal with more than just the existence question, a point that researchers overly fixated on p values have trouble understanding."

Gerben, if you read a typical p-value bashing pape...

2017-04-27T12:58:18.080+02:00

Gerben, if you read a typical p-value bashing paper, the rethoric is much stronger. There are only 2 problems with p-values: 1) don't interpret them as evidence for a theory, and 2) p>0.05 does not mean there is no effect (this is solved by equivalence testing). Nockerson, 2000 explained all of this. All other articles have tried to draw attention by making more extreme, often incorrect, statements. That deserves the p-value bashing label.

Hi Daniel, please don't misunderstand me. I am...

2017-04-27T08:52:22.015+02:00

Hi Daniel, please don't misunderstand me. I am not at all calling you intellectually dishonest. I had no intention to insult you. So, if you feel offended, I apologize. What I am saying is that the terminology you choose i.e. "p-value bashing" does not do justice to many of the legitimate concerns about p-values. I call the rhetorical effect of using these terms intellectually dishonest. I do not see how pointing this out is "uncivil" or insulting.
I do not consider promotion of p-values dishonest, nor do I think (I repeat) that you are intellectually dishonest (on the contrary, I would say; if you care).

Gerben, thank you for leaving this insult on my bl...

2017-04-26T19:26:16.379+02:00

Gerben, thank you for leaving this insult on my blog. I really appreciate you taking time out of your busy work to call me intellectually dishonest. If you think p-values are the problem, you simple don't understand the problem. I do not consider it dishonest to promote p-values. They are one of the best inventions in science. People misused them, and they have limitations. Many limitations people keep pointing out have been solved (e.g. by equivalence testing). Feel free to invite me for a talk at your department and we can discuss this in a more civil manner.

Maybe besides the point: for decades scholars (ran...

2017-04-26T19:06:15.317+02:00

Maybe besides the point: for decades scholars (ranging from epistemologists to statisticians) have outed legitimate concerns about the scientific value of signifiance testing and the mess research workers make in interpreting the results of significance tests. It is a (cheap) rhetorical move to keep calling these concerns "p-value bashing", the effect of which is to make any concern about significance testing not worth the trouble thinking about ("it's just p-value bashing, you see"). I find this intelectually dishonest.

"They show another example of when p-values l...

2017-03-11T01:06:56.222+01:00

"They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors)."

Are they just saying that if significance tests are run ignoring multiple testing and selection effects, then it's easy to get spurious statistical significance? That's what Ioannidis argues elsewhere. Unbelievable.

"Let's define 'support for the null-hypothesis' as a BF < 1.". But BFs never give support for a hypothesis, only comparative support,and since data dependent selections of hypotheses and priors are permitted,there is no error control.