A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, February 11, 2016

So you banned p-values, how’s that working out for you?

The journal Basic and Applied Social Psychology banned p-values a year ago. I read some of their articles published in the last year. I didn’t like many of them. Here’s why.

First of all, it seems BASP didn’t just ban p-values. They also banned confidence intervals, because God forbid you use that lower bound to check whether or not it includes 0. They also banned reporting sample sizes for between subject conditions, because God forbid you divide that SD by the square root of N and multiply it by 1.96 and subtract it from the mean and guesstimate whether that value is smaller than 0.

It reminds me of alcoholics who go into detox and have to hand in their perfume, before they are tempted to drink it. Thou shall not know whether a result is significant – it’s for your own good! Apparently, thou shall also not know whether effect sizes were estimated with any decent level of accuracy. Nor shall thou include the effect in future meta-analyses to commit the sin of cumulative science.

There are some nice papers where the p-value ban has no negative consequences. For example, Swab & Greitemeyer (2015) examine whether indirect (virtual) intergroup contact (seeing you have 1 friend in common with an outgroup member, vs not) would influence intergroup attitudes. It did not, in 8 studies. P-values can’t be used to accept the null-hypothesis, and these authors explicitly note they aimed to control Type 2 errors based on an a-priori power analysis. So, after observing many null-results, they drew the correct conclusion that if there was an effect, it was very unlikely to be larger than what the theory on evaluative conditioning predicted. After this conclusion, they logically switch to parameter estimation, perform a meta-analysis and based on a Cohen’s d of 0.05, suggest that this effect is basically 0. It’s a nice article, and the p-value ban did not make it better or worse.

But in many other papers, especially those where sample sizes were small, and experimental designs were used to examine hypothesized differences between conditions, things don’t look good.

In many of the articles published in BASP, researchers make statements about differences between groups. Whether or not these provide support for their hypotheses becomes a moving target, without the need to report p-values. For example, some authors interpret a d of 0.36 as support for an effect, while in the same study, a Cohen’s d < 0.29 (we are not told the exact value) is not interpreted as an effect. You can see how banning p-values solved the problem of dichotomous interpretations (I’m being ironic). Also, with 82 people divided over three conditions, the p-value associated with the d = 0.36 interpreted as an effect is around p = 0.2. If BASP had required authors to report p-values, they might have interpreted this effect a bit more cautiously. And in case you are wondering: No, this is not the only non-significant finding interpreted as an effect. Surprisingly enough, it seems to happen a lot more often than in journals where authors report p-values! Who would have predicted this?!

Saying one thing is bigger than something else, and reporting an effect size, works pretty well in simple effects. But how would say there is a statistically significant interaction, if you can’t report inferential statistics and p-values? Here are some of my favorite statements.

“The ANOVA also revealed an interaction between [X] and [Y], η² = 0.03 (small to medium effect).”

How much trust do you have in that interaction from an exploratory ANOVA with a small to medium effect size of .03, partial eta squared? That’s what I thought.

“The main effects were qualified by an [X] by [Y] interaction. See Figure 2 for means and standard errors”

The main effects were qualified, but the interaction was not quantified. What does this author expect I do with the means and standard errors? Look at it while humming ‘ohm’ and wait to become enlightened? Everybody knows these authors calculated p-values, and based their statements on these values.

In normal scientific journals, authors sometimes report a Bonferroni correction. But there’s no way you are going to Bonferroni those means and standard deviations, now is there? With their ban on p-values and confidence intervals, BASP has banned error control. For example, read the following statement:

Willpower theories were also related to participants’ BMI. The more people endorsed a limited theory, the higher their BMI. This finding corroborates the idea that a limited theory is related to lower self-control in terms of dieting and might therefore also correlate with patients BMI.

This is based on a two-sided p-value of 0.026, and it was one of 10 calculated correlation coefficient. Would a Bonferroni adjusted p-value have led to a slightly more cautious conclusion?

Oh, and if you hoped banning p-values would lead anyone to use Bayesian statistics: No. It leads to a surprisingly large number of citations to Trafimow’s articles where he tries to use p-values as measures of evidence, and is disappointed they don’t do what he expects. Which is like going to The Hangover part 4 and complaining it’s really not that funny. Except everyone who publishes in BASP mysteriously agrees that Trafimow’s articles show NHST has been discredited and is illogical.

In their latest editorial, Trafimow and Marks hit down some arguments you could, after a decent bottle of liquor, interpret as straw men against their ban of p-values. They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise.

The absence of p-values has not prevented dichotomous conclusions, nor claims that data support theories (which is only possible using Bayesian statistics), nor anything else p-values were blamed for in science. After reading a year’s worth of BASP articles, you’d almost start to suspect p-values are not the real problem. Instead, it looks like researchers find making statistical inferences pretty difficult, and forcing them to ignore p-values didn’t magically make things better.

As far as I can see, all that banning p-values has done, is increase the Type 1 error rate in BASP articles. Restoring a correct use of p-values would substantially improve how well conclusions authors draw actually follow from the data they have collected. The only expense, I predict, is a much lower number of citations to articles written by Trafimow about how useless p-values are.

19 comments:

  1. So how many papers in BASP did Bayesian statistics one year prior to the p-value ban versus one year after? Is there a qualitative difference?

    ReplyDelete
    Replies
    1. I did not encounter any paper using Bayesian statistics in 2015. Note that it would have made sense in the 8 study paper I mention in the blog post where they don't find support for their hypotheses, but even there, nothing.

      Delete
    2. It's surprising that even now with JASP being so easy to use that someone didn't report a BF. That's even less improvement than my low expectations expected.

      Delete
    3. If you read the editorial, you'll see that they are not exactly encouraging authors to do Bayes Factors either. I personally really like Bayes Factors for hypothesis testing, but I would perhaps not risk submitting them to BASP after reading that editorial.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. It's almost as if the problem is with the incentives of the publishing system, rather than with the specific ways in which those problems manifest themselves.

    I suspect that if p-values were declared illegal worldwide tomorrow, we would quickly see a consensus around d=.02 or r=.10 or pesq2=.02 or even BF=6 as the new shorthand for "Look what a clever scientist I am, can I have some more money now please?".

    On the other hand, change takes time. Many of these articles will have been in the pipeline when the journal announced its new policies. Every journey starts with a small step, etc. The problem is to determine when to examine one's progress on that journey and decide whether to carry on, or go home and have a cup of tea on your familiar comfy sofa.

    ReplyDelete
  4. While I think it's not strictly incorrect, personally I feel it ought to be 'thou shalt' not 'thou shall' ;)

    I agree banning p-values without giving any alternative for hypothesis-testing was a silly move.

    ReplyDelete
    Replies
    1. I agree. (With the second point, I have no opinions on Shakespearian English).

      Delete
  5. https://en.wikipedia.org/wiki/Basic_and_Applied_Social_Psychology

    ReplyDelete
    Replies
    1. ha, thanks! I'm sure someone will remove it very soon ;)

      Delete
  6. Enlightening post, thanks! One comment "They don’t, and have never, discussed the only thing p-values are meant to do: control error rates." Fisher didn't think so did he?! I mean you don't really need the p-values; if you just want to control long term error rates you could request people to calculate the relevant test statistic and compare it to a pre-set critical value. Maybe that wouldn't give the feel of a continuous measure of strength of evidence that the p has.

    ReplyDelete
  7. Interesting post! You write: "They also banned reporting sample sizes for between subject conditions" but I don't remember seeing a ban for this anywhere, and checked with the editor & he says they never banned reporting sample sizes for between subject conditions -- only p-values and traditional confidence intervals. Did I miss something? Thanks for all your work -- cheers

    ReplyDelete
    Replies
    1. You are right - I don't think they banned them, but they are missing from many papers (maybe the majority). The editors/reviewers should have asked for them, but I don't really think they are intentionally banned. I was slightly exaggerating there. ;)

      Delete
  8. It seems that there is a lot of editor bashing here. I think that is inappropriate. The important lesson to be learned from this ban is that, without NHST, most researchers--and readers of empirical research--are incapable of evaluating empirical data. The fact that researchers are struggling should not be used to mock Trafimow and Marks. If anything, their ban on NHST has helped make salient just how much of our critical thinking we have outsourced to misunderstood statistical procedures.

    ReplyDelete
    Replies
    1. Hi Chris, I think the editors are to blame for not taking the responsibility to check the articles they publish better than they have. Also, the surprisingly large number of citations to articles that are not good, and suggest NHST is crap, annoy me: http://daniellakens.blogspot.nl/2015/11/the-relation-between-p-values-and.html Obviously the authors and reviewers can and should improve, but I'm criticizing the editorial strategy here.

      Delete
    2. The fact that researchers are struggling isn't the grounds on which to bash the editors, but the very real fact that they truly misunderstand the statistics of significance tests is. I came to learn that through Trafimow's papers.

      Delete
  9. Hi Daniel, could you please tell me how you have come to think that the type I error rate increased? You seem to believe (but correct me if I am wrong) that a p-value tells you whether a type I error has been made or not. But that is simply not true. If my decision criterion is: reject when p is between 1.00 and .95, for instance, the type I error rate is the same as when I reject when p < .05. In both cases it is .05. So, given the first criterion, reject when p = .99, provides a perfect control of type I errors (but of course not of type II errors). So, unless one magically determines which null-hypotheses are actually true, there is no way of determining whether or not a type I error has been made. A rejection of a true null is a type I error regardless of the value of p used to make the decision. (The idea that p-values tell you something about the probability of a type I error is called the local type I error fallacy).

    ReplyDelete
    Replies
    1. Hi, the Type 1 error rate has increased because people stop controlling their error rates at 5% when reporting multiple tests. So it must logically be higher.

      Delete
  10. p value should be there, just to validate the methodological correctness and assigning uniformity in research work or strengthening justifications to the findings only with respect to the individualistic terms of the work, but not to support the hypothesis as universal fact. Of course, we can encourage reporting Power and effect size, because there are many studies where Power is compromised. What I liked Trafimow’s article is that it vibrates the dishonest attempt of researchers to get their paper published in journals based on p value with unrealistic elements like exceptionally low n (as small as 3), skewed distributions, non-homogeneity etc. BASP might have fatigued with such type of papers. That is why they wrote "we encourage the use of larger sample sizes than is typical in much psy-chology research, because as the sample size increases,descriptive statistics become increasingly stable and sampling error is less of a problem" (from Trafimow & Marks, 2015, doi.10.1080/01973533.2015.1012991). Honest and judicious use of p or CI is always welcome.

    ReplyDelete