Recently, people have wondered why researchers seem to have
a special interest in replicating studies that demonstrated unexpected or
surprising results. In this blog post, I will explain why, statistically speaking, this makes sense.
When we evaluate the likelihood that findings reflect real effects, we need to take the prior likelihood that the null-hypothesis is true into account. Null-hypothesis significance testing ignores this issue, because p-values give us the probability of observing the data (D), assuming H0 is true, or Pr(D|H0). If we want to know the probability the null-hypothesis is true, given the data, or Pr(H0|D) we need Bayesian statistics. I generally like p-values, so I will not try to convince you to use Bayesian statistics (although it’s probably smart to educate yourself a little on the topic), but I will explain how you can use calibrated p-values to get a feel for the probability H0 and H1 are true, given some data (see Sellke, Bayarri, & Berger, 2001). This nicely shows how p-values can be related to Bayes Factors (see also Good, 1992, Berger, 2003).
Everything I will talk about can be applied with the help of
the nomogram below (taken from Held, 2010). On the left, we have the prior probability that H0 is
true. For now, let’s assume the null hypothesis and the alternative hypothesis
are equally likely (so the probability of H0 is 50%, and the probability of H1
is 50%). The middle line gives the observed p-value
in a statistical test. It goes up to p
= .37, and for a mathematical reason cannot be used for higher p-values. The right scale is the
posterior probability of the null-hypothesis, from almost 0 (it is practically
impossible H0 is true) to 50% probability that H0 is true (where 50% means that
after we have performed a study, H0 and H1 are still equally likely to be true.
By drawing straight lines between two of the scales, you can read off the
corresponding value on the third scale. For example, assuming you think H0 and
H1 are equally likely to be true before you begin (a prior probability of H0 of
50%), and you observe a p-value of .37, a straight line will bring us to a
posterior probability for H0 of 50%, which means the likelihood that H0 or H1
is true has not changed, even though we have collected data.
If we observe a p
= .049, which is a statistical difference with an alpha level of .05, the posterior likelihood that H0 is
true is still a rather high 29%. The likelihood of the alternative hypothesis (H1) is 100% - 29% = 71%. This
gives a Bayes Factor (the probability of H1, given the data, divided by the
probability of H0, given the data, or Pr(H1|D)/Pr(H0|D) of 0.40, or 2,5 to 1
odds against H0. Bayesian do not consider this strong enough support against H0
(instead, it should be at least 3 to 1 odds against H0). This might be a good
moment to add that these calculations are a best case scenario. This prior
distribution is chosen in a way to given the highest possible Bayes Factor, so
the real Bayes Factor is the value that follows from the nomogram, or worse. Also,
now you’ve seen how easy it is to use the nomogram, I hope showing the Sellke et al., 2001 formula these calculations are based on won’t scare you away:
What if, a-priori, it seems the hypothesized alternative hypothesis is at least somewhat unlikely? This is a subjective judgment, and difficult to quantify, but you often see researchers themselves describe a result as ‘surprising’ or ‘unexpected’. Take a moment to think how likely the H0 should be for a finding to be ‘surprising’ and ‘unexpected’. Let’s see what happens if you think the a-priori probability of H0 is 75% (or 3 to 1 odds for H0). Observing a p = .04 would in that instance lead to, at best, a 51% probability H0 is true, and only a 49% probability H1 is true. That means that even though the observed data are unlikely, assuming H0 is true (or Pr(D|H0)), it is still more likely that H0 is true (Pr(HO|D) than that H1 is true (Pr(H1|D). I've made a spreadsheet you can use to perform these calculations (without any guarantees), in case you want to try out some different values of the prior probability and the observed p-value.
With a prior probability of 50%, a
p = .04 would give a posterior probability of 26%. To have the same posterior probability of 26%, with an prior probability for H0 of 75%, the p-value would need to be p = .009. In other
words, with decreasing a-priori likelihood, we need lower p-values to achieve a comparable posterior probability that H0 is
true. This is why Lakens & Evers (2014, p. 284) stress that “When designing
studies that examine an a priori unlikely hypothesis, power is even more important:
Studies need large sample sizes, and significant findings should be followed by
close replications.” To have a decent chance of observing a low enough p-value, you need to have a lot of statistical power. When reviewing studies that use the words 'unexpected' and 'surprising', be sure to check whether, given the a-priori likelihood of H0 (however subjective this assessment is) the p-values lead to a decent posterior probability that H1 is true. If we would do this consistently and fairly, there would be a lot less complaining about effects that are 'sexy but unreliable'.
This
statistical reality has as a consequence that given two studies with equal sample sizes that yielded results with identical p-values, researchers
who choose te replicate the more ‘unexpected and surprising’ finding are doing our
science a favor. After all, that is the study where H0 still has the highest
posterior likelihood, and is thus the finding where the likelihood that H1 is true is still relatively low. Replicating
the more uncertain result leads to the greatest increase in posterior likelihoods. You can disagree about which finding is subjectively judged to be a-priori less likely, but the choice to replicate a-priori less likely results (all else being equal) makes sense.
Great explanation! One question though: since the calculation appears to rely on knowing a noncentral distribution for p(h1|d), planned effect size should play a role, yes? So is the size of the expected effect an implicit corollary of the priors? Or is it orthogonal to the priors?
ReplyDeleteThe example focusses on p-values. Effect sizes stay the same as sample sizes increase, but p-values get smaller when sample sizes increase. Also, for small effects, you need larger sample sizes to get small p-values, wheras for big effects, you can suffice with smaller sample sizes. So, the effect size and priors are independent, but both influence the sample size you need. If that an answer to yout question?
DeleteThanks - really helpful, especially the example of how with p=0.04, and equally likely a priori H1 is still 26% likely to be wrong. Which is very counter-intuitive.
ReplyDelete75% is 3:1 in odds.
ReplyDeleteThanks for pointing out my error, and taking the effort to leave a comment! I appreciate it! Fixed.
DeleteDan, just a quick note that the spreadsheet's link goes nowhere.
ReplyDelete