Comments on The 20% Statistician: Examining Non-Significant Results with Bayes Factors and Equivalence Tests

Hi Jay, what you are describing is statistically e...

2017-03-04T06:47:51.454+01:00

Hi Jay, what you are describing is statistically equivalent to combining TOST and NHST. See my explanation in the preprint.

Regarding performing both a test of the null hypot...

2017-03-04T06:11:22.421+01:00

Regarding performing both a test of the null hypothesis that the effect is 0 and an equivalence test that the effect is, say, < |.5| seems to suggest that the investigators are confused. If they believe that effects < |.5| are equivalent to 0 for all practical purposes, then why would they care about whether the null hypothesis that the effect is 0 is rejected? because, clearly, rejecting that null would not imply that the effect size is not large enough to be considered different from 0 for all practical purposes.

Instead, it seems to me what would matter is whether the confidence or credible interval were (a) entirely within the equivalence limits, (b) entirely outside the equivalence limits, or (c) straddling an equivalence limit. From case (a) we would infer equivalence; from case (b) we would infer superiority; and case (c) would be indeterminate.

I finally got convinced there's no problem wit...

2017-02-03T18:17:48.213+01:00

I finally got convinced there's no problem with doing TOST and t-test on the same set of data. What is still unintuitive to me is that this combination makes significant results more frequent without inflating alpha. I understand the reason is that both null hypotheses are mutually exclusive. Thanks for your patience :-)

My full response here: https://medium.com/@mazorma...

2017-02-02T11:43:20.232+01:00

My full response here: https://medium.com/@mazormatan/cant-have-your-tost-and-eat-it-too-f55efff0c85e#.a2vl4umpq

Thanks for the clarification, Bill! "The Bay...

2017-02-01T09:36:10.982+01:00

Thanks for the clarification, Bill!

"The Bayesian prior was a somewhat subjective illustration of how someone who believed in the effect would describe that belief."

I appreciate that it's difficult to quantify subject beliefs, particularly if you don't share them. But would someone holding this belief expect that the ratio of the treatment effect and the standard deviation was the same for all ten outcome variables? I.e., that while the variability of the data for DV1 may be larger than that for DV2, the treatment effect for DV1 would be correspondingly larger as well to produce the same ratio?

This isn't a criticism of your study. But I don't understand, in general, why one would express predictions in standardised effect sizes. 'Because we don't know about the raw effect size' isn't really a strong argument, because you need the raw effect size to calculate the standardised effect size.

@Daniël: Thanks for the link.

I don't understand. In both cases the error ra...

2017-01-31T16:40:10.070+01:00

I don't understand. In both cases the error rate of each of the tests is kept under alpha, and in both cases it's meaningless to talk about alpha for single tests because the tests are the same, only with different rejection areas. The serious problem here is that at least one of the null hypotheses is always false: either d!=0, or d==0, and then it lies within the equivalence interval. This makes alpha completely meaningless, because there is no null distribution (so to correct my previous comment, alpha is not 1, it is just undefined).

I think there is a difference with the 2-tail exam...

2017-01-31T15:07:52.746+01:00

I think there is a difference with the 2-tail example, namely that in that case, you are testing: 1) d > 0, 2) d < 0, 3) d = 0. If d = 0 is true, but you test 1 and 2 with 5% alpha, your overall alpha is actually 10% when d = 0. But with equivalence tests, the t-test has a 5% error rate when d = 0, and the TOST test only has a 5% error rate when d <> 0. I think that's a difference.

Well, you can use the exact same argument to justi...

2017-01-31T14:33:53.148+01:00

Well, you can use the exact same argument to justify doing a right tailed t test and move on to left tailed t test only if not significant. Either you make type 1 error on the first test or on the second, can never be both. In the example from my previous comment, you will reject at least one hypothesis for every possible combination of x̄ and σ, so your alpha is 1! Each of the tests is legitimate, its the combination that's problematic.

Hi, yes, they are performed on the same data, but ...

2017-01-31T14:19:47.969+01:00

Hi, yes, they are performed on the same data, but either 1) the true effect is 0, so you can make a Type 1 error for the t-test, but not for the equivalence test, or 2) The true effect is <> 0, so you can make a Type 1 error for the equivalence test, but not for the t-test. Doesn't this solve the problem? It's an interesting question, and it is very well possible that I am missing something.

Corrections: should be: Rejection areas for the fi...

2017-01-31T13:49:44.552+01:00

Corrections: should be:
Rejection areas for the first test are t>1.98 OR t<1.98, i.e., x̄/(σ/sqrt(100))=10*x̄/σ>1.98 OR 10*x̄/σ<-1.98, i.e., x̄/σ>0.198 OR x̄/σ<-0.198

more importantly, rejection areas for the second test are 1.66/sqrt(100)-0.5<x̄/σ<-1.66/sqrt(100)-0.5, i.e., -0.33<x̄/σ<0.33
The important point holds though - both have the same null distribution given the df, and thus should not be treated as independent tests.

2017-01-31T13:47:31.662+01:00

This comment has been removed by the author.

Hi Daniel, Thanks for restoring my reply :-) The ...

2017-01-31T13:14:02.392+01:00

Hi Daniel,
Thanks for restoring my reply :-)

The fact that you can have a finding that is both statistically equivalent and statistically significant is not evidence for the two tests being independent. My point is that the two tests are performed on the same statistic under some rearrangement of terms, and thus are actually the same test with different rejection areas, just like right-tailed and left-tailed t tests.

Let's take a concrete example:
Say I sample 100 samples, with mean x̄ and std σ. Before sampling, I decided to perform
1. a t-test, to test whether μ==0
2. an equivalence test, to test whether d>0.5 or d<0.5

Rejection areas for the first test are t>1.98 OR t<1.98, i.e., x̄/(σ/sqrt(100))=10*x̄/σ>1.98 OR 10*x̄/σ<-1.98, i.e., x̄/σ>0.198 OR 10*x̄/σ<-0.198

Rejection area for the second test is 0.5>d>-0.5, i.e., 0.5>x̄/σ>-0.5.

The statistic x̄/σ has one null distribution. Once you know your sample size and the p value of test 1, you also know the p value of test 2. Also true the other way. It's not about family wise correction, it's about the maintaining the alphas for the same test, even when it's under disguise.

Hi Matan, when you have two distinct hypotheses, t...

2017-01-31T11:17:38.382+01:00

Hi Matan, when you have two distinct hypotheses, they each have their own alpha. In this case, the error rates you are controlling are the following: 1) I do not want to say there is a significant effect, when the null is true, more than 5% of the time (t-test), and 2) I do not want to say there is statistical equivalence, when there is actually a true effect that equals one of the equivalence bounds, more than 5% of the time (TOST).

You are controlling both there error rates at 5%. Remember that you can have a finding that is statistically equivalent AND statistically significant. So there are really two different hypotheses. Each can be individually true or false. That's why you can perform both tests, and each has it's own alpha level.

Now, you might want to say: I want to control my error rate over BOTH these tests. Yes, then you would need to correct (although how much, given their dependencies, is a difficult calculation). But then you might as well correct for all tests you do in the article, or all tests in your lifetime, and I explain here why that is not how NP testing works: http://daniellakens.blogspot.nl/2016/02/why-you-dont-need-to-adjust-you-alpha.html

Hi, the very bad spam filter had flagged it as spa...

2017-01-31T11:04:31.799+01:00

Hi, the very bad spam filter had flagged it as spam (and let's through many messages that are clearly spam on other posts!). I restored your message.

Hi Daniel, Following our twitter conversation I wr...

2017-01-31T10:47:43.162+01:00

Hi Daniel,
Following our twitter conversation I wrote a comment and it now disappeared - have you erased it?
Thanks,
- Matan

I don't think the adoption of TOST for the pur...

2017-01-31T10:36:20.486+01:00

I don't think the adoption of TOST for the purpose of "examining non-significant results" (as in the title) or "interpreting non-significant results" (as in the text body) is responsible from a frequentist perspective, for two reasons. First, performing a significance test conditioned on a null result in a previous test increases your type 1 error rate by definition. Second, even if preregistering the TOST together with the usual t-test, the two tests can not be treated as two independent tests and so the desired alpha should be split between them according to researcher's prior.

Once observing a null result, all alpha has been used to full, and error rates of any additional test on the same contrast can not even be approximated (e.g., http://www.sciencedirect.com/science/article/pii/S0001691814000304). This is one reason why the adoption of Bayes factors as a tool for examining non-significant results can't be justified from an NP perspective, and only makes sense from a Bayesian one. I think using TOST after observing a null result is no different than doing a right-tailed t-test, get null, and then perform a left-tailed test on the same set of data. Alpha is guaranteed to inflate. TOST is exactly the same, except that instead of right or left tails, you spread your alpha also at the center of the distribution. This additional rejection area should be somehow payed for.

The second argument follows from the first one. Even if you perform the two tests (t and TOST) to test two separate hypotheses (one is that the the effect is different from zero, the second that it is outside the equivalence region), you should split your alpha between the tests because there's a one to one mapping between the null distributions of the two tests. All you do is increase your rejection area, without paying for it.

Kruschke has a nice way to compensate for this additional rejection area (ROPE). The equivalent for our case would be that once you decided on your equivalence bounds, they should also be used for your original t-test (i.e., instead of showing the CI doesn't include 0, it should not intersect with the equivalence region). This way bigger equivalence regions don't only make "accepting the null" easier, but also make it harder to reject it. Not sure this has the desired frequentist properties (maintain alpha) - but would be interesting to examine.

Cheers,
- Matan

Hi Jan, A lot of good points here. I'd like t...

2017-01-31T01:10:40.164+01:00

Hi Jan,

A lot of good points here. I'd like to make a quick clarification about the alternative hypothesis, RE:"It seems to me that... the authors didn't have an alternative hypothesis, but since they needed a power analysis and got a null result, they needed to come up with canned ones."

While we did expect to see a null result due to prior research, we did not know we would get a null result on this particular data set at the time of the pre-registration. In accordance to the rules, we were explicitly forbidden to touch the data until the pre-registration was complete.

The point alternative for power analysis and the expected value of the alternative prior distribution are different, because the former sets a minimum value, while the latter sets the centre of a symmetrical distribution. We went with a canned standardized effect for power analysis, because we were unable to find many analogous studies from which we could form a more precise alternative. The Bayesian prior was a somewhat subjective illustration of how someone who believed in the effect would describe that belief. While the alternative hypothesis may not be entirely ideal, but we did have it, so to speak.

Best,
Bill

Hi - I discuss how using standardized ES will boot...

2017-01-30T18:54:01.098+01:00

Hi - I discuss how using standardized ES will bootstrap the use of unstandardized effect sizes in my paper https://osf.io/preprints/psyarxiv/97gpc/

Hello Daniël, I realise this isn't the focus...

2017-01-30T18:45:00.999+01:00

Hello Daniël,

I realise this isn't the focus of your blog post, but would you like to elaborate on the following?

"It is interesting to see the authors wanted to specify their alternative in terms of a ‘standardized effect size’. I fully agree that using standardized effect sizes is currently the easiest way to think about the alternative hypothesis"

We've had our discussions on standardised ESs, and as you may recall, I think they're overused. The paper you discuss nicely illustrates why I think so: For the power analysis on p. 7, the authors didn't derive their effect size under H1 from theory, practical considerations or previous work on the topic at hand. Rather, they went for that the typical mean difference-to-sample standard deviation ratio from a likely p-hacked literature (pre-2003 social psychology).

For the Bayes factor analyses, they similarly ressorted to everyone's favourite standardised effect size ('moderate = d of 0.5') for all 10 variables in Table 2. (I don't really understand why the effect size would now be a different one, but I've only scanned the paper.) Is that really the authors' alternative hypothesis or just a convenient fiction?

It seems to me that rather than having an alternative hypothesis that, if the intervention 'worked', it would produce approximately a mean difference-to-sample standard deviation ratio of 0.4 (or 0.5+/-0.15), the authors didn't have an alternative hypothesis, but since they needed a power analysis and got a null result, they needed to come up with canned ones.

To be clear, I'm not blaming the authors here since few papers on power analysis and, presumably, Bayes factors provide guidance on working with genuine alternative hypotheses. But, to paraphrase John Tukey, are we really so uninterested in our hypotheses that we don't care about their units? So is using standardised effect sizes the easiest way to think about the alternative hypothesis or the most convenient way to avoid having to think about it?

I am sure there is ,ore modern work on Bayes Facto...

2017-01-30T18:28:06.281+01:00

I am sure there is ,ore modern work on Bayes Factors in a Behrens context. The one I remember from the 1980s working on my PhD is :

Bayes Factors for Behrens-Fisher Problems
Hari H. Dayal and James M. Dickey
Sankhyā: The Indian Journal of Statistics, Series B (1960-2002)
Vol. 38, No. 4 (Nov., 1976), pp. 315-328

Hi, you can have perfectly correct interpretation...

2017-01-30T13:51:42.955+01:00

Hi,

you can have perfectly correct interpretations of a Frequentist confidence interval without confusing them with post-data Bayesian interpretations. As a hard-core Bayesian, you might not like nuanced messages, but combining Frequentist and Bayesian inferences is my preferred approach, and I see strenghts in both. If you don't like a nuanced message, I totally understand.

Hi, I think editors and reviewers are capable of c...

2017-01-30T13:43:29.102+01:00

Hi, I think editors and reviewers are capable of checking whether the pre-registration is followed through (even though making the pre-registration public after publishing the article makes total sense to me). One important aspect is that this format prevents publication bias. So, it is excellent that this journals have pre-registration, as a reviewer of some articles at CRSP and other registered report journals, I see no problems or super secret sneaky stuff happening, but by all means, e-mail the editors of the journal - they might be willing to change their policy.

What?!? That defeats the entire purpose of pre-re...

2017-01-30T13:25:23.396+01:00

What?!?

That defeats the entire purpose of pre-registration?! Hiding pre-registration from the reader is the exact opposite of open science, in fact i would argue that it is pseudo open science.

Registered Reports started out quite promising with your special issue (in which pre-registration information *was* included in the articles). As far as i am concerned, the Registered Report format has now been compromised already. Perhaps they should come up with a new format: SSRR's (Super Secret Registered Reports).

As long as pre-registration information is not publically accessible to the reader, e.g. via a link in the paper, "Comprehensive Results in Social Psychology" most definitely is not "one of the highest quality journals" in my reasoning. In fact, i think it could be reasoned that they have set a pseudo (open) scientific precedent...

I agree wholeheartedly that just reporting Bayes F...

2017-01-30T12:57:08.127+01:00

I agree wholeheartedly that just reporting Bayes Factors is a very poor way to report data. I'm pretty sure any Bayesian would also agree, even those who support Bayes factors. The BF itself is just a random variable. I also agree that reporting the descriptive statistics you're recommending is wise as well, except that the CI should be a Bayesian CI because you're confusing the meaning of the CI you're using and the Bayes Factor.

A Bayesian states a belief and updates that belief based on evidence. A Bayesian can say, "I believe this coin is fair," and flip the coin once to see if she is right. She will of course correctly take this one flip as very little evidence to modify her belief and update it as she adds more evidence. This is a continuous process without "tests" per se. A Bayes Factor is not a test but is a way to quantify belief. The Bayes CI is a statement of where one believes a measure is likely to be with certainty attached to various components of it and a belief that it's not just one value but something flexible related to the density of the CI.

A frequentist has no ability to quantify the belief. He can only generate statistics with long run probabilities and therefore the scenario above is ridiculous. That doesn't mean that he won't update his beliefs based on tests with good frequentist properties as any rational person would. It's just that the outcome of the test, or CI, doesn't actually measure the belief. A frequentist CI only gets it's frequentist properties if the statement is that the true value is in the CI without equivocation and without any ability to say anything relative about values within the CI.

Therefore, the Bayes CI and frequentist CI are two very different things. Combining frequentist tests and Bayes Factors ends up producing a hodge podge that does nothing to further the field and ends up confusing the fundamental meaning of both measures. Cheerful articles about how we should all be able to just get along and let's integrate Bayes and frequentist stats show a lack of understanding of both... that whole 80% I suppose.

HI, no, I don't have access to the pre-registr...

2017-01-30T12:27:53.846+01:00

HI, no, I don't have access to the pre-registration. The pre-registration is handled by the editors. You can contact the authors if you want details (they were very responsive to my questions, they might want to make it public). But the editors and reviewers at CRSP check the pre-registration, there is no requirement to make it public.