Comments on The 20% Statistician: Absence of evidence is not evidence of absence: Testing for equivalence

Never heard about it, but why don't you just u...

2017-11-08T06:31:15.875+01:00

Never heard about it, but why don't you just use the soreadsheet that comes with my 2017 article? And R is SUPER easy if you just want to use TOST. Like a simple calculator.

Have you had experience using XLSTAT 'add on&#...

2017-11-08T01:45:48.380+01:00

Have you had experience using XLSTAT 'add on' software for Excel to calculate TOST? Their online tutorial makes it look simple. I am unfamiliar with R and to save me learning it, I thought this might be useful for equivalence testing. Any thoughts? Many thanks in advance.

Very good question. I would think so as well, but ...

2016-10-29T09:10:49.093+02:00

Very good question. I would think so as well, but I've not found a reference that does the simulations to show this. I will do them and let you know, when I publish a paper about this.

Hi, I probably missed this, but wouldn't you n...

2016-10-29T09:07:07.248+02:00

Hi,
I probably missed this, but wouldn't you need a Bonferroni type of correction when looking at two tests?

Applying for an admission or scholarship to any na...

2016-09-25T17:00:30.448+02:00

Applying for an admission or scholarship to any national or international university requires a person to submit either filled hard copy or an online application form. Attaching certified documents, financial statement, English Proficiency test result and 3 recommendation letters are the pre-mandatory items to be submitted along with an application form. See more programming homework

I'm late to the game... I agree with nearly e...

2016-09-15T08:53:17.203+02:00

I'm late to the game...

I agree with nearly everything written in the post, except one I believe crucial issue.

In the section under "Rejecting the presence of a meaningful effect" it reads:
"This means we can reject the null of an effect that is larger than d = 0.5 or smaller than d = -0.5 and conclude this effect is smaller than what we find meaningful (and you’ll be right 95% of the time, in the long run)."

The first part of the sentence is of course correct, But the second part makes a probabilistic claim about the truth of the alternative hypothesis, which cannot be made in the frequentist framework (yes, the sentence uses frequentist language, but the inference is about the truth of the hypothesis). If one wants to make such claims, one would need to use Bayes and a prior to go from P(data|H0) to p(H1|data).

I think a more accurate version of the cited sentence would be

"This means we can reject the null of an effect that is larger than d = 0.5 or smaller than d = -0.5
because the probability of the observed data given the hypothesis that |d| > .05 is smaller than 5%."

Maybe that doesn't sound very satisfying, but if one likes to make statements about the probability of hypotheses there is no way around a Bayesian approach.

If you only do an equivalence test after p > 0....

2016-07-19T04:17:27.089+02:00

If you only do an equivalence test after p > 0.05 isn't alpha now inflated for that test?

Hi, I'm thinking about turning this into a pap...

2016-06-22T15:14:48.092+02:00

Hi, I'm thinking about turning this into a paper - will look into a power analysis for r script, indeed, makes sense to provide!

Hi Daniel, thanks for making it so easy to conduct...

2016-06-22T15:13:00.110+02:00

Hi Daniel,
thanks for making it so easy to conduct these tests.

Are you planning on amending the syntax to provide a power analysis for TOST r (correlations)?
That would be really useful for me.

Yes, the null of non-equivalence or the null in eq...

2016-06-14T10:44:51.898+02:00

Yes, the null of non-equivalence or the null in equivalence tests is correct, the null of equivalence is not correct.

Hi, Daniel. Thanks for a very interesting post and...

2016-06-14T10:37:22.941+02:00

Hi, Daniel. Thanks for a very interesting post and for making your R code available.

In the conclusion, you refer to the "null of equivalence." Strictly speaking, shouldn't this be the null hypothesis of nonequivalence?

See Rogers, Howard, and Vessey (1993, p. 554): "There is a null hypothesis asserting that the difference between the two groups is at least as large as the one specified by the investigator [i.e., nonequivalence], and there is an alternative hypothesis asserting that the difference between two groups is smaller than the specified one [i.e., equivalence]."

Hi Daniels, interesting stuff! A couple more or le...

2016-06-03T18:53:27.142+02:00

Hi Daniels, interesting stuff! A couple more or less random thoughts on this:

1) Equivalence testing is usually used only when a researcher actually hypothesises that a particular null is true. But it can be used more widely: there's no reason we couldn't generally approach inference about any parameter as a problem of working out whether we can conclude that a parameter is trivially small, conclude that it is reasonably large, or conclude that there is too much uncertainty to say. We can do that by combining equivalence testing with traditional NHST, or using something like magnitude-based inference as used in sports science - see http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147311

2) The frequentist approach works fine, but one key advantage of a Bayesian approach here is that we can take into account the fact that most effects in psychology are small. I.e., we can place a prior that represents a belief that most effects aren't too far from zero. That makes it less likely that we'll conclude there's a substantial effect, but also more likely that we can conclude confidently that an effect is trivial.

Hi, if you set a smallest effect size of interest,...

2016-05-27T14:06:28.178+02:00

Hi, if you set a smallest effect size of interest, poewr up for it, and don't find a significant result, you might but don't automatically, have evidence for an effect SMALLER than your SESOI. You could be in the 'undetermined' condition visualized above.

Hi Daniel. For good reason, Popper's principle...

2016-05-26T11:26:50.195+02:00

Hi Daniel. For good reason, Popper's principle of falsifiability has been a pillar of science but even Gosset and Fisher recognised imperfections of 0.05 as a cut off. As you state, over- and under-powered tests will mask meaningful effects. In medicine and sport, there are two key questions about interventions: first, does the treatment/training work and second, if yes, how well? With equivalence-type trials where effects of a new therapy are compared with those of usual care, the conventional null hypothesis testing approach can, perhaps, be retained via the minimum clinically (or practically) important difference that is declared at the outset and that must be exceeded before the new treatment can be considered to be an improvement. The next stage is to evaluate if the improvement is cost effective. Apologies if I have missed something but that wasn't clear in your otherwise helpful account. Incidentally the there-was-no-effect-(P > 0.05)-but-oh-yes-there-was-(d = 0.36) is the pantomime that arises from mixing null-hypothesis significance testing and magnitude-based inferences. Especially when alpha (0.05) is stated in the methods section. The authors of such a statement are using oleaginous Uriah-Heap statements to cover their backs but in fact, by so doing, confuse both themselves and readers.

Hi Remko, from my blog post above: One thing I no...

2016-05-24T06:15:37.241+02:00

Hi Remko, from my blog post above:

One thing I noticed while reading this literature is that TOST procedures, and power analyses for TOST, are not created to match the way psychologists design studies and think about meaningful effects. In medicine, equivalence is based on the raw data (a decrease of 10% compared to the default medicine), while we are more used to think in terms of standardized effect sizes (correlations or Cohen’s d). Biostatisticians are fine with estimating the pooled standard deviation for a future study when performing power analysis for TOST, but psychologists use standardized effect sizes to perform power analyses. Finally, the packages that exist in R (e.g., equivalence) or the software that does equivalence hypothesis tests (e.g., Minitab, which has TOST for t-tests, but not correlations) requires that you use the raw data. In my experience (Lakens, 2013) researchers find it easier to use their own preferred software to handle their data, and then calculate additional statistics not provided by the software they use by typing in summary statistics in a spreadsheet (means, standard deviations, and sample sizes per condition). So my functions don’t require access to the raw data (which is good for reviewers as well). Finally, the functions make a nice picture such as the one above so you can see what you are doing.

Thanks Nick for pointing out the very useful equiv...

2016-05-24T03:23:49.557+02:00

Thanks Nick for pointing out the very useful equivalence test. Perhaps you are not aware of the 'equivalence' R package, but if you are, how does your implementation differ?

Thanks. Looks like I actually understood for once....

2016-05-22T02:00:20.078+02:00

Thanks. Looks like I actually understood for once. :-) Your previous post(s) about one-tailed tests were a big part of why I asked.

Hi Nick - yes, all that is possible (and makes sen...

2016-05-21T07:02:08.080+02:00

Hi Nick - yes, all that is possible (and makes sense). You can test for noninferiority, for example. You can also set the equivalence range any way you like (from -0.1 to 0.5). The symmetric situation is easiest, my code only works with symmetrical intervals (but I can update it). I discussed it in an earlier draft, but the blog was already so long, I removed it. But you know I am a big fan of one-sided tests if you have one-sided hypotheses, and that generalizes to equivalence tests.

Can you explain in nice small words why the equiva...

2016-05-21T02:29:22.267+02:00

Can you explain in nice small words why the equivalence range goes from -0.5 to +0.5, rather than from 0 to +0.5 or perhaps minus infinity to +0.5? That seems to imply that I'm equally interested in results in both directions. But if I'm testing medicines, for example, I don't really care (i.e., I don't have to distinguish between) whether my new pill is less good than the old one, or no good at all, or kills people; I just want to know if it's better than the old one.

Maybe what I'm saying is, this all sounds a bit two-tailed, so how would it fit into a one-tailed world? Or (most likely) have I missed something?

You'd have one of the two situations in the 4 ...

2016-05-20T22:11:32.248+02:00

You'd have one of the two situations in the 4 graphs at the end of the post, right? So either a significant meaningful effect, or an undetermined situation. As far as I understand you can perform both tests (NHST and EHT), and you interpret them both. And it seems stat training was a bit more complete where you had it than where I had it 10 years ago :)

Similar to Rickard, I had equivalence testing in s...

2016-05-20T22:07:21.047+02:00

Similar to Rickard, I had equivalence testing in stats intro course ca. 10 years ago :)

What if your equivalence test fails to reject the equivalence hypothesis? Would you perform a post-hoc test for significant difference? Isn't this HARKing?

To be sure bayes factors can't avoid the inferential limbo either if the evidence isn't decisive (BF~1). But they at least can separaty "non-sig difference due to small power" (BF~1) from the lack of difference (BF_01<1).

I recall reading about frequentist three-way hypothesis tests (reject H1, reject H0, needs more power), but I haven't seen them in use.

As far as I know, Neyman recommends to interpret a...

2016-05-20T21:39:06.295+02:00

As far as I know, Neyman recommends to interpret a p > 0.05 as either accepting the null, or 'remaining in doubt'. It seems to me that equivalence is a nice way to differentiate between the two depending on the smallest effect size of interest (and assuming you did not have 99% power for that effect size). I don't know how it is related to a severity test - sounds like a useful blog post on your end!

It's not "intentions" that change bu...

2016-05-20T20:04:32.436+02:00

It's not "intentions" that change but rather the relevant error probabilities (as a result of things like optional stopping, cherry picking, biasing selection effects).

https://errorstatistics.com/2015/05/27/intentions-is-the-new-code-word-for-error-probabilities-allan-birnbaums-birthday/

It's very strange that users of tests wouldn't know how to interpret insignificant results when it's part of N-P testing. I'm curous as to how this use of equivalence testing compares with (a) power analysis and (b) a severity analysis of a negative result. See for example section 3.1, 4.2 and 4.3 of Mayo and Spanos (for a one sample Normal test). A severity analysis doesn't require that you set a range of interest or equivalence.
http://www.phil.vt.edu/dmayo/personal_website/2006Mayo_Spanos_severe_testing.pdf

Very nice post. I remember we had this in a course...

2016-05-20T19:00:33.813+02:00

Very nice post. I remember we had this in a courses I TA like 7 seven years ago (I was undergrad then) but I also remember I didn't really see the point over simply eyeballing the CI. And then I totally forgot about this until you brought it up.

(The reason why we had it was probably that they had minitab as well as SPSS! But I honestly don't remember )

What is in your opinion the benefits over simply calculating a 95 CI around the observed d? I can think of two 1) eyeballing is not very precise 2) p can be used as a continuous measures in an easier fashion. Are there any others? Am I missing something completely??

Thank you for this nice post. An analogous procedu...

2016-05-20T18:14:45.409+02:00

Thank you for this nice post. An analogous procedure exists in Bayesian estimation: Check whether the posterior credible interval falls within the SESOI. I like the Bayesian version better than the frequentist version because frequentist confidence intervals change when the stopping or testing intentions change, but Bayesian intervals don't depend on those intentions. If interested, see Ch 12 of DBDA2E, or pp. 16-17 of this manuscript.