tag:blogger.com,1999:blog-987850932434001559.post4681941233549222118..comments2017-05-26T08:07:34.942-07:00Comments on The 20% Statistician: Always use Welch's t-test instead of Student's t-testDaniel Lakensnoreply@blogger.comBlogger24125tag:blogger.com,1999:blog-987850932434001559.post-43551346252237288942017-03-22T09:14:14.818-07:002017-03-22T09:14:14.818-07:00Hey, out of curiosity, what about in cases where y...Hey, out of curiosity, what about in cases where you are using an ANOVA with either one or multiple predictors? Kyle Morrisseyhttp://dogsbody.psych.mun.ca/rcdmc/Site/Kyle_Morrissey.htmlnoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-83559159126723583832017-02-01T05:39:16.649-08:002017-02-01T05:39:16.649-08:00Thank you for this valuable information, it is rea...Thank you for this valuable information, it is really useful. Miguel Landa Blancohttp://www.blogger.com/profile/01204900184465837271noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-59170705488265236352016-05-28T02:25:08.590-07:002016-05-28T02:25:08.590-07:00This comment has been removed by a blog administrator.Romilda Garethhttp://www.blogger.com/profile/04571828795230778384noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-57240004794691983522016-05-09T07:21:48.982-07:002016-05-09T07:21:48.982-07:00Yes! Statistics posts with references and simulati...Yes! Statistics posts with references and simulations! Thanks! <br /><br />In you example, the data is normal, the variances are different but the means are the same. For my data on 4 or 6 different genotype groups, the data is not normal, the mean and variances are different between groups. I have chosen to log transform my data then perform Welch's test followed by Games-Howell post hoc test. Is it correct to transform data in such a way before carrying out a welch test or can you not say without more information on the dataset?<br /><br />It is likely that my study has further problems, as the types of genetic crosses we do we give 4 or 6 genotypes where 1/8th of a population has the mutant genotype we wished to study, so low n and varied sample sizes are inevitable. As you say 'Student's t-test is more powerful when variances and sample sizes are unequal and the larger group has the smaller variance' but it affects the type I error rate, I am confused as to what whether Welch's would be the best option for me in such a case. Would you recommend a non-parametric test such as Kruskal-Wallis instead? But Kruskal-Wallis assumes the same shaped distrubution as far as I know, so would again not be correct for my data I beleive.<br /><br />I was a bit worried about having to explain Welch's test in my Viva to an older generation of scientists, especially as a young medical statician had no idea what I was talking about recently. But I understand it much better now despite still having confusions about my own data!<br /><br />Thanks<br /><br />DeeAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-18246174946174656442015-12-21T03:25:23.475-08:002015-12-21T03:25:23.475-08:00At first it seems to be quite a difficult story wh...At first it seems to be quite a difficult story which will help in solving my problem but you are really very good things to clear the concepts and also it seems to be quite easy to understand now. <a href="http://www.spsshelp.org/how-to-run-a-manova-in-spss/" rel="nofollow">how to run a manova in spss</a>Thomas Leanorahttp://www.blogger.com/profile/13796930735622930749noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-82703495367789317512015-12-21T02:08:48.884-08:002015-12-21T02:08:48.884-08:00This comment has been removed by the author.Mark Dawkinshttp://www.blogger.com/profile/03049412878821578827noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-49145893609337767942015-01-29T09:29:46.403-08:002015-01-29T09:29:46.403-08:00Depends on your H0. It's true if your H0 says ...Depends on your H0. It's true if your H0 says 'mu 1 = mu 2' (i.e. two populations with the same mean), not if it says 'x1. and x2. are drawn from the same population'. If it's the latter H0 you're interested in (a decent choice given randomisation), you could actually make the case that the t-test (as well as the Mann-Whitney or permutation tests) outperforms the Welch test in detecting that the two populations are indeed different.Janhttp://www.blogger.com/profile/17765078332699225416noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-70744475134741515882015-01-29T09:00:17.651-08:002015-01-29T09:00:17.651-08:00This comment has been removed by the author.Janhttp://www.blogger.com/profile/17765078332699225416noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-52194743592440481922015-01-29T08:32:33.131-08:002015-01-29T08:32:33.131-08:00I think these are good questions, that require dat...I think these are good questions, that require data. Setting differences between means to 0, but assuming differences in variance is the only way to examine the Type 1 error rate of a test, but does it happen in practice. I think it might, depending on the field you work with, but it's really an empirical question. <br /><br /><br /><br />Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-44419472057431662532015-01-29T08:13:12.054-08:002015-01-29T08:13:12.054-08:00"The R code below examines the Type 1 error r..."The R code below examines the Type 1 error rate of a hypothetical study where 38 participants were assigned to condition X, and 22 participants were assigned to condition Y. The mean score on some DV in both groups is the same (e.g., 0), so there is no effect, but the standard deviations between groups differ, with the SD in condition X being 1.11, and the SD in condition Y being 1.84."<br /><br />What kind of experimental manipulation would lead to identical means but affect variance? And wouldn't the upshot of such a randomised experiment have to be that there <i>was</i> an effect - just not in terms of the mean.<br /><br />I wholeheartedly agree that Levene's test (or test for normality or <a href="http://janhove.github.io/silly%20significance%20tests/2014/09/26/balance-tests/" rel="nofollow">covariate balance</a> for that matter) are overused in randomised experiments, and I also think that the Welch test is a better default than the normal t test in <i>non-randomised</i> experiments.<br />But doesn't randomisation allow us to boldly assume that 'null hypothesis = no effect' comprises both 'no mean shift' <i>and></i> 'no change in variance', i.e. the assumption that both groups were drawn from <i>the same population</i> (whatever it may be)? (And use permutation tests while we're at it.)Janhttp://www.blogger.com/profile/17765078332699225416noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-54676300650552733882015-01-27T16:15:06.287-08:002015-01-27T16:15:06.287-08:001-Thanks for answering :) i think I'm with you...1-Thanks for answering :) i think I'm with you on the glass delta, and I'll be interested to hear about different robust effect size measures when you get around to writing about them. <br /><br />3-But what do we do if they disagree again once we replicate? Replicate again? But it's true they agree most of the time! <br /><br />4-that's true, and it's worthwhile ;)nicebrainhttps://nicebrain.wordpress.com/noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-62908588788514160402015-01-27T08:31:33.356-08:002015-01-27T08:31:33.356-08:00Daniel; thanks for your extensive answers.Daniel; thanks for your extensive answers.Joost de Winterhttps://sites.google.com/site/jcfdewinter/noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-79020145014961478982015-01-27T08:23:25.466-08:002015-01-27T08:23:25.466-08:00Hi Joost,
I provide one example, and then referen...Hi Joost,<br /><br />I provide one example, and then reference an extensive literature that has examined this issue in detail, with a vast amount of different values, in hundreds of simulations, This is not my idea, it is not debated, I'm just explaining it. The references are there, so if you care enough about this topic to run simulations, read the literature I am summarizing. <br /><br />There is a very specific set of effect size/sample size combination where Student's t-test has a little (but not enough to make it worthwhile) more power. This is discussed in the literature, and since you cannot be certain you have equal variances in tiny samples, you should always report Welch's t-test if you do science in the real world. I honestly don't care which specific combination you an come up with while running simulations in R where power values are a tiny bit in the advntage of Student's t. <br /><br />Your latest example has pointed out a situation where I need to add 2 participants to the smallest group to compensate for the difference in power, and you are already entering the domain of underpowered studies (which I hope you are not recommending as good practice). If you want to perform even more underpowered studies, you can boost the difference between Student's t-test and Welch's t-test even more. But this all does not change the very simple fact that you should always report Welch's t-test (or, if data are not normal, robust statistics, such as Yuen's method, an adaptation on Welch's method using trimmed means and windsorized variances - but I'm leaving that for a future post).<br /><br />Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-87078397473941463092015-01-27T07:56:36.712-08:002015-01-27T07:56:36.712-08:00Dear Daniel,
1) One could try other combinations ...Dear Daniel,<br /><br />1) One could try other combinations using larger sample sizes (e.g., n1=100, n2=10, sd1=sd2=1), and this also shows that the t test can have a considerable power advantage (85% vs. 78% for a mean shift of 1). Anyway, I am unclear why you are saying that (n1=30, n2=5) is less informative than (n1=38, n2=22). Sometimes researchers are facing small sample sizes (and large effects). William Gosset developed the t test exactly because he was wanted to obtain valid results for small samples. <br /><br />2) What I meant is NOT that unequal variances are rare IF you have normal distributions. What I meant is that unequal variances are usually accompanied by non-normal distributions. That is, unequal variances usually arise for some reason, such as a floor or ceiling measurement artefact. It would be interesting to explore how the Welch and t test compare in situations other than perfectly normal distributions. <br />For example, suppose one assumes a 5-point Likert scale, with the following density distribution: Totally disagree = 0%, Slightly disagree = 1%, Neutral = 3%, Slightly agree = 6%, Strongly agree = 90%, that is a highly skewed distribution. Also assume n1=22, n2=100. Now sample the two vectors from the same population. The Type I error rate is now 13% for the Welch test, and 4% for the t test. Of course, this is an extreme situation, but it nicely illustrates that the Welch test can break down completely if the distributional assumptions are violated. <br /><br />3) I am also not sure whether the burden of proof should be on me now, while you are the one claiming that the Welch test should “always” be used. I just think you are making quite a generalization by saying that the Welch test is always preferred, because your conclusion seems to be based on 1 simulation using 1 set of parameters and 1 type of distribution. Counterexamples are easily found….<br /><br />Cheers, Joost<br />Joost de Winterhttps://sites.google.com/site/jcfdewinter/noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-59842833258617166712015-01-27T02:21:40.496-08:002015-01-27T02:21:40.496-08:00Finally, in your example, as in the previous examp...Finally, in your example, as in the previous example, the power difference is mitigated by running one additional participant (so 6 instead of 5). Unless you have something with practical relevence to report, I'm sticking with my recommendation to alway report Welch's t-test.Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-42763917640773992372015-01-27T02:09:53.086-08:002015-01-27T02:09:53.086-08:00Also, please provide those empirical references fo...Also, please provide those empirical references for your earlier statement that unequal variances are rare in normal distributions. Thanks.Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-155032506549717422015-01-27T02:08:31.092-08:002015-01-27T02:08:31.092-08:00Joost, you do seem to love ridiculously small samp...Joost, you do seem to love ridiculously small sample sies despite my previous sttement that doing tatistics on such samples is not telling you more than flipping a coin. A type 1 error rate of 3% when you want it to be 5% means the test is performing poorly - so your example is demonstrating that Welch's test is performing better, but just don't realize it.Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-11965597877481161702015-01-27T00:55:18.943-08:002015-01-27T00:55:18.943-08:00Dear Daniel,
It is possible to devise more counte...Dear Daniel,<br /><br />It is possible to devise more counterexamples, even with normal distributions.<br /><br />For example, with n1 = 30, n2 = 5, sd1 = 1.2, sd2 = 1, I get a Type II error rate of 5% for the t test and 10% for the Welch test (for detecting a mean shift of 2). The Type I error rate is 3% for the t test and 6% for the Welch test. So we now have a situation where the Welch t test performs twice as poorly as the t test regarding both power and false positives.<br /><br />Best regards, JoostJoost de Winterhttps://sites.google.com/site/jcfdewinter/noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-45889072517598398962015-01-26T22:48:34.479-08:002015-01-26T22:48:34.479-08:00Hi Joost, yes, in tiny samples, power differences ...Hi Joost, yes, in tiny samples, power differences become more pronounced than 4%. But in your example (and as I note) adding one participant to the n=2 sample will mitigate the power differences (at least in one simulation I just ran). So if the difference in power can be solved by adding a single participant, I consider it practically unimportant. It's something people who like to run simulations might care about, but it is irrelevant for the researchers actually using statistics. <br /><br />Another reason I did not consider these situations is because doing inferential statistics on such small sample sizes is not very useful (http://daniellakens.blogspot.nl/2014/11/evaluating-estimation-accuracy-with.html). <br /><br />More importantly, because your Levene's test will have such low power you will never be able to convince yourself variances are equal - I don't think you can provide a reference for your claim that unequal variances do not occur a lot in normal distributions. As I explain, the assumption is, in it's extreme form, practically untennable, and with different variances, Welch's test should be used. <br /><br />But feel free to provide empirical support that variances are often equal, that would be very interesting. Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-92132168659356222142015-01-26T22:39:50.474-08:002015-01-26T22:39:50.474-08:00Hi,
1) If the Type 1 error rate stays closer to t...Hi,<br /><br />1) If the Type 1 error rate stays closer to the nominal value, this means the 95% CI's also stay closer to their nominal values. About ES - good point! I guess different SD's means Glass's delta (using the SD from one of the two conditions) is more appropriate than Cohen's d in some situations, but it was surprisingly enough never mentioned. I think there's a future paper in that question! My advice now would be to use a robust Cohen's d (I'll write about it in a future blog post). This blog is just an intro to robust statistics later on.<br /><br />2) My idea here is simple: both approaches should (and will typically, except in huge N situations) agree (e.g., http://daniellakens.blogspot.nl/2014/09/bayes-factors-and-p-values-for.html). I want to improve the way we work, also for people who are not ready to switch to Bayes, and honestly, I think for practical purposes I don't care which test is reported. I only added the Bayesian t-test for the large N cases, and for situations where p-values are a little high (e.g p > 0.03). In those cases, Bayes will let you rethink (if you were not smart enough to take the huge N or weakness of high p-values into account yourself!).<br /><br />3: When they disagree, replicate - neither test will be convincing. Both tests will agree (unless you have large samples, but since I believe p-values should be a decreasing function of sample size anyway.... http://daniellakens.blogspot.nl/2014/05/the-probability-of-p-values-as-function.html).<br /><br />4) Yes. But I'm not the person convincing people to use Bayes, I'm trying to show them why what they are currently doing is suboptimal even if they don't use Bayes. :)Daniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-10810638864806316702015-01-26T19:21:25.403-08:002015-01-26T19:21:25.403-08:00I agree that Levene's test serves no purpose. ...I agree that Levene's test serves no purpose. However, I am not convinced about the general recommendation to ‘always’ use the Welch test instead of the t test. The results of the present simulation are clear, but it seems quite rare to me that unequal variances will occur for normal distributions. Usually, unequal variances arise because of some floor or ceiling effect. I would recommend investigating such conditions too.<br /><br />What is also not mentioned is the behavior of both tests for small sample sizes (something which the t test was originally developed for by ‘Student’). <br /><br />Take n1=n2=2 and sd1=sd2=1. The type I error rate is 2.4% for the Welch test, while for the t test it is the nominal 5.0%. Now let us detect a difference of 8 (add 8 to one of the distributions). The power is now 67% for the Welch test and 96% for the t test. Of course, this small sample size is unusual. However, it does illustrate that is easy to come up with counterexamples where the Welch test does not work.Joost de Winterhttps://sites.google.com/site/jcfdewinter/noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-21111212916168004272015-01-26T19:13:12.579-08:002015-01-26T19:13:12.579-08:00Hey Daniel,
I liked your simulations and thought ...Hey Daniel,<br /><br />I liked your simulations and thought that scatter plot was a brilliant visualization of the difference between Welch and Student. I just have a couple of questions/thoughts I'd like to throw your way after reading this post. You may have considered these before, and feel free to tell me to be patient if there is a post in the pipeline to address them: <br /><br />1) What implication does using Welch's test have on estimating effect sizes from our studies? Wider CIs, smaller ES estimates, or something else? <br /><br />2) You say: "And finally, it is recommended to perform a Bayesian t-test to check whether any conclusions about rejecting the null-hypothesis converge with Welch’s t-test"<br /><br />I'm not sure I understand this bit. Does this mean we should seek to verify if our conclusions using the Welch test are correct by using a bayesian test? Why don't we just report the bayesian test in the first place if that option is on the table? <br /><br />3) Does doing a Welch test lead to doing anything differently when your p-values and bayes factors disagree, or do we defer to bayes as usual here? Or will they perhaps disagree less frequently using Welch tests?<br /><br />And last one, I promise,<br /><br />4) Isn't the p-value you get from the Welch test going to suffer from all the same problems (e.g., misinterpretation, adjustments for multiple comparisons, confusion when N gets very large, etc etc etc) of a regular old p-value? Our type-1 error rate will be better off swapping to Welch's test, no doubt about that, but there's still that elephant with the p shaped trunk in the room.<br /><br />Interesting post as always, these were just my rambling thoughts too long for twitter :)nicebrainhttps://nicebrain.wordpress.com/noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-22678989835531547552015-01-26T07:03:06.215-08:002015-01-26T07:03:06.215-08:00And now that you've got me thinking about it m...And now that you've got me thinking about it more, I actually have a function in my old list of function (not in my R package) called yuen.contrast() that lets one use any of those equations in the Table linked above. You are inspiring me to add it to my R package, so I am working on that now!Ryne Shermanhttp://www.blogger.com/profile/08475820023652785317noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-28200308171456068472015-01-26T06:59:10.806-08:002015-01-26T06:59:10.806-08:00Hi Daniel,
Cool post! It reminded me a lot of thi...Hi Daniel,<br /><br />Cool post! It reminded me a lot of this chart (http://rynesherman.com/T-Family.doc) I created in 2012. It displays all of the possible equations for T and their DF based on trimming (trim vs. no trim), variances (assume equality vs. do not assume equality), and number of means (two-sample case vs. more than two samples). Of course, it doesn't speak to repeated measures designs which is a different ball game.<br /><br />The idea came to me after a summer school with Rand Wilcox (whom you cite) learning about various robust T-tests in R. I had nowhere to put the thing until your blog post! Thanks!Ryne Shermanhttp://www.blogger.com/profile/08475820023652785317noreply@blogger.com