tag:blogger.com,1999:blog-987850932434001559.comments2020-02-06T04:29:40.561-08:00The 20% StatisticianDaniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comBlogger957125tag:blogger.com,1999:blog-987850932434001559.post-62332668192820537152019-10-11T10:41:38.519-07:002019-10-11T10:41:38.519-07:00Thank you for the wonderful package.
Is there a g...Thank you for the wonderful package.<br /><br />Is there a good way to determine equivalence margins?Unknownhttps://www.blogger.com/profile/11008284843134095666noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-56681229644173713102019-10-11T04:09:40.125-07:002019-10-11T04:09:40.125-07:00I know that editorials are mass-produced, includin...I know that editorials are mass-produced, including similar criticisms.<br /><br />However, I don't know anyone who showed "how can I calculate post-hoc power" in the case of Welch's t-test, especially when the sample sizes of the two groups are different.<br /><br /><br />I'm asking to show it in a calculation code or mathematical formula in following community but,...<br /><br />https://stats.stackexchange.com/questions/430030/what-is-the-post-hoc-power-in-my-experiment-how-to-calculate-thisUnknownhttps://www.blogger.com/profile/11008284843134095666noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-49611279115177915332019-10-04T01:27:21.576-07:002019-10-04T01:27:21.576-07:00Gelman nails it down: "t’s fine to estimate p...Gelman nails it down: "t’s fine to estimate power (or, more generally, statistical properties of estimates) after the data have come in—but only only only only only if you do this based on a scientifically grounded assumed effect size. One should not not not not not estimate the power (or other statistical properties) of a study based on the “effect size observed in that study.” <br /><br /><br />https://statmodeling.stat.columbia.edu/2018/09/24/dont-calculate-post-hoc-power-using-observed-estimate-effect-size/Sehttps://www.blogger.com/profile/07697710753267461129noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-29233456722390713382019-09-15T09:19:35.510-07:002019-09-15T09:19:35.510-07:00This resource is available for repeated measures (...This resource is available for repeated measures (and mixed) designs, though I think only for 2 way designs<br />https://www.aggieerin.com/shiny-server/tests/omegaprmss.html<br /><br />Also provides CIsawfoothttps://www.blogger.com/profile/04375839310761465743noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-68072481689673487912019-09-15T07:55:17.492-07:002019-09-15T07:55:17.492-07:00Great [post - thank you!Great [post - thank you!Dave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-55058686878789761172019-08-20T00:14:14.095-07:002019-08-20T00:14:14.095-07:00Hi Daniel, many thanks for this resource. Would th...Hi Daniel, many thanks for this resource. Would there be any chance of converting them to a Shiny app, to really drop the barriers to use? A really nice example of similar resources that have been implemented in a Web interface is https://www.estimationstats.com/#/ Unknownhttps://www.blogger.com/profile/06522283751759510506noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-28975326316811825912019-08-14T05:54:08.374-07:002019-08-14T05:54:08.374-07:00Power = people?
Thomas Schmidt, University of Kais...Power = people?<br />Thomas Schmidt, University of Kaiserslautern, Germany<br /><br />There is yet another way to improve your power: Use more trials from the participants you have. Actually power depends on two things: the number of participants in the sample and the reliability of the measurements. However, reliability directly depends on the number of trials. There are several simulation studies that show that both levels (people and trials) are about equally important in determining statistical power. There are many areas of psychology that successfully work with small groups of subjects but massive repetition of measurement -- psychophysics is a good example. In my research, I almost invariably use eight participants and control power entirely by the number of sessions. In my experience, well-trained subjects perform so much more reliably than untrained ones that they can give you high data quality even with limited resources. There is also a convenient, citable name for this approach: Smith & Little (2018) call it "small-N design".<br /><br />Apart from statistical power, there is yet another time-honoured concept that is used in engineering: measurement precision. Precision can simply be defined by setting an upper limit to the standard error of the dependent variable -- all you need is a rough idea about the standard deviation. In a recent paper, we included the following passage to justify our sample sizes (Biafora & Schmidt, 2019): <br /><br />"In multi-factor repeated-measures designs, statistical power is difficult to predict because too many terms are unknown. Instead, we control measurement precision at the level of individual participants in single conditions. We calculate precision as s/√r (Eisenhart, 1969), where s is a single participant's standard deviation in a given cell of the design and r is the number of repeated measures per cell and subject. With r = 120 and 240 in the priming and prime identification task, respectively, we expect a precision of about 5.5 ms in response times (assuming individual SDs around 60 ms), at most 4.6 percentage points in error rates, and at most 3.2 percentage points in prime identification accuracy (assuming the theoretical maximum SD of .5)."<br />Thomas Schmidthttps://www.blogger.com/profile/06538006415414781450noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-41348114351340535502019-08-06T01:29:56.465-07:002019-08-06T01:29:56.465-07:00Hi Daniel,
This might be a lame question, but you...Hi Daniel,<br /><br />This might be a lame question, but your answer would be of immense help. <br /><br />Can i conduct a equivalence test for a one-proportions test?<br /><br />for example, i have binomial outcome variable from an experiment in which participants answered yes or no (example: yes=60, no = 40; N =100). Where p is proportions of people who answered yes. My hypothesis is:<br /><br />H0: p=0.5<br />H1: p>0.5.<br /><br />best, prasad<br />prasadhttps://www.blogger.com/profile/07017280705402230912noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-1108881644459718352019-07-19T16:37:33.167-07:002019-07-19T16:37:33.167-07:00Hey Daniel, great post - thanks for sharing! I hav...Hey Daniel, great post - thanks for sharing! I have a couple suggestions for improvement and a question:<br />1) Thought you might like to know your first line of R-script for your function is missing double quotes.<br />res = optimal_alpha(power_function = [ADD_DOUBLE_QUOTES_HERE]pwr.t.test(d=0.5, n=100, sig.level = x, type='two.sample', alternative='two.sided')$power")<br /><br />2) For some reason, the balance function produces incorrect total error rates. For example, the following produces a res$tot = 8.888209e-08 but a res$alpha + res$beta = 0.9967886.<br />res = optimal_alpha(power_function = "pwr.t.test(d=0.001, n=30000, sig.level = x, type='two.sample', alternative='two.sided')$power", error = "balance")<br />res$alpha<br />res$beta<br />res$tot<br />res$beta + res$alpha<br /><br />3) You mention "If you collect large amounts of data, you should really consider lowering your alpha level." I'm not sure if I follow entirely. Assuming a sample size of 10000 where Cohen's d = 0.2, then adjusting the alpha from 0.5 to something smaller such as .0000000000000000005 has no impact on power, right? I'm probably missing something here, so I'd love to hear your thoughts.Unknownhttps://www.blogger.com/profile/14576444348427501628noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-64874535164645276652019-07-16T09:44:39.663-07:002019-07-16T09:44:39.663-07:00Either I've misunderstood this, or there's...Either I've misunderstood this, or there's something wrong with it or missing from it. The decision tree in Figure 1 is fine, but the tree in Figure 2 isn't analogous to it. In Fig 1, you make the decision whether or not to invest, and then the chance nodes show all the possible outcomes - the product works, or it doesn't - and the probabilities of those are their unconditional probabilities, 0.5 and 0.5 for each. In Figure 2, you choose the alpha, but the following chance nodes don't include all the possible outcomes. They only include the possibilities that there is a type 1 or a type 2 error, but there's another possibility, that there's no error at all and the test gives the correct outcome. Also, the probabilities assigned to the two error types are conditional - alpha is the probability of a result in the critical region (i.e. 'significant') conditional on the null hypothesis being correct, that is, conditional on the true effect being zero, and beta is the probability of a result outside the critical region (i.e. 'not significant'), conditional on the true effect being non-zero, so you can't just put them both in the same expected value calculation like that, as you then find the expected value from two different probability distributions that are conditional on different things, which makes no sense (to me at least). <br /><br />In the Figure 1 example there are only two states (product works or not), but in the testing example there are four:<br />(i) There is no true effect (null hypothesis true) and test result non-significant.<br />(ii) There is no true effect and test result is significant<br />(iii) There is a true effect (null hypothesis false) and test result non-significant.<br />(iv) There is a true effect and test result is significant.<br /><br />Or you could draw a tree with two sets of chance nodes, one set for whether the null hypothesis is true, and one, which could then be conditional on the first node, for whether the test result is significant or not. Then the probabilities for the second set would be alpha, 1 - alpha for those following "Null hypothesis true", and 1 - beta, beta, for those following "Null hypothesis not true". That would work, but you still have to specify the probabilities on the first set of nodes, that is, the probability of whether the null hypothesis is true, and that is the prior probability that you want to avoid. But I don't think you can avoid it - if you put in all four outcomes on the chance nodes and work out their probabilities, that involves the probability that the null is true, that is, the prior.<br /><br />You might be able to take a different decision theoretic approach that avoids using the prior probabilities, but the one you've used, with decision trees, is pretty weell inevitably Bayesian, I think.Kevin McConwayhttps://www.blogger.com/profile/13163867937943443456noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-62515136536636619722019-07-16T04:57:39.988-07:002019-07-16T04:57:39.988-07:00Thanks Daniel, it's good to hear an informed o...Thanks Daniel, it's good to hear an informed opinion which I see as a gentle push away from using the same significance threshold for all kinds of tests in a discipline, or even in sciences as a whole. This has always perplexed me as I'm mostly working in business settings where risks and rewards can be estimated with a fair degree of precision since the number of people/situations affected by a given inference is more or less limited, unlike the sciences.<br /><br />I've actually worked on arriving at significance thresholds and sample sizes (and therefore power/minimum effect of interest) which achieve optimal balance of risk and reward for an online controlled experiment based on its particular circumstances. A brief description of my work can be found at http://blog.analytics-toolkit.com/2017/risk-vs-reward-ab-tests-ab-testing-risk-management/ while a more detailed expose will soon be released in my upcoming book where I devote a solid 30 pages to the topic ( https://www.abtestingstats.com/ ), for anyone interested.Unknownhttps://www.blogger.com/profile/15010168795141940884noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-80025694200887734322019-05-14T11:28:46.107-07:002019-05-14T11:28:46.107-07:00Hi Daniel, how this standarization of P-values bas...Hi Daniel, how this standarization of P-values based on sample size can be coupled to the multiple-testing adjustment by Bonferroni or BH? Dodgerhttps://www.blogger.com/profile/00995569797695061139noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-68019325365475811122019-05-09T18:15:50.064-07:002019-05-09T18:15:50.064-07:00I am taking your course in Coursera. This article ...I am taking your course in Coursera. This article summarizes many ideas that you put on the course. Thank you.Walter Lorenzohttps://www.blogger.com/profile/07629680040548146565noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-45302767731762833652019-05-03T20:28:02.383-07:002019-05-03T20:28:02.383-07:00p value should be there, just to validate the meth...p value should be there, just to validate the methodological correctness and assigning uniformity in research work or strengthening justifications to the findings only with respect to the individualistic terms of the work, but not to support the hypothesis as universal fact. Of course, we can encourage reporting Power and effect size, because there are many studies where Power is compromised. What I liked Trafimow’s article is that it vibrates the dishonest attempt of researchers to get their paper published in journals based on p value with unrealistic elements like exceptionally low n (as small as 3), skewed distributions, non-homogeneity etc. BASP might have fatigued with such type of papers. That is why they wrote "we encourage the use of larger sample sizes than is typical in much psy-chology research, because as the sample size increases,descriptive statistics become increasingly stable and sampling error is less of a problem" (from Trafimow & Marks, 2015, doi.10.1080/01973533.2015.1012991). Honest and judicious use of p or CI is always welcome.www.surjyasaikia.inhttps://www.blogger.com/profile/09172588142202332799noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-59951569656772874422019-04-09T15:22:45.369-07:002019-04-09T15:22:45.369-07:00It's not super-clear that Cohen wasn't. Me...It's not super-clear that Cohen wasn't. Meehl, after all, didn't talk much about experimental randomized interventions, and he was called on it by Oakes (https://www.gwern.net/docs/statistics/1975-oakes.pdf) who gave as a counter-example the now-forgotten OEO 'performance contracting' school reform experiment (https://www.gwern.net/docs/sociology/1972-page.pdf) where despite randomization of dozens of schools with ~33k students, not a single null could be rejected.gwernhttps://www.blogger.com/profile/18349479103216755952noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-53469083120658883772019-03-14T10:21:37.760-07:002019-03-14T10:21:37.760-07:00Hi, thank you very much for this page, this is ver...Hi, thank you very much for this page, this is very helpful!<br /><br />I used the SPSS script to calculate the CIs for eta squared in a MANOVA.<br /><br />However, in some cases, mostly for the main effects in the MANOVA, I obtained an eta squared that was not covered by the CI: For instance I had F (34, 508) = 1.72, partial η2 =.103, 90% CI = [.012; .086]. <br /><br />Is it possible that the multivariate design causes the problem here? And would you have any suggestions on how to fix this?<br /><br />Thanks a lot and best regards,<br />TabeaTabeahttps://www.blogger.com/profile/05619778141992487767noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-46935573936434022722018-12-07T06:59:28.123-08:002018-12-07T06:59:28.123-08:00As the blog explains, this is about solving a prob...As the blog explains, this is about solving a problem with large N - so not intended to be used to increase the alpha for smaller N. Standardization for 100 is a pretty random choice - for these N's, there is no substantial mismatch yet, according to Good. He mentions it is just a useful thing to all use - but feel free to use another number. Or devise another scaling. Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-76644472876885543582018-12-05T12:56:48.744-08:002018-12-05T12:56:48.744-08:00Hi Daniel, I couldn't help thinking about your...Hi Daniel, I couldn't help thinking about your idea of scaling alpha by the square root of the sample size divided by the constant 100. I completely fail to understand you choice of constant, which obviously assigns a false positive rate higher than the traditional criterion of p < 0.05 to independent frequentist null hypothesis tests with a sample size below that arbitrary constant. Wouldn't you prefer an adaptive false positive rate that starts with the traditional criterion or any other initial probability and decreases with sample size, for example alpha = alpha/log(n) or alpha = alpha/n^(1/3) ?<br /><br />Best,<br />Martin DietzUnknownhttps://www.blogger.com/profile/06915342196835755035noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-32835022066764956932018-06-20T05:09:06.249-07:002018-06-20T05:09:06.249-07:00The (gist of) this reasoning in slightly different...The (gist of) this reasoning in slightly different words using a different blogpost on this topic: https://pedermisager.netlify.com/post/what-to-replicate/<br /><br />In the blogpost by Isager various reasons are given why researchers could have decided to replicate certain findings. I was wondering if you have thought about the possibility of *not* replicating, and/or giving attention to, any past work. <br /><br />If we want to take into account your assumptions regarding resource constraints and the willingness to want to replicate, it might be way more fruitful (and perhaps ethical and responsible) for researchers to not replicate any past work but concentrate on replicating current/future work.<br /><br />I reason all the different reasons researchers give to replicate past work might all be considered to be equivalent from the perspective of a cumulative science. I reason this is because all the different reasons Isager provides are, could be, or will be intertwined and influenced by eachother. I reason from the perspective of viewing psychological science as a cumulative science, it therfore possibly doesn't matter 1) what the reason is for replicating among your examples, 2) it could even be the reason *not* to replicate, and 3) the "starting point" in a research program (e.g. a direct replication of past work) is perhaps way less important than the entire processs of that research program.<br /><br />For instance, assuming the narrative of the past few years is (partly) correct that "sexy" (but probably based on low-quality studies) findings have been rewarded, it could be reasoned that these "sexy" findings will have had theoretical impact, gathered personal interest, influenced policy, and ammassed many citations. If this makes any sense, all the reasons researchers give for replicating past work in your blogpost, may in fact be the exact reasons why they *shouldn't* want to replicate them given resource constraints and wanting to replicate things. All this replication of past work might be giving attention to sub-optimal work, and researchers, for a 2nd time ?! Also possibly see "Replication initiatives will not salvage the trustworthiness of psychology" by J. C. Coyne (https://bmcpsychology.biomedcentral.com/articles/10.1186/s40359-016-0134-3)<br /><br />Here is a link to a research (and publication) format that incorporates direct replications of "new" work, and that involves a more continuous and cumulative manner of replicating and doing research:<br /><br />http://andrewgelman.com/2017/12/17/stranger-than-fiction/#comment-628652Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-64237715697091881152018-05-24T10:53:39.361-07:002018-05-24T10:53:39.361-07:00Hi Timothy, Felix Schonbrodt and EJ Wagenmakers ha...Hi Timothy, Felix Schonbrodt and EJ Wagenmakers have papers on Bayesian Design Analysis - you should use those to plan your study. Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-71788565765709620222018-05-24T09:16:49.092-07:002018-05-24T09:16:49.092-07:00Hi Daniel,
Thank you for your post. This could b...Hi Daniel, <br /><br />Thank you for your post. This could be very helpful for me in my work and research. <br /><br />However, I've run into some error messages when running the script. I am not vey well versed in RStudio so could you or someone else help me out in resolving this problem?<br /><br />These are the messages I am getting:<br /><br />Error in winProgressBar(title = "progress bar", min = 0, max = nSim, width = 300) : <br /> could not find function "winProgressBar"<br /><br />Error in setWinProgressBar(pb, i, title = paste(round(i/nSim * 100, 1), : <br /> could not find function "setWinProgressBar"<br /><br />Error in close(pb) : object 'pb' not found<br /><br />Error in hist.default(log(bf), breaks = 20) : character(0)<br />In addition: Warning messages:<br />1: In min(x) : no non-missing arguments to min; returning Inf<br />2: In max(x) : no non-missing arguments to max; returning -Inf<br /><br />Thanks in advance, <br /><br />Timothy Timothy Houtmannoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-36574318262871921812018-05-20T21:17:12.012-07:002018-05-20T21:17:12.012-07:00Hi Jim, the methods described in the blog are perf...Hi Jim, the methods described in the blog are perfectly suited for confirmatory research. One-sided versions of equivalence tests exist (non-imferioritybtests, as explained in my papers). Thanks for the link to your pdf - it does contain some errors and outdated adice (see criticism on the 'power approach' in my equivalence testing papers - you might want to read the latest paper to improve your understanding of equivalence tests.Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-18394603054396273842018-05-20T18:51:15.350-07:002018-05-20T18:51:15.350-07:00The recommendations in this blog post appear to be...The recommendations in this blog post appear to be based on the assumption that a large initial study will be conducted when researchers do not have a clear prediction about an effect. This strategy is feasible when resources are available for large projects. However, if resources are limited, smaller initial exploratory studies may be useful to justify the greater resources for a large study. This is a common situation in medical research, which often requires expensive specialized measurements and a selected pool of subjects. From this perspective, magnitude based inferences might be a useful exploratory method to evaluate whether a larger confirmatory study is justified. In general, any discussion of statistical research methods that does not distinguish between exploratory and confirmatory research and describe how and whether the methods apply to each stage of research will likely encourage continued blurring of exploration and confirmation and continued misuse of statistics. <br /><br />The recommended methods appear to be useful in initial studies when researchers do not have clear predictions, but the methods may not be widely useful for confirmatory research. If the research question is practical such as whether a certain type of shoe, or educational program, or medical treatment is better or worse than another, then it is reasonable that the researchers initially do not have a clear prediction and will use two-sided tests (although the sponsor of the research probably has a preferred outcome). <br /><br />However, when the research questions are more theoretical, a two-sided test usually means the researchers do not have a clear theoretical prediction and want to have the flexibility to make up an explanation after looking at the results. Such post hoc explanations are often not distinguished from pre-specified theory given that the planned statistical analysis was significant. Science is based on making and testing predictions. Two-sided tests are usually the exploratory stage of research without a clear theoretically-based prediction. <br /><br />The extreme case is when the only prediction is that the effect size is not zero, as has been common in psychological research in recent decades. This prediction is not falsifiable in principle because any finite sample size may have inadequate power to detect the extremely small effects consistent with the hypothesis. Without a smallest effect size of interest, research is not falsifiable. <br /><br />The confirmatory research that is needed to make science valid and self correcting will usually be based on one-sided statistical tests with falsifiable predictions. Unfortunately statistical methods for conducting falsifiable research with classical (frequentist) statistics have not been widely known among psychological researchers. Such methods are described in a paper at <br />https://jeksite.org/psi/falsifiable_research.pdf .<br /><br /><br />Jim Kennedyhttps://jeksite.org/psi.htm#t3noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-44725991167197694992018-05-17T19:24:51.162-07:002018-05-17T19:24:51.162-07:00Great article Daniel!!!Great article Daniel!!!SBRhttps://www.blogger.com/profile/11674925702608151127noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-25101340779109031982018-05-14T18:04:17.765-07:002018-05-14T18:04:17.765-07:00p values just tell you the probability of getting ...p values just tell you the probability of getting more extreme results (in the direction of the alternative hypothesis) than the observed value of the test statistic with the actual data. Thus, you are looking at a multitude of possible samples that might occur and yield worse results than your actual sample.<br /><br />The Bayes factor does a better job: you are focusing on your actual data and not on other (virtual) samples that might have occurred. Most importantly, however, the Bayes factor directly compares two different models: the null model and an alternative model (representing the alternative hypothesis).Rob56https://www.blogger.com/profile/05464679702697192852noreply@blogger.com