tag:blogger.com,1999:blog-987850932434001559.comments2019-08-20T00:49:38.986-07:00The 20% StatisticianDaniel Lakenshttp://www.blogger.com/profile/18143834258497875354noreply@blogger.comBlogger949125tag:blogger.com,1999:blog-987850932434001559.post-41348114351340535502019-08-06T01:29:56.465-07:002019-08-06T01:29:56.465-07:00Hi Daniel,
This might be a lame question, but you...Hi Daniel,<br /><br />This might be a lame question, but your answer would be of immense help. <br /><br />Can i conduct a equivalence test for a one-proportions test?<br /><br />for example, i have binomial outcome variable from an experiment in which participants answered yes or no (example: yes=60, no = 40; N =100). Where p is proportions of people who answered yes. My hypothesis is:<br /><br />H0: p=0.5<br />H1: p>0.5.<br /><br />best, prasad<br />prasadhttps://www.blogger.com/profile/07017280705402230912noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-1108881644459718352019-07-19T16:37:33.167-07:002019-07-19T16:37:33.167-07:00Hey Daniel, great post - thanks for sharing! I hav...Hey Daniel, great post - thanks for sharing! I have a couple suggestions for improvement and a question:<br />1) Thought you might like to know your first line of R-script for your function is missing double quotes.<br />res = optimal_alpha(power_function = [ADD_DOUBLE_QUOTES_HERE]pwr.t.test(d=0.5, n=100, sig.level = x, type='two.sample', alternative='two.sided')$power")<br /><br />2) For some reason, the balance function produces incorrect total error rates. For example, the following produces a res$tot = 8.888209e-08 but a res$alpha + res$beta = 0.9967886.<br />res = optimal_alpha(power_function = "pwr.t.test(d=0.001, n=30000, sig.level = x, type='two.sample', alternative='two.sided')$power", error = "balance")<br />res$alpha<br />res$beta<br />res$tot<br />res$beta + res$alpha<br /><br />3) You mention "If you collect large amounts of data, you should really consider lowering your alpha level." I'm not sure if I follow entirely. Assuming a sample size of 10000 where Cohen's d = 0.2, then adjusting the alpha from 0.5 to something smaller such as .0000000000000000005 has no impact on power, right? I'm probably missing something here, so I'd love to hear your thoughts.Unknownhttps://www.blogger.com/profile/14576444348427501628noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-64874535164645276652019-07-16T09:44:39.663-07:002019-07-16T09:44:39.663-07:00Either I've misunderstood this, or there's...Either I've misunderstood this, or there's something wrong with it or missing from it. The decision tree in Figure 1 is fine, but the tree in Figure 2 isn't analogous to it. In Fig 1, you make the decision whether or not to invest, and then the chance nodes show all the possible outcomes - the product works, or it doesn't - and the probabilities of those are their unconditional probabilities, 0.5 and 0.5 for each. In Figure 2, you choose the alpha, but the following chance nodes don't include all the possible outcomes. They only include the possibilities that there is a type 1 or a type 2 error, but there's another possibility, that there's no error at all and the test gives the correct outcome. Also, the probabilities assigned to the two error types are conditional - alpha is the probability of a result in the critical region (i.e. 'significant') conditional on the null hypothesis being correct, that is, conditional on the true effect being zero, and beta is the probability of a result outside the critical region (i.e. 'not significant'), conditional on the true effect being non-zero, so you can't just put them both in the same expected value calculation like that, as you then find the expected value from two different probability distributions that are conditional on different things, which makes no sense (to me at least). <br /><br />In the Figure 1 example there are only two states (product works or not), but in the testing example there are four:<br />(i) There is no true effect (null hypothesis true) and test result non-significant.<br />(ii) There is no true effect and test result is significant<br />(iii) There is a true effect (null hypothesis false) and test result non-significant.<br />(iv) There is a true effect and test result is significant.<br /><br />Or you could draw a tree with two sets of chance nodes, one set for whether the null hypothesis is true, and one, which could then be conditional on the first node, for whether the test result is significant or not. Then the probabilities for the second set would be alpha, 1 - alpha for those following "Null hypothesis true", and 1 - beta, beta, for those following "Null hypothesis not true". That would work, but you still have to specify the probabilities on the first set of nodes, that is, the probability of whether the null hypothesis is true, and that is the prior probability that you want to avoid. But I don't think you can avoid it - if you put in all four outcomes on the chance nodes and work out their probabilities, that involves the probability that the null is true, that is, the prior.<br /><br />You might be able to take a different decision theoretic approach that avoids using the prior probabilities, but the one you've used, with decision trees, is pretty weell inevitably Bayesian, I think.Kevin McConwayhttps://www.blogger.com/profile/13163867937943443456noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-62515136536636619722019-07-16T04:57:39.988-07:002019-07-16T04:57:39.988-07:00Thanks Daniel, it's good to hear an informed o...Thanks Daniel, it's good to hear an informed opinion which I see as a gentle push away from using the same significance threshold for all kinds of tests in a discipline, or even in sciences as a whole. This has always perplexed me as I'm mostly working in business settings where risks and rewards can be estimated with a fair degree of precision since the number of people/situations affected by a given inference is more or less limited, unlike the sciences.<br /><br />I've actually worked on arriving at significance thresholds and sample sizes (and therefore power/minimum effect of interest) which achieve optimal balance of risk and reward for an online controlled experiment based on its particular circumstances. A brief description of my work can be found at http://blog.analytics-toolkit.com/2017/risk-vs-reward-ab-tests-ab-testing-risk-management/ while a more detailed expose will soon be released in my upcoming book where I devote a solid 30 pages to the topic ( https://www.abtestingstats.com/ ), for anyone interested.Unknownhttps://www.blogger.com/profile/15010168795141940884noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-80025694200887734322019-05-14T11:28:46.107-07:002019-05-14T11:28:46.107-07:00Hi Daniel, how this standarization of P-values bas...Hi Daniel, how this standarization of P-values based on sample size can be coupled to the multiple-testing adjustment by Bonferroni or BH? Unknownhttps://www.blogger.com/profile/00995569797695061139noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-45302767731762833652019-05-03T20:28:02.383-07:002019-05-03T20:28:02.383-07:00p value should be there, just to validate the meth...p value should be there, just to validate the methodological correctness and assigning uniformity in research work or strengthening justifications to the findings only with respect to the individualistic terms of the work, but not to support the hypothesis as universal fact. Of course, we can encourage reporting Power and effect size, because there are many studies where Power is compromised. What I liked Trafimow’s article is that it vibrates the dishonest attempt of researchers to get their paper published in journals based on p value with unrealistic elements like exceptionally low n (as small as 3), skewed distributions, non-homogeneity etc. BASP might have fatigued with such type of papers. That is why they wrote "we encourage the use of larger sample sizes than is typical in much psy-chology research, because as the sample size increases,descriptive statistics become increasingly stable and sampling error is less of a problem" (from Trafimow & Marks, 2015, doi.10.1080/01973533.2015.1012991). Honest and judicious use of p or CI is always welcome.www.surjyasaikia.inhttps://www.blogger.com/profile/09172588142202332799noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-59951569656772874422019-04-09T15:22:45.369-07:002019-04-09T15:22:45.369-07:00It's not super-clear that Cohen wasn't. Me...It's not super-clear that Cohen wasn't. Meehl, after all, didn't talk much about experimental randomized interventions, and he was called on it by Oakes (https://www.gwern.net/docs/statistics/1975-oakes.pdf) who gave as a counter-example the now-forgotten OEO 'performance contracting' school reform experiment (https://www.gwern.net/docs/sociology/1972-page.pdf) where despite randomization of dozens of schools with ~33k students, not a single null could be rejected.gwernhttps://www.blogger.com/profile/18349479103216755952noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-53469083120658883772019-03-14T10:21:37.760-07:002019-03-14T10:21:37.760-07:00Hi, thank you very much for this page, this is ver...Hi, thank you very much for this page, this is very helpful!<br /><br />I used the SPSS script to calculate the CIs for eta squared in a MANOVA.<br /><br />However, in some cases, mostly for the main effects in the MANOVA, I obtained an eta squared that was not covered by the CI: For instance I had F (34, 508) = 1.72, partial η2 =.103, 90% CI = [.012; .086]. <br /><br />Is it possible that the multivariate design causes the problem here? And would you have any suggestions on how to fix this?<br /><br />Thanks a lot and best regards,<br />TabeaTabeahttps://www.blogger.com/profile/05619778141992487767noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-46935573936434022722018-12-07T06:59:28.123-08:002018-12-07T06:59:28.123-08:00As the blog explains, this is about solving a prob...As the blog explains, this is about solving a problem with large N - so not intended to be used to increase the alpha for smaller N. Standardization for 100 is a pretty random choice - for these N's, there is no substantial mismatch yet, according to Good. He mentions it is just a useful thing to all use - but feel free to use another number. Or devise another scaling. Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-76644472876885543582018-12-05T12:56:48.744-08:002018-12-05T12:56:48.744-08:00Hi Daniel, I couldn't help thinking about your...Hi Daniel, I couldn't help thinking about your idea of scaling alpha by the square root of the sample size divided by the constant 100. I completely fail to understand you choice of constant, which obviously assigns a false positive rate higher than the traditional criterion of p < 0.05 to independent frequentist null hypothesis tests with a sample size below that arbitrary constant. Wouldn't you prefer an adaptive false positive rate that starts with the traditional criterion or any other initial probability and decreases with sample size, for example alpha = alpha/log(n) or alpha = alpha/n^(1/3) ?<br /><br />Best,<br />Martin DietzUnknownhttps://www.blogger.com/profile/06915342196835755035noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-32835022066764956932018-06-20T05:09:06.249-07:002018-06-20T05:09:06.249-07:00The (gist of) this reasoning in slightly different...The (gist of) this reasoning in slightly different words using a different blogpost on this topic: https://pedermisager.netlify.com/post/what-to-replicate/<br /><br />In the blogpost by Isager various reasons are given why researchers could have decided to replicate certain findings. I was wondering if you have thought about the possibility of *not* replicating, and/or giving attention to, any past work. <br /><br />If we want to take into account your assumptions regarding resource constraints and the willingness to want to replicate, it might be way more fruitful (and perhaps ethical and responsible) for researchers to not replicate any past work but concentrate on replicating current/future work.<br /><br />I reason all the different reasons researchers give to replicate past work might all be considered to be equivalent from the perspective of a cumulative science. I reason this is because all the different reasons Isager provides are, could be, or will be intertwined and influenced by eachother. I reason from the perspective of viewing psychological science as a cumulative science, it therfore possibly doesn't matter 1) what the reason is for replicating among your examples, 2) it could even be the reason *not* to replicate, and 3) the "starting point" in a research program (e.g. a direct replication of past work) is perhaps way less important than the entire processs of that research program.<br /><br />For instance, assuming the narrative of the past few years is (partly) correct that "sexy" (but probably based on low-quality studies) findings have been rewarded, it could be reasoned that these "sexy" findings will have had theoretical impact, gathered personal interest, influenced policy, and ammassed many citations. If this makes any sense, all the reasons researchers give for replicating past work in your blogpost, may in fact be the exact reasons why they *shouldn't* want to replicate them given resource constraints and wanting to replicate things. All this replication of past work might be giving attention to sub-optimal work, and researchers, for a 2nd time ?! Also possibly see "Replication initiatives will not salvage the trustworthiness of psychology" by J. C. Coyne (https://bmcpsychology.biomedcentral.com/articles/10.1186/s40359-016-0134-3)<br /><br />Here is a link to a research (and publication) format that incorporates direct replications of "new" work, and that involves a more continuous and cumulative manner of replicating and doing research:<br /><br />http://andrewgelman.com/2017/12/17/stranger-than-fiction/#comment-628652Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-64237715697091881152018-05-24T10:53:39.361-07:002018-05-24T10:53:39.361-07:00Hi Timothy, Felix Schonbrodt and EJ Wagenmakers ha...Hi Timothy, Felix Schonbrodt and EJ Wagenmakers have papers on Bayesian Design Analysis - you should use those to plan your study. Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-71788565765709620222018-05-24T09:16:49.092-07:002018-05-24T09:16:49.092-07:00Hi Daniel,
Thank you for your post. This could b...Hi Daniel, <br /><br />Thank you for your post. This could be very helpful for me in my work and research. <br /><br />However, I've run into some error messages when running the script. I am not vey well versed in RStudio so could you or someone else help me out in resolving this problem?<br /><br />These are the messages I am getting:<br /><br />Error in winProgressBar(title = "progress bar", min = 0, max = nSim, width = 300) : <br /> could not find function "winProgressBar"<br /><br />Error in setWinProgressBar(pb, i, title = paste(round(i/nSim * 100, 1), : <br /> could not find function "setWinProgressBar"<br /><br />Error in close(pb) : object 'pb' not found<br /><br />Error in hist.default(log(bf), breaks = 20) : character(0)<br />In addition: Warning messages:<br />1: In min(x) : no non-missing arguments to min; returning Inf<br />2: In max(x) : no non-missing arguments to max; returning -Inf<br /><br />Thanks in advance, <br /><br />Timothy Timothy Houtmannoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-36574318262871921812018-05-20T21:17:12.012-07:002018-05-20T21:17:12.012-07:00Hi Jim, the methods described in the blog are perf...Hi Jim, the methods described in the blog are perfectly suited for confirmatory research. One-sided versions of equivalence tests exist (non-imferioritybtests, as explained in my papers). Thanks for the link to your pdf - it does contain some errors and outdated adice (see criticism on the 'power approach' in my equivalence testing papers - you might want to read the latest paper to improve your understanding of equivalence tests.Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-18394603054396273842018-05-20T18:51:15.350-07:002018-05-20T18:51:15.350-07:00The recommendations in this blog post appear to be...The recommendations in this blog post appear to be based on the assumption that a large initial study will be conducted when researchers do not have a clear prediction about an effect. This strategy is feasible when resources are available for large projects. However, if resources are limited, smaller initial exploratory studies may be useful to justify the greater resources for a large study. This is a common situation in medical research, which often requires expensive specialized measurements and a selected pool of subjects. From this perspective, magnitude based inferences might be a useful exploratory method to evaluate whether a larger confirmatory study is justified. In general, any discussion of statistical research methods that does not distinguish between exploratory and confirmatory research and describe how and whether the methods apply to each stage of research will likely encourage continued blurring of exploration and confirmation and continued misuse of statistics. <br /><br />The recommended methods appear to be useful in initial studies when researchers do not have clear predictions, but the methods may not be widely useful for confirmatory research. If the research question is practical such as whether a certain type of shoe, or educational program, or medical treatment is better or worse than another, then it is reasonable that the researchers initially do not have a clear prediction and will use two-sided tests (although the sponsor of the research probably has a preferred outcome). <br /><br />However, when the research questions are more theoretical, a two-sided test usually means the researchers do not have a clear theoretical prediction and want to have the flexibility to make up an explanation after looking at the results. Such post hoc explanations are often not distinguished from pre-specified theory given that the planned statistical analysis was significant. Science is based on making and testing predictions. Two-sided tests are usually the exploratory stage of research without a clear theoretically-based prediction. <br /><br />The extreme case is when the only prediction is that the effect size is not zero, as has been common in psychological research in recent decades. This prediction is not falsifiable in principle because any finite sample size may have inadequate power to detect the extremely small effects consistent with the hypothesis. Without a smallest effect size of interest, research is not falsifiable. <br /><br />The confirmatory research that is needed to make science valid and self correcting will usually be based on one-sided statistical tests with falsifiable predictions. Unfortunately statistical methods for conducting falsifiable research with classical (frequentist) statistics have not been widely known among psychological researchers. Such methods are described in a paper at <br />https://jeksite.org/psi/falsifiable_research.pdf .<br /><br /><br />Jim Kennedyhttps://jeksite.org/psi.htm#t3noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-44725991167197694992018-05-17T19:24:51.162-07:002018-05-17T19:24:51.162-07:00Great article Daniel!!!Great article Daniel!!!Unknownhttps://www.blogger.com/profile/11674925702608151127noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-25101340779109031982018-05-14T18:04:17.765-07:002018-05-14T18:04:17.765-07:00p values just tell you the probability of getting ...p values just tell you the probability of getting more extreme results (in the direction of the alternative hypothesis) than the observed value of the test statistic with the actual data. Thus, you are looking at a multitude of possible samples that might occur and yield worse results than your actual sample.<br /><br />The Bayes factor does a better job: you are focusing on your actual data and not on other (virtual) samples that might have occurred. Most importantly, however, the Bayes factor directly compares two different models: the null model and an alternative model (representing the alternative hypothesis).Rob56https://www.blogger.com/profile/05464679702697192852noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-63787771440470571592018-05-12T19:30:40.772-07:002018-05-12T19:30:40.772-07:00Great read.Great read.Zad Chowhttps://www.lesslikely.comnoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-74738722074027221032018-05-05T02:45:27.743-07:002018-05-05T02:45:27.743-07:00"He self-plagiarized, and excessively cited h..."He self-plagiarized, and excessively cited his own work. "<br /><br />The real problem might be that the no. of publications and citations are used as some sort of metric for quality, used for hiring and promoting researchers, etc. I fear nothing will be solved when this continues to be the case. There is nothing wrong with self-citations in and of itself i reason. And when you all start acting like it is wrong, then people who want to manipulate that will simply ask their friends if they would cite them (as has probably already been happening a lot over the past decades, but let's all not think of that)<br /><br />I fear nothing will be solved changing one Sternberg for another (version of an editor). This Sternberg dude may have been clumsly in his antics, but you don't really think he's the only editor who does bad stuff. <br /><br />Journals, editors, and peer-review are a joke and can possibly be viewed as being anti-scientific in and of itself, and the cause for many, if not the majority, of all the problematic issues in science today. <br /><br />But please, you all keep actively participating in this sh#tshow, and be proud of the fact that you all wrote a letter and got really angry about this Sternberg dude. Congrats, well done!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-54911581876828109952018-04-14T13:41:23.568-07:002018-04-14T13:41:23.568-07:00Tosttwo set bound in d a tostdata in raw. Please N...Tosttwo set bound in d a tostdata in raw. Please NO questions here. Horrible way to communicate. Send an email or use github.Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-54198908166679178632018-04-14T13:31:04.277-07:002018-04-14T13:31:04.277-07:00Maybe it's a stupid question, but why don'...Maybe it's a stupid question, but why don't I get the same results when I use "TOSTtwo" and "dataTOSTtwo"?<br /><br />Here the example code:<br /><br /># Illustration of 1.5 sigma distibution difference<br /><br />n <- 10<br />test_mean <- 20<br />test_sd <- 3<br /><br />Biosimilar <- rnorm(n,test_mean,test_sd)<br />Reference <- rnorm(n,test_mean,test_sd)<br /><br />equiv.margin <- sd(Reference)*1.5<br /><br />Sample <- c(rep("Biosimilar",length(n)),<br /> rep("Reference",length(n)))<br />Values <- c(Biosimilar,<br /> Reference)<br /><br />TOSTtwo(m1=mean(Biosimilar),<br /> m2=mean(Reference),<br /> sd1=sd(Biosimilar),<br /> sd2=sd(Reference),<br /> n1=length(Biosimilar),<br /> n2=length(Reference),<br /> low_eqbound_d=-1.5,<br /> high_eqbound_d=1.5<br /> )<br /><br />df <- data.frame(Sample, Values)<br /><br />dataTOSTtwo(df, deps="Values", group="Sample", var_equal = FALSE, low_eqbound = -1.5,<br /> high_eqbound = 1.5, alpha = 0.05, desc = TRUE, plots = TRUE)<br />Unknownhttps://www.blogger.com/profile/05386301080386949171noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-71285449251927780982018-03-28T14:56:34.096-07:002018-03-28T14:56:34.096-07:00I agree with everything - maybe only should point ...I agree with everything - maybe only should point out I believe one-sided tests should be used more often, but after pre-registering.Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-17461833213466831122018-03-28T14:54:37.109-07:002018-03-28T14:54:37.109-07:00The Schonbrodt paper was published after my blog. ...The Schonbrodt paper was published after my blog. The scale factor of 1 is a nonsense prior and should never be used.Daniel Lakenshttps://www.blogger.com/profile/18143834258497875354noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-67576661877596655102018-03-28T14:03:33.169-07:002018-03-28T14:03:33.169-07:00I found this interesting, from DelPriore et al....I found this interesting, from DelPriore et al.'s Study 5: "Given the absence of statistically significant main effects or interactions following from the randomly assigned writing prime, we proceeded to analyze for the effects of the emotions about fathers expressed in the essays." That sounds to me rather like "We decided which analyses to perform based on what we found in the data". (I'm looking forward to seeing how many registered reports have, as an a priori hypothesis, the specific prediction of a full mediation effect.)<br /><br />Also, this, from Study 4: "Given the extant literature demonstrating reliable effects of paternal absence-disengagement on sexually proceptive behavior in women (as reviewed in the Introduction), a one-tailed statistical test (p = .031) could be justified here, supporting a causal effect of paternal disengagement on flirting." Now I know that Daniel is a big fan of one-tailed tests, but it seems to me that this is basically pleading for "something that we wish was 'true', but we can't say it's 'true' because its p value is >.05" to be turned into "something that we can say is 'true' because its p value is <.05", with the justification that *somebody else previously found a similar result*. Taking this to its logical conclusion, we could run absolutely any study, and whenever the p value doesn't pan out, just say "Oh well, we didn't quite get lucky today, but we know it's 'true' because these other people found a similar effect, so we'll pretend we had their results instead of ours".Nick Brownhttps://www.blogger.com/profile/07481236547943428014noreply@blogger.comtag:blogger.com,1999:blog-987850932434001559.post-10913106512247939272018-03-27T10:20:15.328-07:002018-03-27T10:20:15.328-07:00Hi Daniel,
Why the recommendation to stop at Bayes...Hi Daniel,<br />Why the recommendation to stop at Bayes Factors > 3 (with a scale r on the effect size of 0.5)? Schönbrodt et al suggest BF > 5 (with a scale parameter r to 1).<br />Best regards<br />/BillAnonymousnoreply@blogger.com