Comments on The 20% Statistician: Error Control in Exploratory ANOVA's: The How and the Why

2021-02-08T11:43:02.849+01:00

This comment has been removed by a blog administrator.

You do not need any correlation assumptions for th...

2016-10-12T21:29:47.671+02:00

You do not need any correlation assumptions for this method.

For 2 measures, you essentially do:
min ( (2min (p1,p2)) , (max (p1,p2)) )

The correlations will not matter in this case. Unless you want to assume independence and multiply the measures.

Technically, bonferroni is proven mathematically in an ironclad way. This measure could very theoretically get wrong. But to get wrong it requires extremely weird conditions etc. so it is good for every practical purpose.

I love this measure, because I happened to work it out on my own in the late 90s but did not knew it was already solved. So I am having this little warm corner in it in my heart.

There are more papers on it, I think there is a paper somewhere by Sergiu Hart. Should be called SIMes method

http://www-stat.wharton.upenn.edu/~steele/Courses/...

2016-10-11T15:44:15.635+02:00

http://www-stat.wharton.upenn.edu/~steele/Courses/956/Resource/MultipleComparision/Simes86pdf.pdf

Weber who? Do you have reference? Assumptions of ...

2016-10-11T14:03:04.527+02:00

Weber who? Do you have reference? Assumptions of correlations are always tricky. If you have good estimates, yes, error rates depend on the correlation. Difference is not huge, but it matters a little bit.

Have you accounted for possible dependencies betwe...

2016-10-11T13:47:24.130+02:00

Have you accounted for possible dependencies between the measures?

I'm kinda nitpicking, but it's important sometimes.

When you have, say two slightly different, but highly correlated measures, x2 correction of the p-value is too much.

In your example above, the distortion is only small, as the between measures correlations aren't large. But it's good to be aware of it.

You might find out by simulating. Which is actually how it was argued before by Weber

Thanks for the references and link to your blog! I...

2016-01-05T17:37:57.261+01:00

Thanks for the references and link to your blog! I'm still getting used to the behavior of ANOVA's in simulations - will try to replicate your simulations, looking at power is a near future goal! Have a good 2016 as well!

1. I touched upon this topic in my blog: http://s...

2016-01-05T16:13:34.444+01:00

1. I touched upon this topic in my blog:

http://simkovic.github.io/2014/04/20/No-Way-Anova---Interactions-need-more-power.html

Also see Maxwell (2004) referenced in my blog who wrote about these problems and should have been cited by Cramer et al.

As noted in my blog and ignored in your blog, the issue is not only that multiple testing inflates the error rate but there is bias towards interactions/main effects depending on the sample size.

2. Bonferroni-Holm: As always with multiple comparisons the solution is to use hierarchical modeling:

Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don't have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5(2), 189-211.

3. Exploratory Anova is a contradiction. Exploratory = no hypotheses, so why is everyone doing hypothesis testing? Do estimation and you are ok. There is a large literature that criticizes the use of omnibus Anova and many researchers (incl. Geoff Cumming) recommend contrasts instead of Anova.

4. To do estimation properly you need to abandon the block design and use continuous IVs. Then you can do regression and avoid the discussed problems by using hierarchical priors on the regression coefficients. Another disadvantage with 2^n design is that it does not allow you to infer functional relationship between the IVs and DV. This allows the researcher to twist the result in any way by assuming a functional relationship such that the result confirms his/her theory. With regression design any such assumptions can be directly tested.

5. I wish you and your blog a productive 2016 :)

Thanks to a contribution by Frederik Aust, the lat...

2016-01-03T15:59:23.539+01:00

Thanks to a contribution by Frederik Aust, the latest version of my afex package (0.15-2) allows to specify this type of correction directly in the call to the ANOVA function. For example:

require(afex)
data(obk.long)
# with correction:
aov_ez("id", "value", obk.long, between = "treatment", within = c("phase", "hour"), anova_table = list(p.adjust.method = "holm"))

# without correction:
aov_ez("id", "value", obk.long, between = "treatment", within = c("phase", "hour"))

Indeed, the 7 tests is completely reasoned from th...

2016-01-02T21:22:36.890+01:00

Indeed, the 7 tests is completely reasoned from the output SPSS would provide, but it's a lower bound.

I agree testing for all these effects is perhaps weird, but I don't think it's uncommon to add all factors in experiments, and interpret the outcome based on what is significant without correcting for multiple tests. If this posts motivates researchers to more careufully choose their tests, that would be great.

No need to apologize! These discussions go to the ...

2016-01-02T20:56:33.589+01:00

No need to apologize! These discussions go to the core of the problem, and it is important to address them. With respect to the first post - researchers clearly did not pre-register. This leaves open selective reporting, flexibility in the data-analysis, etc. These researchers will try to game any threshold - and yes, you are right, approaches that don't require a threshold are more likely to get people to just report the data. Except pre-registration and a changing reward system, I see no solution.

With respect to the second point, I meant that an exploratory analysis in study 1, will be a confirmatory test in a follow up study. If a finding looks promising, over studies, it will be everyone's intention to test it. With huge numbers of tests, Holms correction is not workable. More modern approaches that correct for false discovery rates might be preferable. I also would not know how other approaches (assuming you have very little prior info) would fare better. Obviously, if you have very usable priors, Bayesian statistics are always outperforming Frequentist approaches.

[Part 2 of 2] DL: "... differences in inten...

2016-01-02T20:37:21.973+01:00

[Part 2 of 2]

DL: "... differences in intentions will wash out, just as differences in priors will wash out in Bayesian statistics, where the dependency on beliefs yields differences between researchers."
An analogy, between (a) intentions for sampling distributions and (b) prior beliefs for Bayesian inference, is tempting but extremely misleading. (I don't know to what extent you intended to make the analogy, but many people ask me about this so I couldn't let it rest.) Consider the following simple example. There are two professional baseball players, A who is a pitcher and B who is a catcher. In a particular season A had 4 hits in 25 at-bats while B had 6 hits in 30 at-bats. Question: Are the batting abilities of A and B "really" different? By the way, we are also interested in testing many dozens of comparisons of other players. The Bayesian analysis uses the prior knowledge that A is a pitcher and B is catcher and concludes that the two players almost certainly have different batting abilities, because professional pitchers and catchers generally have very different batting abilities and it would take a lot more data to budge us away from the prior knowledge. On the contrary, the frequentist analysis "corrects" for the dozens of other intended tests and concludes, with p>>.05, that the batting abilities are not significantly different. Infinite sample size eventually makes the conclusions of the two methods converge, but in the mean time there are real differences between sampling intentions and prior knowledge.

(Again, sorry to perseverate on this so long... It just truly is the topic that pushed me over the brink into Bayesian approaches...)

[Part 1 of 2] Hi again Daniel: I resonate with t...

2016-01-02T20:36:53.534+01:00

[Part 1 of 2]
Hi again Daniel:

I resonate with the desire to control errors especially in exploratory research. I agree that it's important to apprise researchers of these issues of false-alarm rates for different stopping and testing intentions, and I admire your efforts to do so! And I tried, very hard, to let this thread end with your previous comment, but alas herewith I have failed . :-)

DL: "I also think the dependency on intentions is not too problematic in practice."
That has not been my experience. I have reviewed so many manuscripts in which people are obviously prevaricating to rationalize why they've made only a few tests with the least severe correction for multiple tests, just so the "corrected" p values squeak under .05. In one manuscript I reviewed, a set of experimental conditions that obviously went together (they were run at the same time on the same subject pool) were split apart and labeled as Experiment 1 and Experiment 2 so that the multiple tests "within experiments" had lower corrected p values than if the conditions were all put together into a single set of tests. Later in that same manuscript, in the discussion, they tacked on some extra comparisons "across experiments" and used uncorrected p values for those. In general, for any study with more than a few conditions there are almost always more interesting comparisons and contrasts that could be done, yet people feign disinterest in those comparisons because testing them would inflate the p values of the tests they want to report with p<.05. Defenders of corrections might reply, "Well of course the corrections can be 'gamed' by unscrupulous researchers, but the corrections are not problematic when applied honestly." But even then the corrections remain problematic because honest researchers can be honestly interested in different comparisons and contrasts. Researcher A is honestly interested in testing ten comparisons because she has thought through a lot of interesting implications of the research design. Researcher B, looking at the very same data, is honestly interested in testing only three of those comparisons (perhaps because he hasn't thought about it deeply enough). For those three comparisons, the p values of Researcher B and honestly lower than the p values of Researcher A, even though the data and three comparisons are identical. Maybe that apparent clash of conclusions is not problematic, by definition, because it properly takes into account the set of intended tests. But for me it remains a problem in practice.

[continues in next comment]

An exploratory test in follow up studies will beco...

2016-01-02T19:40:28.963+01:00

An exploratory test in follow up studies will become part of everyone's intention to test. So really, there's no big issue. Lines of research can be meta-analyzed, which have their own error rates.

"We never take single studies as proof of muc...

2016-01-02T18:37:49.600+01:00

"We never take single studies as proof of much, and in lines of studies, the differences in intentions will wash out (...)," that's an odd thing to say for someone whose statistical approach prohibits integration of evidence over studies.

Just a side note: the initial sentence that there ...

2016-01-02T13:59:36.686+01:00

Just a side note: the initial sentence that there are 7 hypotheses to test in a 2x2x2 design (3 mains, 3 two-way interactions and 1 three-way interaction) is in tune with what seems to be the standard approach.

First, there are obviously many more tests you could run (x1 = 1, x1 = 2*x2, x1*x2*x3=1, x1 = log(x2), etc), so in this sense the problem is even worse than you make it seem. Second, and I think this is covered in some of the other comments, one could argue that the only tests to run are the ones about which you have formulated clear hypotheses. Then the problem is a bit less serious than suggested in the example.

Anyhow, my point is that the standard approach to always test for all main and interaction effects (against 0) is equally weird as the reluctance to adapt your alpha.

I did! The mutoss package has a huge number of tes...

2016-01-02T10:23:04.877+01:00

I did! The mutoss package has a huge number of tests. The method is perfectly fine, just has a small extra assumption of stochastical independence if I remember correctly. There is some improvement in power, but it's really 3 digits after the seperator, and since my motto is to explain the 20% that give an 80%% improvement in inferences, and the post was quite long, I left it out.

Hi Daniel, have you ever looked at the Hochberg st...

2016-01-02T10:04:22.500+01:00

Hi Daniel, have you ever looked at the Hochberg step-up procedure? looking at all tests starting with the largest p value

see http://www.stat.osu.edu/~jch/PDF/HuangHsuPreprint.pdf

I use this in multiple pair-wise testing if that's on any interest see https://github.com/CPernet/Robust_Statistical_Toolbox/blob/master/stats_functions/univariate/rst_multicompare.m (same as R Wilcox R function)

It's important to understand the limitations o...

2016-01-02T08:46:28.046+01:00

It's important to understand the limitations of any approach - and the dependency on intentions is an important limitation. I prefer to teach people how to make these corrections (also when it comes to optional stopping, see Lakens, 2014), and I do think error control is really important, especially in more exploratory research lines. There are other options, and I always encourage people to explore these. But I also think the dependency on intentions is not too problematic in practice. We never take single studies as proof of much, and in lines of studies, the differences in intentions will wash out, just as differences in priors will wash out in Bayesian statistics, where the dependency on beliefs yields differences between researchers.

Deborah: Concern for error probabilities is of cou...

2016-01-02T03:02:36.442+01:00

Deborah: Concern for error probabilities is of course an earnest and important concern. We should all hope to control error rates in making decisions. My gripe is not with the concern, but with the impossibility of pinning it down. An error rate is defined in terms of the space of possible errors that could have been generated by counterfactual imaginary data, and those hypothetical data are defined by the sampling intentions, by which I mean the analyst's intended stopping criterion and intended tests. (Sorry to persist with the i-word, but for me the word accurately captures the true meaning of the situation because the word makes explicit the role of the analyst beyond what's in the data.) Because the error rate is inherently conditional on the intended stopping criterion and tests, "the" error rate for a single set of real data can vary dramatically depending the intentions. This fact is, of course, what motivates the blog post that we're commenting on. If a person wants to make a decision based on an error rate, then that person must work through the various error rates that arise from different stopping intentions and different testing intentions, and live with the vagaries of doing that. As Daniel said in his post, "My [Daniel's] main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. ... there are many reasonable ideas about what the ‘family’ of errors is you want to control." Indeed there are.

2016-01-02T03:02:02.661+01:00

This comment has been removed by the author.

Here is everything anyone needs to know about your...

2016-01-02T01:07:56.193+01:00

Here is everything anyone needs to know about your vaunted frequentist error control:

http://www.bayesianphilosophy.com/test-for-anti-bayesian-fanaticism-part-ii/

John: It is entirely appropriate to see a concern ...

2016-01-02T00:36:30.539+01:00

John: It is entirely appropriate to see a concern with error probabilities as the key point distinguishing frequentist (error statistical) methods and Bayesian ones. That is too often forgotten or overlooked. However, it's a mistake to suppose a concern with error probabilities is a concern with "intentions" rather than a concern with probing severely and avoiding confirmation bias. To pick up on gambits that alter or even destroy error probability control is to pick up on something quite irrelevant to evidence for Bayesian updating, Bayes ratios, likelihood ratios. This follows from the Likelihood Principle. Unless the Bayesian adds a concern to ensure the output (be it a posterior or ratio) will frequently inform you of erroneous interpretations of data--in which case he or she has become an error statistician--the only way to save yourself from cheating is to invoke a prior which is supposed to save the day. But then the whole debate turns on competing beliefs rather than a matter of whether the data and method warrant the inference well or terribly. In other words, the work that the formal inference method was supposed to perform has to (hopefully) be performed by you, informally. Moreover, it entitles you to move from a statistical result to a substantive research hypothesis, which frequentist tests prohibit.

But the set of intended tests is exactly what is a...

2016-01-01T22:26:30.146+01:00

But the set of intended tests is exactly what is at issue for the corrections. Perhaps it sounds better for the set of tests to be called "the tests implied by a theory" but it's still just the set of intended tests.

For a researcher entertaining theory A, it's 13 (say) comparisons and contrasts that might be interesting and therefore the researcher intends to do those. For a researcher entertaining theory B, it's 17 comparisons and contrasts (some the same as in theory A) that might be interesting and therefore the researcher intends to do those instead.

Intentions are central for other corrections too. In particular, corrections for optional stopping ("data peeking") are all about the intention to continue sampling data if an interim test fails. In optional stopping, the analyst is doing a series of multiple tests -- it's the multiple test situation again, but with a different structure. The only thing that defines the structure is the intention to stop when reaching "significance" or when patience expires.

For me, THE key problem that drove me away from frequentism to Bayesian approaches is the issue of correcting for multiple comparisons and stopping intentions.

Thanks for the comment. The article by Bender &...

2016-01-01T20:40:33.242+01:00

Thanks for the comment. The article by Bender & Lange (2001) provides a really good discussion of the difference between family-wise and experiment-wise error control (including a nice example of a distinction between primary outcomes and secondary outcomes, both being corrected independently.

I expected the çareer-wise error correction joke - tried to find an official reference for it, but I couldn't find one (except a chapter by Dienes). It's too extreme. You can debate what a family is, and as long as you justify and clearly communicate the decision, you'll be fine.

The correction is not needed for the tests you intend to do. At least, I've not seen any paper arguing this. Again, the common view is that the tests examine a single theory. As with any point where statistical inferences interact directly with theory, there will never be one-size fits all rules you can blindly follow, but you will need to think, argue, and try your best.

Hi. I thought the usual reason for not correctin...

2016-01-01T20:10:10.824+01:00

Hi.

I thought the usual reason for not correcting for multiple tests across factors and interactions was that the study could have been done without the extra factors and therefore they are separate "families" of tests. Analysts want the "familywise" false-alarm rate to be 5%, for each family. If I understand correctly, in this post you are requiring the "experimentwise" false-alarm rate to be 5% (combining all families). I am NOT defending the distinction between familywise and experimenttwise, just bringing it up.

Because the division from one study to the next can be arbitrary (i.e., when does a "study" end?) I think it might be more appropriate to use "career-wise" corrections for multiple tests: We should correct all the tests done by a researcher in his/her career, which presumably is several thousand tests.

Also, the corrections are supposed to apply to the full set of tests that one INTENDS to do, whether or not one actually reports the tests, and whether or not the tests are possible in principle but uninteresting. That would include all the comparisons and contrasts one might be interested in, which could be dozens more. (Again, just raising the issue, not defending a position.) Exactly which correction to use depends on the set of intended tests and the specific structural relation of the tests.