The 20% Statistician: How a power analysis implicitly reveals the smallest effect size you care about

Thursday, May 11, 2017

How a power analysis implicitly reveals the smallest effect size you care about

When designing a study, you need to justify the sample size you aim to collect. If one of your goals is to observe a p-values lower than the alpha level you decided upon (e.g., 0.05), one justification for the sample size can be a power analysis. A power analysis tells you the probability of observing a statistically significant effect, based on a specific sample size, alpha level, and true effect size. At our department, people who use power as a sample size justification need to aim for 90% power if they want to get money from the department to collect data.

A power analysis is performed based on the effect size you expect to observe. When you expect an effect with a Cohen’s d of 0.5 in an independent two-tailed t-test, and you use an alpha level of 0.05, you will have 90% power with 86 participants in each group. What this means, is that only 10% of the distribution of effects sizes you can expect when d = 0.5 and n = 86 falls below the critical value required to get a p < 0.05 in an independent t-test.

In the figure below, the power analysis is visualized by plotting the distribution of Cohen’s d given 86 participants per group when the true effect size is 0 (or the null-hypothesis is true), and when d = 0.5. The blue area is the Type 2 error rate (the probability of not finding p < α, when there is a true effect).

You’ve probably seen such graphs before (indeed, G*power, widely used power analysis software, provides these graphs as output). The only thing I have done is to transform the t-value distribution that is commonly used in these graphs, and calculated the distribution for Cohen’s d. This is a straightforward transformation, but instead of presenting the critical t-value the figure provides the critical d-value. I think people find it easier to interpret d than t. Only t-tests which yield a t ≥ 1.974, or a d ≥ 0.30, will be statistically significant. All effects smaller than d = 0.30 will never be statistically significant with 86 participants in each condition.

If you design a study where results will be analyzed with an independent two-tailed t-test with α = 0.05, the smallest true effect you can statistically detect is determined exclusively by the sample size. The (unknown) true effect size only determines how far to the right the distribution of d-values lies, and thus, which percentage of effect sizes will be larger than the smallest effect size of interest (and will be statistically significant – or the statistical power).

I think it is reasonable to assume that if you decide to collect data for a study where you plan to perform a null-hypothesis significance test, you are not interested in effect sizes that will never be statistically significant. If you design a study that has 90% power for a medium effect of d = 0.5, the sample size you decide to use means effects smaller than d = 0.3 will never be statistically significant. We can use this fact to infer what your smallest effect size of interest, or SESOI (Lakens, 2014), will be. Unless you state otherwise, we can assume your SESOI is d = 0.3, and any effects smaller than this effect size are considered too small to be interesting. Obviously, you are free to explicitly state any effect smaller than d = 0.5 or d = 0.4 is already too small to matter for theoretical or practical purposes. But without such an explicit statement about what your SESOI is, we can infer it from your power analysis.

This is useful. Researchers who use null-hypothesis significance testing often only specify the effect they expect when the null is true (d = 0), but not the smallest effect size that should still be considered support for their theory when there is a true effect. This leads to a psychological science that is unfalsifiable (Morey & Lakens, under review). Alternative approaches to determining what the smallest effect size of interest is have recently been suggested. For example, Simonsohn (2015) suggested to set the smallest effect size of interest to 33% of the effect size in the original study could detect. For example, if an original study used 20 participants per group, the smallest effect size of interest would be d = 0.49 (which is the effect size they had 33% power to detect with n = 20).

Let’s assume the original study used a sample size of n = 20 per group. The figure below shows that an observed effect size of d = 0.8 would be statistically significant (d = 0.8 lies to the right of the critical d-value), but that the critical d-value is d = 0.64. That means that effects smaller than d = 0.64 would never be statistically significant in a study with 20 participants per group in a between-subjects design. I think it makes more sense to assume the smallest effect size of interest for researchers who design a study with n = 20 is d = 0.64, rather than d = 0.49.

The figures can be produced by a new Shiny app I created (the Shiny app also plots power curves and the p-value distribution [they are not all visible on Shinyapps.org, but you can try HERE as long as bandwidth lasts, or just grab the code and app from GitHub] – I might discuss these figures in a future blog post). If you have designed your next study, check the critical d-value to make sure that the smallest effect size you care about, isn’t smaller than the critical effect size you can actually detect. If you think smaller effects are interesting, but you don’t have the resources, specify your SESOI explicitly in your article. You can also use this specified smallest effect size of interest in an equivalence test to statistically reject any effect large enough that you deem it worthwhile (Lakens, 2017), which will help interpreting t-tests where p > α. In short, we really need to start specifying the effects we expect under the alternative model, and if you don’t know where to start, your power analysis might have been implicitly telling you what your smallest effect size of interest is.

References

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. https://doi.org/10.1002/ejsp.2023

Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. https://doi.org/10.1177/1948550617697177

Morey, R. D., & Lakens, D. (under review). Why most of psychology is statistically unfalsifiable.

Simonsohn, U. (2015). Small Telescopes Detectability and the Evaluation of Replication Results. Psychological Science, 26(5), 559–569. https://doi.org/10.1177/0956797614567341

23 comments:

JTMay 11, 2017 at 8:31 PM
If experimental psychologists are basing their sample size requirements on the effect size they expect to observe, then they are making a mistake, because, for one thing, their experiment will be underpowered to detect a smaller effect size that they would still consider scientifically of interest. Sample size planning should always be based on the smallest effect size of interest.
ReplyDelete
Replies
AurélienMay 11, 2017 at 10:50 PM
Hi Daniël! Interesting post. Just a detail: I think Simonsohn (2015) did not suggest to set the smallest effect size of interest to 33% of the effect size in the original study, as you write. He suggested to set the smallest effect size of interest so that the original experiment had 33% power to reject the null if this ES was true. This smallest ES of interest thus does not depend on the found effect size of the original study: it only depends on the sample size. For instance, for n=20 per cell in a two cells design, the effect size would be d=0.5, because this gives 33% power. Your approach is that the smallest ES is the effect size that gives 50% power in the original study. It makes a difference, but I think your approach is, in the end, quite close to Simonsohn's approach.
ReplyDelete
Replies
MayoMay 12, 2017 at 2:07 AM
I find this equivocating between observed ES and population ES. This is very common in psych, and it would really help if you labelled which you have in mind whenever used. Cohen had a subscript s for the observed ES. (I use difference for the observed, and discrepancy for the parametric effect size).
To take the simple one-sample test of a Normal mean : Ho: mu< 0 vs H1: mu > 0, the cut-off for rejection at the 025 level is a sample mean M of 1.96SE. Are you saying the pop effect size of interest is this cut-off, 1.96 SE? That would be to take, as the pop ES of interest, one against which the test has 50% power. I'm not saying that would be bad, I'm just trying to figure out your equivocal use of effect size.

ReplyDelete
Replies
MayoMay 13, 2017 at 6:45 AM
Since you asked about where it's equivocal, between pop ES and sample ES, here's one: you say "true effect size is 0 (or the null-hypothesis is true), and when d = 0.5."
Here d = .5 appears to speak of the pop ES. On your graph it's the observed.
Another: Your first figure shows d = .5 & also that d = .3, the first I take it is a pop, the second a sample.

A separate issue I have with using these standardized pop d's is that it seems you're allowed to do the analysis without knowing the standard deviation. Is that so?
ReplyDelete
Replies
AnonymousMay 29, 2017 at 7:59 AM
I am always searching online for articles that can help me. There is obviously a lot to know about this.
gclub
gclub casino
gclub
ReplyDelete
Replies
AnonymousJune 6, 2017 at 7:02 PM
Dear Daniel,

Thanks for sharing your thoughts. Doing power analysis using effect size, as its metric, does not really alleviate the problem you're thoughtfully raising, namely finding the smallest effect of interest. Because effect size is a simple transformation of other summary statistics (e.g., test statistics). As such, one can simply convert a critical test-statistic value to a corresponding critical effect size value. Such context-free, hypothesis-based view of power-analysis is both old and impractical. Plotting power against effect sizes is useful in conveying the message that the expression "Power of the test = some number" is basically not noteworthy. At a larger level, these revelations are instead important in moving the social and behavioral research toward thinking in terms of Bayesian estimation of effect sizes. If you're interested in frequentist power-analysis, much better ways of doing such power analyses is available via loss functions (frequentist decision theoretic approaches). The traditional power-analytic approach you discuss here has criticisms that take more space, but in short other type of errors than type I and type II are to be involved in the power analysis process.
ReplyDelete
Replies
AnnynomousJune 10, 2017 at 1:51 AM
Dear Daniel,

That's what good methodologists such as yourself are for, right! Promoting good stuff! Ken Kelly, for example, has implemented loss functions for use in a wide range of power-analytic situations in his package. I'm mathematical statistician and really understand some of the limitations in human research. I also mentioned in passing that other types of error for small-sized research such as the ones in psychology are to be seriously heeded when doing power-analysis in the traditional sense. I believe Andrew Gelman has a math-free paper on this topic which is publicly available. Anyway, the point was that, a "Cohen's d" sampling density is simply a "location-scale" version of t-distribution. You make good comments that prepare your colleagues to adopt a bayesian approach to the estimation of the parameters of their interest. Best of luck with your work. Keep it up!
ReplyDelete
Replies
ConchiCBJune 13, 2017 at 12:03 PM
Dear Daniel,

Thank you for your post! I'm a beginner trying to understand power analysis and equivalence test/SESOI calculation. One of your assumptions is that your data has a normal distribution. What it would be your approach if not?
ReplyDelete
Replies
Daniel LakensJune 13, 2017 at 12:14 PM
It's the same idea, just different calculations.
ReplyDelete
Replies
ConchiCBJune 13, 2017 at 6:18 PM
Thanks Daniel! Do you have any recomendations to test equivalence in non-normal data?

ReplyDelete
Replies
UnknownAugust 3, 2017 at 6:57 PM
thanks for sharing the nice idea.

หนังผีฝรั่ง
ReplyDelete
Replies
VladSeptember 13, 2017 at 10:09 PM
Re: "you are not interested in effect sizes that will never be statistically significant"

I would add "...within a time frame that is credible" i.e. findings are most likely to be significant while power is low, well before the "estimated" duration, but these are less likely to reproduce.

Anyway, I had a similar thought, I think. I opted to address uncertainty about the effect size by showing a range of effect sizes. I think ranges are more intuitive than single numbers, esp. when the likely effect size is hard to estimate. The idea is: I first see some reasonable proposal as to how long and what I can detect, then I adjust duration until the range feels achievable given what is being tested. In online analytics, it works, because duration can be extended indefinitely, but usually quick decision are needed. If my duration fixed, I can gauge by best case scenario.

Here's the reverse sample size / effect size calculator I came up with: http://vladmalik.com/abstats

On top of visualizing power, I also wanted to see the impact on false positive rate in case my hypothesis is wrong e.g., I could gauge if the results I am seeing are within the range predicted by pure chance.

I've no formal training in stats, so I was always curious if my approach to this has merit in other real-life scenarios. Glad to see you're doing something somewhat similar. Would love to hear your thoughts too.
ReplyDelete
Replies
ankitDecember 16, 2020 at 11:06 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment