*This blog post is now included in the paper "Sample size justification" available at PsyArXiv.*

*stated*, but not

*justified*. This makes it difficult to evaluate how informative the study was. Referees can’t just assume the number of observations is sufficient to provide an informative answer to your research question, so leaving out a justification for the number of observations is not best practice, and a reason reviewers can criticize your submitted article.

A common reason why a specific number
of observations is collected is because collecting more data was not feasible.
Note that all decisions for the sample size we collect in a study are based on
the resources we have available in some way. A **feasibility justification** makes these resource limitations the
primary reason for the sample size that is collected. Because we always have
resource limitations in science, even when feasibility is not our primary
justification for the number of observations we plan to collect, feasibility is
always at least a secondary reason for any sample size justification. Despite
the omnipresence of resource limitations, the topic often receives very little
attention in texts on experimental design. This might make it feel like a feasibility
justification is not appropriate, and you should perform an a-priori power
analysis or plan for a desired precision instead. But feasibility limitations
play a role in every sample size justification, and therefore regardless of
which justification for the sample size you provide, you will almost always
need to include a feasibility justification as well.

Time and money are the two main resource limitations a scientist faces. Our master students write their thesis in 6 months, and therefore their data collection is necessarily limited in whatever can be collected in 6 months, minus the time needed to formulate a research question, design an experiment, analyze the data, and write up the thesis. A PhD student at our department would have 4 years to complete their thesis, but is also expected to complete multiple research lines in this time. In addition to limitations on time, we have limited financial resources. Although nowadays it is possible to collect data online quickly, if you offer participants a decent pay (as you should) most researchers do not have the financial means to collect thousands of datapoints.

**A
feasibility justification puts the limited resources at the center of the
justification for the sample size that will be collected**. For example, one might argue that 120 observations is the most
that can be collected in the three weeks a master student has available to
collect data, when each observation takes an hour to collect. A PhD student
might collect data until the end of the academic year, and then needs to write
up the results over the summer to stay on track to complete the thesis in time.

A feasibility justification thus *starts* with the expected number of
observations (N) that a researcher expects to be able to collect. The challenge
is to evaluate whether collecting N observations is worthwhile. The answer
should sometimes be that data collection is *not*
worthwhile. For example, assume I plan to manipulate the mood of participants
using funny cartoons and then measure the effect of mood on some dependent
variable - say the amount of money people donate to charity. I should expect an
effect size around d = 0.31 for the mood manipulation
(Joseph et al., 2020), and seems unlikely that the effect on donations
will be larger than the effect size of the manipulation. If I can only collect
mood data from 30 participants in total, how do we decide if this study will be
informative?

## How informative is the data that is feasible to collect?

If we want to evaluate whether the feasibility limitations make data collection uninformative, we need to think about what the goal of data collection is. First of all, having data always provide more knowledge than not having data, so in an absolute sense, all additional data that is collected is better than not collecting data. However, in line with the idea that we need to take into account costs and benefits, it is possible that the cost of data collection outweighs the benefits. To determine this, one needs to think about what the benefits of having the data are. The benefits are clearest when we know for certain that someone is going to make a decision, with or without data. If this is the case, then any data you collect will reduce the error rates of a well-calibrated decision process, even if only ever so slightly. In these cases, the value of information might be positive, as long as the reduction in error rates is more beneficial than the costs of data collection. If your sample size is limited and you know you will make a decision anyway, perform a compromise power analysis, where you balance the error rates, given a specified effect size and sample size.

Another way in which a small dataset can
be valuable is if its existence makes it possible to combine several small
datasets into a meta-analysis. This argument in favor of collecting a small
dataset requires 1) that **you share the
results in a way that a future meta-analyst can find them regardless of the
outcome of the study**, and 2) that **there
is a decent probability that someone will perform a meta-analysis in the future
which inclusion criteria would contain your study**, because a sufficient
number of small studies exist. The uncertainty about whether there will ever be
such a meta-analysis should be weighed against the costs of data collection. Will
anyone else collect more data on cognitive performance during bungee jumps, to
complement the 12 data points you can collect?

One way to increase the probability of a future meta-analysis is if you commit to performing this meta-analysis yourself in the future. For example, you might plan to repeat a study for the next 12 years in a class you teach, with the expectation that a meta-analysis of 360 participants would be sufficient to achieve around 90% power for d = 0.31. If it is not plausible you will collect all the required data by yourself, you can attempt to set up a collaboration, where fellow researchers in your field commit to collecting similar data, with identical measures, over the next years. If it is not likely sufficient data will emerge over time, we will not be able to draw informative conclusions from the data, and it might be more beneficial to not collect the data to begin with, and examine an alternative research question with a larger effect size instead.

Even if you believe over time sufficient data will emerge, you will most likely compute statistics after collecting a small sample size. Before embarking on a study where your main justification for the sample size is based on feasibility, you can expect. I propose that a feasibility justification for the sample size, in addition to a reflection on the plausibility that a future meta-analysis will be performed, and/or the need to make a decision, even with limited data, is always accompanied by three statistics, detailed in the following three sections.

### The smallest effect size that can be statistically significant

In Figure @ref(fig:power-effect1) the
distribution of Cohen’s d given 15 participants per group is plotted when the
true effect size is 0 (or the null-hypothesis is true), and when the true
effect size is d = 0.5. The blue area is the Type 2 error rate (the probability
of not finding p < α, when there is a true effect, and α = 0.05). 1- the
Type 2 error is the statistical power of the test, given an assumption about a
true effect size in the population. **Statistical
power** is the probability of a test to yield a statistically significant
result if the alternative hypothesis is true. Power depends on the Type 1 error rate (α), the true
effect size in the population, and the number of observations.

Null and alternative distribution, assuming d = 0.5, alpha = 0.05, and N = 15 per group.

You might seen such graphs before. The
only thing I have done is to transform the *t*-value
distribution that is commonly used in these graphs, and calculated the
distribution for Cohen’s d. This is a straightforward transformation, but
instead of presenting the critical *t*-value
the figure provides the critical *d*-value.
For a two-sided independent *t*-test,
this is calculated as:

qt(1-(a / 2), (n1 + n2) - 2) * sqrt(1/n1 + 1/n2)

where ‘a’ is the alpha level (e.g., 0.05) and N is the sample size in each independent group. For the example above, where alpha is 0.05 and n = 15:

qt(1-(0.05 / 2), (15 * 2) - 2) * sqrt(1/15 + 1/15)

## [1] 0.7479725

The critical *t*-value (2.0484071) is also provided in commonly used power
analysis software such as G*Power. We can compute the critical Cohen’s d from
the *t*-value and sample size using .

*The critical t-value is provided by
G*Power software.*

When you will test an association between variables with a correlation, G*Power will directly provide you with the critical effect size. When you compute a correlation based on a two-sided test, your alpha level is 0.05, and you have 30 observations, only effects larger than r = 0.361 will be statistically significant. In other words, the effect needs to be quite large to even have the mathematical possibility of becoming statistically significant.

The critical r is provided by G*Power software.

The critical effect size gives you
information about the smallest effect size that, if observed, would by
statistically significant. If you observe a smaller effect size, the *p*-value will be larger than your
significance threshold. You always have some probability of observing effects
larger than the critical effect size. After all, even if the null hypothesis is
true, 5% of your tests will yield a significant effect. But what you should ask
yourself is whether the effect sizes that could be statistically significant
are realistically what you would expect to find. If this is not the case, it
should be clear that there is little (if any) use in performing a significance
test. Mathematically, when the critical effect size is larger than effects you
expect, your statistical power will be less than 50%. If you perform a
statistical test with less than 50% power, your single study is not very
informative. Reporting the critical effect size in a feasibility justification
should make you reflect on whether a hypothesis test will yield an informative
answer to your research question.

### Compute the width of the confidence interval around the effect size

The second statistic to report alongside a feasibility justification is the width of the 95% confidence interval around the effect size. 95% confidence intervals will capture the true population parameter 95% of the time in repeated identical experiments. The more uncertain we are about the true effect size, the wider a confidence interval will be. Cumming (2013) calls the difference between the observed effect size and its upper 95% confidence interval (or the lower 95% confidence interval) the margin of error (MOE).

# Compute the
effect size d and 95% CI

res <- MOTE::d.ind.t(m1 = 0, m2 = 0, sd1 = 1, sd2 = 1, n1 = 15, n2 = 15, a = .05)

# Print the result

res$estimate

## [1] "$d_s$ = 0.00, 95\\% CI [-0.72, 0.72]"

If we compute the 95% CI for an effect
size of 0, we see that with 15 observations in each condition of an independent
*t*-test the 95% CI ranges from -0.72
to 0.72. The MOE is half the width of the 95% CI, 0.72. This clearly shows we
have a very imprecise estimate. A Bayesian estimator who uses an uninformative
prior would compute a credible interval with the same upper and lower bound, and might conclude they personally believe there is a
95% chance the true effect size lies in this interval. A frequentist would
reason more hypothetically: If the observed effect size in the data I plan to
collect is 0, I could only reject effects more extreme than d = 0.72 in an
equivalence test with a 5% alpha level (even though if such a test would be
performed, power might be low, depending on the true effect size). Regardless
of the statistical philosophy you plan to rely on when analyzing the data, our
evaluation of what we can conclude based on the width of our interval tells us
we will not learn a lot. Effect sizes in the range of d = 0.7 are findings such
as “People become aggressive when they are provoked”, “People prefer their own
group to other groups”, and “Romantic partners resemble one another in physical
attractiveness” (Richard et al., 2003). The width of the confidence interval tells
you that you can only reject the presence of effects that are so large, if they
existed, you would probably already have noticed them. It might still be
important to establish these large effects in a well-controlled experiment. But
since most effect sizes in we should realistically expect are much smaller, we
do not learn something we didn’t already know from the data that plan to
collect. Even without data, we would exclude effects larger than d = 0.7 in most
research lines.

We see this the MOE is almost, but not
exactly, the same as the critical effect size d we observed above (d =
0.7479725). The reason for this is that the 95% confidence interval is
calculated based on the *t*-distribution.
If the true effect size is not zero, the confidence interval is calculated
based on the non-central *t*-distribution,
and the 95% CI is asymmetric. The figure below vizualizes three *t*-distributions, one symmetric at 0, and
two asymmetric distributions with a noncentrality parameter of 2 and 3. The
asymmetry is most clearly visible in very small samples (the distribution in
the plot have 5 degrees of freedom) but remain noticeable when calculating
confidence intervals and statistical power. For example, for a true effect size
of d = 0.5 the 95% CI is [-0.23, 1.22]. The MOE based
on the lower bound is 0.7317584 and based on the upper bound is 0.7231479. If
we compute the 95% CI around the critical effect size (d = 0.7479725) we see
the 95% CI ranges from exactly 0.00 to 1.48. If the 95% CI excludes zero, the
test is statistically significant. In this case the lowerbound of the
confidence interval exactly touches 0, which means we would observe a *p* = 0.05 if we exactly observed the
critical effect size.

*Central (black) and 2 non-central (red
and blue) t-distributions.*

Where computing the critical effect size
can make it clear that a *p*-value is
of little interest, computing the 95% CI around the effect size can make it
clear that the effect size estimate is of little value. It will often be so
uncertain, and the range of effect sizes you will not be able to reject if
there is no effect is so large, the effect size estimate is not very useful.
This is also the reason why performing a pilot study to estimate an effect size
for an a-priori power analysis is not a sensible strategy (Albers & Lakens, 2018; Leon et al, 2011). Your effect size estimate will be so uncertain, it is not a
good guide in an a-priori power analysis.

However, it is possible that the sample size is large enough to exclude some effect sizes that are still a-priori plausible. For example, with 50 observations in each independent group, you have 82% power for an equivalence test with bounds of -0.6 and 0.6. If the literature includes claims of effect size estimates larger than 0.6, and if effect larger than 0.6 can be rejected based on your data, this might be sufficient to tentatively start to question claims in the literature, and the data you collect might fulfill that very specific goal.

### Plot a sensitivity power analysis

In a **sensitivity power analysis** the sample size and the alpha level are
fixed, and you compute the effect size you have the desired statistical power
to detect. For example, in the Figure below the sample size in each group
is set to 15, the alpha level is 0.05, and the desired power is set to 90%. The
sensitivity power analysis shows we have 90% power to detect an effect of d =
1.23.

*Sensitivity power analysis in G*Power
software.*

Perhaps you feel a power of 90% is a bit high, and you would be happy with 80% power. We can plot a sensitivity curve across all possible levels of statistical power. In the figure below we see that if we desire 80% power, the effect size should be d = 1.06. The smaller the true effect size, the lower the power we have. This plot should again remind us not to put too much faith in a significance test when are sample size is small, since for 15 observations in each condition, statistical power is very low for anything but extremely large effect sizes.

*Plot of the effect size against the desired power when n = 15 per group and alpha = 0.05.*

If we look at the effect size that we
would have 50% power for, we see it is d = 0.7411272. This is very close to our
critical effect size of d = 0.7479725 (the smallest effect size that, if
observed, would be significant). The difference is due to the non-central *t*-distribution.

### Reporting a feasibility justification.

To summarize, I recommend addressing the following components in a feasibility sample size justification. Addressing these points explicitly will allow you to evaluate for yourself if collecting the data will have scientific value. If not, there might be other reasons to collect the data. For example, at our department, students often collect data as part of their education. However, if the primary goal of data collection is educational, the sample size that is collected can be very small. It is often educational to collect data from a small number of participants to experience what data collection looks like in practice, but there is often no educational value in collecting data from more than 10 participants. Despite the small sample size, we often require students to report statistical analyses as part of their education, which is fine as long as it is clear the numbers that are calculated can not meaningfully be interpreted. Te table below should help to evaluate if the interpretation of statistical tests has any value, or not.

*Overview of recommendations when
reporting a sample size justification based on feasibility.*

What to address? |
How to address it? |

Will a future meta-analysis be performed? |
Consider the plausibility that sufficient highly similar studies will be performed in the future to, eventually, make a meta-analysis possible |

Will a decision be made, regardless of the amount of data that is available? |
If it is known that a decision will be made, with or without data, then any data you collect will reduce error rates. |

What is the critical effect size? |
Report and interpret the critical effect
size, with a focus on whether a hypothesis test would even be significant for
expected effect sizes. If not, indicate you will not interpret the data based
on |

What is the width of the confidence interval? |
Report and interpret the width of the confidence interval. What will an estimate with this much uncertainty be useful for? If the null hypothesis is true, would rejecting effects outside of the confidence interval be worthwhile (ignoring you might have low power to actually test against these values)? |

Which effect sizes would you have decent power to detect? |
Report a sensitivity power analysis, and report the effect sizes you could detect across a range of desired power levels (e.g., 80%, 90%, and 95%), or plot a sensitivity curve of effect sizes against desired power. |

If the study is not performed for educational purposes, but the goal is answer a research question, the feasibility justification might indicate that there is no value in collecting the data. If it wasn’t possible to conclude that one should not proceed with the data collection, there is no use of justifying the sample size. There should be cases where it is unlikely there will ever be enough data to perform a meta-analysis (for example because of a lack of general interest in the topic), the information will not be used to make any decisions, and the statistical tests do not allow you to test a hypothesis or estimate an effect size estimate with any useful accuracy. It should be a feasibility justification - not a feasibility excuse. If there is no good justification to collect the maximum number of observations that is feasible, performing the study nevertheless is a waste of participants time, and/or a waste of money if data collection has associated costs. Collecting data without a good justification why the planned sample size will yield worthwhile information has an ethical component. As Button and colleagues Button et al (2013) write:

**Low power therefore has an ethical
dimension — unreliable research is inefficient and wasteful. This applies to
both human and animal research.**

Think carefully if you can defend data collection based on a feasibility justification. Sometimes data collection is just not feasible, and we should accept this.

Thank you very much, Daniel! That's exactly what I need right now for interpreting my results and reporting! I would like to see this in a Journal! Can you suggest some literature that points out some of the recommendations you mentioned?

ReplyDeleteThis comment has been removed by a blog administrator.

ReplyDeleteHi Daniel. Thanks very much for the blog. How would you suggest doing a sensitivity for a multi-level regression? I am under the impression that G*Power does not include this.

ReplyDelete