Wednesday, August 5, 2020

Feasibility Sample Size Justification

This blog post is now included in the paper "Sample size justification" available at PsyArXiv.
  
When you perform an experiment, you want it to provide an answer to your research question that is as informative as possible. However, since all scientists are faced with resource limitations, you need to balance the cost of collecting each additional datapoint against the increase in information that datapoint provides. In economics, this is known as the Value of Information (Eckermann et al., 2010). Calculating the value of information is notoriously difficult. You need to specify the costs and benefits of possible outcomes of the study, and quantifying a utility function for scientific research is not easy.

Because of the difficulty of quantifying the value of information, scientists use less formal approaches to justify the amount of data they set out to collect. That is, if they provide a justification for the number of observations to begin with. Even though in some fields a justification for the number of observations is required when submitting a grant proposal to a science funder, a research proposal to an ethical review board, or a manuscript for submission to a journal. In some research fields, the number of observations is stated, but not justified. This makes it difficult to evaluate how informative the study was. Referees can’t just assume the number of observations is sufficient to provide an informative answer to your research question, so leaving out a justification for the number of observations is not best practice, and a reason reviewers can criticize your submitted article.

A common reason why a specific number of observations is collected is because collecting more data was not feasible. Note that all decisions for the sample size we collect in a study are based on the resources we have available in some way. A feasibility justification makes these resource limitations the primary reason for the sample size that is collected. Because we always have resource limitations in science, even when feasibility is not our primary justification for the number of observations we plan to collect, feasibility is always at least a secondary reason for any sample size justification. Despite the omnipresence of resource limitations, the topic often receives very little attention in texts on experimental design. This might make it feel like a feasibility justification is not appropriate, and you should perform an a-priori power analysis or plan for a desired precision instead. But feasibility limitations play a role in every sample size justification, and therefore regardless of which justification for the sample size you provide, you will almost always need to include a feasibility justification as well.

Time and money are the two main resource limitations a scientist faces. Our master students write their thesis in 6 months, and therefore their data collection is necessarily limited in whatever can be collected in 6 months, minus the time needed to formulate a research question, design an experiment, analyze the data, and write up the thesis. A PhD student at our department would have 4 years to complete their thesis, but is also expected to complete multiple research lines in this time. In addition to limitations on time, we have limited financial resources. Although nowadays it is possible to collect data online quickly, if you offer participants a decent pay (as you should) most researchers do not have the financial means to collect thousands of datapoints.

A feasibility justification puts the limited resources at the center of the justification for the sample size that will be collected. For example, one might argue that 120 observations is the most that can be collected in the three weeks a master student has available to collect data, when each observation takes an hour to collect. A PhD student might collect data until the end of the academic year, and then needs to write up the results over the summer to stay on track to complete the thesis in time.

A feasibility justification thus starts with the expected number of observations (N) that a researcher expects to be able to collect. The challenge is to evaluate whether collecting N observations is worthwhile. The answer should sometimes be that data collection is not worthwhile. For example, assume I plan to manipulate the mood of participants using funny cartoons and then measure the effect of mood on some dependent variable - say the amount of money people donate to charity. I should expect an effect size around d = 0.31 for the mood manipulation (Joseph et al., 2020), and seems unlikely that the effect on donations will be larger than the effect size of the manipulation. If I can only collect mood data from 30 participants in total, how do we decide if this study will be informative?

How informative is the data that is feasible to collect?

If we want to evaluate whether the feasibility limitations make data collection uninformative, we need to think about what the goal of data collection is. First of all, having data always provide more knowledge than not having data, so in an absolute sense, all additional data that is collected is better than not collecting data. However, in line with the idea that we need to take into account costs and benefits, it is possible that the cost of data collection outweighs the benefits. To determine this, one needs to think about what the benefits of having the data are. The benefits are clearest when we know for certain that someone is going to make a decision, with or without data. If this is the case, then any data you collect will reduce the error rates of a well-calibrated decision process, even if only ever so slightly. In these cases, the value of information might be positive, as long as the reduction in error rates is more beneficial than the costs of data collection. If your sample size is limited and you know you will make a decision anyway, perform a compromise power analysis, where you balance the error rates, given a specified effect size and sample size.

Another way in which a small dataset can be valuable is if its existence makes it possible to combine several small datasets into a meta-analysis. This argument in favor of collecting a small dataset requires 1) that you share the results in a way that a future meta-analyst can find them regardless of the outcome of the study, and 2) that there is a decent probability that someone will perform a meta-analysis in the future which inclusion criteria would contain your study, because a sufficient number of small studies exist. The uncertainty about whether there will ever be such a meta-analysis should be weighed against the costs of data collection. Will anyone else collect more data on cognitive performance during bungee jumps, to complement the 12 data points you can collect?

One way to increase the probability of a future meta-analysis is if you commit to performing this meta-analysis yourself in the future. For example, you might plan to repeat a study for the next 12 years in a class you teach, with the expectation that a meta-analysis of 360 participants would be sufficient to achieve around 90% power for d = 0.31. If it is not plausible you will collect all the required data by yourself, you can attempt to set up a collaboration, where fellow researchers in your field commit to collecting similar data, with identical measures, over the next years. If it is not likely sufficient data will emerge over time, we will not be able to draw informative conclusions from the data, and it might be more beneficial to not collect the data to begin with, and examine an alternative research question with a larger effect size instead.

Even if you believe over time sufficient data will emerge, you will most likely compute statistics after collecting a small sample size. Before embarking on a study where your main justification for the sample size is based on feasibility, you can expect. I propose that a feasibility justification for the sample size, in addition to a reflection on the plausibility that a future meta-analysis will be performed, and/or the need to make a decision, even with limited data, is always accompanied by three statistics, detailed in the following three sections.

The smallest effect size that can be statistically significant

In Figure @ref(fig:power-effect1) the distribution of Cohen’s d given 15 participants per group is plotted when the true effect size is 0 (or the null-hypothesis is true), and when the true effect size is d = 0.5. The blue area is the Type 2 error rate (the probability of not finding p < α, when there is a true effect, and α = 0.05). 1- the Type 2 error is the statistical power of the test, given an assumption about a true effect size in the population. Statistical power is the probability of a test to yield a statistically significant result if the alternative hypothesis is true. Power depends on the Type 1 error rate (α), the true effect size in the population, and the number of observations.



Null and alternative distribution, assuming d = 0.5, alpha = 0.05, and N = 15 per group.

You might seen such graphs before. The only thing I have done is to transform the t-value distribution that is commonly used in these graphs, and calculated the distribution for Cohen’s d. This is a straightforward transformation, but instead of presenting the critical t-value the figure provides the critical d-value. For a two-sided independent t-test, this is calculated as:

qt(1-(a / 2), (n1 + n2) - 2) * sqrt(1/n1 + 1/n2)

where ‘a’ is the alpha level (e.g., 0.05) and N is the sample size in each independent group. For the example above, where alpha is 0.05 and n = 15:

qt(1-(0.05 / 2), (15 * 2) - 2) * sqrt(1/15 + 1/15)

## [1] 0.7479725

The critical t-value (2.0484071) is also provided in commonly used power analysis software such as G*Power. We can compute the critical Cohen’s d from the t-value and sample size using .




The critical t-value is provided by G*Power software.

When you will test an association between variables with a correlation, G*Power will directly provide you with the critical effect size. When you compute a correlation based on a two-sided test, your alpha level is 0.05, and you have 30 observations, only effects larger than r = 0.361 will be statistically significant. In other words, the effect needs to be quite large to even have the mathematical possibility of becoming statistically significant.


The critical r is provided by G*Power software.

The critical effect size gives you information about the smallest effect size that, if observed, would by statistically significant. If you observe a smaller effect size, the p-value will be larger than your significance threshold. You always have some probability of observing effects larger than the critical effect size. After all, even if the null hypothesis is true, 5% of your tests will yield a significant effect. But what you should ask yourself is whether the effect sizes that could be statistically significant are realistically what you would expect to find. If this is not the case, it should be clear that there is little (if any) use in performing a significance test. Mathematically, when the critical effect size is larger than effects you expect, your statistical power will be less than 50%. If you perform a statistical test with less than 50% power, your single study is not very informative. Reporting the critical effect size in a feasibility justification should make you reflect on whether a hypothesis test will yield an informative answer to your research question.

Compute the width of the confidence interval around the effect size

The second statistic to report alongside a feasibility justification is the width of the 95% confidence interval around the effect size. 95% confidence intervals will capture the true population parameter 95% of the time in repeated identical experiments. The more uncertain we are about the true effect size, the wider a confidence interval will be. Cumming (2013) calls the difference between the observed effect size and its upper 95% confidence interval (or the lower 95% confidence interval) the margin of error (MOE).

# Compute the effect size d and 95% CI
res <-
MOTE::d.ind.t(m1 = 0, m2 = 0, sd1 = 1, sd2 = 1, n1 = 15, n2 = 15, a = .05)

# Print the result
res
$estimate

## [1] "$d_s$ = 0.00, 95\\% CI [-0.72, 0.72]"

If we compute the 95% CI for an effect size of 0, we see that with 15 observations in each condition of an independent t-test the 95% CI ranges from -0.72 to 0.72. The MOE is half the width of the 95% CI, 0.72. This clearly shows we have a very imprecise estimate. A Bayesian estimator who uses an uninformative prior would compute a credible interval with the same upper and lower bound, and might conclude they personally believe there is a 95% chance the true effect size lies in this interval. A frequentist would reason more hypothetically: If the observed effect size in the data I plan to collect is 0, I could only reject effects more extreme than d = 0.72 in an equivalence test with a 5% alpha level (even though if such a test would be performed, power might be low, depending on the true effect size). Regardless of the statistical philosophy you plan to rely on when analyzing the data, our evaluation of what we can conclude based on the width of our interval tells us we will not learn a lot. Effect sizes in the range of d = 0.7 are findings such as “People become aggressive when they are provoked”, “People prefer their own group to other groups”, and “Romantic partners resemble one another in physical attractiveness” (Richard et al., 2003). The width of the confidence interval tells you that you can only reject the presence of effects that are so large, if they existed, you would probably already have noticed them. It might still be important to establish these large effects in a well-controlled experiment. But since most effect sizes in we should realistically expect are much smaller, we do not learn something we didn’t already know from the data that plan to collect. Even without data, we would exclude effects larger than d = 0.7 in most research lines.

We see this the MOE is almost, but not exactly, the same as the critical effect size d we observed above (d = 0.7479725). The reason for this is that the 95% confidence interval is calculated based on the t-distribution. If the true effect size is not zero, the confidence interval is calculated based on the non-central t-distribution, and the 95% CI is asymmetric. The figure below vizualizes three t-distributions, one symmetric at 0, and two asymmetric distributions with a noncentrality parameter of 2 and 3. The asymmetry is most clearly visible in very small samples (the distribution in the plot have 5 degrees of freedom) but remain noticeable when calculating confidence intervals and statistical power. For example, for a true effect size of d = 0.5 the 95% CI is [-0.23, 1.22]. The MOE based on the lower bound is 0.7317584 and based on the upper bound is 0.7231479. If we compute the 95% CI around the critical effect size (d = 0.7479725) we see the 95% CI ranges from exactly 0.00 to 1.48. If the 95% CI excludes zero, the test is statistically significant. In this case the lowerbound of the confidence interval exactly touches 0, which means we would observe a p = 0.05 if we exactly observed the critical effect size.


Central (black) and 2 non-central (red and blue) t-distributions.

Where computing the critical effect size can make it clear that a p-value is of little interest, computing the 95% CI around the effect size can make it clear that the effect size estimate is of little value. It will often be so uncertain, and the range of effect sizes you will not be able to reject if there is no effect is so large, the effect size estimate is not very useful. This is also the reason why performing a pilot study to estimate an effect size for an a-priori power analysis is not a sensible strategy (Albers & Lakens, 2018; Leon et al, 2011). Your effect size estimate will be so uncertain, it is not a good guide in an a-priori power analysis.

However, it is possible that the sample size is large enough to exclude some effect sizes that are still a-priori plausible. For example, with 50 observations in each independent group, you have 82% power for an equivalence test with bounds of -0.6 and 0.6. If the literature includes claims of effect size estimates larger than 0.6, and if effect larger than 0.6 can be rejected based on your data, this might be sufficient to tentatively start to question claims in the literature, and the data you collect might fulfill that very specific goal.

Plot a sensitivity power analysis

In a sensitivity power analysis the sample size and the alpha level are fixed, and you compute the effect size you have the desired statistical power to detect. For example, in the Figure below the sample size in each group is set to 15, the alpha level is 0.05, and the desired power is set to 90%. The sensitivity power analysis shows we have 90% power to detect an effect of d = 1.23.


Sensitivity power analysis in G*Power software.

Perhaps you feel a power of 90% is a bit high, and you would be happy with 80% power. We can plot a sensitivity curve across all possible levels of statistical power. In the figure below we see that if we desire 80% power, the effect size should be d = 1.06. The smaller the true effect size, the lower the power we have. This plot should again remind us not to put too much faith in a significance test when are sample size is small, since for 15 observations in each condition, statistical power is very low for anything but extremely large effect sizes.


Plot of the effect size against the desired power when n = 15 per group and alpha = 0.05.

If we look at the effect size that we would have 50% power for, we see it is d = 0.7411272. This is very close to our critical effect size of d = 0.7479725 (the smallest effect size that, if observed, would be significant). The difference is due to the non-central t-distribution.

Reporting a feasibility justification.

To summarize, I recommend addressing the following components in a feasibility sample size justification. Addressing these points explicitly will allow you to evaluate for yourself if collecting the data will have scientific value. If not, there might be other reasons to collect the data. For example, at our department, students often collect data as part of their education. However, if the primary goal of data collection is educational, the sample size that is collected can be very small. It is often educational to collect data from a small number of participants to experience what data collection looks like in practice, but there is often no educational value in collecting data from more than 10 participants. Despite the small sample size, we often require students to report statistical analyses as part of their education, which is fine as long as it is clear the numbers that are calculated can not meaningfully be interpreted.  Te table below should help to evaluate if the interpretation of statistical tests has any value, or not.

Overview of recommendations when reporting a sample size justification based on feasibility.

What to address?

How to address it?

Will a future meta-analysis be performed?

Consider the plausibility that sufficient highly similar studies will be performed in the future to, eventually, make a meta-analysis possible

Will a decision be made, regardless of the amount of data that is available?

If it is known that a decision will be made, with or without data, then any data you collect will reduce error rates.

What is the critical effect size?

Report and interpret the critical effect size, with a focus on whether a hypothesis test would even be significant for expected effect sizes. If not, indicate you will not interpret the data based on p-values.

What is the width of the confidence interval?

Report and interpret the width of the confidence interval. What will an estimate with this much uncertainty be useful for? If the null hypothesis is true, would rejecting effects outside of the confidence interval be worthwhile (ignoring you might have low power to actually test against these values)?

Which effect sizes would you have decent power to detect?

Report a sensitivity power analysis, and report the effect sizes you could detect across a range of desired power levels (e.g., 80%, 90%, and 95%), or plot a sensitivity curve of effect sizes against desired power.



If the study is not performed for educational purposes, but the goal is answer a research question, the feasibility justification might indicate that there is no value in collecting the data. If it wasn’t possible to conclude that one should not proceed with the data collection, there is no use of justifying the sample size. There should be cases where it is unlikely there will ever be enough data to perform a meta-analysis (for example because of a lack of general interest in the topic), the information will not be used to make any decisions, and the statistical tests do not allow you to test a hypothesis or estimate an effect size estimate with any useful accuracy. It should be a feasibility justification - not a feasibility excuse. If there is no good justification to collect the maximum number of observations that is feasible, performing the study nevertheless is a waste of participants time, and/or a waste of money if data collection has associated costs. Collecting data without a good justification why the planned sample size will yield worthwhile information has an ethical component. As Button and colleagues Button et al (2013) write:

Low power therefore has an ethical dimension — unreliable research is inefficient and wasteful. This applies to both human and animal research.

Think carefully if you can defend data collection based on a feasibility justification. Sometimes data collection is just not feasible, and we should accept this.

3 comments:

  1. Thank you very much, Daniel! That's exactly what I need right now for interpreting my results and reporting! I would like to see this in a Journal! Can you suggest some literature that points out some of the recommendations you mentioned?

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Hi Daniel. Thanks very much for the blog. How would you suggest doing a sensitivity for a multi-level regression? I am under the impression that G*Power does not include this.

    ReplyDelete