A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, July 21, 2019

Calculating Confidence Intervals around Standard Deviations


Power analyses require accurate estimates of the standard deviation. In this blog, I explain how to calculate confidence intervals around standard deviation estimates obtained from a sample, and show how much sample sizes in an a-priori power analysis can differ based on variation in estimates of the standard deviation.

If we calculate a standard deviation from a sample, this value is an estimate of the true value in the population. In small samples, our estimate can be quite far off, while due to the law of large numbers, as our sample size increases, we will be measuring the standard deviation more accurately. Since the sample standard deviation is an estimate with uncertainty, we can calculate a 95% confidence interval around it.

Expressing the uncertainty in our estimate of the standard deviation can be useful. When researchers plan to simulate data, or perform an a-priori power analysis, they need accurate estimates of the standard deviation. For simulations, the standard deviation needs to be accurate because we want to generate data that will look like the real data we will eventually collect. For power analyses we often want to think about the smallest effect size of interest, which can be specified as the difference in means you care about. To perform a power analysis we also need to specify the expected standard deviation of the data. Sometimes researchers will use pilot data to get an estimate of the standard deviation. Since the estimate of the population standard deviation based on a pilot study has some uncertainty, the width of confidence intervals around the standard deviation might be a useful way to show how much variability one can expect.

Below is the R code to calculate the confidence interval around a standard deviation from a sample, but you can also use this free GraphPad calculator. The R code then calculates an effect size based on a smallest effect size of interest of half a scale point (0.5) for a scale that has a true standard deviation of 1. The 95% confidence interval for the standard deviation based on a sample of 100 observation ranges from 0.878 to 1.162. If we draw a sample of 100 observations and happen to observe a value on the lower or upper bound of the 95% CI the effect size we calculate will be a Cohen’s d of 0.5/0.878 = 0.57 or 0.5/1.162 = 0.43. This is quite a difference in the effect size we might use for a power calculation. If we enter these effect size estimates in an a-priori power analysis where we aim to get 90% power using an alpha of 0.05 we will estimate that we need either 66 participants in each group, or 115 participants in each group.

It is clear sample sizes from a-priori power anayses depend strongly on an accurate estimate of the standard deviation. Keep into account that estimates of the standard deviation have uncertainty. Use validated or existing measures for which accurate estimates of the standard deviation in your population of interest are available, so that you can rely on a better estimate of the standard deviation in power analyses.

Some people argue that if you have such a limited understanding of the measures you are using that you do not even know the standard deviation of the measure in your population of interest, you are not ready to use that measure to test a hypothesis. If you are doing a power analysis but realize you have no idea what the standard deviation is, maybe you first need to spend more time validating the measures you are using.

When performing simulations or power analyses the same cautionary note can be made for estimates of correlations between dependent variables. When you are estimating these values from a sample, and want to perform simulations and/or power analyses, be aware that all estimates have some uncertainty. Try to get as accurate estimates as possible from pre-existing data. If possible, be a bit more conservative in sample size calculations based on estimated parameters, just to be sure.

Monday, July 15, 2019

Using Decision Theory to Justify Your Alpha


Recently, social scientists have begun to critically re-examine their most sacred (yet knowingly arbitrary) traditions: = .05. This reflection was prompted by 72 researchers (Benjamin et al., 2017) who argued that researchers who use Null Hypothesis Significance Testing should redefine significance criteria to = .005 when claiming the discovery of a new effect. Their rationale is that p-values near .05 often provide only weak evidence for the alternative hypothesis from a Bayesian perspective. Furthermore, from a Bayesian perspective, if one assumes that most alternative hypotheses are wrong (an assumption they justify based on prediction markets and replication results), p-values near .05 often provide evidence in favor of the null hypothesis. Consequently, Benjamin and colleagues suggest that redefining statistical significance to = .005 can limit the frequency of non-replicable effects in the social science literature (i.e., Type 1 errors).

In a reply to Benjamin and colleagues, Lakens and colleagues (2018) argued that researchers should not constrain themselves to a single significance criterion. Instead, they suggested that researchers should use different significance criteria, so long as they justify their decision prior to collecting data. Intuitively, we can think of real-world scenarios where this makes sense. For example, when screening for cancer, we allow more false positives in order to ensure that real cancer cases are rarely missed (i.e., larger s). On the other hand, many courts of law try to strictly limit how often individuals are wrongly convicted (i.e., smaller s). Nevertheless, if we accept Lakens and colleagues’ proposal, we are left with a more difficult questions: How can we justify our alphas? I suggest that the answer lies in decision theory.

A simple overview of decision theory: Making rational decisions under risk


Before using decision theory to justify alphas, it is helpful to first review decision theory in a more classic domain: financial decision making. Figure 1 is an illustration of a hypothetical investment decision where you must decide whether to invest $4 million in the development of a product. The so-called decision tree in Figure 1 has three major components:

  1. Acts: Acts are the possible behaviors related to the decision. In this example, you either invest $4 million or do not invest in the product.
  2. States: States represent the possible truths of the world as it relates to the decision-making context. To simplify this example, we will assume that there are only two relevant states: the product works or the product does not work. In this example, lets assume that we know there is a 50% chance that the product will work.
  3. Outcomes: Outcomes are the consequences of each potential state. In this example, if you decide to invest and the product works, you receive a 6 million return on your investment. If you invest and the product does not work, you lose 4 million. If you abstain from investing, you neither gain nor lose money regardless of whether the product works.

To be a rational decision maker, you should choose whichever act maximizes your expected value. The expected value of each act is calculated by taking the sum of the probability-weighted value of each potential outcome. Whichever act has higher expected value is considered the rational choice, and the law of large numbers dictates that you will be better off in the long run if you act in a manner that maximizes your expected value. In this example, although the investment is risky, you should typically invest because the expected value of investing exceeds the expected value of not investing.

Figure 1. Decision tree for a simple investment decision
Figure 1. Decision tree for a simple investment decision

Evaluating significance criteria using decision theory


Figure 2 illustrates a decision tree that formalizes the decision to use α = .05 or α = .005. In order to evaluate which significance criterion to adopt, we need to consider not only the Type 1 error rate (i.e., α) but also the Type 2 error rate (i.e., 1 - β). This is because, all else equal, lowering the Type 1 error rate increases the Type 2 error rate.

 Figure 2. Simplified decision tree for comparing statistical significance criteria
Figure 2. Simplified decision tree for comparing statistical significance criteria

To calculate the expected value of adopting each significance criterion, researchers need to specify the costs of Type 1 and Type 2 errors. These costs are denoted in Figure 2 as (cost of Type 1 error) and (cost of Type 2 error). Like the investment example, we could operationalize cost in terms of money. However, in the following examples, I will operationalize cost on a unit-less continuous 10-point scale. (This is an inconsequential matter of preference.) In this post, I will specify the cost of a Type I error as -9 out of 10 (i.e., = -9) and the cost of a Type II error as -7 out of 10 (i.e., = -7). However, costs could, of course, vary based on the research context.


Just like the investment example, the expected value of using each significance criterion is calculated using the sum of the probability-weighted cost of each potential outcome.


ExpVal <- function
(alpha, pwr, CT1E = -9, CT2E = -7){
  (alpha * CT1E) + ((1 - pwr) * CT2E)
  }

Example 1: Comparing significance criteria with power held constant


First, we will examine the expected value of using α = .05 vs. α= .005 when power is held constant at .80.


# Expected value of using alpha = .05
ExpVal(alpha = .05, pwr = .80)
## [1] -1.85
# Expected value of using alpha = .005
ExpVal(alpha = .005, pwr = .80)
## [1] -1.445
When power is held constant at .80, results indicate that it is rational to adopt α = .005 vs. α .05. This is perhaps not surprising; If power is held constant, it is always more rational to adopt stricter α values (assuming, of course, that you place negative value on committing errors). This is illustrated below.



This figure demonstrates that, when power is constant, = .005 is more rational than = .05. However, given that stricter significance thresholds are always more rational when power is constant, = .005 is less rational than = .001, and = .001 is less rational than = 5 x . Unfortunately, social scientists cannot adopt an infinitely small significance threshold because we do not have infinite participants and resources. Since power is directly related to sample size, researchers with a set number of participants achieve less power when they adopt stricter significance thresholds. Consequently, determining the expected value of adopting stricter significance thresholds requires researchers to determine how much power they can achieve (i.e., their sample size, effect size, and research design) and how much value they place on Type 1 and Type 2 errors.


Example 2: Comparing alphas based on sample size, effect size, and costs assigned to errors


The figure below compares the expected value of using α = .005 vs. α = .05 in a two-group between-subject design when (a) CT1E = -9, (b) CT1E = -7, and (c) the alternative hypothesis is two-sided. This figure illustrates that the expected value of a significance threshold depends on power—i.e., both sample size and the size of the underlying effect. For example, when examining small effects, it is rational to use α = .05 instead of α = .005 until the sample exceeds 930. However, if one is examining a medium-sized effect, it becomes rationale to use α = .005 once the sample exceeds just 160. We could also use this figure to compare the expected value when different sample sizes are used for each criterion. For example, if one has the option to collect 100 participants with α = .05 or collect 220 participant with α = .005, it is rational under this formalization to adopt the stricter α= .005 threshold.





This figure contains a lot of information, but the main takeaway is that the expected value of adopting stricter significance thresholds requires researchers to determine how much power they can achieve (which depends on sample size, effect size, and experimental design. Put simply, Figure 4 demonstrates that sometimes it is more rational to use = .05 instead of = .005, and other times it is not.

Example 3: Choosing optimal alpha values


So far, we have used a decision theory framework to compare just two alpha values. However, in theory, we should consider all potential alpha values in order to determine which one maximizes our expected value. This is just what the optimize function in R allows us to do. Imagine you are examining a small effect using a two-group between-subjects design, but only have time to collect 100 participants. What alpha value should you use to maximize your expected value? Using the optimize function, the code below demonstrates that the answer is approximately = 0.18. On the other hand, if you can collect 200 participants, the most rational alpha to adopt is .14. You can adjust the power function to further explore how the experimental design, sample size, effect size, and alternative hypothesis influence the optimal alpha value.

optimize(f = function(alpha){   ExpVal(alpha = alpha,         pwr = pwr.t.test(n = 100, d = .20,               sig.level = alpha, power = NULL,               type = "two.sample",               alternative = "two.sided")$power) }   , interval = c(0,1), maximum = TRUE)

Limitations of the framework presented here

The framework I present here provides a simple illustration of how decision theory can be used to justify alphas. However, in the interest of keeping the framework simple, I introduced at least four limitations.
First, the framework does not currently highlight how researchers could formally specify the costs of Type 1 and 2 errors. Choosing a number of a Likert-type scale is a simple approach. However, decision theorists often specify more complex loss functions, wherein they identify the various factors that influence the cost of a state. Second, this framework currently assumes that researchers are uninterested in the prior probability that the alternative and null hypotheses are true. In order to incorporate these priors, Bayesian Decision Theory is an excellent alternative. Third, this framework helps specify what is rational for the individual, but not necessarily what is rational for the scientific community. For example, an individual researcher may not care so much about committing a Type 1 error (i.e., they might assign a low negative value). However, Type I errors may be more costly for the scientific community, as significant resources may be spent chasing and correcting the Type I error. When considering what is rational for the scientific community, researchers will have to consider more complex decision theory frameworks, such as game theory. Fourth, this framework does not current specify what is rational in scenarios where researchers plan to conduct multiple studies. For example, researchers may assign lower cost to committing Type 2 errors if they plan to conduct pooled or meta-analyses after several studies. Nevertheless, decision theory frameworks can be easily expanded to evaluate multi-step decision making problems.

Conclusion


In a decision theory framework, justifying your alpha is an act where you strive to maximize expected value. This differs from other proposed approaches to justifying alphas. For example, in a previous blog post, Lakens discussed that researchers could justify alphas in a way that (1) minimizes the total combined error rate (i.e., Type 1 + Type 2 error), or (2) balances error rates. Although outside of the scope of this blog post, there are scenarios where both of Lakens’ proposed approaches are not rational (i.e., do not maximize expected value). When we use decision theory, on the other hand, we can ensure that our decisions always maximize our expected value.
I agree with Benjamin et al. (2017) that p-values near .05 can provide weak evidence for an alternative hypothesis. I also agree that changing to .005 could potentially reduce the number of Type 1 errors in the literature. However, I do not believe that strictly adopting = .005 (or even = .05) is rational. Rather, I agree with Lakens and colleagues’ (2017) call to “justify your alpha”, and I argue that decision theory provides an ideal framework for formalizing these justifications. In the simple decision theory framework I presented here, the expected value of using a significance criterion depends on (1) the probability of committing a Type 1 error, (2) the perceived cost of a Type 1 error, (3) the probability of committing a Type 2 error (i.e., power, which requires knowledge of sample size, effect size, and research design), and (4) the perceived cost of Type 2 errors.

Consequently, depending on obtainable power and the costs assigned to errors, the most rational significance criterion will vary in different experimental contexts.

Some researchers may feel uncomfortable with such a flexible approach to defining statistical significance and argue that the field needs a clear significance criterion to maintain order. Although I feel that flexible statistical criterion is the most valid way to engage in null hypothesis significance testing, I concede there may be practical benefits to establishing a single significance criterion, or even a few different significance criteria. (Ultimately, though, this is a question that can be answered by—you guessed it—decision theory!) Through critical discussion, perhaps scientists will agree that they are willing to sacrifice nuanced rationality in the name of simpler guidelines for significance testing. If this is the case, we should still use decision theory to formally justify what this criterion should be. In order to do so, researchers will need to specify (a) the average effect size of interest, (b) the average achievable sample size, (c) the typical experimental design, and (d) the average costs of Type 1 and 2 errors.

Whether or not researchers decide to use flexible significance criterion, multiple significance criterion, or a single significance criterion, we should not arbitrarily define statistical significance. Instead, we should rationalize statistical significance using a decision theory framework.

References


Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2017). Redefine statistical significance. Nature Human Behaviour, 1.

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., . Zwaan, R. A. (2017, September 18). Justify Your Alpha: A Response to “Redefine Statistical Significance”. Retrieved from psyarxiv.com/9s3y6