# The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

## Saturday, August 10, 2019

### Requiring high-powered studies from scientists with resource constraints

Underpowered studies make it very difficult to learn something useful from the studies you perform. Low power means you have a high probability of finding non-significant results, even when there is a true effect. Hypothesis tests which high rates of false negatives (concluding there is nothing, when there is something) become a malfunctioning tool. Low power is even more problematic combined with publication bias (shiny app). After repeated warnings over at least half a century, high quality journals are starting to ask authors who rely on hypothesis tests to provide a sample size justification based on statistical power.

The first time researchers use power analysis software, they typically think they are making a mistake, because the sample sizes required to achieve high power for hypothesized effects are much larger than the sample sizes they collected in the past. After double checking their calculations, and realizing the numbers are correct, a common response is that there is no way they are able to collect this number of observations.

Published articles on power analysis rarely tell researchers what they should do if they are hired on a 4 year PhD project where the norm is to perform between 4 to 10 studies that can cost at most 1000 euro each, learn about power analysis, and realize there is absolutely no way they will have the time and resources to perform high-powered studies, given that an effect size estimate from an unbiased registered report suggests the effect they are examining is half as large as they were led to believe based on a published meta-analysis from 2010. Facing a job market that under the best circumstances is a nontransparent marathon for uncertainty-fetishists, the prospect of high quality journals rejecting your work due to a lack of a solid sample size justification is not pleasant.

The reason that published articles do not guide you towards practical solutions for a lack of resources, is that there are no solutions for a lack of resources. Regrettably, the mathematics do not care about how small the participant payment budget is that you have available. This is not to say that you can not improve your current practices by reading up on best practices to increase the efficiency of data collection. Let me give you an overview of some things that you should immediately implement if you use hypothesis tests, and data collection is costly.

1) Use directional tests where relevant. Just following statements such as ‘we predict X is larger than Y’ up with a logically consistent test of that claim (e.g., a one-sided t-test) will easily give you an increase of 10% power in any well-designed study. If you feel you need to give effects in both directions a non-zero probability, then at least use lopsided tests.

2) Use sequential analysis whenever possible. It’s like optional stopping, but then without the questionable inflation of the false positive rate. The efficiency gains are so great that, if you complain about the recent push towards larger sample sizes without already having incorporated sequential analyses, I will have a hard time taking you seriously.

3) Increase your alpha level. Oh yes, I am serious. Contrary to what you might believe, the recommendation to use an alpha level of 0.05 was not the sixth of the ten commandments – it is nothing more than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify Your Alpha paper as an argument to not require an alpha level of 0.005: “without (1) increased funding, (2) a reward system that values large-scale collaboration and (3) clear recommendations for how to evaluate research with sample size constraints, lowering the significance threshold could adversely affect the breadth of research questions examined.” If you *have* to make a decision, and the data you can feasibly collect is limited, take a moment to think about how problematic Type 1 and Type 2 error rates are, and maybe minimize combined error rates instead of rigidly using a 5% alpha level.

4) Use within designs where possible. Especially when measurements are strongly correlated, this can lead to a substantial increase in power.

5) If you read this blog or follow me on Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and money: Designing statistically powerful studies while minimizing financial costs (link). They discuss I) better ways to screen participants for studies where participants need to be screened before participation, II) assigning participants unequally to conditions (if the control condition is much cheaper than the experimental condition, for example), III) using multiple measurements to increase measurement reliability (or use well-validated measures, if I may add), and IV) smart use of (preregistered, I’d recommend) covariates.

6) If you are really brave, you might want to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably, almost all approaches to statistical inferences become very limited when the number of observations is small. If you are very confident in your predictions (and your peers agree), incorporating prior information will give you a benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and colleagues.

Now if you care about efficiency, you might already have incorporated all these things. There is no way to further improve the statistical power of your tests, and by all plausible estimates of effects sizes you can expect or the smallest effect size you would be interested in, statistical power is low. Now what should you do?

What to do if best practices in study design won’t save you?

The first thing to realize is that you should not look at statistics to save you. There are no secret tricks or magical solutions. Highly informative experiments require a large number of observations. So what should we do then? The solutions below are, regrettably, a lot more work than making a small change to the design of your study. But it is about time we start to take them seriously. This is a list of solutions I see – but there is no doubt more we can/should do, so by all means, let me know your suggestions on twitter or in the comments.

Some grant organizations distribute funds to be awarded as a function of how much money is requested. If you need more money to collect informative data, ask for it. Obviously grants are incredibly difficult to get, but if you ask for money, include a budget that acknowledges that data collection is not as cheap as you hoped some years ago. In my experience, psychologists are often asking for much less money to collect data than other scientists. Increasing the requested funds for participant payment by a factor of 10 is often reasonable, given the requirements of journals to provide a solid sample size justification, and the more realistic effect size estimates that are emerging from preregistered studies.

2) Improve management.
If the implicit or explicit goals that you should meet are still the same now as they were 5 years ago, and you did not receive a miraculous increase in money and time to do research, then an update of the evaluation criteria is long overdue. I sincerely hope your manager is capable of this, but some ‘upward management’ might be needed. In the coda of Lakens & Evers (2014) we wrote “All else being equal, a researcher running properly powered studies will clearly contribute more to cumulative science than a researcher running underpowered studies, and if researchers take their science seriously, it should be the former who is rewarded in tenure systems and reward procedures, not the latter.” and “We believe reliable research should be facilitated above all else, and doing so clearly requires an immediate and irrevocable change from current evaluation practices in academia that mainly focus on quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure track, I thought it would be at least principled if I sent this coda to the head of my own department. He replied that the things we wrote made perfect sense, instituted a recommendation to aim for 90% power in studies our department intends to publish, and has since then tried to make sure quality, and not quantity, is used in evaluations within the faculty (as you might have guessed, I am not on the job market, nor do I ever hope to be).

3) Change what is expected from PhD students.
When I did my PhD, there was the assumption that you performed enough research in the 4 years you are employed as a full-time researcher to write a thesis with 3 to 5 empirical chapters (with some chapters having multiple studies). These studies were ideally published, but at least publishable. If we consider it important for PhD students to produce multiple publishable scientific articles during their PhD’s, this will greatly limit the types of research they can do. Instead of evaluating PhD students based on their publications, we can see the PhD as a time where researchers learn skills to become an independent researcher, and evaluate them not based on publishable units, but in terms of clearly identifiable skills. I personally doubt data collection is particularly educational after the 20th participant, and I would probably prefer to  hire a post-doc who had well-developed skills in programming, statistics, and who broadly read the literature, then someone who used that time to collect participant 21 to 200. If we make it easier for PhD students to demonstrate their skills level (which would include at least 1 well written article, I personally think) we can evaluate what they have learned in a more sensible manner than now. Currently, difference in the resources PhD students have at their disposal are a huge confound as we try to judge their skill based on their resume. Researchers at rich universities obviously have more resources – it should not be difficult to develop tools that allow us to judge the skills of people where resources are much less of a confound.

Our society has some serious issues that psychologists can help address. These questions are incredibly complex. I have long lost faith in the idea that a bottom-up organized scientific discipline that rewards individual scientists will manage to generate reliable and useful knowledge that can help to solve these societal issues. For some of these questions we need well-coordinated research lines where hundreds of scholars work together, pool their resources and skills, and collectively pursuit answers to these important questions. And if we are going to limit ourselves in our research to the questions we can answer in our own small labs, these big societal challenges are not going to be solved. Call me a pessimist. There is a reason we resort to forming unions and organizations that have to goal to collectively coordinate what we do. If you greatly dislike team science, don’t worry – there will always be options to make scientific contributions by yourself. But now, there are almost no ways for scientists who want to pursue huge challenges in large well-organized collectives of hundreds or thousands of scholars (for a recent exception that proves my rule by remaining unfunded: see the Psychological Science Accelerator). If you honestly believe your research question is important enough to be answered, then get together with everyone who also thinks so, and pursue answeres collectively. Doing so should, eventually (I know science funders are slow) also be more convincing as you ask for more resources to do the resource (as in point 1).

## Sunday, July 21, 2019

### Calculating Confidence Intervals around Standard Deviations

Power analyses require accurate estimates of the standard deviation. In this blog, I explain how to calculate confidence intervals around standard deviation estimates obtained from a sample, and show how much sample sizes in an a-priori power analysis can differ based on variation in estimates of the standard deviation.

If we calculate a standard deviation from a sample, this value is an estimate of the true value in the population. In small samples, our estimate can be quite far off, while due to the law of large numbers, as our sample size increases, we will be measuring the standard deviation more accurately. Since the sample standard deviation is an estimate with uncertainty, we can calculate a 95% confidence interval around it.

Expressing the uncertainty in our estimate of the standard deviation can be useful. When researchers plan to simulate data, or perform an a-priori power analysis, they need accurate estimates of the standard deviation. For simulations, the standard deviation needs to be accurate because we want to generate data that will look like the real data we will eventually collect. For power analyses we often want to think about the smallest effect size of interest, which can be specified as the difference in means you care about. To perform a power analysis we also need to specify the expected standard deviation of the data. Sometimes researchers will use pilot data to get an estimate of the standard deviation. Since the estimate of the population standard deviation based on a pilot study has some uncertainty, the width of confidence intervals around the standard deviation might be a useful way to show how much variability one can expect.

Below is the R code to calculate the confidence interval around a standard deviation from a sample, but you can also use this free GraphPad calculator. The R code then calculates an effect size based on a smallest effect size of interest of half a scale point (0.5) for a scale that has a true standard deviation of 1. The 95% confidence interval for the standard deviation based on a sample of 100 observation ranges from 0.878 to 1.162. If we draw a sample of 100 observations and happen to observe a value on the lower or upper bound of the 95% CI the effect size we calculate will be a Cohen’s d of 0.5/0.878 = 0.57 or 0.5/1.162 = 0.43. This is quite a difference in the effect size we might use for a power calculation. If we enter these effect size estimates in an a-priori power analysis where we aim to get 90% power using an alpha of 0.05 we will estimate that we need either 66 participants in each group, or 115 participants in each group.

It is clear sample sizes from a-priori power anayses depend strongly on an accurate estimate of the standard deviation. Keep into account that estimates of the standard deviation have uncertainty. Use validated or existing measures for which accurate estimates of the standard deviation in your population of interest are available, so that you can rely on a better estimate of the standard deviation in power analyses.

Some people argue that if you have such a limited understanding of the measures you are using that you do not even know the standard deviation of the measure in your population of interest, you are not ready to use that measure to test a hypothesis. If you are doing a power analysis but realize you have no idea what the standard deviation is, maybe you first need to spend more time validating the measures you are using.

When performing simulations or power analyses the same cautionary note can be made for estimates of correlations between dependent variables. When you are estimating these values from a sample, and want to perform simulations and/or power analyses, be aware that all estimates have some uncertainty. Try to get as accurate estimates as possible from pre-existing data. If possible, be a bit more conservative in sample size calculations based on estimated parameters, just to be sure.

## Monday, July 15, 2019

### Using Decision Theory to Justify Your Alpha

Recently, social scientists have begun to critically re-examine their most sacred (yet knowingly arbitrary) traditions: = .05. This reflection was prompted by 72 researchers (Benjamin et al., 2017) who argued that researchers who use Null Hypothesis Significance Testing should redefine significance criteria to = .005 when claiming the discovery of a new effect. Their rationale is that p-values near .05 often provide only weak evidence for the alternative hypothesis from a Bayesian perspective. Furthermore, from a Bayesian perspective, if one assumes that most alternative hypotheses are wrong (an assumption they justify based on prediction markets and replication results), p-values near .05 often provide evidence in favor of the null hypothesis. Consequently, Benjamin and colleagues suggest that redefining statistical significance to = .005 can limit the frequency of non-replicable effects in the social science literature (i.e., Type 1 errors).

In a reply to Benjamin and colleagues, Lakens and colleagues (2018) argued that researchers should not constrain themselves to a single significance criterion. Instead, they suggested that researchers should use different significance criteria, so long as they justify their decision prior to collecting data. Intuitively, we can think of real-world scenarios where this makes sense. For example, when screening for cancer, we allow more false positives in order to ensure that real cancer cases are rarely missed (i.e., larger s). On the other hand, many courts of law try to strictly limit how often individuals are wrongly convicted (i.e., smaller s). Nevertheless, if we accept Lakens and colleagues’ proposal, we are left with a more difficult questions: How can we justify our alphas? I suggest that the answer lies in decision theory.

## A simple overview of decision theory: Making rational decisions under risk

Before using decision theory to justify alphas, it is helpful to first review decision theory in a more classic domain: financial decision making. Figure 1 is an illustration of a hypothetical investment decision where you must decide whether to invest $4 million in the development of a product. The so-called decision tree in Figure 1 has three major components: 1. Acts: Acts are the possible behaviors related to the decision. In this example, you either invest$4 million or do not invest in the product.
2. States: States represent the possible truths of the world as it relates to the decision-making context. To simplify this example, we will assume that there are only two relevant states: the product works or the product does not work. In this example, lets assume that we know there is a 50% chance that the product will work.
3. Outcomes: Outcomes are the consequences of each potential state. In this example, if you decide to invest and the product works, you receive a 6 million return on your investment. If you invest and the product does not work, you lose 4 million. If you abstain from investing, you neither gain nor lose money regardless of whether the product works.

To be a rational decision maker, you should choose whichever act maximizes your expected value. The expected value of each act is calculated by taking the sum of the probability-weighted value of each potential outcome. Whichever act has higher expected value is considered the rational choice, and the law of large numbers dictates that you will be better off in the long run if you act in a manner that maximizes your expected value. In this example, although the investment is risky, you should typically invest because the expected value of investing exceeds the expected value of not investing.

## Evaluating significance criteria using decision theory

Figure 2 illustrates a decision tree that formalizes the decision to use α = .05 or α = .005. In order to evaluate which significance criterion to adopt, we need to consider not only the Type 1 error rate (i.e., α) but also the Type 2 error rate (i.e., 1 - β). This is because, all else equal, lowering the Type 1 error rate increases the Type 2 error rate.

Figure 2. Simplified decision tree for comparing statistical significance criteria

To calculate the expected value of adopting each significance criterion, researchers need to specify the costs of Type 1 and Type 2 errors. These costs are denoted in Figure 2 as (cost of Type 1 error) and (cost of Type 2 error). Like the investment example, we could operationalize cost in terms of money. However, in the following examples, I will operationalize cost on a unit-less continuous 10-point scale. (This is an inconsequential matter of preference.) In this post, I will specify the cost of a Type I error as -9 out of 10 (i.e., = -9) and the cost of a Type II error as -7 out of 10 (i.e., = -7). However, costs could, of course, vary based on the research context.

Just like the investment example, the expected value of using each significance criterion is calculated using the sum of the probability-weighted cost of each potential outcome.

ExpVal <- function
(alpha, pwr, CT1E = -9, CT2E = -7){
(alpha * CT1E) + ((1 - pwr) * CT2E)
}

### Example 1: Comparing significance criteria with power held constant

First, we will examine the expected value of using α = .05 vs. α= .005 when power is held constant at .80.

# Expected value of using alpha = .05
ExpVal(alpha = .05, pwr = .80)
## [1] -1.85
# Expected value of using alpha = .005
ExpVal(alpha = .005, pwr = .80)
## [1] -1.445
When power is held constant at .80, results indicate that it is rational to adopt α = .005 vs. α .05. This is perhaps not surprising; If power is held constant, it is always more rational to adopt stricter α values (assuming, of course, that you place negative value on committing errors). This is illustrated below.

This figure demonstrates that, when power is constant, = .005 is more rational than = .05. However, given that stricter significance thresholds are always more rational when power is constant, = .005 is less rational than = .001, and = .001 is less rational than = 5 x . Unfortunately, social scientists cannot adopt an infinitely small significance threshold because we do not have infinite participants and resources. Since power is directly related to sample size, researchers with a set number of participants achieve less power when they adopt stricter significance thresholds. Consequently, determining the expected value of adopting stricter significance thresholds requires researchers to determine how much power they can achieve (i.e., their sample size, effect size, and research design) and how much value they place on Type 1 and Type 2 errors.

### Example 2: Comparing alphas based on sample size, effect size, and costs assigned to errors

The figure below compares the expected value of using α = .005 vs. α = .05 in a two-group between-subject design when (a) CT1E = -9, (b) CT1E = -7, and (c) the alternative hypothesis is two-sided. This figure illustrates that the expected value of a significance threshold depends on power—i.e., both sample size and the size of the underlying effect. For example, when examining small effects, it is rational to use α = .05 instead of α = .005 until the sample exceeds 930. However, if one is examining a medium-sized effect, it becomes rationale to use α = .005 once the sample exceeds just 160. We could also use this figure to compare the expected value when different sample sizes are used for each criterion. For example, if one has the option to collect 100 participants with α = .05 or collect 220 participant with α = .005, it is rational under this formalization to adopt the stricter α= .005 threshold.

This figure contains a lot of information, but the main takeaway is that the expected value of adopting stricter significance thresholds requires researchers to determine how much power they can achieve (which depends on sample size, effect size, and experimental design. Put simply, Figure 4 demonstrates that sometimes it is more rational to use = .05 instead of = .005, and other times it is not.

### Example 3: Choosing optimal alpha values

So far, we have used a decision theory framework to compare just two alpha values. However, in theory, we should consider all potential alpha values in order to determine which one maximizes our expected value. This is just what the optimize function in R allows us to do. Imagine you are examining a small effect using a two-group between-subjects design, but only have time to collect 100 participants. What alpha value should you use to maximize your expected value? Using the optimize function, the code below demonstrates that the answer is approximately = 0.18. On the other hand, if you can collect 200 participants, the most rational alpha to adopt is .14. You can adjust the power function to further explore how the experimental design, sample size, effect size, and alternative hypothesis influence the optimal alpha value.

optimize(f = function(alpha){
ExpVal(alpha = alpha,
pwr = pwr.t.test(n = 100, d = .20,
sig.level = alpha, power = NULL,
type = "two.sample",
alternative = "two.sided")\$power)
}
, interval = c(0,1), maximum = TRUE)

### Limitations of the framework presented here

The framework I present here provides a simple illustration of how decision theory can be used to justify alphas. However, in the interest of keeping the framework simple, I introduced at least four limitations.
First, the framework does not currently highlight how researchers could formally specify the costs of Type 1 and 2 errors. Choosing a number of a Likert-type scale is a simple approach. However, decision theorists often specify more complex loss functions, wherein they identify the various factors that influence the cost of a state. Second, this framework currently assumes that researchers are uninterested in the prior probability that the alternative and null hypotheses are true. In order to incorporate these priors, Bayesian Decision Theory is an excellent alternative. Third, this framework helps specify what is rational for the individual, but not necessarily what is rational for the scientific community. For example, an individual researcher may not care so much about committing a Type 1 error (i.e., they might assign a low negative value). However, Type I errors may be more costly for the scientific community, as significant resources may be spent chasing and correcting the Type I error. When considering what is rational for the scientific community, researchers will have to consider more complex decision theory frameworks, such as game theory. Fourth, this framework does not current specify what is rational in scenarios where researchers plan to conduct multiple studies. For example, researchers may assign lower cost to committing Type 2 errors if they plan to conduct pooled or meta-analyses after several studies. Nevertheless, decision theory frameworks can be easily expanded to evaluate multi-step decision making problems.

## Conclusion

In a decision theory framework, justifying your alpha is an act where you strive to maximize expected value. This differs from other proposed approaches to justifying alphas. For example, in a previous blog post, Lakens discussed that researchers could justify alphas in a way that (1) minimizes the total combined error rate (i.e., Type 1 + Type 2 error), or (2) balances error rates. Although outside of the scope of this blog post, there are scenarios where both of Lakens’ proposed approaches are not rational (i.e., do not maximize expected value). When we use decision theory, on the other hand, we can ensure that our decisions always maximize our expected value.
I agree with Benjamin et al. (2017) that p-values near .05 can provide weak evidence for an alternative hypothesis. I also agree that changing to .005 could potentially reduce the number of Type 1 errors in the literature. However, I do not believe that strictly adopting = .005 (or even = .05) is rational. Rather, I agree with Lakens and colleagues’ (2017) call to “justify your alpha”, and I argue that decision theory provides an ideal framework for formalizing these justifications. In the simple decision theory framework I presented here, the expected value of using a significance criterion depends on (1) the probability of committing a Type 1 error, (2) the perceived cost of a Type 1 error, (3) the probability of committing a Type 2 error (i.e., power, which requires knowledge of sample size, effect size, and research design), and (4) the perceived cost of Type 2 errors.

Consequently, depending on obtainable power and the costs assigned to errors, the most rational significance criterion will vary in different experimental contexts.

Some researchers may feel uncomfortable with such a flexible approach to defining statistical significance and argue that the field needs a clear significance criterion to maintain order. Although I feel that flexible statistical criterion is the most valid way to engage in null hypothesis significance testing, I concede there may be practical benefits to establishing a single significance criterion, or even a few different significance criteria. (Ultimately, though, this is a question that can be answered by—you guessed it—decision theory!) Through critical discussion, perhaps scientists will agree that they are willing to sacrifice nuanced rationality in the name of simpler guidelines for significance testing. If this is the case, we should still use decision theory to formally justify what this criterion should be. In order to do so, researchers will need to specify (a) the average effect size of interest, (b) the average achievable sample size, (c) the typical experimental design, and (d) the average costs of Type 1 and 2 errors.

Whether or not researchers decide to use flexible significance criterion, multiple significance criterion, or a single significance criterion, we should not arbitrarily define statistical significance. Instead, we should rationalize statistical significance using a decision theory framework.

## References

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2017). Redefine statistical significance. Nature Human Behaviour, 1.

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., . Zwaan, R. A. (2017, September 18). Justify Your Alpha: A Response to “Redefine Statistical Significance”. Retrieved from psyarxiv.com/9s3y6