# The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

## Wednesday, July 1, 2020

### The Red Team Challenge (Part 3): Is it Feasible in Practice?

By Daniel Lakens & Leonid Tiokhin

Also read Part 1 and Part 2 in this series on our Red Team Challenge.

Six weeks ago, we launched the Red Team Challenge: a feasibility study to see whether it could be worthwhile to pay people to find errors in scientific research. In our project, we wanted to see to what extent a “Red Team” - people hired to criticize a scientific study with the goal to improve it - would improve the quality of the resulting scientific work.

Currently, the way that error detection works in science is a bit peculiar. Papers go through the peer-review process and get the peer-reviewed “stamp of approval”. Then, upon publication, some of these same papers receive immediate and widespread criticism. Sometimes this even leads to formal corrections or retractions. And this happens even at some of the most prestigious scientific journals.

So, it seems that our current mechanisms of scientific quality control leave something to be desired. Nicholas Coles, Ruben Arslan, and the authors of this post (Leonid Tiokhin and Daniël Lakens) were interested in whether Red Teams might be one way to improve quality control in science.

Ideally, a Red Team joins a research project from the start and criticizes each step of the process. However, doing this would have taken the duration of an entire study. At the time, it also seemed a bit premature -- we didn’t know whether anyone would be interested in a Red Team approach, how it would work in practice, and so on. So, instead, Nicholas Coles, Brooke Frohlich, Jeff Larsen, and Lowell Gaertner volunteered one of their manuscripts (a completed study that they were ready to submit for publication). We put out a call on Twitter, Facebook, and the 20% Statistician blog, and 22 people expressed interest. On May 15th, we randomly selected five volunteers based on five areas of expertise: Åse Innes-Ker (affective science), Nicholas James (design/methods), Ingrid Aulike (statistics), Melissa Kline (computational reproducibility), and Tiago Lubiana (wildcard category). The Red Team was then given three weeks to report errors.

Our Red Team project was somewhat similar to traditional peer review, except that we 1) compensated Red Team members’ time with a $200 stipend, 2) explicitly asked the Red Teamers to identify errors in any part of the project (i.e., not just writing), 3) gave the Red Team full access to the materials, data, and code, and 4) provided financial incentives for identifying critical errors (a donation to the GiveWell charity non-profit for each unique “critical error” discovered). The Red Team submitted 107 error reports. Ruben Arslan--who helped inspire this project with his Bug Bounty Program--served as the neutral arbiter. Ruben examined the reports, evaluated the authors’ responses, and ultimately decided whether an issue was “critical” (see this post for Ruben’s reflection on the Red Team Challenge) Of the 107 reports, Ruben concluded that there were 18 unique critical issues (for details, see this project page). Ruben decided that any major issues that potentially invalidated inferences were worth$100, minor issues related to computational reproducibility were worth $20, and minor issues that could be resolved without much work were worth$10. After three weeks, the total final donation was $660. The Red Team detected 5 major errors. These included two previously unknown limitations of a key manipulation, inadequacies in the design and description of the power analysis, an incorrectly reported statistical test in the supplemental materials, and a lack of information about the sample in the manuscript. Minor issues concerned reproducibility of code and clarifications about the procedure. After receiving this feedback, Nicholas Coles and his co-authors decided to hold off submitting their manuscript (see this post for Nicholas’ personal reflection). They are currently conducting a new study to address some of the issues raised by the Red Team. We consider this to be a feasibility study of whether a Red Team approach is practical and worthwhile. So, based on this study, we shouldn’t draw any conclusions about a Red Team approach in science except one: it can be done. That said, our study does provide some food for thought. Many people were eager to join the Red Team. The study’s corresponding author, Nicholas Coles, was graciously willing to acknowledge issues when they were pointed out. And it was obvious that, had these issues been pointed out earlier, the study would have been substantially improved before being carried out. These findings make us optimistic that Red Teams can be useful and feasible to implement. In an earlier column, the issue was raised that rewarding Red Team members with co-authorship on the subsequent paper would create a conflict of interest -- too severe criticism on the paper might make the paper unpublishable. So, instead, we paid each Red Teamer$200 for their service. We wanted to reward people for their time. We did not want to reward them only for finding issues because, before we knew that 19 unique issues would be found, we were naively worried that the Red Team might find few things wrong with the paper. In interviews with Red Team members, it became clear that the charitable donations for each issue were not a strong motivator. Instead, people were just happy to detect issues for decent pay. They didn't think that they deserved authorship for their work, and several Red Team members didn't consider authorship on an academic paper to be valuable, given their career goals.

After talking with the Red Team members, we started to think that certain people might enjoy Red Teaming as a job – it is challenging, requires skills, and improves science. This opens up the possibility of a freelance services marketplace (such as Fiverr) for error detection, where Red Team members are hired at an hourly rate and potentially rewarded for finding errors. It should be feasible to hire people to check for errors at each phase of a project, depending on their expertise and reputation as good error-detectors. If researchers do not have money for such a service, they might be able to set up a volunteer network where people “Red Team” each other’s projects. It could also be possible for universities to create Red Teams (e.g., Cornell University has a computational reproducibility service that researchers can hire).

As scientists, we should ask ourselves when, and for which type of studies, we want to invest time and/or money to make sure that published work is as free from errors as possible. As we continue to consider ways to increase the reliability of science, a Red Team approach might be something to further explore.

## Monday, May 11, 2020

### Red Team Challenge

by Nicholas A. Coles, Leonid Tiokhin, Ruben Arslan, Patrick Forscher, Anne Scheel, & Daniël Lakens

All else equal, scientists should trust studies and theories that have been more critically evaluated. The more that a scientific product has been exposed to processes designed to detect flaws, the more that researchers can trust the product (Lakens, 2019; Mayo, 2018). Yet, there are barriers to adopting critical approaches in science. Researchers are susceptible to biases, such as confirmation bias, the “better than average” effect, and groupthink. Researchers may gain a competitive advantage for jobs, funding, and promotions by sacrificing rigor in order to produce larger quantities of research (Heesen, 2018; Higginson & Munafò, 2016) or to win priority races (Tiokhin & Derex, 2019). And even if researchers were transparent enough to allow others to critically examine their materials, code, and ideas, there is little incentive for others--including peer reviewers--to do so. These combined factors may hinder the ability of science to detect errors and self-correct (Vazire, 2019).

Today we announce an initiative that we hope can incentivize critical feedback and error detection in science: the Red Team Challenge. Daniël Lakens and Leonid Tiokhin are offering a total of $3,000 for five individuals to provide critical feedback on the materials, code, and ideas in the forthcoming preprint titled “Are facial feedback effects solely driven by demand characteristics? An experimental investigation”. This preprint examines the role of demand characteristics in research on the controversial facial feedback hypothesis: the idea that an individual’s facial expressions can influence their emotions. This is a project that Coles and colleagues will submit for publication in parallel with the Red Team Challenge. We hope that challenge will serve as a useful case study of the role Red Teams might play in science. We are looking for five individuals to join “The Red Team”. Unlike traditional peer review, this Red Team will receive financial incentives to identify problems. Each Red Team member will receive a$200 stipend to find problems, including (but not limited to) errors in the experimental design, materials, code, analyses, logic, and writing. In addition to these stipends, we will donate $100 to a GoodWell top ranked charity (maximum total donations:$2,000) for every new “critical problem” detected by a Red Team member. Defining a “critical problem” is subjective, but a neutral arbiter--Ruben Arslan--will make these decisions transparently. At the end of the challenge, we will release: (1) the names of the Red Team members (if they wish to be identified), (2) a summary of the Red Team’s feedback, (3) how much each Red Team member raised for charity, and (4) the authors’ responses to the Red Team’s feedback.

For us, this is a fun project for several reasons. Some of us are just interested in the feasibility of Red Team challenges in science (Lakens, 2020). Others want feedback about how to make such challenges more scientifically useful and to develop best practices. And some of us (mostly Nick) are curious to see what good and bad might come from throwing their project into the crosshairs of financially-incentivized research skeptics. Regardless of our diverse motivations, we’re united by a common interest: improving science by recognizing and rewarding criticism (Vazire, 2019).

References
Heesen, R. (2018). Why the reward structure of science makes reproducibility problems inevitable. The Journal of Philosophy, 115(12), 661-674.
Higginson, A. D., & Munafò, M. R. (2016). Current incentives for scientists lead to underpowered studies with erroneous conclusions. PLoS Biology, 14(11), e2000995.
Lakens, D. (2019). The value of preregistration for psychological science: A conceptual analysis. Japanese Psychological Review.
Lakens, D. (2020). Pandemic researchers — recruit your own best critics. Nature, 581, 121.
Mayo, D. G. (2018). Statistical inference as severe testing. Cambridge: Cambridge University Press.
Tiokhin, L., & Derex, M. (2019). Competition for novelty reduces information sampling in a research game-a registered report. Royal Society Open Science, 6(5), 180934.
Vazire, S. (2019). A toast to the error detectors. Nature, 577(9).

## Sunday, March 29, 2020

### Effect Sizes and Power for Interactions in ANOVA Designs

Based on our recent preprint explaining power analysis for ANOVA designs, in this post I want provide a step-by-step mathematical overview of power analysis for interactions. These details often do not make it into tutorial papers because of word limitations, and few good free resources are available (for a paid resource worth your money, see Maxwell, Delaney, & Kelley, 2018). This post is a bit technical, but nothing in this post requires more knowedge than multiplying and dividing numbers, and I believe that for anyone willing to really understand effect sizes and power in ANOVA designs digging in to these details will be quite beneficial. There are some take-home messages in this post:
1. In power analyses for ANOVA designs, you should always think of the predicted pattern of means. Different patterns of means can have the same effect size, and your intuition can not be relied on when predicting an effect size for ANOVA designs.
2. Understanding how patterns of means relate to the effect you predict is essential to design an informative study.
3. Always perform a power analysis if you want to test a predicted interaction effect, and always calculate the effect size based on means, sd’s, and correlations, instead of plugging in a ‘medium’ partial eta squared.
4. Crossover interaction effects often have larger effects than ordinal interaction effects and can thus often be studied with high power in smaller samples. If your theory can predict crossover interactions, such experiments might be worthwhile to design.
5. There are some additional benefits of examining interactions (risky predictions, generalizability, efficiently examining multiple main effects) and it would be a shame if the field is turned away from examining interactions because they sometimes require large samples.

# Getting started: Comparing two groups

We are planning a two independent group experiment. We are using a validated measure, and we know the standard deviation of our measure is approximately 2. Psychologists are generaly horribly bad at knowing the standard deviation of their measures, even though a very defensible position is that you are not ready to perform a power analysis without solid knowledge of the standard deviation of your measure. We are interested in observing a mean difference of 1 or more, because smaller effects would not be practically meaningful. We expect the mean in the control condition to be 0, and therefore want the mean in the intervention group to be 1 or higher.
This means the standardized effect size is the mean difference, divided by the standard deviation, or 1/2 = 0.5. This is the Cohen’s d we want to be able to detect in our study:

d=m1m2σ=102=0.5.
An independent t-test is mathematically identical to an F-test with two groups. For an F-test, the effect size used for power analyses is Cohen’s f, which is a generalization of Cohen’s d to more than two groups (Cohen, 1988). It is calculated based on the standard deviation of the population means divided by the population standard deviation which we know for our measure is 2), or:
$\frac{{}_{}}{}$
where for equal sample sizes,

In this formula m is the grand mean, k is the number of means, and mi is the mean in each group. The formula above might look a bit daunting, but calculating Cohen’s f is not that difficult for two groups.
If we take the expected means of 0 and 1, and a standard deviation of 2, the grand mean (the m in the formula above) is (0 + 1)/2 = 0.5. The formula says we should subtract this grand mean from the mean of each group, square this value, and sum them. So we have (0-0.5)^2 and (1-0.5)^2, which are both 0.25. We sum these values (0.25 + 0.25 = 0.5), divide them by the number of groups (0.5/2 = 0.25) and take the square root, we find that ${\sigma }_{m}$ = 0.5. We can now calculate Cohen’s f (remember than we know $\sigma$ = 2 for our measure):

We see that for two groups Cohen’s f is half as large as Cohen’s d, f = 1/2d, which always holds for an F-test with two independent groups.
Although calculating effect sizes by hand is obviously an incredibly enjoyable thing to do, you might prefer using software that performs these calculations for you. Here, I will use our Superpower power analysis package (developed by Aaron Caldwell and me). The code below uses a function from the package that computes power analytically for a one-way ANOVA where all conditions are manipulated between participants. In addition to the effect size, the function will compute power for any sample size per condition you enter. Let’s assume you have a friend who told you that they heard from someone else that you now need to use 50 observations in each condition (n = 50), so you plan to follow this trustworthy advice. We see the code below returns a Cohen’s f of 0.25, and also tells us we would have 61.78% power if we use a preregistered alpha level of 0.03 (yes, the alpha level can be set to something else than 0.05 - really).
library(Superpower)

design ← ANOVA_design(
design = "2b",
n = 50,
mu = c(1, 0),
sd = 2)

power_oneway_between(design, alpha_level = 0.03)$Cohen_f ## [1] 0.25 power_oneway_between(design, alpha_level = 0.03)$power
## [1] 61.78474
We therefore might want to increase our sample size for our planned study. Using the plot_power function, we can see we would pass 90% power with 100 observations per condition.
plot_power(design, alpha_level = 0.03, min_n = 45, max_n = 150)\$plot_ANOVA

# Interaction Effects

So far we have explained the basics for effect size calculations (and we have looked at statistical power) for 2 group ANOVA designs. Now we have the basis to look at interaction effects.
One of the main points in this blog post is that it is better to talk about interactions in ANOVAs in terms of the pattern of means, standard deviations, and correlations, than in terms of a standarized effect size. The reason for this is that, while for two groups a difference between means directly relates to a Cohen’s d, wildly different patterns of means in an ANOVA will have the same Cohen’s f. In my experience helping colleagues out their with power analyses for ANOVA designs, talking about effects in terms of a Cohen’s f is rarely a good place to start when thinking about what your hypothesis predicts. Instead, you need to specify the predicted pattern of means, have some knowledge about the standard deviation of your measure, and then calculate your predicted effect size.
There are two types of interactions, as visualized below. In an ordinal interaction, the mean of one group (“B1”) is always higher than the mean for the other group (“B2”). Disordinal interactions are also known as ‘cross-over’ interactions, and occur when the group with the larger mean switches over. The difference is important, since another main takeaway of this blog post is that, in two studies where the largest simple comparison has the same effect size, a study with a disordinal interaction has much higher power than a study with an ordinal interaction (note that an ordinal interaction can have a bigger effect than a disordinal one - in general it is not just about the pattern of means, but also how much means differ!). Thus, if possible, you will want to design experiments where an effect in one condition flips around in the other condition, instead of an experiment where the effect in the other condition just disappears. I personally never realized this before I learned how to compute power for interactions, and never took this simple but important fact into account. Let’s see why it is important.

# Calculating effect sizes for interactions

Mathematically the interaction effect is computed as the cell mean minus the sum of the grand mean, the marginal mean in each condition of one factor minus the grand mean, and the marginal mean in each condition for the other factor minus grand mean (see Maxwell et al., 2018).
Let’s consider two cases comparable to the figure above, one where we have a perfect disordinal interaction (the means of 0 and 1 flip around in the other condition, and are 1 and 0) or an ordinal interaction (the effect is present in one condition, with means of 0 and 1, but there is no effect in the other condition, and both means are 0). We can calcuate the interaction effect as follows. First, let’s look at the interaction in a 2x2 matrix:

 A1 A2 marginal B1 1 0 0.5 B2 0 1 0.5 marginal 0.5 0.5 0.5

The grand mean is (1 + 0 + 0 + 1) / 4 = 0.5.
We can compute the marginal means for A1, A2, B1, and B2, which is simply averaging per row and column, which gets us for the A1 column (1+0)/2=0.5. For this perfect disordinal interaction, all marginal means are 0.5. This means there are no main effects. There is no main effect of factor A (because the marginal means for A1 and A2 are both exactly 0.5), nor is there a main effect of B.
We can also calculate the interaction effect. For each cell we take the value in the cell (e.g., for a1b1 this is 1) and compute the difference between the cell mean and the additive effect of the two factors as: 1 - (the grand mean of 0.5 + (the marginal mean of a1 minus the grand mean, or 0.5 - 0.5 = 0) + (the marginal mean of b1 minus the grand mean, or 0.5 - 0.5 = 0)).

Thus, for each cell we get:
a1b1: 1 - (0.5 + (0.5 - 0.5) + (0.5 - 0.5)) = 0.5
a1b2: 0 - (0.5 + (0.5 - 0.5) + (0.5 - 0.5)) = -0.5
a2b1: 0 - (0.5 + (0.5 - 0.5) + (0.5 - 0.5)) = -0.5
a2b2: 1 - (0.5 + (0.5 - 0.5) + (0.5 - 0.5)) = 0.5

Cohen’s is then 
or in R code: sqrt(((0.5)^2 +(-0.5)^2 + (-0.5)^2 + (0.5)^2)/4)/2 = 0.25.
For the ordinal interaction the grand mean is (1 + 0 + 0 + 0) / 4, or 0.25. The marginal means are a1: 0.5, a2: 0, b1: 0.5, and b2: 0.

 A1 A2 marginal B1 1 0 0.5 B2 0 0 0 marginal 0.5 0 0.25

Completing the calculation for all four cells for the ordinal interaction gives:

a1b1: 1 - (0.25 + (0.5 - 0.25) + (0.5 - 0.25)) = 0.25
a1b2: 0 - (0.25 + (0.5 - 0.25) + (00.25)) = -0.25
a2b1: 0 - (0.25 + (00.25) + (0.5 - 0.25)) = -0.25
a2b2: 0 - (0.25 + (00.25) + (00.25)) = 0.25

Cohen’s  is then

or in R code: sqrt(((0.25)^2 +(-0.25)^2 + (-0.25)^2 + (0.25)^2)/4)/2 = 0.125.

We see the effect size of the cross-over interaction (f = 0.25) is twice as large as the effect size of the ordinal interaction (f = 0.125).
If the math so far was a bit too much to follow, there is an easier way to think of why the effect sizes are halved. In the disordinal interaction we are comparing cells a1b1 and a2b2 against a1b2 and a2b1, or (1+1)/2 vs. (0+0)/2. Thus, if we see this as a t-test for a contrast, it is clear the mean difference is 1, as it was in the simple effect we started with. For the ordinal interaction, we have (1+0)/2 vs. (0+0)/2, so the mean difference is halved, namely 0.5.

# Power for interactions

All of the above obviously matters for the statistical power we will have when we examine interaction effects in our experiments. Let’s use Superpower to perform power analyses for the disordinal interaction first, if we would collect 50 participants in each condition.
design ← ANOVA_design(
design = "2b*2b",
n = 50,
mu = c(1, 0, 0, 1),
sd = 2)

ANOVA_exact(design, alpha_level = 0.03)
## Power and Effect sizes for ANOVA tests
##      power partial_eta_squared cohen_f non_centrality
## a    3.000                0.00  0.0000            0.0
## b    3.000                0.00  0.0000            0.0
## a:b 91.055                0.06  0.2525           12.5
##
## Power and Effect sizes for pairwise comparisons (t-tests)
##                       power effect_size
## p_a_a1_b_b1_a_a1_b_b2 61.78        -0.5
## p_a_a1_b_b1_a_a2_b_b1 61.78        -0.5
## p_a_a1_b_b1_a_a2_b_b2  3.00         0.0
## p_a_a1_b_b2_a_a2_b_b1  3.00         0.0
## p_a_a1_b_b2_a_a2_b_b2 61.78         0.5
## p_a_a2_b_b1_a_a2_b_b2 61.78         0.5
First let’s look at the Power and Effect size for the pairwise comparisons. Not surprisingly, these are just the same as our original t-test, given that we have 50 observations per condition, and our mean difference is either 1, or a Cohen’s d of 0.5 (in which case we have 61.78% power) or the mean difference is 0, and we have no power (because there is no true effect) but we wil observe significant results 3% of the time because we set our apha level to 0.03.
Then, let’s look at the results for the ANOVA. Since there are no main effects in a perfect crossover interaction, we have a 3% Type 1 error rate. We see the power for the crossover interaction between factor a and b is 91.06%. This is much larger than the power for the simple effects. The reason is that the contrast that is equal to the test of the interaction is based on all 200 observations. Unlike the pairwise comparisons with 50 vs 50 observations, the contrast for the interaction has 100 vs 100 observations. Given that the effect size is the same (f = 0.25) we end up with much higher power.
If you current think it is impossible to find a statistically significant interaction without a huge sample size, you clearly see this is wrong. Power can be higher for an interaction than for the simpe effect - but this depends on the pattern of means underlying the interaction. If possible, design studies where your theory predicts a perfect crossover interaction.
For the ordinal interaction, our statistical power does not look that good based on an a-priori power analysis. Superpower tells us we have 33.99% power for the main effects and interaction (yes, we have exactly the same power for the 2 main effects and the interaction effect - if you think about the three contrasts that are tested, you will see these have the same effect size).

design ← ANOVA_design(
design = "2b*2b",
n = 50,
mu = c(1, 0, 0, 0),
sd = 2)

ANOVA_exact(design, alpha_level = 0.03)
## Power and Effect sizes for ANOVA tests
##       power partial_eta_squared cohen_f non_centrality
## a   33.9869              0.0157  0.1263          3.125
## b   33.9869              0.0157  0.1263          3.125
## a:b 33.9869              0.0157  0.1263          3.125
##
## Power and Effect sizes for pairwise comparisons (t-tests)
##                       power effect_size
## p_a_a1_b_b1_a_a1_b_b2 61.78        -0.5
## p_a_a1_b_b1_a_a2_b_b1 61.78        -0.5
## p_a_a1_b_b1_a_a2_b_b2 61.78        -0.5
## p_a_a1_b_b2_a_a2_b_b1  3.00         0.0
## p_a_a1_b_b2_a_a2_b_b2  3.00         0.0
## p_a_a2_b_b1_a_a2_b_b2  3.00         0.0
If you have heard people say you should be careful when designing studies predicting interaction patterns because you might have very low power, this is the type of pattern of means they are warning about. Maxwell, Delaney, and Kelley (2018) discuss why power for interactions is often smaller, and note interactions effects are often smaller in the real world, and we often examine ordinal interactions. This might be true. But in experimental psychology it might be possile to think about hypotheses that predict disordinal interactions. In addition to the fact that such predictions are often theoretically riskier and more impressive (after all, many things can make an effect go away, but without your theory it might be difficult to explain why an effect flips around) they also have larger effects and are easier to test with high power.
Some years ago other blog posts by Uri Simonsohn and Roger Giner-Sorolla did a great job in warning researchers they need large sample sizes for ordinal interactions, and my post repeats this warning. But it would be a shame if researchers would stop examining interaction effects. There are some nice benefits studying interactions, such as 1) making riskier theoretical predictions, 2) greater generalizability (if there is no interaction effect, you might show a main effect operates across different conditions of a second factor) and 3) if you want to study two main effects it is more efficient to do this in a 2x2 design than in two seperate designs (see Maxwell, Delaney, & Kelley, 2018 for a discussion). So maybe this blog post has been able to highlight some scenarios where examining interaction effects is still beneficial.

Thanks to Lisa DeBruine for the idea and html-code to color code the calculation of the interaction effect.