A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, January 29, 2016

The correlation between original and replication effect sizes might be spurious

In the reproducibility project, original effect sizes correlated r=0.51 with the effect sizes of replications. Some researchers find this hopeful.

I don’t think we should be interpreting this correlation at all, because it might very well be completely spurious. One important reason why correlations might be spurious is the presence of different subgroups, as introduction to statistics textbooks explain.

When we consider the Reproducibility Project (note: I’m a co-author of the paper) we can assume there are two subsets, one subgroup consisting of experiments that examine true effects, and one subgroup consisting of experiments that examine effects that are not true. This logically implies that for one subgroup, the true effect size is 0, while for the other, the true effect size is an unknown larger value. Different means in subgroups is a classic case where spurious correlations can emerge.

I find the best way to learn to understand statistics is through simulations. So let’s simulate 100 normally distributed effect sizes from original studies that are comparable to the 100 studies included in the Reproducibility Project, and 100 effect sizes for their replications, and correlate these. We create two subgroups. Forty effect sizes will have true effects (e.g., d = 0.4). The original and replication effect sizes will be correlated (e.g., r = 0.5). Sixty of the effect sizes will have an effect size of d = 0, and a correlation between replication and original studies of r = 0. I’m not suggesting this reflects the truth of the studies in the Reproducibility Project – there’s no way to know. The parameters look sort of reasonable to me, but feel free to explore different choices for parameters by running the code yourself.

As you see, the pattern is perfectly expected, under reasonable assumptions, when 60% of the studies is simulated to have no true effect. With a small N (100 studies gives a pretty unreliable correlation, see for yourself by running the code a few times) the spuriousness of the correlation might not be clear. So let’s simulate 100 times more studies.

Now, the spuriousness becomes clear. The two groups differ in their means, and if we calculate the correlation over the entire sample, the r = 0.51 we get is not very meaningful (I cut off original studies at d = 0, to simulate publication bias and make the graph more similar to Figure 1 in the paper, but it doesn't matter for the current point).

So: be careful interpreting correlations when there are different subgroups. There’s no way to know what is going on. The correlation of 0.51 between effect sizes in original and replication studies might not mean anything.

Thursday, January 14, 2016

Power analysis for default Bayesian t-tests

One important benefit of Bayesian statistics is that you can provide relative support for the null hypothesis. When the null hypothesis is true, p-values will forever randomly wander between 0 and 1, but a Bayes factor has consistency (Rouder, Speckman, Sun, Morey, & Iverson, 2009), which means that as the sample size increases, the Bayes Factor will tell you which of two hypotheses has received more support.

Bayes Factors can express relative evidence for the null hypothesis (H0) compared to the alternative hypothesis (H1), referred to as a BF01, or relative evidence for H1 compared to H0 (a BF10). A BF01 > 3 is sometimes referred to as substantial evidence for H0 (see Wagenmakers, Wetzels, Borsboom, & Van Der Maas, 2011), but what they really mean is substantial evidence for H0 compared to H1.

Table 1 from Wagenmakers et al., 2011, #fixedthatforyou

Since a Bayes Factor provides relative support for one over another hypothesis, the choice for an alternative hypothesis is important. In Neyman-Pearson Frequentist approaches, researchers have to specify the alternative hypothesis to be able to calculate the power of their test. For example, let’s assume a researcher no idea what to expect, and decided to use the average effect size in psychology of d = 0.43 as the expected effect when the alternative hypothesis is true. It then becomes easy to calculate that you need 115 participants in each group of your experiment, if you plan to perform a two-sided independent t-test with an alpha of 0.05, and you want to have 90% power.

When calculating the Bayes Factor, the alternative is specified through the r-scale. This is a scale for the Cauchy prior, which is chosen in such a way that the researcher expects there is a 50% chance of observing an absolute effect larger than the scale value chosen. As Rouder and colleagues say: “A well-known and attractive alternative to placing priors on the mean µ is to place them on effect size, where effect size is denoted by δ and given as δ = µ/σ. Thus, the researcher above would choose an r-scale of 0.43.

The default Bayesian t-test originally used a r-scale of 1 (a very large effect), but the updated default test uses a r-scale of 0.707. This means that whenever you perform a study where you calculate the default Bayes Factor, and find a BF10 < 1/3, you have observed support for the null-hypothesis, relative to an alternative hypothesis of an effect of d = 0.707.

Let’s assume the default Bayesian t-test answers a question you are interested in, and you want to know whether your data is stronger support for a d = 0 compared to a d = 0.707. You know that to have 90% power to observe a d = 0.707 in an independent t-test, you need 44 participants in each condition, and you have heard people recommend to use at least 50 participants in each condition, so you set out to collect 50 participants in each condition. You are also completely fine with a null-result: You have submitted your study to a journal that accepts pre-registered reports, and the reviewers there agreed the data will be interesting whether they support the null hypothesis or the alternative hypothesis. What is the probability you will observe support for H0 relative to H1 (see this paper by Felix Schonbrodt and colleagues for similar Frequentist calculations of Bayes Factors)?

There are no Bayesian power calculators (yet, as far as I know). Bayes Factors express the relative evidence in your data, and you don’t need a threshold to interpret a Bayes Factor. In principle, any BF10 < 1 provides stronger support for the null hypothesis than for the alternative hypothesis. But no one will be convinced if you argue for the null hypothesis based on a BF10 = 0.98, and it’s practical to have some sort of agreed upon threshold where we all agree data can be interpreted as ‘support’ – hence tables with Bayes Factor thresholds as the one reported in Wagenmakers et al, 2011 (see above). So what are the chances of actually observing relative support for the null hypothesis, compare to an alternative hypothesis of d = 0.707, with 50 participants in each condition, when you want a BF10 < 1/3?

Using the R code below, you can easily calculate this probability yourself. It also provides the probability of finding support for the alternative hypothesis. This means you can run the simulations twice, once setting the true effect size to 0, once setting the true effect size to 0.707, and see whether your sample size is large enough to get what you want from the data you plan to collect.Note you need quite some simulations to be accurate, and this takes minutes)

Running these simulations, we see there’s about a 69% probability of finding support for the null-hypothesis with a BF10 < 1/3 if the true effect size is 0. In the histogram of Bayes Factors below, this is the distribution to the left of the left vertical red line. Approximately 1.6% of the Bayes Factors give misleading evidence (a BF > 3, to the right of the right vertical red line) and around 30% of the Bayes Factors are inconclusive evidence.

The second set of simulations reveals there’s approximately a 85% chance of finding a BF10 > 3 when the true effect size is 0.707. Based on these data, and your explicit interest in providing support for the null-hypothesis, you might want to collect some more participants.

Let’s look at some other scenarios. Imagine that in the previous situation, the true effect size was d = 0.43. It turns out that 13% of our experiments will find relative support for the null-hypothesis, 38% of the tests will find relative support for the alternative hypothesis, and 48% of the tests will be inconclusive, with a Bayes Factor between 1/3 and 3. Luckily, with Bayesian statistics you can continue collecting data, but since no one has unlimited resources, it’s important to know whether it’s feasible to collect sufficient data to answer your question in the first place. For example, in the default Bayesian t-test, don’t bother trying to find ‘strong evidence’ (BF < 1/10) for the null if you are planning to collect less than 200 participants. When the null is really true, and you plan to collect data from 100 to 200 participants in each condition, you’ll never find the evidence you are looking for. With 300 participants in each condition, you’ll start to get around 33% probability of finding what you want, indicating a threshold of 10 might be unreasonable for single studies, and more suitable for meta-analyses.  

The simulations also show that the probability of an incorrect conclusion (observing support for H0 when H1 is true, or support for H1 when H0 is true) is very low. Indeed, it’s so low, that we might want to reconsider used a Bayes Factor of 3 as a threshold.

For example, if we set a threshold as low as 1.2, instead of 3, the simulations show that, when the null-hypothesis is true, approximately 5% of the Bayes Factors we observe will provide relative support for the alternative hypothesis, and over 90% of the Bayes Factors will provide relative support for the null hypothesis, with 50 participants in each condition (the remainder, 5% fall in the small inconclusive evidence area). This may not be ‘substantial evidence’ but you won’t be wrong very often, in the long run, when you decide to accept the null hypothesis over the alternative hypothesis of d = 0.707 when you collect 50 participants in each condition.

You may not like the use of a cut-off value to indicate ‘relative support’ as proposed by some Bayesians (Dienes, 2016; Wagenmakers et al., 2011). I think these cut-offs are most important when planning an experiment. It makes sense to aim for support most people will find convincing. When the data are in hand, you can interpret it any way you like. If you agree, then important questions are which cut-off values are acceptable under which circumstances? Can we develop formulas for Bayesian power analysis instead of having to rely on simulations?

I learned something by playing around with these simulations, and I expect others moving to Bayes Factors will have similar questions as I tried to examine here. Power analysis is  Frequentist concept, and it might not be close to the heart of Bayesians, but I expect it meets a future need of researchers switching to Bayesian statistics, especially when they want to design studies that provide support for the null.

Dienes, Z. (2016). How Bayes factors change scientific practice. Journal of Mathematical Psychology. http://doi.org/10.1016/j.jmp.2015.10.003

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. http://doi.org/10.3758/PBR.16.2.225

Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & Van Der Maas, H. L. (2011). Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). Retrieved from http://psycnet.apa.org/journals/psp/100/3/426/

Wednesday, January 13, 2016

The Replication Value: What should be replicated?

Researchers are often reminded that replications are a cornerstone of empirical science (e.g., Koole & Lakens, 2012). However, we don’t need to regard every replication as equally valuable. Although most researchers will agree that a journal editor who rejects a manuscript reporting 20 high-powered direct replications of the Stroop-effect (Stroop, 1935) is making the right decision, they also know that some replications are worthy of being performed and published. Cumulative scientific knowledge requires a balance between original research and close replications of important findings.

The question when a close replication of an empirical finding is of sufficient value to the scientific community to justify being performed and published is an important question for any science that operates within financial and time constraints. Some years ago, I started a project on The Replication Value. The goal of the replication value was to create a quantitative and objective index to determine the value and importance of a close replication. The Replication Value can guide decisions of what to replicate directly, and can serve as a tool both for researchers to assess whether time and resources should be spent on replicating a finding, and for journal editors to help determine whether close replications should be considered for publication.

Developing a formula that can quantify the value of a replication is an interesting challenge. I realized I needed more knowledge of statistics before I could contribute, and, even though we were working with a pretty large team, I think it’s even better if even more people contribute suggestions.

Now, Courtney Soderberg and Charlie Ebersole have taken over the coordination of this project, and from now, anyone who feels like contributing to this important question can generate candidate formulas. Read more about how to contribute here. Want to demonstrate how the replication value can only be computed using Bayesian statistics? Convinced we need to rely on estimation instead? Show us what’s the best way to quantify the value of replications, and earn authorship to what will no doubt be a nice paper in the end.

My approach

I’m not going to give away my approach completely – I don’t want to limit the creativity of others – but I want to give some pointers to get people started.

I think at least two components determine the Replication Value of empirical findings: the impact of the effect, the precision of the effect size estimate. Quantifying the impact of studies is notably difficult, but I think citation counts are an easy to use proxy. Based on the idea that more data yields a better estimate of the population effect size, sample size is a dominant factor in precision (Borenstein, Hedges, Higgins, & Rothstein, 2009). The larger the sample, the lower the variance of the effect size estimate, which leads to a narrower confidence interval around the effect size estimate. We can take the precision of the effect size estimate: The confidence interval for r is calculated by first transforming r to Fisher’s z:

z=0.5 ×ln((1+r)/(1-r)) 

A very good approximation of the variance of z is:

Vz=  1/(n-3)

The confidence interval can then be calculated as normal:

95% CI=r ±1.96*√(Vz )

The values acquired through this procedure can be transformed back to r using:

r= (e^(2 × z)-1)/(e^(2 × z)+1)

where the z value is the z transformed upper or lower boundary of the 95% CI.

By expressing the width of the confidence interval of the effect size estimate of an effect as a percentage of the total possible width of the confidence interval, we have an index of the precision of the effect size estimate, which I call the ‘spielraum’, or the playing field, based on the conceptual similarity to the precision of a theoretical prediction in Meehl’s (1990) work on appraising theories.

Now the tricky thing is how these two factors interact, and determine the replication value. While I’m going back to solve that question, perhaps you want to propose a completely different approach. I mean, really, this is a question that requires Bayesian statistics, right? Are citation counts the absolutely worst way to quantify impact?

See how to contribute here: https://docs.google.com/document/d/1ufO7gwwI2rI7PnESn4wDA-pLcns46NyE7Pp-3zG3PN8/edit  I really look forward to your suggestions.

Friday, January 1, 2016

Error Control in Exploratory ANOVA's: The How and the Why

In a 2X2X2 design, there are three main effects, three two-way interactions, and one three-way interaction to test. That’s 7 statistical tests.The probability of making at least one Type 1 error in a single ANOVA is 1-(0.95)^7=30%.

There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven't read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you’ll find in exploratory ANOVA's.

Simulating Type 1 errors in 3-way ANOVA's

Let’s simulate 250000 2x2x2 ANOVAs where all factors are manipulated between individuals, with 50 participants in each condition, and without any true effect (all group means are equal). The R code is at the bottom of this page. We store the p-values of the 7 tests. The total p-value distribution has the by now familiar uniform shape we see if the null hypothesis is true.

If we count the number of significant findings (even though there is no real effect), we see that from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05 (the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1 errors x 7 tests. If we split up the p-values for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5% error rate, which together add up to a 30% error rate due to multiple testing (i.e., the probability of not making a Type 1 error is 0.95*0.95*0.95*0.95*0.95*0.95*0.95, and the probability of making a Type 1 error is 1 minus this number). With a 2x2x2x2 ANOVA, the Type 1 errors you'll make reach a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.

Let’s fix this. We need to adjust the error rate. The Bonferroni correction (divide your alpha level by the number of tests, so for 7 tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g., economics, gene discovery) work on optimal Type 1 error control procedures continues. I’ve used the mutoss package in R in my simulations to check a wide range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I'm wrong in the comments!). It orders p-values from lowest to highest, and tests them sequentially against an increasingly more lenient alpha level. If you prefer a spreadsheet, go here.

In a 2x2x2 ANOVA, we can test three main effects, three 2-way interactions, and one 3-way interaction. The table below shows the error rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the Holm-Bonferroni correction, the Type 1 error rate nicely controlled.

However, another challenge is to not let Type 1 error control increase the Type 2 errors too much. To examine this, I’ve simulated 2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction). 

Let’s first look at the p-value distribution. I’ve added a horizontal and vertical line. The horizontal line indicates the null-distribution caused by the four null-effects. The vertical line indicates the significance level of 0.05. The two lines create four quarters. Top left are the true positives, bottom left are the false positives, top right are the false negatives (not significant due to a lack of power) and the bottom right are the true negatives.

Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).

We see a substantial drop in the left-most column, and this drop is larger than the false height due to false positives. We also see a peculiarly high bar on the right, caused by the Holm correction adjusting a large number of p-values to 1. We can see this drop in power in the Table below as well. It’s substantial: From 87% power to 68% power.

If you perform a 2x2x2 ANOVA, we might expect you are not really interested in the main effects (if you were, a simply t-test would have sufficed). The power cost is already much lower if the exploratory analysis focusses on only four tests, the three 2-way interactions, and the 3-way interaction (see the third row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100% exploratory. If so, preregistering the subset of all tests you are interesting in, and controlling the error rate for this subset of tests, provides an important boost in power. 

Oh come on you silly methodological fetishist!

If you think Type 1 error control should not endanger the discovery of true effects, here’s what you should not do. You should not wave your hands at controlling Type 1 error rates, saying it is ‘methodological fetishism’ (Ellemers, 2013). It ain’t gonna work. If you choose to report p-values (by all means, don’t), and want to do quantitative science (by all means, don’t) than the formal logic you are following (even if you don’t realize this) is the Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying there’s something, when there is nothing, more than X% of the time’. If you don’t control error rates, your epistemic foundation of making statements reduces to ‘In the long run, I’m not saying there’s something, when there is nothing, more than … uhm … nice weather for the time of the year, isn’t it?’.

Now just because you need to control error rates, doesn’t mean you need to use a Type 1 error rate of 5%. If you plan to replicate any effect you find in an exploratory study, and you set the alpha to 0.2, the probability of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type 1 error across two studies to 0.2, and then using an alpha of 0.05 for each of the 4 tests might be a good idea. If some effects would be costlier to miss, but others less costly, you can use an alpha of 0.8 for two effects, and an alpha of 0.02 for the other two. This is just one example. It’s your party. You can easily pre-register the choices you make to the OSF or AsPredicted to transparently communicate them.

You can also throw error control out of the window. There are approximately 1.950.000 hits in Google Scholar when I search for ‘An Exploratory Analysis Of’. Put these words in the title, list all your DV’s in the main test (e.g., in a table), add Bayesian statistics and effect sizes with their confidence intervals, and don’t draw strong conclusions (Bender & Lange, 2001).

Obviously, the tricky thing is always what to do if your prediction was not confirmed. I think you enter a Lakatosian degenerative research line (as opposed to the progressive research line you’d be in if your predictions were confirmed). With some luck, there’s an easy fix. The same study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue the data collection after the first look at the data, Lakens, 2014) might get you back in a progressive research line after an update in the predicted effect size. Try again, with a better manipulation of dependent variable. Giving up on a research idea after a single failed confirmation is not how science works, in general. Statistical inferences tell you how to interpret the data without fooling yourself. Type 1 error control matters, and in most psychology experiments, is relatively easy to do. But it’s only one aspect of the things you take into account when you decide which research you want to do.

My main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot of room between 100% exploratory and 100% confirmatory research, and there are many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender & Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly be stated why and how the chosen analyses are performed, and which error rate is controlled for”. Clear words, no hand waving.

Thanks to @RogierK for correcting an error in an earlier version of this blog post.

Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4), 343–349. 
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416
Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8. http://doi.org/10.1002/ejsp.1932
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023