A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, February 21, 2016

Where are all the competent researchers?

In response to failed replications, some researchers argue that replication studies are especially convincing when the people who performed the replication are ‘competent’ ‘experts’.

Paul Bloom has recently remarked: “Plainly, a failure to replicate means a lot when it’s done by careful and competent experimenters, and when it’s clear that the methods are sensitive enough to find an effect if one exists. Many failures to replicate are of this sort, and these are of considerable scientific value. But I’ve read enough descriptions of failed replications to know how badly some of them are done. I’m aware as well that some attempts at replication are done by undergraduates who have never run a study before. Such replication attempts are a great way to train students to do psychological research, but when they fail to get an effect, the response of the scientific community should be: Meh.

This mirrors the response by John Bargh after replications of the elderly priming studies yielded no significant effects: “The attitude that just anyone can have the expertise to conduct research in our area strikes me as more than a bit arrogant and condescending, as if designing the conducting these studies were mere child's play.” “Believe it or not, folks, a PhD in social psychology actually means something; the four or five years of training actually matters.

So where is the evidence we should ‘meh’ replications by novices that show no effect? And how do we define a ‘competent’ experimenter? And can we justify the intuition that a non-significant finding by undergraduate students is ‘meh’, when we are more than willing to submit the work by the same undergraduate when the outcome is statistically significant?

One way to define a competent experimenter is simply by looking who managed to observe the effect in the past. However, this won’t do. If we look at the elderly priming literature, a p-curve analysis gives no reason to assume anything more is going on than p-hacking. Thus, merely finding a significant result in the past should not be our definition of competence. It is a good definition of an ‘expert’, where the difference between an expert and novice is the amount of expertise one has in researching a topic. But I see no reason to believe expertise and competence are perfectly correlated.

There are cases where competence matters, as Paul Meehl reminds us in his lecture series (video 2, 46:30 minutes). He discusses a situation where David Miller provided evidence in support of the ether drift, long after Einstein’s relativity theory explained it away. This is perhaps the opposite as replication showing a null effect, but the competence of Miller, who had the reputation of being a very reliable experimenter, is clearly being taken into account by Meehl. It took until 1955 before the ‘occult result’ observed by Miller was explained by a temperature confound.

Showing that you can reliably reproduce findings is an important sign of competence – if this has been done without relying on publication bias and researchers’ degrees of freedom. This could easily be done in a single well-powered pre-registered replication study, but over the last years, I am not aware of researchers demonstrating their competence in reproducing contested findings in a pre-registered study. I definitely understand researchers prefer to spend their time in other ways than defending their past research. At the same time, I’ve seen many researchers who spend a lot of time writing papers criticizing replications that yield null results. Personally, I would say that if you are going to invest in defending your study, and data collection doesn’t take too much time, the most convincing demonstration of competence is a pre-registered study showing the effect.

So, the idea that there are competent researchers who can reliably demonstrate the presence of effects, which are not observed by others, is not supported by empirical data (so far). In the extreme case of clear incompetence, there is no need for an empirical justification, as the importance of competence to observe an effect is trivially true. It might very well be true under less trivial circumstances. These circumstances are probably not experiments that occur completely in computer cubicles, where people are guided through the experiment by a computer program. I can’t see how the expertise of experimenters has a large influence on psychological effects in these situations. This is also one of the reasons (along with the 50 participants randomly assigned to four between subject conditions) why I don’t think the ‘experimenter bias’ explanation for the elderly priming studies by Doyen and colleagues is particularly convincing (see Lakens & Evers, 2014).

In a recent pre-registered replication project re-examining the ego-depletion effect, both experts and novices performed replication studies. Although this paper is still in press, preliminary reports at conferences and on social media tell us the overall effect is not reliably different from 0. Is expertise a moderator? I have it on good authority that the answer is: No.

This last set of studies shows the importance of getting experts involved in replication efforts, since it allows us to empirically examine the idea that competence plays a big role in replication success. There are, apparently, people who will go ‘meh’ whenever non-experts perform replications. As is clear from my post, I am not convinced the correlation between expertise and competence is 1, but in light of the importance of social aspects of science, I think experts in specific research areas should get more involved in registered replication efforts of contested findings. In my book, and regardless of the outcome of such studies, performing pre-registered studies examining the robustness of your findings is a clear sign of competence.

Sunday, February 14, 2016

Why you don't need to adjust your alpha level for all tests you'll do in your lifetime.

In this blog post about error control, I discuss when we need it, how it depends on the question you want answered, why you don’t need to control the alpha level over all experiments you will perform in your lifetime, and when you might want to increase your alpha level above the holy 5%.

1740 words. Reading Time: 8 minutes 

The main goal in a Neyman-Pearson approach is to develop a procedure that that will guide behavior, without being wrong too often. The long-run error rate of the decisions you make (based on the p-values you calculate) is easily controlled at a specific alpha level when only a single statistical test is performed. When multiple tests are performed, one can’t simply use the overall alpha level for all performed tests. Although there is some misguided discussion in the literature about whether error rates should be controlled when making multiple comparisons, the need for adjustments is a logical consequence of using Frequentist statistics such as the Neyman-Pearson approach (Thompson, 1998).

Consider an experiment where people are randomly assigned to either a control or experimental condition. Two unrelated dependent variables are measured to test a hypothesis. A researcher will conclude a specific manipulation has an effect, if there is a difference between the control group and the experimental group on either of these two dependent variables. Because two independent tests are performed, the probability of not making a Type 1 error when α = 0.05 is 0.95*0.95, or 0.9025. This means that the probability of concluding there is an effect, when there is no effect, is 1 - 0.9025 = 0.0975 instead of 0.05.

There are different ways to control for error rates, the easiest being the Bonferroni correction (divide the α by the number of tests), and an ever-so-slightly less conservative correction being the Holm-Bonferroni sequential procedure. For some multiple testing situations, dedicated statistical approaches have been developed. For example, sequential analyses (Lakens, 2014) control the error rate when researchers want to look at their data as it comes in, and stop the data collection whenever a statistically significant result is observed (this is also needed when updating meta-analyses). When the number of statistical tests becomes substantial, it is sometimes preferable to control false discovery rates, instead of error rates (Benjamini, Krieger, & Yekutieli, 2006). Many procedures that control for false discovery rates take dependencies among hypotheses into account. All these approaches have the same goal of limiting the probability of saying there is an effect, when there is no effect.

The Bonferroni correction controls the familywise error rate, but what a family of tests is, requires some thought. The main reason this question is not straightforward is that error control does not just aim to control the number of erroneous statistical inferences, but the number of erroneous theoretical inferences. We therefore need to make a statement about which tests relate to a single theoretical inference, which depends on the theoretical question. I believe many of the problems researchers have in deciding how to correct for multiple comparisons is actually a problem in deciding what their theoretical question is.

Error rates can be controlled for all tests in an experiment (the experimentwise Type 1 error rate) or for a specific group of tests (the familywise Type 1 error rate). Broad questions have many possible answers. If we want to know if there is ‘an effect’ in a study, then rejecting the null-hypothesis in any test we perform would lead us to decide the answer to our question is ‘yes’. In this situation, the experimentwise Type 1 error rate correctly controls the probability of deciding there is any effect, when all null hypotheses are true. For example, in a 2x2x2 ANOVA, we test for three main effects, three two-way interactions, and one three-way interaction, which makes seven tests in total. If we use a 5% alpha level for every test, the probability that we will conclude there is an effect, when the null hypothesis is true, is 30%.

But researchers often have more specific questions. Let’s assume a researcher has designed an experiment that compares predictions from two competing theories. Theory A predicts an interaction in a 2x2 ANOVA, while Theory B predicts no interaction, but at least one significant main effect. The researcher will perform three tests, which we will assume is highly powered for any theoretically relevant effect size. One might intuitively assume that since we will perform three tests (two main effects and one interaction) we should control the error rate for all three tests, for example by using α/3. But when controlling the familywise error rate, what constitutes a ‘family’ depends on a set of theoretically related tests. In this case, where we test two theories, there are two families of tests, the first family consisting of a single interaction effect, and the second family of two main effects. With an overall alpha level of 5%, we will decide to accept Theory A when p < α for the interaction, and we will decide no to accept Theory A when p > α. If the null is true, at most 5% of these decisions we make in the long run will be incorrect, so the percentage of decision errors is controlled. Furthermore, we will decide to accept Theory B when p < α/2 (using a Bonferroni correction) for either of the two main effects, and not accept theory B when p > α/2. When the null hypothesis is true, we will decide to accept Theory B when it is not true at most 5% of the time. We could accept neither theory, or even both, if it turned out the experiment was not the crucial test the researcher had thought.

Some researchers criticize corrections for multiple comparisons because one might as well correct for all tests you do in your lifetime (Perneger, 1998). If you choose to use a Neyman-Pearson paradigm, as opposed to a Likelihood approach or Bayesian statistics, the only reason to correct for all tests you perform in your lifetime is when all the work you have done in your life tests a single theory, and you would use your last words to decide to accept or reject this theory, as long as only one of all individual tests you have performed yielded a p < α. Researchers rarely work like this. Instead, they often draw a conclusion after a single study. It’s these intermediate decisions to accept or reject the null hypothesis that should not be wrong too often, in the long run. We control errors when we make decisions about theories, and we make these decisions more than once in our lifetime. 

It might seem if researchers can find a way out of using error control by formulating a hypothesis for every possible test they will perform. Indeed, they can. For a ten by ten correlation matrix, a researcher might have theoretical predictions for all 45 individual correlations. If all these 45 predicted correlations are tests using an alpha level of 5%, the statistical inference is valid. However, readers might reasonably question the theoretical validity of these 45 tests. All statistical inferences interact with theoretical inferences at some point, and choices to control error rates are a good example of this.

Another criticism on corrections for multiple comparisons is that it is strange that the conclusions a researcher draws depend on the number of additional tests a researcher performs. For example, if a researcher had measured only a single dependent variable, a p = 0.04 would have led to a decision to reject the null hypothesis, but with a second dependent variable, the alpha level is reduced to 0.025, and now the same data no longer leads to the conclusion to reject the null hypothesis. Lowering alpha levels is a mathematical necessity when you want to control error rates, but it is not needed if all you want to do is quantify relative likelihoods of the data under different hypotheses.

Likelihood approaches look at the relative likelihood of the data, given two hypotheses (complemented with prior knowledge in Bayesian statistics). Likelihoods only care about the data. Obviously the probability that the strong evidence in favor of the alternative hypothesis is a fluke increases with the number of tests that were performed. There are ways to control error rates in likelihood approaches and Bayesian statistics, but they are less straightforward than using a Neyman-Pearson approach. It might seem strange for someone who uses a likelihood approach (or Bayesian statistics) that conclusions depend on the number of additional tests that are performed. But from a Neyman-Pearson approach, it is similarly strange to interpret one out of 45 likelihood ratios or Bayes factors from a ten by ten correlation matrix as ‘strong evidence’ for a true effect, without taking into account 44 other tests were performed at the same time. Combining both approaches is probably a win-win, where long run error rates are controlled, after which the evidential value in individual studies in interpreted (and, because why not, parameters are estimated).

A better understanding of controlling error rates is useful. There are researchers who fear the current scientific climate is focusing too much on Type 1 error control, at the expense of Type 2 error control (Fiedler, Kutzner, & Krueger, 2012). But this is not necessarily so. It all depends on how you design your experiments. Just like you need to lower the alpha level if multiple tests would allow you to reject the null hypothesis, you can choose to increase the alpha level if you will only reject the null hypothesis when multiple independent tests yield a p < α. For example, it is perfectly fine to pre-register a set of two experiments, the second a close replication of the first, where you will choose to reject the null-hypothesis if the p-value is smaller than 0.2236 in both experiments. The probability that you will reject the null hypothesis twice in a row if the null hypothesis is true is α * α, or 0.2236 * 0.2236 = 0.05. In other words, if you set out to do a line of pre-registered studies, which you will report without publication bias, it makes sense to increase your alpha level. For example, an alpha level of 0.1 in both studies effectively limits the Type 1 error rate to 0.1 * 0.1 = 0.01. Conceptually, this is similar to deciding to base your decision on the outcome of a small-scale meta-analysis with an alpha of 0.01.

There is only one reason to calculate p-values, and that is to control Type 1 error rates using a Neyman-Pearson approach. Therefore, if you use p-values, you need to correct for multiple comparisons, but be smart about it. We need better error control, not necessarily stricter error control.

Benjamini, Y., Krieger, A. M., & Yekutieli, D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93(3), 491–507.
Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The Long Way From -Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate. Perspectives on Psychological Science, 7(6), 661–669. http://doi.org/10.1177/1745691612462587
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023
Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. Bmj, 316(7139), 1236–1238.
Thompson, J. R. (1998). Invited Commentary: Re: ‘Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data”. American Journal of Epidemiology, 147(9), 801–806. http://doi.org/10.1093/oxfordjournals.aje.a009530

Thursday, February 11, 2016

So you banned p-values, how’s that working out for you?

The journal Basic and Applied Social Psychology banned p-values a year ago. I read some of their articles published in the last year. I didn’t like many of them. Here’s why.

First of all, it seems BASP didn’t just ban p-values. They also banned confidence intervals, because God forbid you use that lower bound to check whether or not it includes 0. They also banned reporting sample sizes for between subject conditions, because God forbid you divide that SD by the square root of N and multiply it by 1.96 and subtract it from the mean and guesstimate whether that value is smaller than 0.

It reminds me of alcoholics who go into detox and have to hand in their perfume, before they are tempted to drink it. Thou shall not know whether a result is significant – it’s for your own good! Apparently, thou shall also not know whether effect sizes were estimated with any decent level of accuracy. Nor shall thou include the effect in future meta-analyses to commit the sin of cumulative science.

There are some nice papers where the p-value ban has no negative consequences. For example, Swab & Greitemeyer (2015) examine whether indirect (virtual) intergroup contact (seeing you have 1 friend in common with an outgroup member, vs not) would influence intergroup attitudes. It did not, in 8 studies. P-values can’t be used to accept the null-hypothesis, and these authors explicitly note they aimed to control Type 2 errors based on an a-priori power analysis. So, after observing many null-results, they drew the correct conclusion that if there was an effect, it was very unlikely to be larger than what the theory on evaluative conditioning predicted. After this conclusion, they logically switch to parameter estimation, perform a meta-analysis and based on a Cohen’s d of 0.05, suggest that this effect is basically 0. It’s a nice article, and the p-value ban did not make it better or worse.

But in many other papers, especially those where sample sizes were small, and experimental designs were used to examine hypothesized differences between conditions, things don’t look good.

In many of the articles published in BASP, researchers make statements about differences between groups. Whether or not these provide support for their hypotheses becomes a moving target, without the need to report p-values. For example, some authors interpret a d of 0.36 as support for an effect, while in the same study, a Cohen’s d < 0.29 (we are not told the exact value) is not interpreted as an effect. You can see how banning p-values solved the problem of dichotomous interpretations (I’m being ironic). Also, with 82 people divided over three conditions, the p-value associated with the d = 0.36 interpreted as an effect is around p = 0.2. If BASP had required authors to report p-values, they might have interpreted this effect a bit more cautiously. And in case you are wondering: No, this is not the only non-significant finding interpreted as an effect. Surprisingly enough, it seems to happen a lot more often than in journals where authors report p-values! Who would have predicted this?!

Saying one thing is bigger than something else, and reporting an effect size, works pretty well in simple effects. But how would say there is a statistically significant interaction, if you can’t report inferential statistics and p-values? Here are some of my favorite statements.

“The ANOVA also revealed an interaction between [X] and [Y], η² = 0.03 (small to medium effect).”

How much trust do you have in that interaction from an exploratory ANOVA with a small to medium effect size of .03, partial eta squared? That’s what I thought.

“The main effects were qualified by an [X] by [Y] interaction. See Figure 2 for means and standard errors”

The main effects were qualified, but the interaction was not quantified. What does this author expect I do with the means and standard errors? Look at it while humming ‘ohm’ and wait to become enlightened? Everybody knows these authors calculated p-values, and based their statements on these values.

In normal scientific journals, authors sometimes report a Bonferroni correction. But there’s no way you are going to Bonferroni those means and standard deviations, now is there? With their ban on p-values and confidence intervals, BASP has banned error control. For example, read the following statement:

Willpower theories were also related to participants’ BMI. The more people endorsed a limited theory, the higher their BMI. This finding corroborates the idea that a limited theory is related to lower self-control in terms of dieting and might therefore also correlate with patients BMI.

This is based on a two-sided p-value of 0.026, and it was one of 10 calculated correlation coefficient. Would a Bonferroni adjusted p-value have led to a slightly more cautious conclusion?

Oh, and if you hoped banning p-values would lead anyone to use Bayesian statistics: No. It leads to a surprisingly large number of citations to Trafimow’s articles where he tries to use p-values as measures of evidence, and is disappointed they don’t do what he expects. Which is like going to The Hangover part 4 and complaining it’s really not that funny. Except everyone who publishes in BASP mysteriously agrees that Trafimow’s articles show NHST has been discredited and is illogical.

In their latest editorial, Trafimow and Marks hit down some arguments you could, after a decent bottle of liquor, interpret as straw men against their ban of p-values. They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise.

The absence of p-values has not prevented dichotomous conclusions, nor claims that data support theories (which is only possible using Bayesian statistics), nor anything else p-values were blamed for in science. After reading a year’s worth of BASP articles, you’d almost start to suspect p-values are not the real problem. Instead, it looks like researchers find making statistical inferences pretty difficult, and forcing them to ignore p-values didn’t magically make things better.

As far as I can see, all that banning p-values has done, is increase the Type 1 error rate in BASP articles. Restoring a correct use of p-values would substantially improve how well conclusions authors draw actually follow from the data they have collected. The only expense, I predict, is a much lower number of citations to articles written by Trafimow about how useless p-values are.