The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Wednesday, August 18, 2021

P-values vs. Bayes Factors

In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?

 

What is the goal of statistical inference?

 

When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.

 

Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.

 

The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.

 

What is the advantage of your approach in a practical/applied context?

 

A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still  occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130  are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).

 

Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.

 

In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).

 

When do you think the other approach may be applicable?

 

Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.

 

Conclusion

 

In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.

Wednesday, May 26, 2021

Can joy and rigor co-exist in science?

This is a post-publication peer review of "Joy and rigor in behavioral science". A response by the corresponding author, Leslie John, is at the bottom of this post - make sure to read this as well. 

In a recent paper “Joy and rigor in behavioral science” https://doi.org/10.1016/j.obhdp.2021.03.002 Hanne Collins, Ashley Whillans, and Leslie John aim to examine the behavioral and subjective consequences of performing confirmatory research (e.g., a preregistered study). In their abstract they conclude from Study 1 that “engaging in a pre-registration task impeded the discovery of an interesting but non-hypothesized result” and from Study 2 that “relative to confirmatory research, researchers found exploratory research more enjoyable, motivating, and interesting; and less anxiety-inducing, frustrating, boring, and scientific.” An enjoyable talk about this paper is available at: https://www.youtube.com/watch?v=y31G63iw2xw.

I like meta-scientific work that examines the consequences of changes in scientific practices. It is likely that new initiatives (e.g., preregistration) will have unintended negative consequences, and describing these consequences will make it possible to prevent them through, for example, education. I also think it is important to examine what makes scientists more or less happy in their job (although in this respect, my prior is very low that topics such as preregistration explain a lot of variance compared to job uncertainty, stress, and a lack of work-life balance).

However, I am less confident in the insights this study provides than the authors suggest in their abstract and conclusion. First, and perhaps somewhat ironically, the authors base their conclusions from Study 1 on exploratory analyses that I am willing to bet are a Type 1 error (or maybe a confound), and are not strong enough to be taken seriously.

In Study 1 researchers are asked to go through a hypothetical research process in a survey. Researchers collect data on whether people do yoga on a weekly basis, how happy participants are today, and the gender of participants. Across 3 conditions, the study was preregistered (see the Figure below), preregistered with a message that they could still explore, and non-preregistered. The researchers counted how many of 7 possible analysis options were selected (including an ‘other’ option). The hypothesis is that if researchers explore more in non-preregistered analyses, they would select more of these 7 analysis options to perform in the hypothetical research project.


The authors write their interest is in whether “participants in the confirmation condition viewed fewer analyses overall and were less likely to view and report the results of the gender interaction”. This first analysis seems to be a direct test of a logical prediction. The second prediction is surprising. Why would researchers care about the results of a gender interaction? It turns out that this is the analysis where the authors have hidden a significant interaction that can be discovered through exploring. Of course, the participants do not know this.

The results showed the following: 

A negative binomial logistic regression (Hilbe, 2011) revealed no difference between conditions in the number of analyses participants viewed (Mexploration = 3.48, SDexploration = 2.08; Mconfirmation = 3.79, SDconfirmation = 1.99; Mhybrid = 3.67, SDhybrid = 2.19; all ps ≥ 0.45). Of particular interest, we assessed between condition differences in the propensity to view the results of an exploratory interaction using binary logistic regressions. In the confirmation condition, 53% of participants viewed the results of the interaction compared with 69% in the exploration condition, b = 0.70, SE = 0.24, p = .01.

So the main result here is clear: The is no effect of confirmatory research on the tendency to explore. This is the main conclusion from this analysis. Then, the authors do something very weird. They analyze the item that, unbeknownst to participants, would have revealed a significant interaction. This is one of 7 options participants could click on. The difference they report (p = 0.01) is not significant if the authors correct for multiple comparisons [NOTE: The authors made the preregistration public after I wrote a draft of this blog https://aspredicted.org/dg9m9.pdf and this reveals they did a-priori plan to analyze this item separately – it nicely shows how preregistration allows readers to evaluate the severity of a test (Lakens, 2019), and this test was statistically more severe than I initially though before I had access to the preregistration – I left this comment in the final version of the blog for transparency, and because I think it is a nice illustration of a benefit of preregistration]. But more importantly, there is no logic behind only testing this item. It is, from the perspective of participants, not special at all. They don’t know it will yield a significant result. Furthermore, why would we only care about exploratory analyses that yield a significant result? There are many reasons to explore, such as getting a better understanding of the data. 

To me, this study nicely shows a problem with exploration. You might get a significant result, but you don’t know what it means, and you don’t know if you just fooled yourself. This might be noise. It might be something about this specific item (e.g., people realize that due to the CRUD factor, exploring gender interactions without a clear theory is uninteresting, as there are many uninteresting reasons you observe a significant effect). We don’t know what drives the effect on this single item.

The authors conclude “Study 1 provides an “existence proof” that a focus on confirmation can impede exploration”. First of all, I would like it if we banned the term ‘existence proof’ following a statistical test. We did not find a black swan feather, and we didn’t dig up a bone. We observed a p-value in a test that lacked severity, and we might very well be talking about noise. If you want to make a strong claim, we know what to do: Follow up on this study, and show the effect again in a confirmatory test. Results of exploratory analysis of slightly illogical predictions are not ‘existence proofs’. They are a pattern that is worth following up on, but that we can not make any strong claims about as it stands.

In Study 2 we get some insights into why Study 2 was not a confirmatory study replicating Study 1: Performing confirmatory studies is quite enjoyable, interesting, and motivating – but it is slightly less so than exploratory work (see the Figure below). Furthermore, confirmatory tests are more anxiety inducing. I remember this feeling very well from when I was a PhD student. We didn’t want to do a direct replication, because what if that exploratory finding in your previous study didn’t replicate? Then you could no longer make a strong claim based on the study you had. Furthermore, doing the same thing again, but better, is simply less enjoyable, interesting, and motivating. The problem in Study 2 is not in the questions that were asked, but in the questions that were not asked.

For example, the authors did not ask ‘how enjoyable is exploratory research, when after you have written up the exploratory finding, someone writes a blogpost about how that finding does not support the strong claims you have made?’ Yet, the latter might get a lot more weight in the overall evaluation of the utility of performing confirmatory and exploratory research. Another relevant question is ‘How bad would you feel if someone tried to replicate your exploratory finding, but they failed, and they published an article that demonstrated your ‘existence proof’ was just a fluke’? Another relevant question is ‘How enjoyable is it to see a preregistered hypothesis support your prediction’ or ‘How enjoyable are the consequences of providing strong support for your claims for where the paper is published, or how often it is cited, and how seriously it is taken by academic peers’? The costs and benefits of confirmatory studies are multi-facetted. We should look not just at the utility of performing the actions, but at the utility of the consequences. I don’t enjoy doing the dishes, but I enjoy taking that time to call friends and being able to eat from a clean plate. A complete evaluation of the joy of confirmatory research needs to ask questions about all facets that go into the utility function.

To conclude, I like articles that examine consequences of changes in scientific practice, but in this case I felt the conclusion were too far removed from the data. In the conclusion, the authors write “Like exploration, confirmation is integral to the research process, yet, more so than exploration, it seems to spur negative sentiment.” Yet, we could just as easily have concluded from the data that confirmatory and exploratory research are both enjoyable, given the means of 5.39 and 5.87 on a 7 point scale, respectively. If anything, I was surprised by how small the difference in effect size is (a d = 0.33 for how enjoyable both are). Although the authors do not interpret the size of the effects in their paper, that was quite a striking conclusion for myself – I would have predicted the difference in how enjoyable these two types of research were to perform would have been larger. The authors also conclude that about Study 1 that “researchers who were randomly assigned to preregister a prediction were less likely to discover an interesting, non-hypothesized result.” I am not sure this is not just a Type 1 error, as the main analysis yielded no significant result, and my prior is very low that a manipulation that makes people less likely to explore, would only make them less likely to explore the item that, unbeknownst to the participants, would reveal a statistically significant interaction, as I don’t know which mechanism could cause such a specific effect. Instead of exploring this in hypothetical research projects in a survey, I would love to see an analysis of the number of exploratory analyses in published preregistered studies. I would predict that in real research projects, researchers report all the possibly interesting significant results they can find in exploratory analyses.

 

Leslie John’s response:

 

  1. Lakens characterizes Study 1 as exploratory, but it was explicitly a confirmatory study. As specified in our pre-registration, our primary hypothesis was “H1: We will test whether participants in a confirmatory mindset (vs. control) will be less likely to seek results outside of those that would confirm their prediction (i.e., to seek the results of an interaction, as opposed to simply the predicted main effect).” Lakens is accurate in noting that participants did not know a priori that the interaction was significant, but this is precisely our point: when people feel that exploration is off-limits, informative effects can be missed. Importantly, we also stress that a researcher shouldn’t simply send a study off for publication once s/he has discovered something new (like this interaction effect) through exploration; rather “exploration followed by rigorous confirmation is integral to scientific discovery” (p. 188). As a result, Lakens’ questions such as “How bad would you feel if someone tried to replicate your exploratory finding, but they failed, and they published an article that demonstrated your ‘existence proof’ was just a fluke” is not an event that researchers would need to worry about if they follow the guidance we offer in our paper (which underscores the importance of exploration followed by confirmation).

 

  1. Lakens raises additional research questions stemming from our findings about the subjective experience of conducting exploratory versus confirmatory work. We welcome additional research to better understand the subjective experience of conducting research in the reform era. We suspect that one reason that research in the reform era can induce anxiety is the fear of post-publication critiques, and the valid concern that such critiques will mispresent both one’s findings as well as one’s motives for conducting the research. We are therefore particularly solicitous of research that speaks to making the research process, including post-publication critique, not only rigorous but joyful.