The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, September 20, 2021

Jerzy Neyman: A Positive Role Model in the History of Frequentist Statistics

Many of the facts in this blog post come from the biography ‘Neyman’ by Constance Reid. I highly recommend reading this book if you find this blog interesting.

In recent years researchers have become increasingly interested in the relationship between eugenics and statistics, especially focusing on the lives of Francis Galton, Karl Pearson, and Ronald Fisher. Some have gone as far as to argue for a causal relationship between eugenics and frequentist statistics. For example, in a recent book ‘Bernouilli’s Fallacy’, Aubrey Clayton speculates that Fisher’s decision to reject prior probabilities and embrace a frequentist approach was “also at least partly political”. Rejecting prior probabilities, Clayton argues, makes science seem more ‘objective’, which would have helped Ronald Fisher and his predecessors to establish eugenics as a scientific discipline, despite the often-racist conclusions eugenicists reached in their work.

When I was asked to review an early version of Clayton’s book for Columbia University Press, I thought that the main narrative was rather unconvincing, and thought the presented history of frequentist statistics was too one-sided and biased. Authors who link statistics to problematic political views often do not mention equally important figures in the history of frequentist statistics who were in all ways the opposite of Ronald Fisher. In this blog post, I want to briefly discuss the work and life of Jerzy Neyman, for two reasons.


Jerzy Neyman (image from https://statistics.berkeley.edu/people/jerzy-neyman)

First, the focus on Fisher’s role in the history of frequentist statistics is surprising, given that the dominant approach to frequentist statistics used in many scientific disciplines is the Neyman-Pearson approach. If you have ever rejected a null hypothesis because a p-value was smaller than an alpha level, or if you have performed a power analysis, you have used the Neyman-Pearson approach to frequentist statistics, and not the Fisherian approach. Neyman and Fisher disagreed vehemently about their statistical philosophies (in 1961 Neyman published an article titled ‘Silver Jubilee of My Dispute with Fisher’), but it was Neyman’s philosophy that won out and became the default approach to hypothesis testing in most fields[i]. Anyone discussing the history of frequentist hypothesis testing should therefore seriously engage with the work of Jerzy Neyman and Egon Pearson. Their work was not in line with the views of Karl Pearson, Egon's father, nor the views of Fisher. Indeed, it was a great source of satisfaction to Neyman that their seminal 1933 paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of the work, and (as Neyman thought) reviewed by Fisher[ii], who strongly disagreed with their philosophy of statistics.

Second, Jerzy Neyman was also the opposite to Fisher in his political viewpoints. Instead of promoting eugenics, Neyman worked to improve the position of those less privileged throughout his life, teaching disadvantaged people in Poland, and creating educational opportunities for Americans at UC Berkeley. He hired David Blackwell, who was the first Black tenured faculty member at UC Berkeley. This is important, because it falsifies the idea put forward by Clayton[iii] that frequentist statistics became the dominant approach in science because the most important scientists who worked on it wanted to pretend their dubious viewpoints were based on ‘objective’ scientific methods.  

I think it is useful to broaden the discussion of the history of statistics, beyond the work by Fisher and Karl Pearson, and credit the work of others[iv] who contributed in at least as important ways to the statistics we use today. I am continually surprised about how few people working outside of statistics even know the name of Jerzy Neyman, even though they regularly use his insights when testing hypotheses. In this blog, I will try to describe his work and life to add some balance to the history of statistics that most people seem to learn about. And more importantly, I hope Jerzy Neyman can be a positive role-model for young frequentist statisticians, who might so far have only been educated about the life of Ronald Fisher.


Neyman’s personal life


Neyman was born in 1894 in Russia, but raised in Poland. After attending the gymnasium, he studied at the University of Kharkov. Initially trying to become an experimental physicist, he was too clumsy with his hands, and switched to conceptual mathematics, in which he concluded his undergraduate in 1917 in politically tumultuous times. In 1919 he met his wife, and they marry in 1920. Ten days later, because of the war between Russia and Poland, Neyman is imprisoned for a short time, and in 1921 flees to a small village to avoid being arrested again, where he obtains food by teaching the children of farmers. He worked for the Agricultural Institute, and then worked at the University in Warsaw. He obtained his doctor’s degree in 1924 at age 30. In September 1925 he was sent to London for a year to learn about the latest developments in statistics from Karl Pearson himself. It is here that he met Egon Pearson, Karl’s son, and a friendship and scientific collaboration starts.

Neyman always spends a lot of time teaching, often at the expense of doing scientific work. He was involved in equal opportunity education in 1918 in Poland, teaching in dimly lit classrooms where the rag he used to wipe the blackboard would sometimes freeze. He always had a weak spot for intellectuals from ‘disadvantaged’ backgrounds. He and his wife were themselves very poor until he moved to UC Berkeley in 1938. In 1929, back in Poland, his wife becomes ill due to their bad living conditions, and the doctor who comes to examine her is so struck by their miserable living conditions he offers the couple stay in his house for the same rent they were paying while he visits France for 6 months. In his letters to Egon Pearson from this time, he often complained that the struggle for existence takes all his time and energy, and that he can not do any scientific work.

Even much later in his life, in 1978, he kept in mind that many people have very little money, and he calls ahead to restaurants to make sure a dinner before a seminar would not cost too much for the students. It is perhaps no surprise that most of his students (and he had many) talk about Neyman with a lot of appreciation. He wasn’t perfect (for example, Erich Lehmann - one of Neyman's students - remarks how he was no longer allowed to teach a class after Lehmann's notes, building on but extending the work by Neyman, became extremely popular – suggesting Neyman was no stranger to envy). But his students were extremely positive about the atmosphere he created in his lab. For example, job applicants were told around 1947 that “there is no discrimination on the basis of age, sex, or race ... authors of joint papers are always listed alphabetically."

Neyman himself often suffered discrimination, sometimes because of his difficulty mastering the English language, sometimes for being Polish (when in Paris a piece of clothing, and ermine wrap, is stolen from their room, the police responds “What can you expect – only Poles live there!”), sometimes because he did not believe in God, and sometimes because his wife was Russian and very emancipated (living independently in Paris as an artist). He was fiercely against discrimination. In 1933, as anti-Semitism is on the rise among students at the university where he works in Poland, he complains to Egon Pearson in a letter that the students are behaving with Jews as Americans do with people of color. In 1941 at UC Berkeley he hired women at a time it was not easy for a woman to get a job in mathematics.  

In 1942, Neyman examined the possibility of hiring David Blackwell, a Black statistician, then still a student. Neyman met him in New York (so that Blackwell does not need to travel to Berkeley at his own expense) and considered Blackwell the best candidate for the job. The wife of a mathematics professor (who was born in the south of the US) learned about the possibility that a Black statistician might be hired, warns she will not invite a Black man to her house, and there was enough concern for the effect the hire would have on the department that Neyman can not make an offer to Blackwell. He is able to get Blackwell to Berkeley in 1953 as a visiting professor, and offers him a tenured job in 1954, making David Blackwell the first tenured faculty member at the University of Berkeley, California. And Neyman did this, even though Blackwell was a Bayesian[v] ;).

In 1963, Neyman travelled to the south of the US and for the first time directly experienced the segregation. Back in Berkeley, a letter is written with a request for contributions for the Southern Christian Leadership Conference (founded by Martin Luther King, Jr. and others), and 4000 copies are printed and shared with colleagues at the university and friends around the country, which brought in more than $3000. He wrote a letter to his friend Harald Cramér that he believed Martin Luther King, Jr. deserved a Nobel Peace Prize (which Cramér forwarded to the chairman of the Nobel Committee, and which he believed might have contributed at least a tiny bit to fact that Martin Luther King, Jr. was awarded the Nobel Prize a year later). Neyman also worked towards the establishment of a Special Scholarships Committee at UC Berkeley with the goal of providing education opportunities to disadvantaged Americans

Neyman was not a pacifist. In the second world war he actively looked for ways he could contribute to the war effort. He is involved in statistical models that compute the optimal spacing of bombs by planes to clear a path across a beach of land mines. (When at a certain moment he needs specifics about the beach, a representative from the military who is not allowed to directly provide this information asks if Neyman has ever been to the seashore in France, to which Neyman replies he has been to Normandy, and the representative answers “Then use that beach!”). But Neyman early and actively opposed the Vietnam war, despite the risk of losing lucrative contracts the Statistical Laboratory had with the Department of Defense. In 1964 he joined a group of people who bought advertisements in local newspapers with a picture of a napalmed Vietnamese child with the quote “The American people will bluntly and plainly call it murder”.


A positive role model


It is important to know the history of a scientific discipline. Histories are complex, and we should resist overly simplistic narratives. If your teacher explains frequentist statistics to you, it is good if they highlight that someone like Fisher had questionable ideas about eugenics. But the early developments in frequentist statistics involved many researchers beyond Fisher[vi], and, luckily, there are many more positive role-models that also deserve to be mentioned - such as Jerzy Neyman. Even though Neyman’s philosophy on statistical inferences forms the basis of how many scientists nowadays test hypotheses, his contributions and personal life are still often not discussed in histories of statistics - an oversight I hope the current blog post can somewhat mitigate. If you want to learn more about the history of statistics through Neyman’s personal life, I highly recommend the biography of Neyman by Constance Reid, which was the source for most of the content of this blog post.

 



[i] See Hacking, 1965: “The mature theory of Neyman and Pearson is very nearly the received theory on testing statistical hypotheses.”

[ii] It turns out, in the biography, that it was not Fisher, but A. C. Aitken, who reviewed the paper positively.

[iii] Clayton’s book seems to be mainly intended as an attempt to persuade readers to become a Bayesian, and not as an accurate analysis of the development of frequentist statistics.

[iv] William Gosset (or 'Student', from 'Student's t-test'), who was the main inspiration for the work by Neyman and Pearson, is another giant in frequentist statistics who does not in any way fit into the narrative that frequentist statistics is tied to eugenics, as his statistical work was motivated by applied research questions in the Guinness brewery. Gosset was a modest man – which is probably why he rarely receives the credit he is due.

[v] When asked about his attitude towards Bayesian statistics in 1979, he answered: “It does not interest me. I am interested in frequencies.” He did note multiple legitimate approaches to statistics exist, and the choice one makes is largely a matter of personal taste. Neyman opposed subjective Bayesian statistics because their use could lead to bad decision procedures, but was very positive about later work by Wald, which inspired Bayesian statistical decision theory.

[vi] For a more nuanced summary of Fisher's life, see https://www.nature.com/articles/s41437-020-00394-6 

 

Wednesday, August 18, 2021

P-values vs. Bayes Factors

In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?

 

What is the goal of statistical inference?

 

When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.

 

Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.

 

The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.

 

What is the advantage of your approach in a practical/applied context?

 

A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still  occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130  are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).

 

Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.

 

In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).

 

When do you think the other approach may be applicable?

 

Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.

 

Conclusion

 

In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.

Wednesday, May 26, 2021

Can joy and rigor co-exist in science?

This is a post-publication peer review of "Joy and rigor in behavioral science". A response by the corresponding author, Leslie John, is at the bottom of this post - make sure to read this as well. 

In a recent paper “Joy and rigor in behavioral science” https://doi.org/10.1016/j.obhdp.2021.03.002 Hanne Collins, Ashley Whillans, and Leslie John aim to examine the behavioral and subjective consequences of performing confirmatory research (e.g., a preregistered study). In their abstract they conclude from Study 1 that “engaging in a pre-registration task impeded the discovery of an interesting but non-hypothesized result” and from Study 2 that “relative to confirmatory research, researchers found exploratory research more enjoyable, motivating, and interesting; and less anxiety-inducing, frustrating, boring, and scientific.” An enjoyable talk about this paper is available at: https://www.youtube.com/watch?v=y31G63iw2xw.

I like meta-scientific work that examines the consequences of changes in scientific practices. It is likely that new initiatives (e.g., preregistration) will have unintended negative consequences, and describing these consequences will make it possible to prevent them through, for example, education. I also think it is important to examine what makes scientists more or less happy in their job (although in this respect, my prior is very low that topics such as preregistration explain a lot of variance compared to job uncertainty, stress, and a lack of work-life balance).

However, I am less confident in the insights this study provides than the authors suggest in their abstract and conclusion. First, and perhaps somewhat ironically, the authors base their conclusions from Study 1 on exploratory analyses that I am willing to bet are a Type 1 error (or maybe a confound), and are not strong enough to be taken seriously.

In Study 1 researchers are asked to go through a hypothetical research process in a survey. Researchers collect data on whether people do yoga on a weekly basis, how happy participants are today, and the gender of participants. Across 3 conditions, the study was preregistered (see the Figure below), preregistered with a message that they could still explore, and non-preregistered. The researchers counted how many of 7 possible analysis options were selected (including an ‘other’ option). The hypothesis is that if researchers explore more in non-preregistered analyses, they would select more of these 7 analysis options to perform in the hypothetical research project.


The authors write their interest is in whether “participants in the confirmation condition viewed fewer analyses overall and were less likely to view and report the results of the gender interaction”. This first analysis seems to be a direct test of a logical prediction. The second prediction is surprising. Why would researchers care about the results of a gender interaction? It turns out that this is the analysis where the authors have hidden a significant interaction that can be discovered through exploring. Of course, the participants do not know this.

The results showed the following: 

A negative binomial logistic regression (Hilbe, 2011) revealed no difference between conditions in the number of analyses participants viewed (Mexploration = 3.48, SDexploration = 2.08; Mconfirmation = 3.79, SDconfirmation = 1.99; Mhybrid = 3.67, SDhybrid = 2.19; all ps ≥ 0.45). Of particular interest, we assessed between condition differences in the propensity to view the results of an exploratory interaction using binary logistic regressions. In the confirmation condition, 53% of participants viewed the results of the interaction compared with 69% in the exploration condition, b = 0.70, SE = 0.24, p = .01.

So the main result here is clear: The is no effect of confirmatory research on the tendency to explore. This is the main conclusion from this analysis. Then, the authors do something very weird. They analyze the item that, unbeknownst to participants, would have revealed a significant interaction. This is one of 7 options participants could click on. The difference they report (p = 0.01) is not significant if the authors correct for multiple comparisons [NOTE: The authors made the preregistration public after I wrote a draft of this blog https://aspredicted.org/dg9m9.pdf and this reveals they did a-priori plan to analyze this item separately – it nicely shows how preregistration allows readers to evaluate the severity of a test (Lakens, 2019), and this test was statistically more severe than I initially though before I had access to the preregistration – I left this comment in the final version of the blog for transparency, and because I think it is a nice illustration of a benefit of preregistration]. But more importantly, there is no logic behind only testing this item. It is, from the perspective of participants, not special at all. They don’t know it will yield a significant result. Furthermore, why would we only care about exploratory analyses that yield a significant result? There are many reasons to explore, such as getting a better understanding of the data. 

To me, this study nicely shows a problem with exploration. You might get a significant result, but you don’t know what it means, and you don’t know if you just fooled yourself. This might be noise. It might be something about this specific item (e.g., people realize that due to the CRUD factor, exploring gender interactions without a clear theory is uninteresting, as there are many uninteresting reasons you observe a significant effect). We don’t know what drives the effect on this single item.

The authors conclude “Study 1 provides an “existence proof” that a focus on confirmation can impede exploration”. First of all, I would like it if we banned the term ‘existence proof’ following a statistical test. We did not find a black swan feather, and we didn’t dig up a bone. We observed a p-value in a test that lacked severity, and we might very well be talking about noise. If you want to make a strong claim, we know what to do: Follow up on this study, and show the effect again in a confirmatory test. Results of exploratory analysis of slightly illogical predictions are not ‘existence proofs’. They are a pattern that is worth following up on, but that we can not make any strong claims about as it stands.

In Study 2 we get some insights into why Study 2 was not a confirmatory study replicating Study 1: Performing confirmatory studies is quite enjoyable, interesting, and motivating – but it is slightly less so than exploratory work (see the Figure below). Furthermore, confirmatory tests are more anxiety inducing. I remember this feeling very well from when I was a PhD student. We didn’t want to do a direct replication, because what if that exploratory finding in your previous study didn’t replicate? Then you could no longer make a strong claim based on the study you had. Furthermore, doing the same thing again, but better, is simply less enjoyable, interesting, and motivating. The problem in Study 2 is not in the questions that were asked, but in the questions that were not asked.

For example, the authors did not ask ‘how enjoyable is exploratory research, when after you have written up the exploratory finding, someone writes a blogpost about how that finding does not support the strong claims you have made?’ Yet, the latter might get a lot more weight in the overall evaluation of the utility of performing confirmatory and exploratory research. Another relevant question is ‘How bad would you feel if someone tried to replicate your exploratory finding, but they failed, and they published an article that demonstrated your ‘existence proof’ was just a fluke’? Another relevant question is ‘How enjoyable is it to see a preregistered hypothesis support your prediction’ or ‘How enjoyable are the consequences of providing strong support for your claims for where the paper is published, or how often it is cited, and how seriously it is taken by academic peers’? The costs and benefits of confirmatory studies are multi-facetted. We should look not just at the utility of performing the actions, but at the utility of the consequences. I don’t enjoy doing the dishes, but I enjoy taking that time to call friends and being able to eat from a clean plate. A complete evaluation of the joy of confirmatory research needs to ask questions about all facets that go into the utility function.

To conclude, I like articles that examine consequences of changes in scientific practice, but in this case I felt the conclusion were too far removed from the data. In the conclusion, the authors write “Like exploration, confirmation is integral to the research process, yet, more so than exploration, it seems to spur negative sentiment.” Yet, we could just as easily have concluded from the data that confirmatory and exploratory research are both enjoyable, given the means of 5.39 and 5.87 on a 7 point scale, respectively. If anything, I was surprised by how small the difference in effect size is (a d = 0.33 for how enjoyable both are). Although the authors do not interpret the size of the effects in their paper, that was quite a striking conclusion for myself – I would have predicted the difference in how enjoyable these two types of research were to perform would have been larger. The authors also conclude that about Study 1 that “researchers who were randomly assigned to preregister a prediction were less likely to discover an interesting, non-hypothesized result.” I am not sure this is not just a Type 1 error, as the main analysis yielded no significant result, and my prior is very low that a manipulation that makes people less likely to explore, would only make them less likely to explore the item that, unbeknownst to the participants, would reveal a statistically significant interaction, as I don’t know which mechanism could cause such a specific effect. Instead of exploring this in hypothetical research projects in a survey, I would love to see an analysis of the number of exploratory analyses in published preregistered studies. I would predict that in real research projects, researchers report all the possibly interesting significant results they can find in exploratory analyses.

 

Leslie John’s response:

 

  1. Lakens characterizes Study 1 as exploratory, but it was explicitly a confirmatory study. As specified in our pre-registration, our primary hypothesis was “H1: We will test whether participants in a confirmatory mindset (vs. control) will be less likely to seek results outside of those that would confirm their prediction (i.e., to seek the results of an interaction, as opposed to simply the predicted main effect).” Lakens is accurate in noting that participants did not know a priori that the interaction was significant, but this is precisely our point: when people feel that exploration is off-limits, informative effects can be missed. Importantly, we also stress that a researcher shouldn’t simply send a study off for publication once s/he has discovered something new (like this interaction effect) through exploration; rather “exploration followed by rigorous confirmation is integral to scientific discovery” (p. 188). As a result, Lakens’ questions such as “How bad would you feel if someone tried to replicate your exploratory finding, but they failed, and they published an article that demonstrated your ‘existence proof’ was just a fluke” is not an event that researchers would need to worry about if they follow the guidance we offer in our paper (which underscores the importance of exploration followed by confirmation).

 

  1. Lakens raises additional research questions stemming from our findings about the subjective experience of conducting exploratory versus confirmatory work. We welcome additional research to better understand the subjective experience of conducting research in the reform era. We suspect that one reason that research in the reform era can induce anxiety is the fear of post-publication critiques, and the valid concern that such critiques will mispresent both one’s findings as well as one’s motives for conducting the research. We are therefore particularly solicitous of research that speaks to making the research process, including post-publication critique, not only rigorous but joyful.