The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, October 3, 2024

Andrew Huberman vs. Decoding the Gurus (Metascience course assignment)

Below I am providing the first assignment in a new Metascience course I am teaching at Eindhoven University of Technology. The goal of the assignment is to teach students to critically evaluate claims scientists make. 


Assignment 1: Andrew Huberman vs. Decoding the Gurus

 

Andrew Huberman is an associate professor at Stanford University who hosts a popular podcast called ‘Huberman Lab’. His podcast is one of the most listened-to podcasts in the world, and he has more than 5 million subscribers on YouTube and more than 6 million followers on Instagram. He discusses science and science-based tools for everyday life, focusing on physical and mental health. Before starting the main part of this assignment, answer the following two questions.

 

Question 1:

a) Which factors increase your trust in Andrew Huberman as a reliable source of information on topics surrounding physical and mental health?

b) Which factors decrease your trust?

Feel free to use the internet to form an opinion.

 

Question 2:

On a scale from 1 (not at all reliable) to 10 (extremely reliable), how reliable do you consider Andrew Huberman to be as a source of information on topics surrounding physical and mental health?

 

As indicated on Wikipedia, Andrew Huberman’s podcast “has attracted criticism for promoting poorly supported health claims”. In this assignment, you will reflect on whether and why Andrew Huberman promotes poorly supported health claims. More generally, you will reflect on a number of factors that can help you to evaluate if information people provide about scientific findings is reliable.

 

The study material for this assignment is podcast episode 85 of “Decoding the Gurus” by Christopher Kavanagh and Matthew Browne called “Andrew Huberman and Peter Attia: Self-enhancement, supplements & doughnuts?” released on the 9th of November 2023.

You can listen to the episode here: https://decoding-the-gurus.captivate.fm/episode/andrew-huberman-and-peter-attia-optimising-your-pizza-binges. Note that most Decoding the Gurus episodes are very long. The section you need to listen to for this episode starts at 1 hour, 46 minutes, and 50 seconds. If you listen to the end, it will take 1 hour and 26 minutes. Before you listen, read through the questions you will have to answer about the podcast below (especially question 5).  

Although it is not necessary to read this information, the paper Huberman discusses is: https://www.biorxiv.org/content/10.1101/2022.07.15.500226v2 The paper was published 2 years later, but it is in a journal we do not have access to because the subscription fees are too high, so we can not read the final version of the scientific research these authors did (a good reminder why open access publication is important).  

 

Question 3:

Which criteria for the quality of scientific research does Andrew Huberman rely on? In the episode he remarks how the study is not peer reviewed, and in other episodes he often discusses whether a study appeared in a peer reviewed journal (and sometimes if the journal is considered prestigious). Do you think this is a good criterion of scientific quality? Which aspects make this a good criterion? Which aspects do not make this a good criterion?

a) I believe the following aspects make this a good criterion:

b) I believe the following aspects do not make this a good criterion:

c) My overall evaluation about whether a study being peer reviewed or not is a good criterion for scientific quality is:

 

Question 4:

Another criterion Andrew Huberman uses to evaluate whether a finding can be trusted is if there are multiple published articles that show a similar effect. Which aspects make this a good criterion? Which aspects do not make this a good criterion? The section in the textbook on publication bias might help to reflect on this question: https://lakens.github.io/statistical_inferences/12-bias.html#sec-publicationbias

a) I believe the following aspects make this a good criterion:

b) I believe the following aspects do not make this a good criterion:

c) My overall evaluation about whether the presence of multiple studies in the literature is a good criterion for scientific quality is:

 

Question 5:

a) Which criticisms do Christopher Kavanagh and Matthew Browne raise of the study Huberman discusses?

b) Which criticisms do the podcast hosts raise about how Huberman presents the study?

c) Which warning signs of the past studies by the same lab do the podcast hosts raise?

 

Question 6:

The podcast hosts discuss the ‘dead salmon’ study. I agree with podcast host Christopher Kavanagh that people interested in metascience should know about this study. It lead to lasting changes in the data analysis of fMRI studies. A similar point was made in a full paper, which you can read here. The title of the paper is “Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition”. The original title of this paper when submitted to the journal was ““Voodoo Correlations in Social Neuroscience”. The peer reviewers did not like this title, and the authors had to change it before publication, but it is still often referred to as the ‘voodoo correlations’ paper, together with the ‘dead salmon’ poster. Read through the study (which was presented as a poster at a conference, not as a full paper). It is not intended as a serious paper. What is the main point of the poster? A high-resolution version is available here






Question 7:

Huberman discusses the power analysis of the study, but does not criticize it. Below, you can find the power analysis in the original study. The authors plan to detect an effect of d = 0.69, which is as large as the effect of reward learning observed in an earlier study. The following two questions are difficult, and there is not a lot of accessible reading material in the literature yet to help you. Some information to help you can be found in https://lakens.github.io/statistical_inferences/06-effectsize.html#interpreting-effect-sizes and the references in this section, and https://lakens.github.io/statistical_inferences/06-effectsize.html#interpreting-effect-sizes and the references in this section.

a) How plausible do you think it is that the placebo effect would have an effect size as large as the effect for reward learning?

b) How large should an effect be for an individual to be aware of it?

 

Question 8:

a) Do you think Andrew Huberman is overclaiming in the end of the podcast about possible applications of this effect? Is he overhyping?

b) How do you think the studies should have been communicated to a general audience?

 

Question 9:

It is not possible to ask the following question in any other way, than to make it a loaded question. It is clear what I think about this topic, as I chose to make this assignment. Nevertheless, feel free to disagree with my beliefs.

a) Is Andrew Huberman’s understanding of statistics (and red flags where reading the results of a study) strong enough to adequately weigh the evidence in studies?

b) How well should science communicators be able to interpret the evidence underlying scientific claims in the literature, for example through adequate training in research methods and statistics?

c) How well should you be trained in research methods and statistics to be able to weigh the evidence in research yourself?

 

Question 10:

After completing the assignment, we will revisit question 2 by asking you once more: On a scale from 1 (not at all reliable) to 10 (extremely reliable), how reliable do you consider Andrew Huberman to be as a source of information on topics surrounding physical and mental health?

 

 

 

Further reading and listening

Additional episodes by Decoding the Gurus on Andrew Huberman:

Episode 81: Andrew Huberman: Forest Bathing in Negative Ions https://decoding-the-gurus.captivate.fm/episode/andrew-huberman-forest-bathing-in-negative-ions (This starts with some kind words about our own podcast, Nullius in Verba).

Episode 90: Mini-Decoding: Huberman on the Vaccine-Autism Controversy https://decoding-the-gurus.captivate.fm/episode/mini-decoding-huberman-on-vaccine-autism-controversy

In Dutch, see https://www.youtube.com/watch?v=KHnQK6wliJU for extra information.

 


Saturday, September 21, 2024

The distinction between logical justifications and empirical justifications: A reply to Prof. Hullman

In my previous blog post I explained why debates about whether or not we should preregister will not be solved through empirical means. This blog was inspired by a preprint of my good friend Leonie Dudda which contains a scoping review of interventions to improve replicability and reproducibility. The scoping review finds that authors concluded interventions had positive effects in 60 out of 104 research conclusions (and negative effects in only 1). This is good news. At the same time, evidence for interventions was often lacking. I argued in my blog that this is not problematic for all interventions, as it would probably be better if my peers could provide a logical argument for why they preregister. I strongly stressed the importance of coherence – scientists should work in a way where their methods, aims of science, and the theories they work on are coherently aligned. I've taken this idea from Larry Laudan (see the figure below, from Laudan, 1986), who provides what I think is the best take on how we deal with disagreements in science. Scientists disagree about many things. Should we use Bayesian statistics, or frequentist statistics? The main thing I learned from discussing this on Twitter for a decade is that there is no universal answer. The only answer is conditional: If your aim is X, then given a valid and logical justification, you use method Y. Laudan refers to this as the triadic network of justification. And if you know me, you know I would like scientists to #JustifyEverything. 



Prof Hullman read my blog, and in a blog post writes she is left “with more questions than answers”. Because the point I was making is so important, I will explain it in more detail. Hullman titles her blog post ‘Getting a pass on evaluating ways to improve science’. One of the main points in my blog was a delineation between when proposed improvements do not get a ‘free pass’ (e.g., when they need to be empirically justified). These are all improvements that are not ‘principled’ – they are not directly tied to an aim in science. I wrote “After the 2010 crisis in psychology, scientists did make changes to how they work. Some of these changes were principled, others less so. For example, badges were introduced for certain open science practices, and researchers implementing these open science practices would get a badge presented alongside their article. This was not a principled change, but a nudge to change behavior.” And then I said “And for some changes to science, such as the introduction of Open Science Badges, there might not be any logical justifications (or if they exist, I have not seen them). For those changes, empirical justifications are the only possibility.”

 

But some discussions we have in science are not empirical disagreements. They are disagreements in philosophy of science. Prof Hullman agrees with me that “Logic is obviously an important part of rigor, and I can certainly relate to being annoyed with the undervaluing of logic in fields where evidence is conventionally empirical” but thinks that my arguments for preregistration were not just logical arguments based on a philosophy of science. She writes “Beyond your philosophy of scientific progress, it comes down to the extent to which you think that scientists owe it to others to “prove” that they followed the method they said they did. It’s about how much transparency (versus trust) we feel we owe our fellow scientists, not to mention how committed we are to the idea that lying or bad behavior on the part of scientists are the big limiter of scientific progress.” The point Prof Hullman makes does not go ‘beyond’ philosophy of science, because whether and how much we should trust scientists is a core topic in social epistemology. It is part of your philosophy of science. As I wrote “This in itself is not a sufficient argument for preregistration, because there are many procedures that we could rely on. For example, we can trust scientists. If they do not say anything about flexibly analyzing their data, we can trust that they did not flexibly analyze their data. You can also believe that science should not be based on trust. Instead, you might believe that scientists should be able to scrutinize claims by peers, and that they should not have to take their word for it: Nullius in Verba. If so, then science should be transparent. You do not need to agree with this, of course”. So the point really is one about philosophy of science, and we can make certain logical arguments about which practices follow from certain philosophies, as I did in my blog post.

 

Prof Hullman writes “It reads a bit as if it’s a defense of preregistration, delivered with an assurance that this logical argument could not possibly be paralleled by empirical evidence: “A little bit of logic is worth more than two centuries of cliometric metatheory.” I am not providing a ‘defense’ of preregistration - that is not the right way to think about this topic. I simply pointed out that your aims can logically justify your methods. For example, if my aim is to generate knowledge by withstanding criticism, then I need to be transparent about what I have done. Note the ‘if-then’ relationship. One of my main points was to get empirical scientists to realize the difference between a logical justification and an empirical justification.

 

Then Prof Hullman makes a big, but very insightful, mistake. She writes “He argues that all rational individuals who agree with the premise (i.e., share his philosophical commitments) should accept the logical view, whereas empirical evidence has to be “strong enough” to convince and may still be critiqued. And so while he seems to start out by admitting that we’ll never know if science would be better if preregistration was ubiquitous, he ends up concluding that if one shares his views on science, it’s logically necessary to preregister for science to improve.” She confuses the two things my post set out to educate scientists about: There is a difference between implementing a change because you claim it will improve science, and a change because it logically follows from assumptions. I guess I did not do a good job explaining the distinction.

 

As I wrote: “There are two ways to respond to the question why scientific practices need to change. The first justification is ‘because science will improve’. This is an empirical justification. The world is currently in a certain observable state, and if we change things about our world, it will be in a different, but better, observable state. The second justification is ‘because it logically follows’.” Hullman’s statement that “he ends up concluding that if one shares his views on science, it’s logically necessary to preregister for science to improve” is exactly what I was *not* saying. Let me explain this again, because if Prof Hullman did not understand this, others might be confused as well.

 

It is essential to distinguish a coherent way of working from a better way of working. Is it better to be a subjective Bayesian, ignore error control, and aim to update beliefs until all scientists rationally believe the same thing? Or is it better to be a frequentist, make error-controlled claims that you do or do not believe, and create a collection of severely tested claims? As we have seen in the last century, there will never be evidence about the question which of these two approaches is an improvement. An all-knowing entity could tell us, but as mere mortals, an answer to the question which of these approaches is an improvement is beyond our ability to achieve.

 

If a subjective Bayesian changes their research practice from using a ‘default’ prior in their work, to actually quantifying their prior beliefs, their research practice becomes more *coherent*. After all, the aim is to update *your* beliefs, not some generic default belief. Maybe the use of subjective priors will slow down knowledge generation compared to the use of default priors. It might not be an improvement. But it is more coherent, and in the absence of having an empirical guarantee about the single best way to do science, my argument is that we should at least be coherent as scientists. We preregister because it makes our approach to science more *coherent*, and we evaluate coherence based on logical arguments, not based on empirical data.

 

In my blog, I wrote that empirical evidence can be useful to convince some people to implement policies. I think this section was too short to clearly explain my point. I have this idea because Prof Hullman writes “It strikes me as contradictory to say that it is a flaw that “Psychologists are empirically inclined creatures, and to their detriment, they often trust empirical data more than logical arguments” while at the same time saying it’s ok to produce weak empirical evidence to convince some people.” In the comments to the blog post Prof Hullman writes “I suspect he knows that his logical argument is conditional on a lot of assumptions but he wants to sell it as something more universal. That would be one explanation for why he then seems to walk it back by adding the part about how empirical evidence sometimes has value.” Prof Hullman seems to have a main worry about my blog post: “For example, is the implication that logical justification should be enough for journals to require preregistration to publish, or that lack of preregistration should be valid ground for rejecting a paper that makes claims requiring error control?” Because this last point is exactly what I argued against, I must not have explained myself clearly enough. Let’s try again.

 

A logical justification can never lead to a policy such as ‘require preregistration to publish’ or ‘a lack of preregistration is grounds for rejecting a paper’. Logical arguments as I discussed have a premise: ‘if you aim to do X’. All studies that do not aim to do X do not have to use method Y. My blog is not just a reminder of the importance of a coherent approach to science, but also a reminder for the people who do not want to preregister to develop a logically coherent ‘if-then’. What are your aims, if not to make error-controlled claims? Which methods are logically coherent with those aims? Write this down clearly, address criticism you will get from peers, sharpen your argument, implement your ideas formally in your papers, and you are all set and never have to worry about not preregistering. Just as I have developed a coherent argument for preregistration, tied to a specific philosophy of science over the last decade, you should – if you want to be taken seriously – have a well-developed alternative philosophy of why preregistration is not in line with your aims.

 

If policy makers were smart and rational they would create policies based on logical justifications where possible. Regrettably, policy makers are typically not very smart and rational. Here is the kind of policy I want to see: “If preregistration is a logically coherent step in your scientific method, we want you to implement it.” This is the same logically principled justification of a research practice as ‘if we think scientists should discover the truth, they should not lie’. The policy requires scientists to act in a logically coherent manner. In practice, this means that if you set an alpha level, control your type 2 error rate through a power analysis, and make claims based on statistical tests that have sufficiently low error rates, you have decided to adopt Mayo’s error-statistical philosophy of science. As I explained in my blog, if we add a second assumption to the aims of science, namely that the aim is to make claims that can withstand scrutiny by peers, then it logically follows that we adopt a procedure that enables scrutiny. Of course, as Laudan’s figure above illustrates, the methods we choose should ‘exhibit the realizability’ of our aims. If we believe it is important to scrutinize claims, but the only way to achieve it would be to have every scientist in the world wear a body-cam, and we all watch all footage related to a study before believing the claim, the aim of scrutiny might not be ‘realizable’. But preregistration can be implemented in practice, so the method and aim can be aligned in practice.

 

I would hope that if scientists embrace my view that there is a distinction between logical justifications for preregistration and empirical justifications for preregistration, they will actually gain a very strong argument to push back to the universal implementation of preregistration. All you need to do is pursue different aims than error-controlled claims, or develop a different coherent approach to scientific knowledge generation than the dominant approach we now see in psychology based on Mayo’s error-statistical framework, and any rational editor should accept your arguments.

 

Now, I did not want to dismiss empirical research on the consequences of interventions to improve completely. I said it could be useful to implement policies. I wrote: “I think [empirical] work can be valuable, and it might convince some people, and it might even lead to a sufficient evidence base to warrant policy change by some organizations. After all, policies need to be set anyway, and the evidence base for most of the policies in science are based on weak evidence, at best.” But this short section led to confusion. 

 

Let me make some things clearer. First, in this example, I am not talking about the specific intervention to adopt preregistration. A policy about preregistration can be implemented based on logical arguments, and if it is implemented, it should be implemented as I stated above: “If you aim to do X, and you believe a principle in science is Y, you need to preregister”. But there are many policies that need to be set for everyone, regardless of their philosophy of science. An example would be the implementation of badges, which as I mentioned in my blog, cannot be justified logically. Furthermore, badges apply to every article in a paper. You get a preregistration badge, or not. Although in principle we could have a badge for preregistration, a badge for a logically coherent argument why you do not need preregistration, and no badge, this would go beyond the simple nudge idea behind badges. Empirical data can be useful if researchers want to convince editors to implement badges. Prof Hullman writes “It strikes me as contradictory to say that it is a flaw that “Psychologists are empirically inclined creatures, and to their detriment, they often trust empirical data more than logical arguments” while at the same time saying it’s ok to produce weak empirical evidence to convince some people.”. She does not summarize write I wrote correctly. I do not say it is ‘ok to produce weak empirical evidence to convince some people’. I am simply saying this is how some people choose to go about things, and in the absence of strong empirical evidence, and giving political interests that some scientists have, they will use empirical arguments to convince others, and that can work. I much prefer a logical basis for policies, and I prefer not to engage in policies that do not have a logical basis (I also for that reason do not like open science badges). But often, such a logical basis is not available, strong evidence is not available, and there are people who want to change the status quo.

 

My blog had the goal to make scientists aware of the possibility of developing logical arguments – given some premises – for preregistration. I think convincing logical arguments exist, and I have developed them for one (arguably the dominant) error-statistical philosophy in my own discipline. A lack of evidence for preregistration is not problematic, and if you ask me, realistically we should not expect it to emerge. Anyone who carefully reads my blog will see it provides ammunition for scientists to fight back against exactly the overgeneralized policies Prof Hullman is worried about (ie.., that you will need to preregister to get published). The ‘free pass’ we should be worried about in science is not the absence of empirical data, but the absence of a logical argument.

Wednesday, September 4, 2024

Why I don’t expect to be convinced by evidence that scientific reform is improving science (and why that is not a problem)

Since more or less a decade there has been sufficient momentum in science to not just complain about things scientists do wrong, but to actually do something about it. When social psychologists declared a replication crisis in the 60’and 70’s, nothing much changed (Lakens, 2023). They also complained about bad methodology, flexibility in the data analysis, a lack of generalizability and applicability, but no concrete actions to improve things emerged from this crisis.

 

After the 2010 crisis in psychology, scientists did make changes to how they work. Some of these changes were principled, others less so. For example, badges were introduced for certain open science practices, and researchers implementing these open science practices would get a badge presented alongside their article. This was not a principled change, but a nudge to change behavior. There were also more principled changes. For example, if researchers say they make error-controlled claims at a 5% alpha level, they should make error controlled claims at a 5% alpha level, and they should not engage in research practices that untransparently inflate the Type 1 error rate. The introduction of a practice such as preregistration had the goal to prevent untransparently inflating Type 1 error rates, by making any possible inflation transparent. This is a principled change because it increases the coherence of research practices.

 

As these changes in practices became more adopted, a large group of researchers was confronted with requirements such as having to justify their sample size, indicate whether they deserved an open science badge, or make explicit that a claim was exploratory (i.e., not error controlled). As more people were confronted with these changes, the absolute number of people critical about these changes increased. A very reasonable question to ask as a scientist is ‘Why?’, and so people asked: “Why should I do this new thing?’.

 

There are two ways to respond to the question why scientific practices need to change. The first justification is ‘because science will improve’. This is an empirical justification. The world is currently in a certain observable state, and if we change things about our world, it will be in a different, but better, observable state. The second justification is ‘because it logically follows’. This is, not surprisingly, a logical argument. There is a certain way of working that is internally inconsistent, and there is a way of working that is consistent.

 

An empirical justification requires evidence. A logical justification requires agreement with a principle. If we want to justify preregistration empirically, we need to provide evidence that it improved science. If you want to disagree with the claim that preregistration is a good idea, you need to disagree with the evidence. If we want to justify preregistration logically, we need to people to agree with the principle that researchers should be able to transparently evaluate how coherently their peers are acting (e.g., they are not saying they are making an error controlled claim, when in actuality they did not control their error rate).

 

Why evidence for better science is practically impossible.

Although it is always difficult to provide strong evidence for a claim, some things are more difficult to study than others. Providing evidence that a change in practice improves science is so difficult, it might be practically impossible. Paul Meehl, one of the first meta-scientists, developed the idea of cliometric meta-theory, or the empirical investigation of which theories are doing better than others. He proposes to follow different theories for something like 50 years, and see which one leads to greater scientific progress. If we want to provide evidence that a change in practice improves science, we need something similar. So, the time scale we are talking about makes the empirical study of what makes science ‘better’ difficult.

But we also need to collect evidence for a causal claim, which requires excluding confounders. A good start would be to randomly assign half of the scientists to preregister all their research for the next fifty years, and order half not to. This is the second difficulty: It is practically impossible to go beyond observational data, and this will always have confounds. But even if we would be able to manipulate something, the assumption that the control condition is not affected by the manipulation is too likely to be violated. The people who preregister will – if they preregister well – have no flexibility in the data analysis, and their alpha levels are controlled. But the people in the control condition know about preregistration as well. After p-hacking their way to a p = 0.03 in Study 1, p = 0.02 in Study 2, and p = 0.06 (marginally significant) in Study 3, they will look at their studies and wonder if these people will take their set of studies seriously. Probably not. So, they develop new techniques to publish evidence for what they want to be true – for example by performing large studies with unreliable measures and a tiny sprinkle of confounds, which consistently yield low p-values.

So after running several studies for 50 years each, we end up with evidence that is not particularly difficult to poke holes in. We have invested a huge amount of effort, for what we should know from the outset will yield very little gain.

 

As we wrote in our recent paper “The benefits of preregistration and Registered Reports(Lakens et al., 2024):

 

It is difficult to provide empirical support for the hypothesis that preregistration and Registered Reports will lead to studies of higher quality. To test such a hypothesis, scientists should be randomly assigned to a control condition where studies are not preregistered, a condition where researchers are instructed to preregister all their research, and a condition where researchers have to publish all their work as a Registered Report. We would then follow the success of theories examined in each of these three conditions in an approach Meehl (2004) calls cliometric metatheory by empirically examining which theories become ensconced, or sufficiently established that most scientists consider the theory as no longer in doubt. Because such a study is not feasible, causal claims about the effects of preregistration and Registered Reports on the quality of research are practically out of reach.

 

At this time, I do not believe there will ever be sufficiently conclusive empirical evidence for causal claims that a change in scientific practice makes science better. You might argue that my bar for evidence is too high. That conclusive empirical evidence in science is rarely possible, but that we can provide evidence from observational studies – perhaps by attempting to control for the most important confounds, measuring decent proxies of ‘better science’ on a shorter time scale. I think this work can be valuable, and it might convince some people, and it might even lead to a sufficient evidence base to warrant policy change by some organizations. After all, policies need to be set anyway, and the evidence base for most of the policies in science are based on weak evidence, at best.

 

A little bit of logic is worth more than two centuries of cliometric metatheory.

 

Psychologists are empirically inclined creatures, and to their detriment, they often trust empirical data more than logical arguments. We published the nine studies on precognition by Daryl Bem because they followed standard empirical methods and yielded significant p values, even when one of the reviewers pointed out that the paper should be rejected because it logically violated the laws of physics. Psychologists too often assign more weight to a p value than to logical consistency.

And yet, a little bit of logic will often yield much greater returns, with much less effort. A logical justification of preregistration does not require empirical evidence. It just needs to point out that it is logically coherent to preregister. Logical propositions have premises and a conclusion: If X, then Y.

In meta-science logical arguments are of the form ‘if we have the goal to generate knowledge following a certain philosophy of science, then we need to follow certain methodological procedures.’ For example, if you think it is a fun idea to take Feyerabend seriously and believe that science progresses in a system that cannot be captured by any rules, then anything goes. Now let’s try a premise that is not as stupid as the one proposed by Feyerabend, and entertain the idea that some ways of doing science are better than others. For example, you might believe that scientists generate knowledge by making statistical claims (e.g., ‘we reject the presence of a correlation larger than r = 0.1’) that are not too often wrong. If this aligns with your philosophy of science, you might think the following proposition is valid: If a scientist wants to generate knowledge by making statistical claims that are not too often wrong, then they need to control their statistical error rates’. This puts us in Mayo’s error-statistical philosophy. We can change the previous proposition, which was written on the level of individual scientist, if we believe that science is not an individual process, but a social one. A proposition that is more in line with a social epistemological perspective would be: “If the scientific community wants to generate knowledge by making statistical claims that are not too often wrong, then they need to have procedures in place to evaluate which claims were made by statistically controlling error rates”.

 

This in itself is not a sufficient argument for preregistration, because there are many procedures that we could rely on. For example, we can trust scientists. If they do not say anything about flexibly analyzing their data, we can trust that they did not flexibly analyze their data. You can also believe that science should not be based on trust. Instead, you might believe that scientists should be able to scrutinize claims by peers, and that they should not have to take their word for it: Nullius in Verba. If so, then science should be transparent. You do not need to agree with this, of course, just as you did not have to agree with the premise that the goal of science is to generate claims that are not too often wrong. If we include this premise, we get the following proposition: “If the scientific community wants to generate knowledge by making statistical claims that are not too often wrong, and if scientists should be able to scrutinize claims by peers, then they need to have procedures in place for peers to transparently evaluate which claims were made by statistically controlling error rates”.

Now we have a logical argument for preregistration as one change in the way scientists work, because it makes it more coherent. Preregistration is not the only possible change to make science coherent. For example, we could also test all hypotheses in the presence of the entire scientific community, for example by live-streaming and recording all research that is being done. This would also be a coherent improvement to how scientists work, but it would also be more cumbersome. The hope is that preregistration, when implemented well, is a more efficient change to make science more coherent.

 

Should logic or evidence be the basis of change in science?

 

Which of the two justifications for changes in scientific practice is more desirable? A benefit of evidence is that it can convince all rational individuals, as long as it is strong enough. But evidence can be challenged, especially when it is weak. This is an important feature of science, but when disagreements about the evidence base can not be resolved, it quickly leads to ‘even the experts are do not agree about what the data shows’. A benefit of logic is also that it should convince rational individuals, as long as they agree with the premise. But not everyone will agree with the premise. Again, this is an important feature of science. It might be a personal preference, but I actually like disagreements about the premises of what the goals of science are. Where disagreements about evidence are temporarily acceptable, but in the long run undesirable, disagreements about what the goals of science are is good for the diversity in science. Or at least that is a premise I accept.

 

As I see it, the goal should not be to convince people to implement certain changes to scientific practice per se, but to get scientists to behave in a coherent manner, and to implement changes to their practice if this makes their practice more coherent. Whether practices are coherent or not is unrelated to whether you believe practices are good, or desirable. Those value judgments are part of your decision to accept or reject a premise. You might think it is undesirable that scientists make claims, as this will introduce all sorts of undesirable consequences, such as confirmation bias. Then, you would choose a different philosophy of science. That is fine, as long as you then implement research practices that logically follow from the premises. Empirical research can guide you towards or away from accepting certain premises. For example, meta-scientists might describe facts that make you believe scientists are extremely trustworthy, and transparency is not needed. Meta-scientists might also point out ways in which research practices are not coherent with certain premises. For example, if we believe transparency is important, but most researchers selectively publish results, then we have identified in incoherency that we might need to educate people about, or we need to develop ways for researchers to resolve this incoherency (such as developing preprint servers that allow researchers to share all results with peers). And for some changes to science, such as the introduction of Open Science Badges, there might not be any logical justifications (or if they exist, I have not seen them). For those changes, empirical justifications are the only possibility.

 

Conclusion

 

As changes to scientific practice become more institutionalized, it is only fair that researchers ask why these changes are needed. There are two possible justifications: One based on empirical evidence, and one on logically coherent procedures that follow from a premise. Psychologists might intuitively believe that empirical evidence is the better justification for a practice. I personally doubt it. I think logical arguments will often provide a stronger foundation, especially when scientific evidence is practically difficult to collect.

Tuesday, July 23, 2024

New paper: The benefits of preregistration and Registered Reports.

 With my PhD students Cristian Mesquida and Sajedeh Rasti, and former lab visitor Max Ditroilo we published a new paper on preregistration and Registered Reports. We aim to provide a state-of-the-art overview of the idea behind and metascience on preregistration and Registered Reports. https://www.tandfonline.com/doi/full/10.1080/2833373X.2024.2376046

We explain the link between preregistration and severe testing, and how systematic bias might reduce the severity of tests. Preregistration is a tool to allow others to evaluate the severity of tests.

We provide and defend a more narrow use-case of preregistration. In essence, we argue you can only preregister level 6 and 5 studies from this table from the Peer Community In guide for authors https://rr.peercommunityin.org/help/guide_for_authors



We deviate from current consensus, but in the conviction that our use of the term preregistration is more principled, and will become the default in the future (just as how the Preregistration+ badge would be seen as the only valid preregistration today. As our understanding changes, so do our definitions.


We summarize 18 surveys on research practices that reduce the severity of tests. You might have seen previous version of this Figure – this is the final published version, in case you want to re-use or cite this. More details on the studies in this figure is available from https://osf.io/sxg7q. 



We carefully point out: “It is important to point out that the percentages presented here do not directly translate into the percentage of researchers who are engaging in these practices.” We wish we knew, but we just don’t know. 

We discuss cost-benefit analyses of preregistration, and conclude there are too many unknowns to determine if preregistration is beneficial. We also say it does not really matter, because the main reason to preregister is based on a normative argument.

We say: “researchers who test hypotheses from a methodological falsificationist approach to science should preregister their studies if they want a science that has intersubjectively established severely tested claims.” As always, we believe it is essential to be clear about your philosophy on scientific knowledge generation - not being clear about it can lead to a lot of discussion that will go nowhere (see Lakens, 2019).  

That means we also do not expect people who have different epistemological philosophies to preregister – nor is it a logical solution for exploratory research, or certain types of secondary data analysis. We feel it is important to point this out, because there are alternative approaches to argue a test is severe that are better suited for those studies: open lab notebooks, sensitivity analyses, robustness checks, independent replication. It is always important to use the right tool for the job - we do not want preregistration to be mindlessly overused. 

We discuss meta-scientific evidence that shows preregistration makes it possible to evaluate the severity of tests (and we cite some anecdotal examples). Of course, not all preregistrations are equally good yet – people need more training. 

We also engage with the most important criticism on preregistration. Beyond the valid concern that the mere presence of a preregistration may be mindlessly used as a proxy for high quality, we identify conflicting viewpoints, several misunderstandings, and a general lack of empirical support for the criticisms that have been raised. I personally feel critics need to raise the bar if they want to be taken seriously. They should at the very least resolve the contradicting criticisms among each other. They should also collect empirical data to test their claims. 

I strongly expect this fourth paper (following Nosek & Lakens, 2014, Lakens, 2019, and Lakens, 2024) to be my last contribution to this topic. I have said all I want to say, and contributed all I can with this final paper.

 

Friday, February 9, 2024

Why Effect Sizes Selected for Significance are Inflated

Estimates based on samples from the population will show variability. The larger the sample, the closer our estimates will be to the true population values. Sometimes we will observe larger estimates than the population value, and sometimes we will observe smaller values. As long as we have an unbiased collection of effect size estimates, combining effect sizes estimates through a meta-analysis can increase the accuracy of the estimate. Regrettably, the scientific literature is often biased. It is specifically common that statistically significant studies are published (e.g., studies with p values smaller than 0.05) while studies with p values larger than 0.05 remain unpublished (Ensinck & Lakens, 2023; Franco et al., 2014; Sterling, 1959). Instead of having access to all effect sizes, anyone reading the literature only has access to effects that passed a significance filter. This will introduce systematic bias in our effect size estimates.

The explain how selection for significance introduces bias, it is useful to understand the concept of a truncated or censored distribution. If we want to measure the average length of people in The Netherlands we would collect a representative sample of individuals, measure how tall they are, and compute the average score. If we collect sufficient data the estimate will be close to the true value in the population. However, if we collect data from participants who are on a theme park ride where people need to be at least 150 centimeters tall to enter,  the mean we compute is based on a truncated distribution where only individuals taller than 150 cm are included. Smaller individuals are missing. Imagine we have measured the height of two individuals in the theme park ride, and they are 164 and 184 cm tall. Their average height is (164+184)/2 = 174 cm. Outside the entrance of the theme park ride is one individual who is 144 cm tall. Had we measured this individual as well, our estimate of the average length would be (144+164+184)/3 = 164 cm. Removing low values from a distribution will lead to overestimation of the true value. Removing high values would lead to underestimation of the true value.

The scientific literature suffers from publication bias. Non-significant test results – based on whether a p value is smaller than 0.05 or not – are often less likely to be published. When an effect size estimate is 0 the p value is 1. The further removed effect sizes are from 0, the smaller the p value. All else equal (e.g., studies have the same sample size, and measures have the same distribution and variability) if results are selected for statistical significance (e.g., p < .05) they are also selected for larger effect sizes. As small effect sizes will be observed with their corresponding probabilities, their absence will inflate effect size estimates. Every study in the scientific literature provides it’s own estimate of the true effect size, just as every individual provides it’s own estimate of the average height of people in a country. When these estimates are combined – as happens in meta-analyses in the scientific literature – the meta-analytic effect size estimate will be biased (or systematically different from the true population value) whenever the distribution is truncated. To achieve unbiased estimates of population values when combining individual studies in the scientific literature in meta-analyses researchers need access to the complete distribution of values – or all studies that are performed, regardless of whether they yielded a p value above or below 0.05.

In the figure below we see a distribution centered at an effect size of Cohen’s d = 0.5 for a two-sided t-test with 50 observations in each independent condition. Given an alpha level of 0.05 in this test only effect sizes larger than d = 0.4 will be statistically significant (i.e., all observed effect sizes in the grey area). The threshold for which observed effect sizes will be statistically significant is determined by the sample size and the alpha level (and not influenced by the true effect size). The white  area under the curve illustrates Type 2 errors – non-significant results that will be observed if the alternative hypothesis is true. If researchers only have access to the effect sizes estimates in the grey area – a truncated distribution where non-significant results are removed – a weighted average effect size from only these studies will be upwardly biased.


If researchers only have access to the effect sizes estimates in the grey area – a truncated distribution where non-significant results are removed – a weighted average effect size from only these studies will be upwardly biased. We can see this in the two forest plots visualizing meta-analyses below. In the top meta-analysis all 5 studies are included, even though study C and D yield non-significant results (as can be seen from the fact that the 95% CI overlaps with 0). The estimated effect size based on all 5 studies is d = 0.4. In the bottom meta-analysis the two non-significant studies are removed - as would happen when there is publication bias. Without these two studies the estimated effect size in the meta-analysis, d = 0.5, is inflated. The extent to which meta-analyses are inflated depends on the true effect size and the sample size of the studies.

 


The inflation will be greater the larger the part of the distribution is truncated, and the closer the true population effect size is to 0. In our example about the height of individuals the inflation would be greater had we truncated the distribution by removing everyone smaller than 170 cm instead of 150 cm. If the true average height of individuals was 194 cm, removing the few people that are expected to be smaller than 150 (based on the assumption of normally distributed data) would have less of an effect on how much our estimate is inflated than when the true average height was 150 cm, in which case we would remove 50% of individuals. In statistical tests where results are selected for significance at a 5% alpha level more data will be removed if the true effect size is smaller, but also when the sample size is smaller. If the sample size is smaller, statistical power is lower, and more of the values in the distribution (those closest to 0)  will be non-significant.

Any single estimate of a population value will vary around the true population value. The effect size estimate from a single study can be smaller than the true effect size, even if studies have been selected for significance. For example, it is possible that the true effect size is 0.5, you have observed an effect size of 0.45, but only effect sizes smaller than 0.4 are truncated when selecting studies based on statistical significance (as in the figure above). At the same time, this single effect size estimate of 0.45 is inflated. What inflates the effect size is the long-run procedure used to generate the value. In the long run effect sizes estimates based on a procedure where estimates are selected for significance will be upwardly biased. This means that a single observed effect size of d = 0.45 will be inflated if it is generated based on a procedure where all non-significant effects are truncated, but it will be unbiased if it is generated based on a distribution where all observed effect sizes are reported, regardless of whether they are significant or not. This also means that a single researcher can not guarantee that the effect sizes they contribute to a literature will contribute to an unbiased effect sizes estimate: There needs to be a system in place where all researchers report all observed effect sizes to prevent bias. An alternative is to not have to rely on other researchers, and collect sufficient data in a single study to have a highly accurate effect size estimate. Multi-lab replication studies are an example of such an approach, where dozens of researchers collect a large number (up to thousands) of observations.

The most extreme consequence of the inflation of effect size estimates occurs when the true effect size in the population is 0, but due to selection of statistically significant results, only significant effects in the expected direction are published. Note that if all significant results are published (and not only effect sizes in the expected direction) 2.5% of Type 1 error rates will be in the positive direction, and 2.5% will be in the negative direction, and the average effect size would be actually be 0. Thus, as long as the true effect size is exactly 0, and all Type 1 errors are published, the effect size estimate would be unbiased. In practice, we see scientists often do not simply publish all results, but only statistically significant results in the desired direction. An example of this is the literature on ego-depletion, where hundreds of studies were published, most showing statistically significant effects, but unbiased large scale replication studies revealed effect sizes of 0 (Hagger et al., 2015; Vohs et al., 2021).

What can be done about the problem of biased effect sizes estimates if we mainly have access to the studies that passed a significance filter? Statisticians have developed approaches to adjust biased effect size estimates by taking a truncated distribution into account (Taylor & Muller, 1996). This approach has recently been implemented in R (Anderson et al., 2017). Implementing this approach in practice is difficult, because we never know for sure if an effect size estimate is biased, and if it is biased, how much bias there is. Furthermore, selection based on significance is only one form of bias, whereas researchers who selectively report significant results may engage in additional problematic research practices, such as selectively reporting results, which are not accounted for in the adjustment. Other researchers have referred to this problem as a Type M error (Gelman & Carlin, 2014; Gelman & Tuerlinckx, 2000) and have suggested that researchers always report the average inflation factor of effect sizes. I do not believe this approach is useful. The Type M error is not an error, but a bias in estimation, and it is more informative to compute the adjusted estimate based on a truncated distribution as proposed by Taylor and Muller in 1996, than to compute the average inflation for a specific study design. If effects are on average inflated by a factor of 1.3 (the Type M error) it does not mean that the observed effect size is inflated by this factor, and the truncated effect sizes estimator by Taylor and Muller will provide researchers with an actual estimate based on their observed effect size. Type M errors might have a function in education, but they are not useful for scientists (I will publish a paper on Type S and M errors later this year, explaining in more detail why I think neither are useful concepts).

Of course the real solution to bias in effect size estimates due to significance filters that lead to truncated or censored distributions is to stop selectively reporting results. Designing highly informative studies that have high power to both reject the null, as a smallest effect size of interest in an equivalence test, is a good starting point. Publishing research as Registered Reports is even better. Eventually, if we do not solve this problem ourselves, it is likely that we will face external regulatory actions that force us to include all studies that have received ethical review board approval to a public registry, and update the registration with the effect size estimate, as is done for clinical trials.


References:

Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://doi.org/10.1177/0956797617723724

Ensinck, E., & Lakens, D. (2023). An Inception Cohort Study Quantifying How Many Registered Studies are Published. PsyArXiv. https://doi.org/10.31234/osf.io/5hkjz

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505. https://doi.org/10.1126/SCIENCE.1255484

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651.

Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics, 15(3), 373–390. https://doi.org/10.1007/s001800000040

Hagger, M. S., Chatzisarantis, N. L., Alberts, H., Anggono, C. O., Batailler, C., Birt, A., & Zwienenberg, M. (2015). A multi-lab pre-registered replication of the ego-depletion effect. Perspectives on Psychological Science, 2.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—Or vice versa. Journal of the American Statistical Association, 54(285), 30–34. JSTOR. https://doi.org/10.2307/2282137

Taylor, D. J., & Muller, K. E. (1996). Bias in linear model power and sample size calculation due to estimating noncentrality. Communications in Statistics-Theory and Methods, 25(7), 1595–1610. https://doi.org/10.1080/03610929608831787

Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J. L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi, J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay, S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733