The 20% Statistician: Why I don’t expect to be convinced by evidence that scientific reform is improving science (and why that is not a problem)

Since more or less a decade there has been sufficient momentum in science to not just complain about things scientists do wrong, but to actually do something about it. When social psychologists declared a replication crisis in the 60’and 70’s, nothing much changed (Lakens, 2023). They also complained about bad methodology, flexibility in the data analysis, a lack of generalizability and applicability, but no concrete actions to improve things emerged from this crisis.

After the 2010 crisis in psychology, scientists did make changes to how they work. Some of these changes were principled, others less so. For example, badges were introduced for certain open science practices, and researchers implementing these open science practices would get a badge presented alongside their article. This was not a principled change, but a nudge to change behavior. There were also more principled changes. For example, if researchers say they make error-controlled claims at a 5% alpha level, they should make error controlled claims at a 5% alpha level, and they should not engage in research practices that untransparently inflate the Type 1 error rate. The introduction of a practice such as preregistration had the goal to prevent untransparently inflating Type 1 error rates, by making any possible inflation transparent. This is a principled change because it increases the coherence of research practices.

As these changes in practices became more adopted, a large group of researchers was confronted with requirements such as having to justify their sample size, indicate whether they deserved an open science badge, or make explicit that a claim was exploratory (i.e., not error controlled). As more people were confronted with these changes, the absolute number of people critical about these changes increased. A very reasonable question to ask as a scientist is ‘Why?’, and so people asked: “Why should I do this new thing?’.

There are two ways to respond to the question why scientific practices need to change. The first justification is ‘because science will improve’. This is an empirical justification. The world is currently in a certain observable state, and if we change things about our world, it will be in a different, but better, observable state. The second justification is ‘because it logically follows’. This is, not surprisingly, a logical argument. There is a certain way of working that is internally inconsistent, and there is a way of working that is consistent.

An empirical justification requires evidence. A logical justification requires agreement with a principle. If we want to justify preregistration empirically, we need to provide evidence that it improved science. If you want to disagree with the claim that preregistration is a good idea, you need to disagree with the evidence. If we want to justify preregistration logically, we need to people to agree with the principle that researchers should be able to transparently evaluate how coherently their peers are acting (e.g., they are not saying they are making an error controlled claim, when in actuality they did not control their error rate).

Why evidence for better science is practically impossible.

Although it is always difficult to provide strong evidence for a claim, some things are more difficult to study than others. Providing evidence that a change in practice improves science is so difficult, it might be practically impossible. Paul Meehl, one of the first meta-scientists, developed the idea of cliometric meta-theory, or the empirical investigation of which theories are doing better than others. He proposes to follow different theories for something like 50 years, and see which one leads to greater scientific progress. If we want to provide evidence that a change in practice improves science, we need something similar. So, the time scale we are talking about makes the empirical study of what makes science ‘better’ difficult.

But we also need to collect evidence for a causal claim, which requires excluding confounders. A good start would be to randomly assign half of the scientists to preregister all their research for the next fifty years, and order half not to. This is the second difficulty: It is practically impossible to go beyond observational data, and this will always have confounds. But even if we would be able to manipulate something, the assumption that the control condition is not affected by the manipulation is too likely to be violated. The people who preregister will – if they preregister well – have no flexibility in the data analysis, and their alpha levels are controlled. But the people in the control condition know about preregistration as well. After p-hacking their way to a p = 0.03 in Study 1, p = 0.02 in Study 2, and p = 0.06 (marginally significant) in Study 3, they will look at their studies and wonder if these people will take their set of studies seriously. Probably not. So, they develop new techniques to publish evidence for what they want to be true – for example by performing large studies with unreliable measures and a tiny sprinkle of confounds, which consistently yield low p-values.

So after running several studies for 50 years each, we end up with evidence that is not particularly difficult to poke holes in. We have invested a huge amount of effort, for what we should know from the outset will yield very little gain.

As we wrote in our recent paper “The benefits of preregistration and Registered Reports” (Lakens et al., 2024):

It is difficult to provide empirical support for the hypothesis that preregistration and Registered Reports will lead to studies of higher quality. To test such a hypothesis, scientists should be randomly assigned to a control condition where studies are not preregistered, a condition where researchers are instructed to preregister all their research, and a condition where researchers have to publish all their work as a Registered Report. We would then follow the success of theories examined in each of these three conditions in an approach Meehl (2004) calls cliometric metatheory by empirically examining which theories become ensconced, or sufficiently established that most scientists consider the theory as no longer in doubt. Because such a study is not feasible, causal claims about the effects of preregistration and Registered Reports on the quality of research are practically out of reach.

At this time, I do not believe there will ever be sufficiently conclusive empirical evidence for causal claims that a change in scientific practice makes science better. You might argue that my bar for evidence is too high. That conclusive empirical evidence in science is rarely possible, but that we can provide evidence from observational studies – perhaps by attempting to control for the most important confounds, measuring decent proxies of ‘better science’ on a shorter time scale. I think this work can be valuable, and it might convince some people, and it might even lead to a sufficient evidence base to warrant policy change by some organizations. After all, policies need to be set anyway, and the evidence base for most of the policies in science are based on weak evidence, at best.

A little bit of logic is worth more than two centuries of cliometric metatheory.

Psychologists are empirically inclined creatures, and to their detriment, they often trust empirical data more than logical arguments. We published the nine studies on precognition by Daryl Bem because they followed standard empirical methods and yielded significant p values, even when one of the reviewers pointed out that the paper should be rejected because it logically violated the laws of physics. Psychologists too often assign more weight to a p value than to logical consistency.

And yet, a little bit of logic will often yield much greater returns, with much less effort. A logical justification of preregistration does not require empirical evidence. It just needs to point out that it is logically coherent to preregister. Logical propositions have premises and a conclusion: If X, then Y.

In meta-science logical arguments are of the form ‘if we have the goal to generate knowledge following a certain philosophy of science, then we need to follow certain methodological procedures.’ For example, if you think it is a fun idea to take Feyerabend seriously and believe that science progresses in a system that cannot be captured by any rules, then anything goes. Now let’s try a premise that is not as stupid as the one proposed by Feyerabend, and entertain the idea that some ways of doing science are better than others. For example, you might believe that scientists generate knowledge by making statistical claims (e.g., ‘we reject the presence of a correlation larger than r = 0.1’) that are not too often wrong. If this aligns with your philosophy of science, you might think the following proposition is valid: If a scientist wants to generate knowledge by making statistical claims that are not too often wrong, then they need to control their statistical error rates’. This puts us in Mayo’s error-statistical philosophy. We can change the previous proposition, which was written on the level of individual scientist, if we believe that science is not an individual process, but a social one. A proposition that is more in line with a social epistemological perspective would be: “If the scientific community wants to generate knowledge by making statistical claims that are not too often wrong, then they need to have procedures in place to evaluate which claims were made by statistically controlling error rates”.

This in itself is not a sufficient argument for preregistration, because there are many procedures that we could rely on. For example, we can trust scientists. If they do not say anything about flexibly analyzing their data, we can trust that they did not flexibly analyze their data. You can also believe that science should not be based on trust. Instead, you might believe that scientists should be able to scrutinize claims by peers, and that they should not have to take their word for it: Nullius in Verba. If so, then science should be transparent. You do not need to agree with this, of course, just as you did not have to agree with the premise that the goal of science is to generate claims that are not too often wrong. If we include this premise, we get the following proposition: “If the scientific community wants to generate knowledge by making statistical claims that are not too often wrong, and if scientists should be able to scrutinize claims by peers, then they need to have procedures in place for peers to transparently evaluate which claims were made by statistically controlling error rates”.

Now we have a logical argument for preregistration as one change in the way scientists work, because it makes it more coherent. Preregistration is not the only possible change to make science coherent. For example, we could also test all hypotheses in the presence of the entire scientific community, for example by live-streaming and recording all research that is being done. This would also be a coherent improvement to how scientists work, but it would also be more cumbersome. The hope is that preregistration, when implemented well, is a more efficient change to make science more coherent.

Should logic or evidence be the basis of change in science?

Which of the two justifications for changes in scientific practice is more desirable? A benefit of evidence is that it can convince all rational individuals, as long as it is strong enough. But evidence can be challenged, especially when it is weak. This is an important feature of science, but when disagreements about the evidence base can not be resolved, it quickly leads to ‘even the experts are do not agree about what the data shows’. A benefit of logic is also that it should convince rational individuals, as long as they agree with the premise. But not everyone will agree with the premise. Again, this is an important feature of science. It might be a personal preference, but I actually like disagreements about the premises of what the goals of science are. Where disagreements about evidence are temporarily acceptable, but in the long run undesirable, disagreements about what the goals of science are is good for the diversity in science. Or at least that is a premise I accept.

As I see it, the goal should not be to convince people to implement certain changes to scientific practice per se, but to get scientists to behave in a coherent manner, and to implement changes to their practice if this makes their practice more coherent. Whether practices are coherent or not is unrelated to whether you believe practices are good, or desirable. Those value judgments are part of your decision to accept or reject a premise. You might think it is undesirable that scientists make claims, as this will introduce all sorts of undesirable consequences, such as confirmation bias. Then, you would choose a different philosophy of science. That is fine, as long as you then implement research practices that logically follow from the premises. Empirical research can guide you towards or away from accepting certain premises. For example, meta-scientists might describe facts that make you believe scientists are extremely trustworthy, and transparency is not needed. Meta-scientists might also point out ways in which research practices are not coherent with certain premises. For example, if we believe transparency is important, but most researchers selectively publish results, then we have identified in incoherency that we might need to educate people about, or we need to develop ways for researchers to resolve this incoherency (such as developing preprint servers that allow researchers to share all results with peers). And for some changes to science, such as the introduction of Open Science Badges, there might not be any logical justifications (or if they exist, I have not seen them). For those changes, empirical justifications are the only possibility.

Conclusion

As changes to scientific practice become more institutionalized, it is only fair that researchers ask why these changes are needed. There are two possible justifications: One based on empirical evidence, and one on logically coherent procedures that follow from a premise. Psychologists might intuitively believe that empirical evidence is the better justification for a practice. I personally doubt it. I think logical arguments will often provide a stronger foundation, especially when scientific evidence is practically difficult to collect.

The 20% Statistician

Wednesday, September 4, 2024

Why I don’t expect to be convinced by evidence that scientific reform is improving science (and why that is not a problem)

No comments:

Post a Comment