The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, July 4, 2025

Are meta-scientists ignoring philosophy of science?

Are meta-scientists ignoring philosophy of science (PoS)? Are they re-inventing the wheel? A recent panel at the Metascience conference engaged with this question, and the first sentence of the abstract states “Critics argue that metascience merely reinvents the wheel of other academic fields.” It’s a topic I have been thinking about for a while, so I will share my thoughts on this question. In this blog post I will only speak for myself, and not for any other metascientists. I studied philosophy for a year, read quite a lot about philosophy of science, regularly review for philosophy journals, have co-organized a conference that brings together philosophers and metascientists (and am co-organizing the next meeting) and I currently have 3 ongoing collaborations with philosophers of science. I would say it seems a bit far-fetched to claim I would be ignoring philosophy of science, and just reinvent what that field has already done. But I am ignoring a lot of it. That is not something PoS should take personally. I am also ignoring a lot of metascientific work that is done. That seems perfectly normal to me – there is only so much work I need to engage with to do my work better.

 

I read a lot of the work philosophers of science have written that is relevant for metascientists, and a lot they are currently writing. Too often, I find work in philosophy of science on the replication crisis and related topics to be of quite low quality, and mostly it turns out to be rather irrelevant for my work. It is very common that philosophers seem to have thought about topics very little, and the limited time they thought about a topic has been spent without any engagement with actual scientists. This is especially true about the philosophical work on the replication crisis. Having lived through it, and having thought about it every day for the last 15 years, most of the work by philosophers is quite superficial. If you spent only 3 years fulltime on your paper (and I know people who spent no more than a year full-time on an entire book!), it is just not going to be insightful enough for me to learn anything I didn’t know. Instead, I will notice a lot of mistakes, faulty assumptions, and incorrect conclusions.

 

I recently read a book from a philosopher of science I was looking forward to, hoping I would learn new things. Instead, I just thought ‘but psychologists themselves have done so much work on this that you are not discussing!’ for 200 pages. I found that quite frustrating. Maybe we should also talk about philosophers ignoring the work by psychologists.

 

As I was being frustrated, a thought popped up. There are very few philosophers of science, compared to the number of psychologists. Let’s say I read the literature on metascientific topics in psychology and philosophy, and only the top X% of papers make my work better. What is the probability that a psychologist has performed work that is relevant for me as a metascientist, compared to work by a philosopher? Because there are so many more psychologists than philosophers, all else equal, there will be many more important papers by psychologists than by philosophers that I should read.

 

Of course, all else is not equal. Psychologists have thought about all their crises for more than 60 years, ever since the first crisis in the 70’s (Lakens, 2025a, 2025b). Psychologists have a much better understanding of how psychological research is done than philosophers. Philosophers are on average smarter than psychologists (but again, there are many more psychologists), and have better training in conceptual analysis. Psychologists are more motivated to work on challenges in their field than philosophers. There are many other differences. So, we need to weigh all these factors in our model that will predict how many papers by philosophers of science I find useful to read, compared to the number of papers by psychologists I find useful to read. I don’t have those weights, but I have the outcome of the model for my own research: Most often, papers by psychologists on metascience are better and more relevant for my work. Remember: I still ignore a lot of the papers on metascience by psychologists, and I find a lot of those papers low quality as well! But psychologists write a lot more on the topic, and I also think the best papers on metascience when I combine both fields are more often written by psychologists.

 

I think this alternative explanation for why we engage very little with philosophers of science is worth taking into account. I personally consider it a strong contender to explain the behavior of metascientists to an explanation that posits that we intentionally do not engage with the literature in philosophy of science.

 

There are additional reasons for why I end up reading less work by philosophers of science. One is that I often do not agree with certain assumptions they have. The ideas that guide my work have been out of fashion in philosophy of science for half a century. Most of the ideas that are in fashion turn me off. I just do not enjoy reading papers that say ‘Let’s assume scientists are rational Bayesian updaters’ or ‘There is not one way to do science’. My working model of science is something very different, best summarized by this cartoon:

Afbeelding

 

When I say I ignore most of the work by philosophers of science on metascience, I still engage with quite a lot of philosophy of science. If you browse through the reference list of my online textbook – which is about statistics! – I am confident that there are more philosophers cited than the number of metascientists cited in a book by a philosopher of science. If you browse through the reading notes of the podcast ‘Nullius in Verba’ that I record with Smriti Mehta, I am confident that there are more papers by philosophers than there are papers by metascientists in podcasts by philosophers of science.

 

I just wanted to share these thoughts to provide some more diversity to the ideas that were shared in the panel at the Metascience conference. When “this panel asks what is new about metascience, and why it may have captured greater attention than previous research and reform initiatives” maybe one reason is that on average this literature is better and more relevant for researchers interested in how science works, and how it could work. I know that is not going to be a very popular viewpoint for philosophers to read, but it is my viewpoint, and I think no one can criticize me for not engaging with philosophy of science enough. I have at least a moderately informed opinion on this matter. 




P.S. The Metascience symposium also discussed why work in the field of Science and Technology Studies is not receiving much love by metascientists. I also have thoughts about this topic, but those thoughts are a bit too provocative to share on a blog. 


References

Lakens, D. (2025a). Concerns About Replicability Across Two Crises in Social Psychology. International Review of Social Psychology, 38(1). https://doi.org/10.5334/irsp.1036

Lakens, D. (2025b). Concerns About Theorizing, Relevance, Generalizability, and Methodology Across Two Crises in Social Psychology. International Review of Social Psychology, 38(1). https://doi.org/10.5334/irsp.1038


Monday, June 23, 2025

Retrieving Planned Sample Sizes from AsPredicted Preregistrations

It is increasingly common for researchers to preregister their studies (Spitzer and Mueller 2023; Imai et al. 2025). As preregistration is a new practice, it is not surprising that it is not always implemented well. One challenge is that researchers do not always indicate where they deviated from their preregistration when reporting results from preregistered studies (Akker et al. 2024). Reviewers could check whether researchers adhere to the preregistration, but this requires some effort. Automation can provide a partial solution be making information more easily available, and perhaps even performing automated checks of some parts of the preregistration.

Here we demonstrate how Papercheck, our software to perform automated checks on scientific manuscripts, can automatically retrieve the content of a preregistration. The preregistration can then be presented alongside the relevant information in the manuscript. This makes it easier for peer reviewers to compare the information.

We focus on AsPredicted preregistrations as their structured format makes it especially easy to retrieve information (but we also have code to do the same for structured OSF preregistration templates). We can easily search for AsPredicted links in all 250 open access papers from psychological science. The Papercheck package conveniently includes these in xml format in the psychsci object.

Sample Size

Recent metascientific research on preregistrations (Akker et al. 2024) has shown that the most common deviation from a preregistration in practice is that researchers do not collect the sample size that they preregistered. This is not necessarily problematic, as a difference might be only a single datapoint. Nevertheless, researchers should discuss the deviation, and evaluate whether the deviation impacts the severity of the test (Lakens 2024).

Checking the sample size of a study against the preregistration is some effort, as the preregistration document needs to be opened, the correct entry located, and the corresponding text in the manuscript needs to be identified. Recently, a fully automatic comparison tool, (RegCheck) has been created by Jamie Cummins from the University of Bern, that relies on large language models and AI where users upload the manuscript, the preregistration, and receive an automated comparison. We take a slightly different approach. We retrieve the preregistration from AsPredicted automatically, and present users with the information about the preregistered sample size (which is straightforward given the structured approach of the AsPredicted template). We then recommend users to compare this information against the method section in the manuscript.

Preregistration Sample Size Plan

You can access the sample size plan from the results of aspredicted_retrieve() under the column name AP_sample_size.

# get the sample size section from AsPredicted
prereg_sample_size <- unique(prereg$AP_sample_size)

# use cat("> ", x = _) with #| results: 'asis' in the code chunk
# to print out results with markdown quotes
prereg_sample_size |> cat("> ", x = _)

The study will compare four lemur species: ruffed lemur, Coquerel’s sifakas, ring-tailed lemur and mongoose lemur at the Duke Lemur Center. We will test a minimum of 10 and a maximum of 15 individuals for each species based on availability and individual’s willingness to participate at the time of testing.

Paper Sample size

Now we need to check what the achieved sample size in the paper is.

To facilitate this comparison, we can retrieve all paragraphs that contain words such as ‘sample’ or ‘participants’ from the manuscript, in the hope that this contains the relevant text. A more advanced version of this tool could attempt to identify the relevant information in the manuscript with a search for specific words used in the preregistration. Below, we also show how AI can be used to identify the related text in the manuscript. We first use Papercheck’s inbuilt search_text() function to find sentences discussing the sample or participants. For the current paper, we see this simple approach works.

# match "sample" or "# particip..."
regex_sample <- "\\bsample\\b|\\d+\\s+particip\\w+"

# get full paragraphs only from the method section
sample <- search_text(paper, regex_sample, 
                      section = "method", 
                      return= "paragraph")

sample$text |> cat("> ", x = _)

We tested 39 lemurs living at the Duke Lemur Center (for subject information, see Table S1 in the Supplemental Material available online). We assessed four taxonomic groups: ruffed lemurs (Varecia species, n = 10), Coquerel’s sifakas (Propithecus coquereli, n = 10), ringtailed lemurs (Lemur catta, n = 10), and mongoose lemurs (Eulemur mongoz, n = 9). Ruffed lemurs consisted of both red-ruffed and black-and-white-ruffed lemurs, but we collapsed analyses across both groups given their socioecological similarity and classification as subspecies until recently (Mittermeier et al., 2008). Our sample included all the individuals available for testing who completed the battery; two additional subjects (one sifaka and one mongoose lemur) initiated the battery but failed to reach the predetermined criterion for inclusion in several tasks or stopped participating over several days. All tests were voluntary: Lemurs were never deprived of food, had ad libitum access to water, and could stop participating at any time. The lemurs had little or no prior experience in relevant cognitive tasks such as those used here (see Table S1). All behavioral tests were approved by Duke University’s Institutional Animal Care and Use Committee .

The authors planned to test 10 mongoose lemurs, but one didn’t feel like participating. This can happen, and it does not really impact the severity of the test, but the statistical power is slightly lower than desired, and it is a deviation form the original plan - both deserve to be discussed. This papercheck module can remind researchers they deviated from a preregistration, and discuss their deviation, or it can help peer reviewers to notice a deviation is not discussed.

Mongoose Lemurs

Asking a Large Language Model to Compare the Paper and the Preregistration

Although Papercheck’s philosophy is that users should evaluate the information from automated checks, and that AI should be optional and never the default, it can be efficient to send the preregistered sample size and the text reported in the manuscript to a large language model, and compare the preregistration with the text in the method section. This is more costly (both financially and ecologically) but it can work better, as the researchers might not use words like ‘sample’ or ‘participants’ and a LLM provides more flexibility to match text across two documents.

Papercheck makes it easy to extract the method section in a paper:


method_section <- search_text(paper, pattern = "*", section = c("method"), return = "section")

We can send the method section to an LLM, and ask which paragraph is most closely related to the text in the preregistration. Papercheck has a custom function to send text and a query to Groq. We use Groq because of its privacy policy, as it will not retain data or train on data, which is important when sending text from scientific manuscripts that may be unpublished to a LLM. Furthermore, we use an open source model (llama-3.3-70b-versatile).

query_template <- "The following text is part of a scientific article. It describes a performed study. Part of this text should correspond to what researchers planned to do. Before data collection, the researchers stated they would:

%s

Your task is to retrieve the sentence(s) in the article that correspond to this plan, and evaluate based on the text in the manuscript whether researchers followed their plan with respect to the sample size. Start your answer with a 'The authors deviated from their preregistration' if there is any deviation."

# insert prereg text into template
query <- sprintf(query_template, prereg_sample_size)

# combine all relevant paragraphs
text <- paste(method_section$text, collapse = "\n\n")

# run query
llm_response <- llm(text, query, model = "llama-3.3-70b-versatile")
#> You have 499999 of 500000 requests left (reset in 172.799999ms) and 296612 of 300000 tokens left (reset in 677.6ms).

llm_response$answer |> cat("> ", x = _)

The authors deviated from their preregistration. The preregistered plan stated that they would test a minimum of 10 and a maximum of 15 individuals for each species. However, according to the text, they tested 10 ruffed lemurs, 10 Coquerel’s sifakas, 10 ring-tailed lemurs, and 9 mongoose lemurs. The number of mongoose lemurs (9) is below the minimum of 10 individuals planned for each species, indicating a deviation from the preregistered plan.

As we see, the LLM does a very good job evaluating whether the authors adhered to their preregistration in terms of the sample size. The long-run performance of this automated evaluation needs to be validated in future research - this is just a proof of principle - but it has potential for editors who want to automatically check if authors followed their preregistration, and for meta-scientists who want to examine preregistration adherence across a large number of papers. For such meta-scientific use-cases, however, the code needs to be extensively validated and error rates should be acceptably low (i.e., comparable to human coders).

Automated Checks Can Be Wrong!

The use of AI to interpret deviations is convenient, but it can’t replace human judgment. The following article, Exploring the Facets of Emotional Episodic Memory: Remembering “What,” “When,” and “Which” also has a preregistration. A large language model will incorrectly state that the authors deviated from their preregistration. It misses that the authors explicitly say that Cohort B was not preregistered, and therefore, falling short of the planned sample size of 60 in that cohort should not be seen as a deviation from the preregistration. All flagged deviations from a preregistration should be manually checked. Papercheck is only intended to make checks of a preregistration more efficient, but in the end, people need to make the final judgment. The preregistered sample size statement is as follows:

paper <- psychsci$`0956797621991548`
links <- aspredicted_links(paper)
prereg <- aspredicted_retrieve(links)
#> Starting AsPredicted retrieval for 1 files...
#> * Retrieving info from https://aspredicted.org/p4ci6.pdf...
#> ...AsPredicted retrieval complete!

#sample size
prereg_sample_size <- unique(prereg$AP_sample_size)
prereg_sample_size |> cat("> ", x = _)

N = 60. Participants will be recruited from the undergraduate student population of the University of British Columbia, and will be compensated with course credit through the Human Subject Pool Sona system. All participants aged 18-35 will be eligible for participation, and must be fluent in English (to ensure instruction comprehension).

If we send the method section to an LLM and ask it to identify any deviations from the preregistration, we get the following response:

# LLM workflow - send potentially relevant paragraphs

method_section <- search_text(paper, pattern = "*", section = c("method"), return = "section")

# combine all relevant paragraphs
text <- paste(method_section$text, collapse = "\n\n")

query <- sprintf(query_template, prereg_sample_size)
llm_response <- llm(text, query, model = "llama-3.3-70b-versatile")
#> You have 499999 of 500000 requests left (reset in 172.799999ms) and 297698 of 300000 tokens left (reset in 460.4ms).

llm_response$answer |> cat("> ", x = _)

The authors deviated from their preregistration in terms of the sample size for cohort B. According to the preregistration, the researchers planned to collect data from 60 participants in each cohort. However, for cohort B, they were only able to collect data from 56 participants due to the interruption of data collection caused by the COVID-19 pandemic. The sentence that corresponds to the plan is: “Here, we sought to collect data from 60 participants in each cohort.”

Future Research

We believe automatically retrieving information about preregistrations has potential to reduce the workload of peer reviewers, and might function as a reminder to authors that they should discuss deviations from the preregistration. The extent to which this works out in practice should be investigated.

We have only focused on an automated check for the preregistered sample size. Other components of a preregistration, such as exclusions criteria, or the planned analysis, are also important to check. It might be more difficult to create automated checks for these components, given the great flexibility in how especially statistical analyses are reported. In an earlier paper we have discussed the benefits of create machine readable hypothesis tests, and we have argued that this should be considered the gold standard for a preregistration (Lakens and DeBruine 2021). Machine readable hypothesis tests would allow researchers to automatically check if preregistered analyses are corroborated or falsified. But we realize it will be some years before this becomes common practice.

There are a range of other improvements and extensions that should be developed, such as support for multi-study papers that contain multiple preregistrations, and extending this code to preregistrations on other platforms, such as the OSF. If you are interested in developing this papercheck module further, or performing such a validation study, do reach out to us.

References

Akker, Olmo R. van den, Marjan Bakker, Marcel A. L. M. van Assen, Charlotte R. Pennington, Leone Verweij, Mahmoud M. Elsherif, Aline Claesen, et al. 2024. “The Potential of Preregistration in Psychology: Assessing Preregistration Producibility and Preregistration-Study Consistency.” Psychological Methods, October. https://doi.org/10.1037/met0000687.
Imai, Taisuke, Séverine Toussaert, Aurélien Baillon, Anna Dreber, Seda Ertaç, Magnus Johannesson, Levent Neyse, and Marie Claire Villeval. 2025. Pre-Registration and Pre-Analysis Plans in Experimental Economics. 220. I4R Discussion Paper Series. https://www.econstor.eu/handle/10419/315047.
Lakens, Daniël. 2024. “When and How to Deviate from a Preregistration.” Collabra: Psychology 10 (1): 117094. https://doi.org/10.1525/collabra.117094.
Lakens, Daniël, and Lisa M. DeBruine. 2021. “Improving Transparency, Falsifiability, and Rigor by Making Hypothesis Tests Machine-Readable.” Advances in Methods and Practices in Psychological Science 4 (2): 2515245920970949. https://doi.org/10.1177/2515245920970949.
Spitzer, Lisa, and Stefanie Mueller. 2023. “Registered Report: Survey on Attitudes and Experiences Regarding Preregistration in Psychological Research.” PLOS ONE 18 (3): e0281086. https://doi.org/10.1371/journal.pone.0281086.