It is increasingly common for researchers to preregister their studies (Spitzer and Mueller 2023; Imai et al. 2025). As preregistration is a new practice, it is not surprising that it is not always implemented well. One challenge is that researchers do not always indicate where they deviated from their preregistration when reporting results from preregistered studies (Akker et al. 2024). Reviewers could check whether researchers adhere to the preregistration, but this requires some effort. Automation can provide a partial solution be making information more easily available, and perhaps even performing automated checks of some parts of the preregistration.
Here we demonstrate how Papercheck, our software to perform automated checks on scientific manuscripts, can automatically retrieve the content of a preregistration. The preregistration can then be presented alongside the relevant information in the manuscript. This makes it easier for peer reviewers to compare the information.
We focus on AsPredicted preregistrations as their structured format makes it especially easy to retrieve information (but we also have code to do the same for structured OSF preregistration templates). We can easily search for AsPredicted links in all 250 open access papers from psychological science. The Papercheck package conveniently includes these in xml format in the psychsci
object.
Find AsPredicted Links
Extracting links is a bit trickier than just searching for “aspredicted.org”, so we provide a convenient function for extracting them from papers: aspredicted_links()
. It cleans up the links when they are incorrectly formatted, such as splitting “https://aspredicted.org/blind.php?x=nq4xa3” across two sentences at the question mark, which is common in scientific PDFs.
This function returns a table with a row for each link found that indicates the location of the link in each paper. Many papers include the same link in multiple places, so we will just show you the unique links here, returned in the “text” column.
<- aspredicted_links(psychsci)
links
unique(links$text)
#> [1] "https://aspredicted.org/ve2qn.pdf"
#> [2] "https://aspredicted.org/mq97g.pdf"
#> [3] "https://aspredicted.org/4gf64.pdf"
#> [4] "https://aspredicted.org/8a6ta.pdf"
#> [5] "https://aspredicted.org/vp4rg.pdf"
#> [6] "https://aspredicted.org/3kq9y.pdf"
#> [7] "https://aspredicted.org/rz98j.pdf"
#> [8] "https://aspredicted.org/z97us.pdf"
#> [9] "https://aspredicted.org/h9xm3.pdf"
#> [10] "https://aspredicted.org/bj9er.pdf"
#> [11] "https://aspredicted.org/my5jk.pdf"
#> [12] "https://aspredicted.org/5xe8i.pdf"
#> [13] "https://aspredicted.org/ak97v.pdf"
#> [14] "https://aspredicted.org/p4ci6.pdf"
#> [15] "https://aspredicted.org/iv9tb.pdf"
#> [16] "https://aspredicted.org/dp8r5.pdf"
#> [17] "https://aspredicted.org/Y2F_6B7"
#> [18] "https://aspredicted.org/2YK_D6R"
#> [19] "https://aspredicted.org/MPG_T3C"
#> [20] "https://aspredicted.org/G68_GBZ"
#> [21] "https://aspredicted.org/9SR_7BC"
#> [22] "https://aspredicted.org/6D7_FVX"
#> [23] "https://aspredicted.org/KD5_7LF"
#> [24] "https://aspredicted.org/6tv5v.pdf"
#> [25] "https://aspredicted.org/qs7zz.pdf"
#> [26] "https://aspredicted.org/4mk6i.pdf"
#> [27] "https://aspredicted.org/z5k26.pdf"
#> [28] "https://aspredicted.org/wd3pm.pdf"
#> [29] "https://aspredicted.org/LVH_7KX"
#> [30] "https://aspredicted.org/LZQ_DXY"
#> [31] "https://aspredicted.org/6X6_XZW"
#> [32] "https://aspredicted.org/blind.php?x=nq4xa3"
#> [33] "https://aspredicted.org/blind.php?"
#> [34] "https://aspredicted.org/blind.php?x=772w3a"
#> [35] "https://aspredicted.org/blind.php?x=55km72"
#> [36] "https://aspredicted.org/blind.php?x=yv9c2a"
#> [37] "https://aspredicted.org/blind.php?x=4xe5ih"
#> [38] "https://aspredicted.org/blind.php?x=pk8ff3"
#> [39] "https://aspredicted.org/sn9xs.pdf"
#> [40] "https://aspredicted.org/vh8kg.pdf"
#> [41] "https://aspredicted.org/ay3yk.pdf"
#> [42] "https://aspredicted.org/jz2nc.pdf"
#> [43] "https://aspredicted.org/a2wc9.pdf"
#> [44] "https://aspredicted.org/PD5_KKS"
#> [45] "https://aspredicted.org/9PG_LTT"
#> [46] "https://aspredicted.org/M3P_X3P"
#> [47] "https://aspredicted.org/H53_M3P"
#> [48] "https://aspredicted.org/CQW_DTT"
#> [49] "https://aspredicted.org/PW5_5VT"
#> [50] "https://aspredicted.org/sq22k.pdf"
#> [51] "https://aspredicted.org/u53e3.pdf"
One thing we notice is that published version of articles often still include ‘blind’ links, which are created to allow for anonymous peer review. Reviewers can access the preregistration, but it is anonymized. This is useful during peer review, but when a paper is accepted, the link should be replaced by the normal link to the aspredicted preregistration. Editors using Papercheck could easily create a module to check for this, for example by using the following code:
if (any(grepl("blind", links$text, ignore.case = TRUE))) {
message("Warning: Blinded link(s) detected.")}
#> Warning: Blinded link(s) detected.
Retrieve Link Info
The function aspredicted_retrieve()
can be used to get structured information from AsPredicted. In principle, papercheck can be used to extract information from all 51 links in Psychological Science articles, but it is not polite to overburden a free platform by extracted data you do not need. So let’s just retrieve information for one paper, titled “The Evolution of Cognitive Control in Lemurs”, which contains a single preregistration on AsPredicted.
<- psychsci$`09567976221082938`
paper
<- aspredicted_links(paper)
links
<- aspredicted_retrieve(links)
prereg #> Starting AsPredicted retrieval for 1 files...
#> * Retrieving info from https://aspredicted.org/iv9tb.pdf...
#> ...AsPredicted retrieval complete!
# get just the AsPredicted columns
<- names(prereg)
cols <- cols[grepl("^AP_", cols)]
ap_cols # transpose for easier reading
1, ap_cols] |> t()
prereg[#> [,1]
#> AP_title "Lemur executive function"
#> AP_authors "Francesca De Petrillo (Institute for Advanced Study of Toulouse, France) - francesca.de-petrillo@iast.frAlexandra Rosati (University of Michigan, USA) - rosati@umich.edu"
#> AP_created "2019/06/06 - 11:19 AM (PT)"
#> AP_data "No, no data have been collected for this study yet."
#> AP_hypotheses "Do differences in species’ socio-ecology predict variation across different aspects of executive function? How are different components of executive function related across individuals in these species?"
#> AP_key_dv "For each individual, we will take measures of their cognitive ability using a cognitive battery comprising 7 experimental tasks. \r\n1. Novel object task. Lemurs will be presented with a series of novel items (baseline, person, stationary object, moving object). For each item, we will measure lemurs’ latency to approach and their interest in each item.\r\n2. Quantity discrimination task. Lemurs will make choices between smaller and larger pieces of food. We will measure choices for the larger piece.\r\n3. Persistence task. Lemurs will be presented with a piece of food in a box that is impossible to access. We will measure how many times and for how long subjects interact with the box.\r\n4. Temporal discounting task. Lemurs will be presented with a series of choices between a smaller piece of food immediately available and a larger piece of the same food available after a delay. We will measure their choices for the larger delayed option.\r\n5. A-not-B error task. Lemurs will be first familiarized with finding food at a container in one location (A), and then in the test trial visibly see the food moved from A to a new container (B). We will measure lemurs’ choices for the correct location (B).\r\n6. Working memory task. Lemurs will see the experimenter hide food under one of 3 identical containers; after a 5s delay with occlusion, lemurs can choose one of the cups. We will measure their choices for the correct location.\r\n7. Reversal learning task. Lemurs will be presented with two containers (different colors and locations). They will first learn that one container provides a food reward (whereas the other is always empty). Once they learn this, the reward contingencies will be switched in the test trials. We will measure responses for the correct option in the learning and test trials."
#> AP_conditions "This study does not have conditions: all individuals will complete the same tasks in the same order to assess species-level and individual-level variation in these cognitive abilities."
#> AP_analyses "First, we will examine whether differences in species’ socio-ecology predict variation in subjects’ performance in a battery of cognitive tests that tap into several component executive functions. To do this, we will use generalized linear mixed models to analyze species’ performance in each separate task. For each task, the dependent variable will be correct choices in that task, and the test predictor will be species. We will include subject ID as a random factor, and control for age, sex and trial number in models as relevant. Across analyses, we will compare the fit of different models using likelihood-ratio tests. In addition, we will also examine each individual’s performance across the cognitive tasks, to examine whether performance on a given cognitive task predicts performance in other tasks. To do so we will first use pairwise bivariate correlations across all individuals as well as partitioning by species. If there were significant age-related variation in cognitive performance in tasks in the first phase of analyses, we will also use linear regressions accounting for age. Second, we will use a factor analysis to detect whether performance in different task co-vary across individuals (overall, and within each species)."
#> AP_outliers "For each task, individuals will be excluded from relevant analyses if they fail to complete the task for three consecutive attempts. Any individual who fails to complete 1 or more cognitive task, will be excluded from the relevant analyses requiring that data, and we will check if including them affects other results."
#> AP_sample_size "The study will compare four lemur species: ruffed lemur, Coquerel’s sifakas, ring-tailed lemur and mongoose lemur at the Duke Lemur Center. We will test a minimum of 10 and a maximum of 15 individuals for each species based on availability and individual’s willingness to participate at the time of testing."
#> AP_anything_else "Nothing else to pre-register."
#> AP_version "2.00"
Sample Size
Recent metascientific research on preregistrations (Akker et al. 2024) has shown that the most common deviation from a preregistration in practice is that researchers do not collect the sample size that they preregistered. This is not necessarily problematic, as a difference might be only a single datapoint. Nevertheless, researchers should discuss the deviation, and evaluate whether the deviation impacts the severity of the test (Lakens 2024).
Checking the sample size of a study against the preregistration is some effort, as the preregistration document needs to be opened, the correct entry located, and the corresponding text in the manuscript needs to be identified. Recently, a fully automatic comparison tool, (RegCheck) has been created by Jamie Cummins from the University of Bern, that relies on large language models and AI where users upload the manuscript, the preregistration, and receive an automated comparison. We take a slightly different approach. We retrieve the preregistration from AsPredicted automatically, and present users with the information about the preregistered sample size (which is straightforward given the structured approach of the AsPredicted template). We then recommend users to compare this information against the method section in the manuscript.
Preregistration Sample Size Plan
You can access the sample size plan from the results of aspredicted_retrieve()
under the column name AP_sample_size
.
# get the sample size section from AsPredicted
<- unique(prereg$AP_sample_size)
prereg_sample_size
# use cat("> ", x = _) with #| results: 'asis' in the code chunk
# to print out results with markdown quotes
|> cat("> ", x = _) prereg_sample_size
The study will compare four lemur species: ruffed lemur, Coquerel’s sifakas, ring-tailed lemur and mongoose lemur at the Duke Lemur Center. We will test a minimum of 10 and a maximum of 15 individuals for each species based on availability and individual’s willingness to participate at the time of testing.
Paper Sample size
Now we need to check what the achieved sample size in the paper is.
To facilitate this comparison, we can retrieve all paragraphs that contain words such as ‘sample’ or ‘participants’ from the manuscript, in the hope that this contains the relevant text. A more advanced version of this tool could attempt to identify the relevant information in the manuscript with a search for specific words used in the preregistration. Below, we also show how AI can be used to identify the related text in the manuscript. We first use Papercheck’s inbuilt search_text()
function to find sentences discussing the sample or participants. For the current paper, we see this simple approach works.
# match "sample" or "# particip..."
<- "\\bsample\\b|\\d+\\s+particip\\w+"
regex_sample
# get full paragraphs only from the method section
<- search_text(paper, regex_sample,
sample section = "method",
return= "paragraph")
$text |> cat("> ", x = _) sample
We tested 39 lemurs living at the Duke Lemur Center (for subject information, see Table S1 in the Supplemental Material available online). We assessed four taxonomic groups: ruffed lemurs (Varecia species, n = 10), Coquerel’s sifakas (Propithecus coquereli, n = 10), ringtailed lemurs (Lemur catta, n = 10), and mongoose lemurs (Eulemur mongoz, n = 9). Ruffed lemurs consisted of both red-ruffed and black-and-white-ruffed lemurs, but we collapsed analyses across both groups given their socioecological similarity and classification as subspecies until recently (Mittermeier et al., 2008). Our sample included all the individuals available for testing who completed the battery; two additional subjects (one sifaka and one mongoose lemur) initiated the battery but failed to reach the predetermined criterion for inclusion in several tasks or stopped participating over several days. All tests were voluntary: Lemurs were never deprived of food, had ad libitum access to water, and could stop participating at any time. The lemurs had little or no prior experience in relevant cognitive tasks such as those used here (see Table S1). All behavioral tests were approved by Duke University’s Institutional Animal Care and Use Committee .
The authors planned to test 10 mongoose lemurs, but one didn’t feel like participating. This can happen, and it does not really impact the severity of the test, but the statistical power is slightly lower than desired, and it is a deviation form the original plan - both deserve to be discussed. This papercheck module can remind researchers they deviated from a preregistration, and discuss their deviation, or it can help peer reviewers to notice a deviation is not discussed.
Asking a Large Language Model to Compare the Paper and the Preregistration
Although Papercheck’s philosophy is that users should evaluate the information from automated checks, and that AI should be optional and never the default, it can be efficient to send the preregistered sample size and the text reported in the manuscript to a large language model, and compare the preregistration with the text in the method section. This is more costly (both financially and ecologically) but it can work better, as the researchers might not use words like ‘sample’ or ‘participants’ and a LLM provides more flexibility to match text across two documents.
Papercheck makes it easy to extract the method section in a paper:
<- search_text(paper, pattern = "*", section = c("method"), return = "section") method_section
We can send the method section to an LLM, and ask which paragraph is most closely related to the text in the preregistration. Papercheck has a custom function to send text and a query to Groq. We use Groq because of its privacy policy, as it will not retain data or train on data, which is important when sending text from scientific manuscripts that may be unpublished to a LLM. Furthermore, we use an open source model (llama-3.3-70b-versatile).
<- "The following text is part of a scientific article. It describes a performed study. Part of this text should correspond to what researchers planned to do. Before data collection, the researchers stated they would:
query_template
%s
Your task is to retrieve the sentence(s) in the article that correspond to this plan, and evaluate based on the text in the manuscript whether researchers followed their plan with respect to the sample size. Start your answer with a 'The authors deviated from their preregistration' if there is any deviation."
# insert prereg text into template
<- sprintf(query_template, prereg_sample_size)
query
# combine all relevant paragraphs
<- paste(method_section$text, collapse = "\n\n")
text
# run query
<- llm(text, query, model = "llama-3.3-70b-versatile")
llm_response #> You have 499999 of 500000 requests left (reset in 172.799999ms) and 296612 of 300000 tokens left (reset in 677.6ms).
$answer |> cat("> ", x = _) llm_response
The authors deviated from their preregistration. The preregistered plan stated that they would test a minimum of 10 and a maximum of 15 individuals for each species. However, according to the text, they tested 10 ruffed lemurs, 10 Coquerel’s sifakas, 10 ring-tailed lemurs, and 9 mongoose lemurs. The number of mongoose lemurs (9) is below the minimum of 10 individuals planned for each species, indicating a deviation from the preregistered plan.
As we see, the LLM does a very good job evaluating whether the authors adhered to their preregistration in terms of the sample size. The long-run performance of this automated evaluation needs to be validated in future research - this is just a proof of principle - but it has potential for editors who want to automatically check if authors followed their preregistration, and for meta-scientists who want to examine preregistration adherence across a large number of papers. For such meta-scientific use-cases, however, the code needs to be extensively validated and error rates should be acceptably low (i.e., comparable to human coders).
Automated Checks Can Be Wrong!
The use of AI to interpret deviations is convenient, but it can’t replace human judgment. The following article, Exploring the Facets of Emotional Episodic Memory: Remembering “What,” “When,” and “Which” also has a preregistration. A large language model will incorrectly state that the authors deviated from their preregistration. It misses that the authors explicitly say that Cohort B was not preregistered, and therefore, falling short of the planned sample size of 60 in that cohort should not be seen as a deviation from the preregistration. All flagged deviations from a preregistration should be manually checked. Papercheck is only intended to make checks of a preregistration more efficient, but in the end, people need to make the final judgment. The preregistered sample size statement is as follows:
<- psychsci$`0956797621991548`
paper <- aspredicted_links(paper)
links <- aspredicted_retrieve(links)
prereg #> Starting AsPredicted retrieval for 1 files...
#> * Retrieving info from https://aspredicted.org/p4ci6.pdf...
#> ...AsPredicted retrieval complete!
#sample size
<- unique(prereg$AP_sample_size)
prereg_sample_size |> cat("> ", x = _) prereg_sample_size
N = 60. Participants will be recruited from the undergraduate student population of the University of British Columbia, and will be compensated with course credit through the Human Subject Pool Sona system. All participants aged 18-35 will be eligible for participation, and must be fluent in English (to ensure instruction comprehension).
If we send the method section to an LLM and ask it to identify any deviations from the preregistration, we get the following response:
# LLM workflow - send potentially relevant paragraphs
<- search_text(paper, pattern = "*", section = c("method"), return = "section")
method_section
# combine all relevant paragraphs
<- paste(method_section$text, collapse = "\n\n")
text
<- sprintf(query_template, prereg_sample_size)
query <- llm(text, query, model = "llama-3.3-70b-versatile")
llm_response #> You have 499999 of 500000 requests left (reset in 172.799999ms) and 297698 of 300000 tokens left (reset in 460.4ms).
$answer |> cat("> ", x = _) llm_response
The authors deviated from their preregistration in terms of the sample size for cohort B. According to the preregistration, the researchers planned to collect data from 60 participants in each cohort. However, for cohort B, they were only able to collect data from 56 participants due to the interruption of data collection caused by the COVID-19 pandemic. The sentence that corresponds to the plan is: “Here, we sought to collect data from 60 participants in each cohort.”
Future Research
We believe automatically retrieving information about preregistrations has potential to reduce the workload of peer reviewers, and might function as a reminder to authors that they should discuss deviations from the preregistration. The extent to which this works out in practice should be investigated.
We have only focused on an automated check for the preregistered sample size. Other components of a preregistration, such as exclusions criteria, or the planned analysis, are also important to check. It might be more difficult to create automated checks for these components, given the great flexibility in how especially statistical analyses are reported. In an earlier paper we have discussed the benefits of create machine readable hypothesis tests, and we have argued that this should be considered the gold standard for a preregistration (Lakens and DeBruine 2021). Machine readable hypothesis tests would allow researchers to automatically check if preregistered analyses are corroborated or falsified. But we realize it will be some years before this becomes common practice.
There are a range of other improvements and extensions that should be developed, such as support for multi-study papers that contain multiple preregistrations, and extending this code to preregistrations on other platforms, such as the OSF. If you are interested in developing this papercheck module further, or performing such a validation study, do reach out to us.
No comments:
Post a Comment