A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, June 20, 2025

Checking Data and Code in Repositories with Papercheck

In this blog post we will explain how Papercheck can automatically check the content of data repositories that are linked to in a scientific manuscript.

There is widespread support among scientists for open science practices, such as data and code sharing (Ferguson et al. 2023). There is increasing awareness of the importance of making data and code available alongside publication, and sometimes this is even a requirement by funders or journals. As data and code sharing is a relatively new practice, and many scientists lack training in open science, it is common to see badly documented data repositories. Best practices exist, such as the TIER protocol, but not all researchers might be aware of best practices. By automatically checking data repositories, Papercheck can increase awareness of best practices, and make specific suggestions for improvements.

At a minimum a data repository should contain a Read Me file with instructions for how to reproduce the results. If data is shared, it should be stored in a ‘data’ folder, or at least have the word ‘data’ in the filename. Code or scripts should similarly be shared in a folder with that name, or at least with the word in the filename. Finally, if data is shared, there should be a codebook or data dictionary that explains which variables are in the dataset in order to allow others to re-use the data. Ideally peer reviewers or editors would check the contents of a data repository. In practice, time constraints mean that data repository are often not checked by any human. Automation can perform some of the checks that peers might otherwise perform manually. Here we demonstrate how Papercheck can check if a Read Me is present, whether data and/or code are shared, and if there is a codebook.

Checking an OSF repository with Papercheck

We will illustrate the process of checking a data repository by focusing on projects on the Open Science Framework. For this illustration we use an open access paper published in Psychological Science that has already been converted to a papercheck object using GROBID. There are 250 open access papers in the Papercheck object psychsci; we will choose one for this example.

# paper to use in this example
paper <- psychsci[[250]]

Set up OSF functions

You can only make 100 API requests per hour to the OSF, unless you authorise your requests, which allow you to make 10K requests per day. The OSF functions in papercheck often make several requests per URL to get all of the info, so it’s worthwhile setting up a personal access token (PAT) at https://osf.io/settings/tokens and including the following line in your .Renviron file (which you can open using usethis::edit_r_environ()):

OSF_PAT="replace-with-your-token-string"

The OSF API server is down regularly, so it’s often good to check it before you run a bunch of OSF functions, we provide the function osf_api_check() for this. When the server is down, it can take several seconds to return an error. It can take some time to realise the OSF is down when you are checking many URLs, and it is useful to be aware of this. .

osf_api_check()
#> [1] "ok"

Summarize Contents

The OSF allows you to categorize components by category, and we can also determine file types using extensions.

osf_id name filetype
hv29w Kim,Doeller_Prereg_OSF.pdf text
2es6n suppleVideo1_learnSph.mp4 video
jpm5a suppleVideo2_learnPlane.mp4 video
aux7s suppleVideo3_objlocSph.mp4 video
nw3mc suppleVideo4_objlocPlane.mp4 video
ks639 suppleVideo5_triangleSph.mp4 video
y75nu suppleVideo6_trianglePlane.mp4 video
662145f3716cb7048fa45a58 virtualizerStudy1-main.zip archive
662146fa8df04804d3177e59 ReadMe.txt text
64f0aaf4d9f2c905a0d04821 main_analyseTriangleComple_20230423.m code
64f0aaf7f3dcd105dbddd396 main_simulate_objlocTraj.m code
64f0aaf9989de605c3dd152a supple_learningTrajectory.m code
6293d2d5b59d5f1df7720e41 poweranalysis_sph.R code
6614017dc053943058b4d41c supple_sphWithVariousRadius_clean.m code
6293d2d5b59d5f1df0720c66 main_analyseObjLocTest.m code
6293d26c86324127ca5b5862 suppleVideo2_learnPlane.mp4 video
6293d29eb7e8c726edc2dc38 suppleVideo3_objlocSph.mp4 video
6293d2a1bbdcde278b42696a suppleVideo6_trianglePlane.mp4 video
6293d271ddbe49279ba215f6 suppleVideo1_learnSph.mp4 video
6293d275bbdcde278f426b25 suppleVideo4_objlocPlane.mp4 video
6293d282ddbe49279aa21548 suppleVideo5_triangleSph.mp4 video
6293d1e8b59d5f1df7720d4c suppleMovie_legend.txt text
6293d310b7e8c726ecc2db7d sumDemograph.csv data
6293d30fb7e8c726ecc2db79 rawdata_plane_triangle.csv data
6293d30ebbdcde278f426c49 rawdata_sph_objlocTest.csv data
6293d310b59d5f1df3720cf4 rawdata_sph_triangle.csv data
661402d9e65c603b737d9c10 cleanData_combine.mat code
661402b4943bee32eadfebdd pilotData_triangle_combine_clean.csv data
6293d30d86324127d25b5c27 rawdata_plane_objlocIdentity.csv data
6293d30d86324127ce5b5a91 rawdata_sph_objlocIdentity.csv data
6293d30db7e8c726edc2dcb7 rawdata_plane_objlocTest.csv data
6339c121ec7f3f0704f5fbf0 Kim,Doeller_Prereg_OSF.pdf text

We can then use this information to determine if, for each file, the information about the files contains text that makes it easy to determine what is being shared. A simple regular expression text search for ‘README’, ‘codebook’, ‘script’, and ‘data’ (in a number of possible ways that these words can be written) is used to automatically detect what is shared.

osf_files_summary <- summarize_contents(info)
name filetype best_guess
Kim,Doeller_Prereg_OSF.pdf text
suppleVideo1_learnSph.mp4 video
suppleVideo2_learnPlane.mp4 video
suppleVideo3_objlocSph.mp4 video
suppleVideo4_objlocPlane.mp4 video
suppleVideo5_triangleSph.mp4 video
suppleVideo6_trianglePlane.mp4 video
virtualizerStudy1-main.zip archive
ReadMe.txt text readme
main_analyseTriangleComple_20230423.m code code
main_simulate_objlocTraj.m code code
supple_learningTrajectory.m code code
poweranalysis_sph.R code code
supple_sphWithVariousRadius_clean.m code code
main_analyseObjLocTest.m code code
suppleVideo2_learnPlane.mp4 video
suppleVideo3_objlocSph.mp4 video
suppleVideo6_trianglePlane.mp4 video
suppleVideo1_learnSph.mp4 video
suppleVideo4_objlocPlane.mp4 video
suppleVideo5_triangleSph.mp4 video
suppleMovie_legend.txt text
sumDemograph.csv data data
rawdata_plane_triangle.csv data data
rawdata_sph_objlocTest.csv data data
rawdata_sph_triangle.csv data data
cleanData_combine.mat code code
pilotData_triangle_combine_clean.csv data data
rawdata_plane_objlocIdentity.csv data data
rawdata_sph_objlocIdentity.csv data data
rawdata_plane_objlocTest.csv data data
Kim,Doeller_Prereg_OSF.pdf text

Report Text

Finally, we print a report that communicates to the user - for example, a researcher preparing their manuscript for submission - whether there are suggestions to improve their data repository. We provide feedback about whether any of the four categories could be automatically detected, and if not, provide additional information about what would have made the automated tool recognize the files of interest. The output gives a detailed overview of the information it could not find, alongside a suggestion for how to learn more about best practices in this domain. If researchers use this Papercheck module before submission, they can improve the quality of their data repository in case any information is missing. Papercheck might miss data and code that is shared, but not clearly named. This might be considered a Type 2 error by Papercheck, but our philosophy is that it might still correctly indicate that the data repository can be improved by more clearly naming folders and files, and that this would not only benefit automatic recognition of what is in a repository, but also helps people to find files in a repository.

osf_report <- function(summary) {
  files <- dplyr::filter(summary, osf_type == "files")
  data <- dplyr::filter(files, best_guess == "data") |> nrow()
  code <- dplyr::filter(files, best_guess == "code") |> nrow()
  codebook <- dplyr::filter(files, best_guess == "codebook") |> nrow()
  readme <- dplyr::filter(files, best_guess == "readme") |> nrow()
  
  traffic_light <- dplyr::case_when(
    data == 0 & code == 0 & readme == 0 ~ "red",
    data == 0 | code == 0 | readme == 0 ~ "yellow",
    data > 0 & code > 0 & readme > 0 ~ "green"
  )
  
  data_report <- dplyr::case_when(
    data == 0 ~ "\u26A0\uFE0F There was no data detected. Are you sure you cannot share any of the underlying data? If you did share the data, consider naming the file(s) or file folder with 'data'.",
    data > 0 ~ "\u2705 Data file(s) were detected. Great job making your research more transparent and reproducible!"
  )
  
  codebook_report <- dplyr::case_when(
    codebook == 0 ~ "\u26A0\uFE0F️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/",
    codebook > 0 ~ "\u2705 Codebook(s) were detected. Well done!"
  )
  
  code_report <- dplyr::case_when(
    code == 0 ~ "\u26A0\uFE0F️ No code files were found. Are you sure there is no code related to this manuscript? If you shared code, consider naming the file or file folder with 'code' or 'script'.",
    code > 0 ~ "\u2705 Code file(s) were detected. Great job making it easier to  reproduce your results!"
  )
  
  readme_report <- dplyr::case_when(
    readme == 0 ~ "\u26A0\uFE0F No README files were identified. A read me is best practice to facilitate re-use. If you have a README, please name it explicitly (e.g., README.txt or _readme.pdf).",
    readme > 0 ~ "\u2705 README detected. Great job making it easier to understand how to re-use files in your repository!"
  )
  
  report_message <- paste(
    readme_report,
    data_report, 
    codebook_report,
    code_report,
    "Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/",
    sep = "\n\n"
  )

  return(list(
    traffic_light = traffic_light,
    report = report_message
  ))
}
report <- osf_report(osf_files_summary) 

# print the report into a file
module_report(report) |> cat()

✅ README detected. Great job making it easier to understand how to re-use files in your repository!

✅ Data file(s) were detected. Great job making your research more transparent and reproducible!

⚠️️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/

✅ Code file(s) were detected. Great job making it easier to reproduce your results!

Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/

Checking the Contents of files

So far we have used Papercheck to automatically check whether certain types of files exist. But it is also possible to automatically download files, examine their contents, and provide feedback to users. This can be useful to examine datasets (e.g., do files contain IP addresses or other personal information), or code files. We will illustrate the latter by automatically checking the content of R scripts stored on the OSF, in repositories that are linked to in a scientific manuscript.

We can check R files for good coding practices that improve reproducibility. We have created a check that examines 1) whether all libraries are loaded in one block, instead of throughout the R script, 2) whether relative paths are used that will also work when someone runs the code on a different computer (e.g., data <- read.csv(file='../data/data_study_1.csv') ) instead of fixed paths (e.g., data <- read.csv(file='C:/data/data_study_1.csv') ), and 3) whether information is provided about the software used (i.e., the R version), the version of packages that were used, and properties of the computer that the analyses were performed on. In R, this can be achieved by:

sessionInfo()
#> R version 4.4.2 (2024-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Amsterdam
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1        
#>  [4] dplyr_1.1.4           purrr_1.0.4           readr_2.1.5          
#>  [7] tidyr_1.3.1           tibble_3.3.0          ggplot2_3.5.2        
#> [10] tidyverse_2.0.0       papercheck_0.0.0.9045
#> 
#> loaded via a namespace (and not attached):
#>  [1] generics_0.1.4     stringi_1.8.7      httpcode_0.3.0     hms_1.1.3         
#>  [5] digest_0.6.37      magrittr_2.0.3     evaluate_1.0.3     grid_4.4.2        
#>  [9] timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
#> [13] crul_1.5.0         urltools_1.7.3.1   httr_1.4.7         scales_1.4.0      
#> [17] cli_3.6.5          rlang_1.1.6        triebeard_0.4.1    withr_3.0.2       
#> [21] cachem_1.1.0       yaml_2.3.10        tools_4.4.2        tzdb_0.5.0        
#> [25] memoise_2.0.1      curl_6.3.0         vctrs_0.6.5        R6_2.6.1          
#> [29] lifecycle_1.0.4    fs_1.6.6           htmlwidgets_1.6.4  pkgconfig_2.0.3   
#> [33] osfr_0.2.9         pillar_1.10.2      gtable_0.3.6       glue_1.8.0        
#> [37] Rcpp_1.0.14        xfun_0.52          tidyselect_1.2.1   rstudioapi_0.17.1 
#> [41] knitr_1.50         farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.29    
#> [45] compiler_4.4.2

As most scientists have not been taught how to code explicitly, it is common to see scripts that do not adhere to best coding practices. We are no exception ourselves (e.g., you will not find a sessioninfo.txt file in our repositories). Although code might be reproducible even if it takes time to figure out which versions of an R package was used, which R version was used, and by changing fixed paths, reproducibility is facilitated if best practices are used. The whole point of automated checks is to have algorithms that capture expertise make recommendations that improve how we currently work.

check_r_files <- function(summary) {
  r_files <- summary |>
    dplyr::filter(osf_type == "files",
                  grepl("\\.R(md)?", name, ignore.case = TRUE)) |>
    dplyr::mutate(abs_report = NA, 
                  pkg_report = NA,
                  session_report = NA)
  
  report <- lapply(r_files$osf_id, \(id) {
    report <- dplyr::filter(r_files, osf_id == !!id)
    # Try downloading the R file
    file_url <- paste0("https://osf.io/download/", id)
    r_code <- tryCatch(
      readLines(url(file_url), warn = FALSE),
      error = function(e) return(NULL)
    )
    
    if (is.null(r_code)) return("")
    
    # absolute paths
    abs_path <- grep("[\"\']([A-Z]:|\\/|~)", r_code)
    report$abs <- dplyr::case_when(
      length(abs_path) == 0 ~ "\u2705 No absolute paths were detected",
      length(abs_path) > 0 ~ paste("\u274C Absolute paths found at lines: ",
                                   paste(abs_path, collapse = ", "))
    )
    
    # package loading
    pkg <- grep("\\b(library|require)\\(", r_code)
    report$pkg<- dplyr::case_when(
      length(pkg) == 0 ~ "\u26A0\uFE0F️ No packages are specified in this script.",
      length(pkg) == 1 ~ "\u2705 Packages are loaded in a single block.",
      all(diff(pkg) < 5) ~ "\u2705 Packages are loaded in a single block.",
      .default = paste(
        "\u274C Packages are loaded in multiple places: lines " ,
        paste(pkg, collapse = ", ")
      )
    )
    
    # session info 
    session <- grep("\\bsession_?[Ii]nfo\\(", r_code)
    report$session <- dplyr::case_when(
      length(session) == 0 ~ "\u274C️ No session info was found in this script.",
      length(session) > 0 ~ paste(
        "\u2705 Session info was found on line", 
        paste(session, collapse = ", "))
    )
    
    return(report)
  }) |>
    do.call(dplyr::bind_rows, args = _)
  
  return(report)
}
r_file_results <- check_r_files(osf_files_summary)
name report feedback
poweranalysis_sph.R abs ✅ No absolute paths were detected
poweranalysis_sph.R pkg ✅ Packages are loaded in a single block.
poweranalysis_sph.R session ❌️ No session info was found in this script.

Put it All Together

Let’s put everything together in one block of code, and perform all automated checks for another open access paper in Psychological Science.


osf_file_check <- function(paper) {
  links <- osf_links(paper)
  info <- osf_retrieve(links, recursive = TRUE)
  osf_files_summary <- summarize_contents(info)
  report <- osf_report(osf_files_summary)
  r_file_results <- check_r_files(osf_files_summary)  
  
  list(
    traffic_light = report$traffic_light,
    table = r_file_results,
    report = report$report,
    summary = osf_files_summary
  )
}
osf_file_check(psychsci$`0956797620955209`)
#> Starting OSF retrieval for 1 files...
#> * Retrieving info from k2dbf...
#> ...Main retrieval complete
#> Starting retrieval of children...
#> * Retrieving children for k2dbf...
#> * Retrieving files for k2dbf...
#> * Retrieving files for 5e344fb4f6631d013e5a48c9...
#> ...OSF retrieval complete!
#> $traffic_light
#> [1] "yellow"
#> 
#> $table
#> # A tibble: 2 × 31
#>   text  section header   div     p     s id    osf_id name  description osf_type
#>   <chr> <chr>   <chr>  <dbl> <dbl> <int> <chr> <chr>  <chr> <chr>       <chr>   
#> 1 <NA>  <NA>    <NA>      NA    NA    NA <NA>  5ecbe… SBVM… <NA>        files   
#> 2 <NA>  <NA>    <NA>      NA    NA    NA <NA>  5e344… SBVM… <NA>        files   
#> # ℹ 20 more variables: public <lgl>, category <chr>, registration <lgl>,
#> #   preprint <lgl>, parent <chr>, kind <chr>, size <int>, downloads <int>,
#> #   filetype <chr>, is_readme <lgl>, is_data <lgl>, is_code <lgl>,
#> #   is_codebook <lgl>, best_guess <chr>, abs_report <lgl>, pkg_report <lgl>,
#> #   session_report <lgl>, abs <chr>, pkg <chr>, session <chr>
#> 
#> $report
#> [1] "⚠️ No README files were identified. A read me is best practice to facilitate re-use. If you have a README, please name it explicitly (e.g., README.txt or _readme.pdf).\n\n✅ Data file(s) were detected. Great job making your research more transparent and reproducible!\n\n⚠️️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/\n\n✅ Code file(s) were detected. Great job making it easier to  reproduce your results!\n\nLearn more about reproducible data practices: https://www.projecttier.org/tier-protocol/"
#> 
#> $summary
#> # A tibble: 13 × 25
#>    text         section header    div     p     s id    osf_id name  description
#>    <chr>        <chr>   <chr>   <dbl> <dbl> <int> <chr> <chr>  <chr> <chr>      
#>  1 osf.io/k2dbf method  Analys…     8     1     1 0956… k2dbf  Prer… ""         
#>  2 osf.io/k2dbf funding Open P…    14     1     1 0956… k2dbf  Prer… ""         
#>  3 osf.io/k2dbf funding Open P…    14     2     1 0956… k2dbf  Prer… ""         
#>  4 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5f212… rawd… <NA>       
#>  5 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5ecbe… SBVM… <NA>       
#>  6 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5ecbe… SBVM… <NA>       
#>  7 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5f212… SBVM… <NA>       
#>  8 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5e344… SBVM… <NA>       
#>  9 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5e344… SBVM… <NA>       
#> 10 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5e344… IPA … <NA>       
#> 11 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5ecbe… SBVM… <NA>       
#> 12 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5b880… R Co… <NA>       
#> 13 <NA>         <NA>    <NA>       NA    NA    NA <NA>  5d760… Illn… <NA>       
#> # ℹ 15 more variables: osf_type <chr>, public <lgl>, category <chr>,
#> #   registration <lgl>, preprint <lgl>, parent <chr>, kind <chr>, size <int>,
#> #   downloads <int>, filetype <chr>, is_readme <lgl>, is_data <lgl>,
#> #   is_code <lgl>, is_codebook <lgl>, best_guess <chr>

Future Developments

We have demonstrated a workflow that can automatically check files stored on the Open Science Framework. Although this alpha version of Papercheck works, all the checks can be made more accurate or complete. At the same time, even the current of Papercheck might already facilitate re-use by reminding researchers to include README files and improving how files are named. There are many obvious ways to expand these automated checks. First, the example can be expanded to other commonly used data repositories, such as GitHub, Dataverse, etc. Second, the checks can be expanded beyond the properties that are automatically checked now. If you are an expert on code reproducibility or data re-use and would like to add checks, do reach out to us. Third, we can check for other types of files. For example, we are collaborating with Attila Simko who is interested in identifying the files required to reproduce deep learning models in the medical imaging literature. We believe there will be many such field-dependent checks that can be automated, as the ability to automatically examine and/or retrieve files that are linked to in a paper should be useful for a large range of use-cases.

These examples were created using papercheck version 0.0.0.9045.

References

Ferguson, Joel, Rebecca Littman, Garret Christensen, Elizabeth Levy Paluck, Nicholas Swanson, Zenan Wang, Edward Miguel, David Birke, and John-Henry Pezzuto. 2023. “Survey of Open Science Practices and Attitudes in the Social Sciences.” Nature Communications 14 (11): 5401. https://doi.org/10.1038/s41467-023-41111-1.

No comments:

Post a Comment