The 20% Statistician: Automatically Checking Journal Article Reporting Standards

Automatically Checking Journal Article Reporting Standards

JARS (journal article reporting standards), by the American Psychological Association offer guidance on the information that should be reported in scientific articles to enhance their scientific rigour (Appelbaum et al. 2018). The guidelines for quantitative research are a set of excellent recommendations, and almost every published scientific article would be improved if researchers actually followed JARS.

However, as the guidelines are not well known, authors usually do not implement them, and reviewers do not check if journal article reporting standards are followed. Furthermore, there are so many guidelines, it would take a lot of time to check them all manually. Automation can help increase awareness of JARS by systematically checking if recommendations are followed, and if not, point out where improvements can be made. Below we will illustrate how 2 JARS guidelines can be automatically checked. There are dozens of other potential guidelines for which dedicated Papercheck modules could be created. Anyone who has created an R package will have the experience of running R CMD check, which automatically checks dozens of requirements that an R package must adhere to before it is allowed on CRAN. It should be possible to automatically check many of the JARS guidelines in a similar manner.

Exact p-values

The first reporting guideline we will illustrate is to report exact p-values. The APA Manual states:

Report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

Reporting p values precisely allows readers to include the test results in p value meta-analytic tests, such as p-curve, or z-curve (Simonsohn, Nelson, and Simmons 2014; Bartoš and Schimmack 2020), and makes it possible the check the internal coherence of the reported result with tools such as Statcheck (Nuijten et al. 2015). Papercheck has a dedicated module, “exact-p”, to identify the presence of imprecise p-values. We can run it on a single paper:

library(papercheck)
res_imprecise <- module_run(psychsci$`0956797617744542`, "exact_p")

res_imprecise

You may have reported some imprecise p-values

text	p_comp	p_value	section	div	p	s
p > .01	>	0.01	method	3	5	9
p < .01	<	0.01	results	13	1	3
p < .01	<	0.01	results	13	1	3
p < .01	<	0.01	results	13	1	4
p < .01	<	0.01	results	13	1	4
p < .01	<	0.01	results	14	1	7
p < .01	<	0.01	results	14	1	7
p < .05	<	0.05	results	14	1	7
p < .01	<	0.01	results	14	1	7
p < .01	<	0.01	results	15	1	4
p < .01	<	0.01	results	15	1	4
p < .01	<	0.01	results	16	1	2
p < .01	<	0.01	results	16	1	2
p < .01	<	0.01	results	16	1	2
p < .01	<	0.01	results	17	1	2
p < .05	<	0.05	results	17	2	1

Showing 16 of 16 rows

The module only returns the exact p values, but the Papercheck software has several convenient functions to interact with the information in scientific manuscripts. The expand_text function can be used to retrieve text around matches (for example by retrieving the previous and next sentence as well). We can retrieve the full sentence so that users can easily examine whether the reported p values should have been precise:

res_imprecise_expanded <- expand_text(
  results_table = res_imprecise, 
  paper = psychsci$`0956797617744542`, 
  expand_to = "sentence"
)

unique(res_imprecise_expanded$expanded)

[1] "We analyzed SNPs in Hardy-Weinberg equilibrium (p > .01)."                                                                                                                                                                                                                                                                                                                                                                        
[2] "The increase in risk was modest: a standard-deviation decrease in the polygenic score was associated with a 20% to 30% greater risk of having been cautioned or convicted (E-Risk cohort: incidencerate ratio, or IRR = 1.33, 95% CI = [1.13, 1.55], p < .01; Dunedin cohort: IRR = 1.21, 95% CI = [1.09, 1.34], p < .01; Table 1)."                                                                                              
[3] "Effect sizes were similar across the two cohorts (Fig. 2) and across sex (males: IRR = 1.25, 95% CI = [1.13, 1.39], p < .01; females: IRR = 1.30, 95% CI = [1.08, 1.57], p < .01, with participants pooled across cohorts)."                                                                                                                                                                                                      
[4] "Participants with lower polygenic scores for education were more likely to have grown up in socioeconomically deprived households (E-Risk cohort: r = .23, 95% CI = [.17, .29], p < .01; Dunedin cohort: r = .16, 95% CI = [.09, .22], p < .01) and with parents who displayed antisocial behavior (E-Risk cohort: r = .06, 95% CI = [.01, .11], p < .05; Dunedin cohort: r = .13, 95% CI = [.06, .19], p < .01)."                
[5] "Participants with lower polygenic scores were more likely to leave school with poor educational qualifications in both the E-Risk cohort (polychoric r = .21, 95% CI = [.13, .28], p < .01) and the Dunedin cohort (polychoric r = .19, 95% CI = [.09, .29], p < .01)."                                                                                                                                                           
[6] "As children, participants with lower polygenic scores for educational attainment exhibited lower cognitive ability (E-Risk cohort: r = . in primary school (E-Risk cohort: r = .14, 95% CI = [.08, .20], p < .01; Dunedin cohort: r = .19, 95% CI = [.12, .25], p < .01), and in the Dunedin cohort, more truancy (E-Risk cohort: r = .08, 95% CI = [-.05, .20], p = .19; Dunedin cohort: r = .15, 95% CI = [.03, .28], p < .01)."
[7] "Survival analyses indicated that participants with lower education polygenic scores tended to get convicted earlier in life (hazard ratio = 1.25, 95% CI = [1.10, 1.42], p < .01; Fig. 4a)."                                                                                                                                                                                                                                      
[8] "Results from multinomial regression models supported our hypothesis that participants with lower polygenic scores would be significantly more likely to belong to the life-course-persistent subtype than to the alwayslow-antisocial subtype (relative-risk ratio, or RRR = 1.36, 95% CI = [1.07, 1.73], p < .05)."

Luckily, there are also many papers that follow the JARS guideline and report all p values correctly, for example:

module_run(psychsci$`0956797616665351`, "exact_p")

All p-values were reported with standard precision

Reporting standardized effect sizes

A second JARS guideline that can be automatically checked is whether people report effect sizes alongside their test result. Each test (e.g., a t-test, F-test, etc.) should include the corresponding effect size (e.g., a Cohen’s d, or partial eta-squared). Based on a text search that uses regular expressions (regex), we can identify t-tests and F-tests that are not followed by an effect size, and warn researchers accordingly.

module_run(
  paper = psychsci$`0956797616657319`,
  module = "effect_size"
)

No effect sizes were detected for any t-tests or F-tests. The Journal Article Reporting Standards state effect sizes should be reported.

text	div	p	s	test	test_text
Participants were more accurate in correctly rejecting pseudowords (M = 98%) than correctly accep…	7	1	2	t-test	t(25) = 4.99; t(25) = 8.38
For nouns, higher accuracies, t(25) = 2.40, p = .024, and shorter RTs, t(25) = 9.57, p < .001, we…	7	2	2	t-test	t(25) = 2.40; t(25) = 9.57
For nonnouns, higher accuracies, t(25) = 2.13, p = .043, and shorter RTs, t(25) = 4.25, p < .001,…	7	2	3	t-test	t(25) = 2.13; t(25) = 4.25
Paired-samples t tests for the pseudoword items showed no significant effect of the case format o…	7	2	4	t-test	t(25) = 1.33
Repeated measures 2 (word type: nouns vs. nonnouns) × 2 (case format: uppercase vs. lowercase) an…	7	2	1	F-test	F(1, 25) = 7.28; F(1, 25) = 85.67
Additional ANOVA findings were that nonnouns elicited higher activation than nouns in a cluster l…	8	1	10	F-test	F(1, 75) = 20.94; F(1, 75) = 18.99
As in the analysis of the word items, we identified higher activation for uppercase compared with…	8	2	2	F-test	F(1, 25) = 20.24

Showing 7 of 7 rows

Checking Multiple Papers

You can also run modules for multiple papers at once and get a summary table.

module_run(psychsci[1:10], "effect_size")

Effect sizes were detected for some, but not all t-tests or F-tests. The Journal Article Reporting Standards state effect sizes should be reported.

id	ttests_n	ttests_with_es	ttests_without_es	Ftests_n	Ftests_with_es	Ftests_without_es
0956797613520608	0	0	0	5	5	0
0956797614522816	5	0	5	20	20	0
0956797614527830	0	0	0	0	0	0
0956797614557697	1	0	1	5	5	0
0956797614560771	2	2	0	0	0	0
0956797614566469	0	0	0	0	0	0
0956797615569001	2	1	1	0	0	0
0956797615569889	0	0	0	12	12	0
0956797615583071	10	6	4	4	2	2
0956797615588467	7	4	3	1	0	1

Showing 10 of 10 rows

This can be useful for meta-scientific research questions, such as whether there is an increase in the best practice to report effect sizes for t-tests over time. In the plot below we have run the module on 1838 published articles in the journal Psychological Science between 2014 and 2024. We can see that where a decade ago, close to half the reported t-tests would not be followed by an effect size, but a decade later, this holds for only around 25% of tests. Perhaps the introduction of a tool like Papercheck can reduce this percentage even further (although it does not need to be 0, as we discuss below).

Our main point is to demonstrate that it is relatively easy to answer some meta-scientific questions with Papercheck. Editors could easily replicate the plot below for articles in their own journal, and see which practices they should improve. As we highlighted in the introductory blog post, when modules are used for metascience, they need to be validated, and have low error rates. We have manually checked the t-tests and F-tests in 250 papers in Psychological Science, and our effect_size module detected 100% of t-tests with effect sizes, 99% of t-tests without effect sizes, 99% of F-tests with effect sizes, and 100% of F-tests without effect sizes. This is accurate enough for meta-scientific research.

Improving the Modules

These two examples are relatively straightforward examples of text searches that can identify examples where researchers do not follow reporting guidelines. Still, these algorithms can be improved.

For example, the module to detect effect sizes following t-tests only matches standardized effect sizes, but it is not always necessary to compute a standardized effect size. For example, if a future meta-analysis would be based on raw scores, and means and standard deviations are reported, it might not be needed to report an effect size. Alternatively, we might just accept a tool that has a relatively high Type 1 error rate when checking our manuscript. After all, a spellchecker has a high Type 1 error rate, underlining many names and abbreviations that are correct, but that it does not recognize, and most people use spellcheckers all the time, as any errors they successfully catch make it worthwhile to read over the Type 1 errors and dismiss them. Despite the room for improvement, even these simple text searches can already identify places where published articles could have been improved by adding effect sizes.

There are many more algorithms that can be added to detect other information that should be reported according to the JARS guidelines. If you would like to create and/or validate such a module, do reach out. We are happy to collaborate.

References

Appelbaum, Mark, Harris Cooper, Rex B. Kline, Evan Mayo-Wilson, Arthur M. Nezu, and Stephen M. Rao. 2018. “Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and Communications Board Task Force Report.” American Psychologist 73 (1): 3–25. https://doi.org/10.1037/amp0000191.

Bartoš, František, and Ulrich Schimmack. 2020. “Z-Curve.2.0: Estimating Replication Rates and Discovery Rates,” January. https://doi.org/10.31234/osf.io/urgtn.

Nuijten, Michèle B., Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, and Jelte M. Wicherts. 2015. “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013).” Behavior Research Methods, October. https://doi.org/10.3758/s13428-015-0664-2.

Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–47. https://doi.org/https://doi.org/10.1037/a0033242.

The 20% Statistician

Wednesday, June 18, 2025

Automatically Checking Journal Article Reporting Standards

Automatically Checking Journal Article Reporting Standards

Exact p-values

Reporting standardized effect sizes

Checking Multiple Papers

Improving the Modules

References

No comments:

Post a Comment