The 20% Statistician: A Modular Approach to Research Quality

Rene Bekkers, 4 September 2025[*]

A dashboard of transparency indicators signaling trustworthiness

Our Research Transparency Check (Bekkers et al., 2025) rests on two pillars. The first pillar that we blogged about previously is the development of Papercheck, a collection of software applications that assess the transparency and methodological quality of research (DeBruine & Lakens, 2025). Our approach is modular: for each defined aspect of transparency and methodological quality we develop a dedicated module and integrate it in the Papercheck package. The module assesses the presence, the level of detail and – if possible – the accuracy of information. Complete and accurate information for a large number of transparency indicators signals the trustworthiness of a research report (Jamieson et al., 2019; Nosek et al., 2024). On the dashboard a transparency indicator lights up in bright green when a research report passed a specific check. An orange light indicates that some information is provided, but more detail is needed. Papercheck gives actionable feedback, suggesting ways to provide more detailed information and correct inaccurate reporting.

Respecting epistemic diversity

How do we decide for which indicators we will develop a module that assesses research reports? Choosing these which indicators is not easy. This is where the second pillar comes in. It requires a series of deliberative conversations with researchers from different disciplines. In the social and behavioral sciences, there is much epistemic diversity (Leonelli, 2022). Researchers working with different data and methods in different disciplines have very different ideas about what constitutes good research. They may even disagree which aspects of research should count in the evaluation of quality. We designed Research Transparency Check to respect these differences. This means that we do not impose a set of good practices on researchers. We should not determine standards for good scientific practice. Instead, the trustworthiness of research should be evaluated with respect to “the prevailing methodological standards in their field” (De Ridder, 2022, p.18). Therefore we start with a series of conversation between researchers who all work with the same types of data. In the social and behavioral sciences, we see researchers regularly use different types of data, collected from seven different sources: from self-reports in surveys, from personal interviews of individuals and (focus) groups, from observations by researchers of behavior through equipment, from participant observation by researchers, from official registers, news and social media, and from synthetic data. We expect that structured conversations with researchers using the same category of data will produce consensus about a core set of indicators that should be transparent.

The quality of surveys as an example

Think about surveys for example. Surveys are a ubiquitous source of data in the social and behavioral sciences: researchers in almost all disciplines use them. Regardless of their discipline, survey researchers have agreed for decades that it is important to know how the sample of participants was determined, what the researchers did to take selectivity in response rates and dropout into account, and how researchers made sure that the reliability and validity of the survey questions posed to respondents was high (Deming, 1944; Groves & Lyberg, 2010). Without information about the sampling frame, the sampling method, the response rate, and the reliability and validity of measures in the questionnaire, it is impossible to evaluate the quality of data from a survey. Still, a large proportion of research reports relying on surveys published in ‘top journals’ in the social sciences do not provide information on these transparency indicators (Stefkovics et al., 2024).

Despite consensus about these indicators, there may still be differences in opinions about the importance of other indicators. Political scientists for instance tend to care a lot about weighting the data, for instance with respect to voter registration, or voting behavior in the previous election. Personality psychologists do not value weights as much, because there are e.g. no objective standards for the true distribution of intelligence or neuroticism in the population.

When researchers agree on the importance of a certain indicator, there may still be disciplinary specific standards of good practice. Researchers in different disciplines value different practices for the same methodological quality indicator as good practices. For instance, standards for the number of items to compose reliable measures in surveys vary between disciplines. Surveys about intergenerational mobility in sociology typically ask just one or a few question about educational attainment (Connelly et al., 2016); measuring implicit attitudes in social psychology requires dozens of repeated measures (Nosek et al., 2010). These differences may be understandable given that researchers in different fields study different phenomena that are inherently more variable and more difficult to measure with high levels of precision in some fields than in others. Another example is the norm for p-values, which is .05 in most fields but much lower in others, such as 0.00000005 (5 x 10^-8) in behavioral genetics (Benjamin et al., 2018). The point is that different fields set different standards for the same quality indicators, even when they are working with similar data sources. Thus, it is important to use field-specific norms when evaluating the methodological quality of a study.

Toward reporting standards

Transparency is a necessary condition to evaluate research quality (Vazire, 2017; Hardwicke & Vazire, 2023): “transparency doesn’t guarantee credibility, transparency and scrutiny together guarantee that research gets the credibility it deserves” (Vazire, 2019). Only when research reports include information about indicators of methodological quality in sufficient detail and clear language, can the quality of the research be evaluated. In some fields, scholars, publishers, associations and funders have come together to define reporting standards. Authors who wish to publish a paper in a journal of the American Psychological Association are requested to conform to the APA Journal Article Reporting Standards. Funders and regulators in biomedicine impose reporting standards, for example CONSORT guidelines on the reporting of randomized control trials, or SPIRIT guidelines for their protocols. Automated checks such as in Papercheck should not replace peer review, but help relieve the burden on human reviewers to determine the degree of compliance with such reporting guidelines (Schulz et al., 2022).

In other fields, however, it’s almost as if anything goes. In most areas of the social sciences, journals do not impose reporting standards (Malički, Aalbersberg, Bouter & Ter Riet, 2019). They may have rules on the cosmetics of submitted journal articles, such as on language, style, and formatting of tables and figures, which editorial assistants enforce. But the way sampling frames, sampling methods, response rates and information about the reliability and validity of study measures are typically not subject to reporting standards. That should change, if we want a more reliable and valid evaluation of research quality. It is also possible since journals can simply mandate reporting standards (Malički & Mehmani, 2024).

The identification of transparency indicators and the collection of examples of good and poor practices in data communities will guide researchers in the social and behavioral sciences toward precise and valid reporting standards. The field of biomedicine is ahead of the social sciences, with more than 675 reporting guidelines developed for very specific study types (Equator Network, 2025). As we develop modules to automate checks of methodological quality in research reports, we benefit from the experiences of toolmakers in biomedicine (Eckmann et al., 2025).

A multidimensional measure of research quality

With multiple modules assessing the methodological quality of research reports for various transparency indicators, we obtain a multidimensional and more refined measure of research quality. The modular approach helps solve a difficult problem in the Recognition and Rewards movement: the lack of consensus about valid and reliable measurement of the quality of science. In the absence of such a measurement, universities have used “one size fits all” metrics of the volume and prestige of science publications.

Universities incentivized researchers to produce as many publications in peer reviewed journals as possible, generally regarding them as proxy measurements of ‘high quality’ science. Furthermore, the number of citations to the work of scholars became the standard measure of scholarly ‘impact’. Universities and science funders rewarded scholars who published proficiently and were cited more frequently in international peer-reviewed journals by promoting them, giving them more research time, and grants for research. Institutions ranked journals into tiers, and rewarded employees more for publishing in ‘top journals’ than in ‘B-journals’.

As a result, these incentives reshaped scholarly behavior. Scholars created networks of co-authors, each producing an article in turn, inviting colleagues to read along and pretend they helped produce the paper. In practice, the contributions were typically uneven, but the advantage was large: the number of co-authors on publications increased (Henriksen, 2016; Chapman et al., 2019), as did overall publication and citation counts. Scholars also sought to publish in journals that on average receive higher numbers of citations, so called ‘high-impact journals’. However, both journal rank and citation counts are not correlated with higher methodological quality of research; in some cases the reverse is true, with worse science in the higher ranked journals (Brembs, Button & Munafò, 2013; Dougherty & Horne, 2022).

Scholars behaved according to Campbell’s Law (Campbell, 1979): “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor”, with a result in line with Goodheart’s Law (Goodheart, 1975): “when a measure becomes a target, it ceases to be a good measure” (Varela, Benedetto, & Sanchez-Santos, 2014). Over time, citations have become a less informative indicator of research quality (Brembs, Button & Munafò, 2013; Koltun & Hafner, 2021).

What we’ve learned from the hyper focus on peer reviewed journal articles is that one size doesn’t fit all. It is not only misguided to evaluate the quality of research by the number of articles published or the number of citations and derivatives such as the H-index or the journal impact factor, it can lead to creation of perverse incentives and questionable research practices (Higginson & Munafò, 2016; Smaldino & McElreath, 2016; Edwards & Roy, 2017).

Recognition and rewards for transparent and good science

Once the perverse effects of these incentives became clear, the resistance against quantitative output driven rewards grew. More than 3,500 organizations including the Association of Universities in the Netherlands (VSNU), the Netherlands Federation of University Medical Centers (NFU), the Netherlands Organisation for Scientific Research (NWO), the Netherlands Organisation for Health Research and Development (ZonMW), and the Royal Netherlands Academy of Arts and Sciences (KNAW) signed the San Francisco Declaration on Research Assessment (DORA, 2025), promising not to measure the performance with quantitative indicators. In the effort to recognize and reward good science rather than a high volume of publications in peer reviewed journals, universities around the world – and particularly those in the Netherlands – have diversified the criteria for tenure and promotion guidelines, in line with the Agreement on Reforming Research Assessment of the Coalition for Advancing Research Assessment (COARA, 2022).

The problem that has remained unsolved is the measurement of research quality. In due course, Research Transparency Check may help to address this problem. For transparency indicators that data communities agree upon as relevant, we will have an automated screening tool, that provides good examples for best practices. Because the assessments can be updated with every revision, institutions can not only measure the eventual quality of a publication, but also the quality of an initial preprint, and the change from the first draft to the published version. The added value of going through peer review can also be measured, incentivizing journals to provide better value for money. Journals could use Papercheck to ensure that authors adhere to journal reporting guidelines. On their end, authors can use Papercheck before they submit their manuscript to ensure that it passes. At the same time, they are directed to the best practices in their field.

References

Bekkers, R., Lakens, D., DeBruine, L., Mesquida Caldenty, C. & Littel, M. (2025). Research Transparency Check. TDCC-SSH Challenge grant. Proposal: https://osf.io/cpv4d. Project: https://osf.io/z3tr9.

Benjamin, D.J., et al., (2018). Redefine statistical significance. Nature Human Behavior, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z

Brembs, B., Button, K., & Munafò, M. (2013). Deep impact: unintended consequences of journal rank. Frontiers in human Neuroscience, 7, 291. https://doi.org/10.3389/fnhum.2013.00291

Campbell, D.T. (1979). Assessing the impact of planned social change. Evaluation and Program Planning, 2 (1): 67–90. https://doi.org/10.1016/0149-7189(79)90048-X.

Chapman, C. A., Bicca-Marques, J. C., Calvignac-Spencer, S., Fan, P., Fashing, P. J., Gogarten, J., ... & Chr. Stenseth, N. (2019). Games academics play and their consequences: how authorship, h-index and journal impact factors are shaping the future of academia. Proceedings of the Royal Society B, 286(1916), 20192047. http://dx.doi.org/10.1098/rspb.2019.2047

COARA (2022). Agreement on Reforming Research Assessment. https://coara.org/wp-content/uploads/2022/09/2022_07_19_rra_agreement_final.pdf

Connelly, R., Gayle, V., & Lambert, P. S. (2016). A review of educational attainment measures for social survey research. Methodological Innovations, 9, https://doi.org/10.1177/2059799116638001

Deming, E. (1944). On Errors in Surveys. American Sociological Review, 9(4): 359-369. https://doi.org/10.2307/2085979

DeBruine, L., Lakens, D. (2025). papercheck: Check Scientific Papers for Best Practices. R package version 0.0.0.9056, https://github.com/scienceverse/papercheck.

De Ridder, J. (2022). How to trust a scientist. Studies in the History and Philosophy of Science, 93: 11-20. https://doi.org/10.1016/j.shpsa.2022.02.003

DORA (2025). 3,488 individuals and organizations in 166 countries have signed DORA to date. https://sfdora.org/signers/?_signer_type=organisation

Dougherty, M. R., & Horne, Z. (2022). Citation counts and journal impact factors do not capture some indicators of research quality in the behavioural and brain sciences. Royal Society Open Science, 9(8), 220334. https://doi.org/10.1098/rsos.220334

Eckmann, P. et al. (2025). Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work. https://arxiv.org/pdf/2507.17991

Edwards, M. A., & Roy, S. (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34(1), 51-61. https://doi.org/10.1089/ees.2016.0223

Equator Network (2025). Reporting Guidelines. https://www.equator-network.org/reporting-guidelines/

Fire, M., & Guestrin, C. (2019). Over-optimization of academic publishing metrics: observing Goodhart's Law in action. GigaScience, 8(6), giz053. https://doi.org/10.1093/gigascience/giz053

Goodhart, C. (1975). Problems of Monetary Management: The UK Experience. Papers in Monetary Economics. Papers in monetary economics 1975; 1; 1. - [Sydney]. - 1975, p. 1-20. Vol. 1. Sydney: Reserve Bank of Australia. https://doi.org/10.1007/978-1-349-17295-5_4

Groves, R.M. & Lyberg, L. (2010). Total Survey Error: Past, Present, And Future. Public Opinion Quarterly, 74 (5): 849–879. https://doi.org/10.1093/poq/nfq065

Hardwicke, T. E., & Vazire, S. (2023). Transparency Is Now the Default at Psychological Science. Psychological Science, 35(7), 708-711. https://doi.org/10.1177/09567976231221573

Henriksen, D. (2016). The rise in co-authorship in the social sciences (1980–2013). Scientometrics 107, 455–476. https://doi.org/10.1007/s11192-016-1849-x

Higginson, A. D., & Munafò, M. R. (2016). Current incentives for scientists lead to underpowered studies with erroneous conclusions. PLoS Biology, 14(11), e2000995. https://doi.org/10.1371/journal.pbio.2000995

Jamieson, K. H., McNutt, M., Kiermer, V., & Sever, R. (2019). Signaling the trustworthiness of science. Proceedings of the National Academy of Sciences, 116(39), 19231-19236. https://doi.org/10.1073/pnas.1913039116

Koltun, V., & Hafner, D. (2021). The h-index is no longer an effective correlate of scientific reputation. PLoS ONE 16(6): e0253397. https://doi.org/10.1371/journal.pone.0253397

Leonelli, S. (2022). Open science and epistemic diversity: friends or foes? Philosophy of Science, 89(5), 991-1001. https://doi.org/10.1017/psa.2022.45

Malički, M., Aalbersberg, I. J., Bouter, L., & Ter Riet, G. (2019). Journals’ instructions to authors: A cross-sectional study across scientific disciplines. PLoS One, 14(9), e0222157. https://doi.org/10.1371/journal.pone.0222157

Malički, M., & Mehmani, B. (2024). Structured peer review: pilot results from 23 Elsevier journals. PeerJ, 12, e17514. https://doi.org/10.7717/peerj.17514

Nosek, B. A., Smyth, F. L., Hansen, J. J., Devos, T., Lindner, N. M., Ranganath, K. A., ... & Banaji, M. R. (2007). Pervasiveness and correlates of implicit attitudes and stereotypes. European Review of Social Psychology, 18(1), 36-88. https://doi.org/10.1080/10463280701489053

Nosek, B. A., Allison, D., Jamieson, K. H., McNutt, M., Nielsen, A. B., & Wolf, S. M. (2024, December 23). A Framework for Assessing the Trustworthiness of Research Findings. https://doi.org/10.31222/osf.io/jw6fz

Schulz, R., Barnett, A., Bernard, R. et al. (2022). Is the future of peer review automated? BMC Research Notes, 15, 203. https://doi.org/10.1186/s13104-022-06080-6

Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3(9), 160384. https://doi.org/10.1098/rsos.160384

Stefkovics, A., Eichhorst, A., Skinnion, D. & Harrison, C.H. (2024). Are We Becoming More Transparent? Survey Reporting Trends in Top Journals of Social Sciences. International Journal of Public Opinion Research, 36, edae013. https://doi.org/10.1093/ijpor/edae013

Varela, D., Benedetto, G., Sanchez-Santos, J.M. (2014). Editorial statement: Lessons from Goodhart's law for the management of the journal. European Journal of Government and Economics, 3 (2): 100–103. https://doi.org/10.17979/ejge.2014.3.2.4299

Vazire, S. (2017). Quality Uncertainty Erodes Trust in Science. Collabra: Psychology, 3(1), 1. https://doi.org/10.1525/collabra.74

Vazire, S. (2019). Do We Want to Be Credible or Incredible? Psychological Science website, December 23, 2019. https://www.psychologicalscience.org/observer/do-we-want-to-be-credible-or-incredible

[*] Thanks to Mario Malički, Gerben ter Riet, Lex Bouter, IJsbrand Jan Aalbersberg, Cristian Mesquida Caldenty, Daniël Lakens and Jakub Werner for suggestions.

The 20% Statistician

Saturday, September 13, 2025

A Modular Approach to Research Quality

No comments:

Post a Comment