Rene Bekkers, 4 September 2025[*]
A dashboard of transparency indicators
signaling trustworthiness
Our Research Transparency Check (Bekkers
et al., 2025) rests on two pillars. The first pillar that we blogged about
previously is the development of Papercheck, a collection of software applications
that assess the transparency and methodological quality of research (DeBruine
& Lakens, 2025). Our approach is modular: for each defined aspect of transparency
and methodological quality we develop a dedicated module and integrate it in
the Papercheck package. The module assesses the presence, the level of
detail and – if possible – the accuracy of information. Complete and accurate
information for a large number of transparency indicators signals the
trustworthiness of a research report (Jamieson et al., 2019; Nosek et al., 2024).
On the dashboard a transparency indicator lights up in bright green when a research
report passed a specific check. An orange light indicates that some information
is provided, but more detail is needed. Papercheck gives actionable feedback,
suggesting ways to provide more detailed information and correct inaccurate reporting.
Respecting epistemic diversity
How do we decide for which indicators we
will develop a module that assesses research reports? Choosing these which
indicators is not easy. This is where the second pillar comes in. It requires a
series of deliberative conversations with researchers from different
disciplines. In the social and behavioral sciences, there is much epistemic
diversity (Leonelli, 2022). Researchers working with different data and methods
in different disciplines have very different ideas about what constitutes good
research. They may even disagree which aspects of research should count in the
evaluation of quality. We designed Research Transparency Check to
respect these differences. This means that we do not impose a set of good
practices on researchers. We should not determine standards for good scientific
practice. Instead, the trustworthiness of research should be evaluated with
respect to “the prevailing methodological standards in their field” (De Ridder,
2022, p.18). Therefore we start with a series of conversation between
researchers who all work with the same types of data. In the social and
behavioral sciences, we see researchers regularly use different types of data,
collected from seven different sources: from self-reports in surveys, from personal
interviews of individuals and (focus) groups, from observations by researchers
of behavior through equipment, from participant observation by researchers, from
official registers, news and social media, and from synthetic data. We expect that
structured conversations with researchers using the same category of data will produce
consensus about a core set of indicators that should be transparent.
The quality of surveys as an example
Think about surveys for example. Surveys
are a ubiquitous source of data in the social and behavioral sciences:
researchers in almost all disciplines use them. Regardless of their discipline,
survey researchers have agreed for decades that it is important to know how the
sample of participants was determined, what the researchers did to take
selectivity in response rates and dropout into account, and how researchers
made sure that the reliability and validity of the survey questions posed to
respondents was high (Deming, 1944; Groves & Lyberg, 2010). Without
information about the sampling frame, the sampling method, the response rate, and
the reliability and validity of measures in the questionnaire, it is impossible
to evaluate the quality of data from a survey. Still, a large proportion of
research reports relying on surveys published in ‘top journals’ in the social
sciences do not provide information on these transparency indicators
(Stefkovics et al., 2024).
Despite consensus about these indicators,
there may still be differences in opinions about the importance of other indicators.
Political scientists for instance tend to care a lot about weighting the data,
for instance with respect to voter registration, or voting behavior in the
previous election. Personality psychologists do not value weights as much,
because there are e.g. no objective standards for the true distribution of
intelligence or neuroticism in the population.
When researchers agree on the importance of
a certain indicator, there may still be disciplinary specific standards of good
practice. Researchers in different disciplines value different practices for
the same methodological quality indicator as good practices. For instance, standards
for the number of items to compose reliable measures in surveys vary between
disciplines. Surveys about intergenerational mobility in sociology typically
ask just one or a few question about educational attainment (Connelly et al.,
2016); measuring implicit attitudes in social psychology requires dozens of repeated
measures (Nosek et al., 2010). These differences may be understandable given
that researchers in different fields study different phenomena that are
inherently more variable and more difficult to measure with high levels of
precision in some fields than in others. Another example is the norm for
p-values, which is .05 in most fields but much lower in others, such as 0.00000005
(5 x 10-8) in behavioral genetics (Benjamin et al., 2018). The point
is that different fields set different standards for the same quality indicators,
even when they are working with similar data sources. Thus, it is important to use
field-specific norms when evaluating the methodological quality of a study.
Toward reporting standards
Transparency is a necessary condition to
evaluate research quality (Vazire, 2017; Hardwicke & Vazire, 2023):
“transparency doesn’t guarantee credibility, transparency and scrutiny together
guarantee that research gets the credibility it deserves” (Vazire, 2019). Only
when research reports include information about indicators of methodological
quality in sufficient detail and clear language, can the quality of the
research be evaluated. In some fields, scholars, publishers, associations and
funders have come together to define reporting standards. Authors who wish to publish
a paper in a journal of the American Psychological Association are requested to
conform to the APA Journal Article Reporting Standards. Funders and regulators
in biomedicine impose reporting standards, for example CONSORT guidelines on the
reporting of randomized control trials, or SPIRIT guidelines for their
protocols. Automated checks such as in Papercheck should not replace peer
review, but help relieve the burden on human reviewers to determine the degree
of compliance with such reporting guidelines (Schulz et al., 2022).
In other fields, however, it’s almost as if
anything goes. In most areas of the social sciences, journals do not impose reporting
standards (Malički, Aalbersberg, Bouter & Ter Riet, 2019). They may have
rules on the cosmetics of submitted journal articles, such as on language,
style, and formatting of tables and figures, which editorial assistants enforce.
But the way sampling frames, sampling methods, response rates and information
about the reliability and validity of study measures are typically not subject
to reporting standards. That should change, if we want a more reliable and
valid evaluation of research quality. It is also possible since journals can
simply mandate reporting standards (Malički & Mehmani, 2024).
The identification of transparency
indicators and the collection of examples of good and poor practices in data
communities will guide researchers in the social and behavioral sciences toward
precise and valid reporting standards. The field of biomedicine is ahead of the
social sciences, with more than 675 reporting guidelines developed for very
specific study types (Equator Network, 2025). As we develop modules to automate
checks of methodological quality in research reports, we benefit from the experiences
of toolmakers in biomedicine (Eckmann et al., 2025).
A multidimensional measure of research
quality
With multiple modules assessing the methodological
quality of research reports for various transparency indicators, we obtain a
multidimensional and more refined measure of research quality. The modular
approach helps solve a difficult problem in the Recognition and Rewards
movement: the lack of consensus about valid and reliable measurement of the
quality of science. In the absence of such a measurement, universities have
used “one size fits all” metrics of the volume and prestige of science publications.
Universities incentivized researchers to
produce as many publications in peer reviewed journals as possible, generally
regarding them as proxy measurements of ‘high quality’ science. Furthermore, the
number of citations to the work of scholars became the standard measure of
scholarly ‘impact’. Universities and science funders rewarded scholars who
published proficiently and were cited more frequently in international
peer-reviewed journals by promoting them, giving them more research time, and
grants for research. Institutions ranked journals into tiers, and rewarded
employees more for publishing in ‘top journals’ than in ‘B-journals’.
As a result, these incentives reshaped scholarly
behavior. Scholars created networks of co-authors, each producing an article in
turn, inviting colleagues to read along and pretend they helped produce the
paper. In practice, the contributions were typically uneven, but the advantage
was large: the number of co-authors on publications increased (Henriksen, 2016;
Chapman et al., 2019), as did overall publication and citation counts. Scholars
also sought to publish in journals that on average receive higher numbers of
citations, so called ‘high-impact journals’. However, both journal rank and
citation counts are not correlated with higher methodological quality of
research; in some cases the reverse is true, with worse science in the higher
ranked journals (Brembs, Button & Munafò, 2013; Dougherty & Horne, 2022).
Scholars behaved according to Campbell’s
Law (Campbell, 1979): “The more any quantitative social indicator is used for
social decision-making, the more subject it will be to corruption pressures and
the more apt it will be to distort and corrupt the social processes it is
intended to monitor”, with a result in line with Goodheart’s Law (Goodheart,
1975): “when a measure becomes a target, it ceases to be a good measure” (Varela,
Benedetto, & Sanchez-Santos, 2014). Over time, citations have become a less
informative indicator of research quality (Brembs, Button & Munafò, 2013;
Koltun & Hafner, 2021).
What we’ve learned from the hyper focus on peer
reviewed journal articles is that one size doesn’t fit all. It is not only
misguided to evaluate the quality of research by the number of articles
published or the number of citations and derivatives such as the H-index or the
journal impact factor, it can lead to creation of perverse incentives and
questionable research practices (Higginson & Munafò, 2016; Smaldino &
McElreath, 2016; Edwards & Roy, 2017).
Recognition and rewards for transparent
and good science
Once the perverse effects of these
incentives became clear, the resistance against quantitative output driven
rewards grew. More than 3,500 organizations including the Association of
Universities in the Netherlands (VSNU), the Netherlands Federation of
University Medical Centers (NFU), the Netherlands Organisation for Scientific
Research (NWO), the Netherlands Organisation for Health Research and
Development (ZonMW), and the Royal Netherlands Academy of Arts and Sciences
(KNAW) signed the San Francisco Declaration on Research Assessment (DORA, 2025),
promising not to measure the performance with quantitative indicators. In the
effort to recognize and reward good science rather than a high volume of
publications in peer reviewed journals, universities around the world – and
particularly those in the Netherlands – have diversified the criteria for
tenure and promotion guidelines, in line with the Agreement on Reforming
Research Assessment of the Coalition for Advancing Research Assessment (COARA,
2022).
The problem that has remained unsolved is
the measurement of research quality. In due course, Research Transparency
Check may help to address this problem. For transparency indicators that
data communities agree upon as relevant, we will have an automated screening
tool, that provides good examples for best practices. Because the assessments
can be updated with every revision, institutions can not only measure the eventual
quality of a publication, but also the quality of an initial preprint, and the change
from the first draft to the published version. The added value of going through
peer review can also be measured, incentivizing journals to provide better
value for money. Journals could use Papercheck to ensure that authors adhere
to journal reporting guidelines. On their end, authors can use Papercheck before
they submit their manuscript to ensure that it passes. At the same time, they
are directed to the best practices in their field.
References
Bekkers, R., Lakens, D., DeBruine, L., Mesquida
Caldenty, C. & Littel, M. (2025). Research Transparency Check. TDCC-SSH
Challenge grant. Proposal: https://osf.io/cpv4d. Project: https://osf.io/z3tr9.
Benjamin, D.J.,
et al., (2018). Redefine statistical significance. Nature
Human Behavior, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z
Brembs, B., Button, K., & Munafò, M.
(2013). Deep impact: unintended consequences of journal rank. Frontiers in
human Neuroscience, 7, 291. https://doi.org/10.3389/fnhum.2013.00291
Campbell, D.T. (1979). Assessing the impact
of planned social change. Evaluation and Program Planning, 2 (1): 67–90.
https://doi.org/10.1016/0149-7189(79)90048-X.
Chapman, C. A., Bicca-Marques, J. C.,
Calvignac-Spencer, S., Fan, P., Fashing, P. J., Gogarten, J., ... & Chr.
Stenseth, N. (2019). Games academics play and their consequences: how
authorship, h-index and journal impact factors are shaping the future of
academia. Proceedings of the Royal Society B, 286(1916), 20192047. http://dx.doi.org/10.1098/rspb.2019.2047
COARA (2022). Agreement on Reforming
Research Assessment. https://coara.org/wp-content/uploads/2022/09/2022_07_19_rra_agreement_final.pdf
Connelly, R., Gayle, V., & Lambert, P.
S. (2016). A review of educational attainment measures for social survey
research. Methodological Innovations, 9, https://doi.org/10.1177/2059799116638001
Deming, E. (1944). On Errors in Surveys. American
Sociological Review, 9(4): 359-369. https://doi.org/10.2307/2085979
DeBruine, L., Lakens, D. (2025).
papercheck: Check Scientific Papers for Best Practices. R package version
0.0.0.9056, https://github.com/scienceverse/papercheck.
De Ridder, J. (2022). How to trust a
scientist. Studies in the History and Philosophy of Science, 93: 11-20. https://doi.org/10.1016/j.shpsa.2022.02.003
DORA (2025). 3,488 individuals and
organizations in 166 countries have signed DORA to date. https://sfdora.org/signers/?_signer_type=organisation
Dougherty, M. R., & Horne, Z. (2022).
Citation counts and journal impact factors do not capture some indicators of
research quality in the behavioural and brain sciences. Royal Society Open
Science, 9(8), 220334. https://doi.org/10.1098/rsos.220334
Eckmann, P. et al. (2025). Use as Directed?
A Comparison of Software Tools Intended to Check Rigor and Transparency of
Published Work. https://arxiv.org/pdf/2507.17991
Edwards, M. A., & Roy, S. (2017).
Academic research in the 21st century: Maintaining scientific integrity in a
climate of perverse incentives and hypercompetition. Environmental Engineering
Science, 34(1), 51-61. https://doi.org/10.1089/ees.2016.0223
Equator Network (2025). Reporting
Guidelines. https://www.equator-network.org/reporting-guidelines/
Fire, M., & Guestrin, C. (2019).
Over-optimization of academic publishing metrics: observing Goodhart's Law in
action. GigaScience, 8(6), giz053. https://doi.org/10.1093/gigascience/giz053
Goodhart, C. (1975). Problems of Monetary
Management: The UK Experience. Papers in Monetary Economics. Papers in monetary
economics 1975; 1; 1. - [Sydney]. - 1975, p. 1-20. Vol. 1. Sydney: Reserve Bank
of Australia. https://doi.org/10.1007/978-1-349-17295-5_4
Groves, R.M. & Lyberg, L. (2010). Total
Survey Error: Past, Present, And Future. Public Opinion Quarterly, 74
(5): 849–879. https://doi.org/10.1093/poq/nfq065
Hardwicke, T. E., & Vazire, S. (2023).
Transparency Is Now the Default at Psychological Science. Psychological
Science, 35(7), 708-711. https://doi.org/10.1177/09567976231221573
Henriksen, D. (2016). The rise in
co-authorship in the social sciences (1980–2013). Scientometrics 107,
455–476. https://doi.org/10.1007/s11192-016-1849-x
Higginson, A. D., & Munafò, M. R.
(2016). Current incentives for scientists lead to underpowered studies with
erroneous conclusions. PLoS Biology, 14(11), e2000995. https://doi.org/10.1371/journal.pbio.2000995
Jamieson, K. H., McNutt, M., Kiermer, V.,
& Sever, R. (2019). Signaling the trustworthiness of science. Proceedings
of the National Academy of Sciences, 116(39), 19231-19236. https://doi.org/10.1073/pnas.1913039116
Koltun, V., & Hafner, D. (2021). The
h-index is no longer an effective correlate of scientific reputation. PLoS ONE 16(6): e0253397. https://doi.org/10.1371/journal.pone.0253397
Leonelli, S. (2022). Open science and
epistemic diversity: friends or foes? Philosophy of Science, 89(5),
991-1001. https://doi.org/10.1017/psa.2022.45
Malički, M., Aalbersberg, I. J., Bouter,
L., & Ter Riet, G. (2019). Journals’ instructions to authors: A
cross-sectional study across scientific disciplines. PLoS One, 14(9),
e0222157. https://doi.org/10.1371/journal.pone.0222157
Malički, M., & Mehmani, B. (2024).
Structured peer review: pilot results from 23 Elsevier journals. PeerJ, 12, e17514. https://doi.org/10.7717/peerj.17514
Nosek, B. A.,
Smyth, F. L., Hansen, J. J., Devos, T., Lindner, N. M., Ranganath, K. A., ... & Banaji, M. R. (2007). Pervasiveness and correlates of implicit
attitudes and stereotypes. European Review of Social Psychology, 18(1),
36-88. https://doi.org/10.1080/10463280701489053
Nosek, B. A., Allison, D., Jamieson, K. H.,
McNutt, M., Nielsen, A. B., & Wolf, S. M. (2024, December 23). A Framework
for Assessing the Trustworthiness of Research Findings. https://doi.org/10.31222/osf.io/jw6fz
Schulz, R.,
Barnett, A., Bernard, R. et al. (2022). Is the
future of peer review automated? BMC Research Notes, 15, 203. https://doi.org/10.1186/s13104-022-06080-6
Smaldino, P. E., & McElreath, R.
(2016). The natural selection of bad science. Royal Society Open Science,
3(9), 160384. https://doi.org/10.1098/rsos.160384
Stefkovics, A., Eichhorst, A., Skinnion, D.
& Harrison, C.H. (2024). Are We Becoming More Transparent? Survey Reporting
Trends in Top Journals of Social Sciences. International Journal of Public
Opinion Research, 36, edae013. https://doi.org/10.1093/ijpor/edae013
Varela, D., Benedetto, G., Sanchez-Santos,
J.M. (2014). Editorial statement: Lessons from Goodhart's law for the
management of the journal. European Journal of Government and Economics,
3 (2): 100–103. https://doi.org/10.17979/ejge.2014.3.2.4299
Vazire, S. (2017). Quality Uncertainty
Erodes Trust in Science. Collabra:
Psychology, 3(1),
1. https://doi.org/10.1525/collabra.74
Vazire, S. (2019). Do We Want to Be
Credible or Incredible? Psychological Science website, December 23, 2019. https://www.psychologicalscience.org/observer/do-we-want-to-be-credible-or-incredible
[*] Thanks to Mario Malički, Gerben ter Riet, Lex Bouter, IJsbrand Jan
Aalbersberg, Cristian Mesquida Caldenty, Daniël Lakens and Jakub Werner for
suggestions.
No comments:
Post a Comment