With my collaborators, I am increasingly performing qualitative research. I find qualitative research projects a useful way to improve my understanding of behaviors that I want to explore with other methods in the future. For example, some years ago I performed qualitative interviews with researchers who believed their own research had no value whatsoever. Although I did not intend to publish these interviews, they provided important insights for other projects that I am engaged in now. I was involved in qualitative research on the assessment process of interdisciplinary research (Schölvinck et al., 2024), and we performed interviews to understand how researchers interpret a questionnaire we were developing that measures personal values in science (Kis et al., 2025). Together with Anna van ‘Veer I supervised Julia Weschenfelder who interviewed scientists on what they believed the value of their research was, and I have hired Julia as a PhD as part of a large project on the meaningful interpretation of effect sizes. She is planning interviews with researchers about what determined the maximum sample size they are willing to collect (if you want to be interviewed about this, reach out!). With Sajedeh Rasti, who is completing her PhD in my lab, we have spent the last 2 years interviewing people who played important roles in the creation of large-scale coordinated research projects in science.
As a supervisor, I am always very actively involved in research projects, and I joined as many of the (extremely interesting) interviews Sajedeh performed, and I listened to the audio recordings of all interviews that Julia performed to give my interpretation of what the scientists discussed when interviewed. Yet it never occurred to me to independently perform the thematic analysis for these interviews, and compare the themes we derived. I became aware of this peculiarity after reading a great qualitative paper analyzing open questions in a study on questionable research practices (Makel et al., 2025). In this paper, two teams independently analyze themes in the same set of open questions. They largely find the same themes, and conclude: “our two independent analysis teams reported themes that were generally similar or overlapping, suggesting a robustness of the findings. We believe this suggests that independent qualitative analyst teams with similar positionality can use unique analytic paths and reach largely similar destinations. This contributes to the ongoing conversation within the qualitative research community about whether reproducibility and replicability are relevant or possible in qualitative research”.
Research on inter-rater reliability of thematic coding
I looked
into the literature to search for papers similar to Makel et al., 2025, where
the same qualitative data is analyzed by multiple coders to examine how reproducible
the themes are that are identified in the data. There are many more papers than
the few I will list here, and someone should write a paper summarizing this
literature. But this is a Sunday morning blog post, and not a systematic
review, so I will just present some papers that I found interesting.
Armstrong and colleagues had six researchers independently analyze the same single focus‑group transcript, and found “close agreement on the basic themes,” but substantial divergence in how those themes were interpreted and organized, with each analyst having “‘packaged’ the themes differently.” (Armstrong et al., 1997). The authors in this paper then go on and say these differences demonstrate the inherent subjectivity in qualitative research. But this is not the message I take away from this paper at all. All coders of any type of data will differ slightly in the details they highlight. What matters most in this project is not how in verbal summaries of the themes, researchers highlight different details – that is to be expected, but also largely irrelevant – but that there is such clear agreement on the themes identified. If I read the examples in the paper, the differences are mainly in detail, where some summarize the themes at a higher level, and others on a more detailed level. Those who summarize the themes at a detailed level will of course pick out specific details others did not mention – but this can easily be improved upon by instructing coders better about the level of detail at which results should be provided, and how details should be chosen in verbal summaries.
Campbell and colleagues (2013) provide a great example of the work needed to reach high inter-rater reliability. They say that “Reliability is just as important for qualitative research as it is for quantitative research” and argue that replicability problems stem less from disagreement over themes per se than from “the unitization problem—that is, identifying appropriate blocks of text for a particular code or codes,” which can “wreak havoc when researchers try to establish intercoder reliability for in-depth semistructured interviews.” Using their own empirical coding exercise, they show that even with a detailed codebook, intercoder reliability remained modest (“54 percent reliability on average”), reflecting the interpretive complexity of semistructured interviews where “more than one theme and therefore more than one code may be applicable at once.” However, when disagreements were resolved through discussion, intercoder agreement rose dramatically to 96%, and 91% of initial disagreements were resolved. In my view, qualitative research is nothing special in this respect, as it is often difficult to achieve high inter-rater reliability in any coding project. For example, we have the same difficulty in reaching agreement when we code what the main hypothesis in a scientific paper is in metascientific research projects (Mesquida et al., 2025).
The importance and challenges of clear coding instructions related to text-segmentation is also discussed in a classic paper on the reliability qualitative research (MacQueen et al., 1998). Based on extensive experience with qualitative research at the Centers for Disease Control and Prevention, MacQueen and colleagues offer useful suggestions on creating a codebook that will lead to high reliability. Through an iterative process in which multiple coders independently code the same text, compare results, and revise definitions, they show how disagreement over themes is an indication of ambiguous codes – and not an inherent limitation of qualitative research. Reproducibility can be achieved by repeatedly checking whether coders can “apply the codes in a consistent manner” and refining the codebook until agreement is acceptable. The authors argue that we should not expect that coders will naturally “see” the same themes, but that they can code the same themes reliably if researchers use a disciplined, transparent, and collective codebook development process that supports reproducible qualitative analysis without denying its interpretive character.
A realist ontology in qualitative research
The idea of reliability, or reproducibility (the two concepts become intertwined in a lovely way in qualitative research, as the coder is the measurement device, so to say) in qualitative research emerges most naturally from a scientific realist perspective on knowledge generation. There are qualitative researchers who adopt philosophical perspectives that attempt to argue reliability and reproducibility are not relevant for qualitative research. I have engaged with these ideas a lot over the years, and find them unconvincing. Seale and Silverman (1997) push back strongly against the idea that reliability would not apply in qualitative research, and write “We believe that such a position can amount to methodological anarchy and resist this on 2 grounds. First, it simply makes no sense to argue that all knowledge and feelings are of equal weight and value. Even in everyday life, we readily sort fact from fancy. Why, therefore, should science be any different? Second, methodological anarchy offers a clearly negative message to the audiences of qualitative health research, suggesting that its proponents have given up claims to validity.”
I am personally more sympathetic to the view expressed
by Popay and colleagues (1998): “On one side,
there are those who argue that there is nothing unique about qualitative research
and that traditional definitions of reliability, validity, objectivity, and generalizability
apply across both qualitative and quantitative approaches. On the other side,
there are those postmodernists who contend that there can be no criteria for
judging qualitative research outcomes (Fuchs, 1993). In this radical relativist
position, all criteria are doubtful and none can be privileged. However, both
of these positions are unsatisfactory. The second is nihilistic and precludes
any distinction based on systematic or other criteria. If the first is adopted,
then, at best, qualitative research will always be seen as inferior to
quantitative research. At worst, there is a danger that poor-quality
qualitative research, which meets criteria inappropriate for the assessment of
such evidence, will be privileged.” There are unique aspects of reliability in
qualitative research, but qualitative research will not be taken seriously by
the majority of scientists if researchers do not engage with reliability at
all.
O’Conner and Joffe provide a useful guide on how to achieve inter-coder reliability in qualitative research in psychology, based on their own extensive experience (O’Connor & Joffe, 2020). They argue that “ICR helps qualitative research achieve this communicative function by showing the basic analytic structure has meaning that extends beyond an individual researcher. The logic is that if separate individuals converge on the same interpretation of the data, it implies “that the patterns in the latent content must be fairly robust and that if the readers themselves were to code the same content, they too would make the same judgments” (Potter & Levine-Donnerstein, 1999, p. 266).” They highlight how there are both external incentives to care about reliability (as it can function as a signal of quality) but also has direct benefits for the researchers performing the qualitative research.
If you feel similarly, and want to educate your students about qualitative methods where reliability and reproducibility are important, there is a nice paper that can be used to introduce your students to realist ontologies in qualitative research by Lourie and McPhail (2024). They note how the methodology literature in qualitative research often engages with interpretivist-constructivist approaches. Among my own collaborators, this perspective is not seen as appealing, and we prefer to build our qualitative research from a realist ontology. In this philosophy, inter-rater reliability, and reproducibility, are important aspects in knowledge generation (Seale, 1999). A good textbook on applied thematic analysis from this perspective is Applied Thematic Analysis (Guest et al., 2012).
I am very grateful to Makel and colleagues for making
me realize I had overlooked the importance of independent coding of themes in
qualitative research, and establish inter-rater reliability or reproducibility
in qualitative research.
Personal take-home messages
After reflecting on this topic, there are some points that I am taking away from the work on reliability or reproducibility of thematic coding.
First, independent thematic analysis, and comparing how we code qualitative data, should be a standard practice in the qualitative studies in my lab. The suggestions by MacQueen and colleagues provide useful guidance that we should follow. There is nothing special about qualitative data sources in this respect.
Second, we already ask all participants if we can
share full transcripts, and many agree. In our case, as we primarily interview
scientists, so we are in the lucky position that they value transparency and
data sharing, and that the content of our interviews is not particularly
sensitive. Data sharing is of course not always possible. For example, in my interviews
on why researchers felt their own research lacked any value whatsoever, many
did continue to receive funding for this research, and they would not want
their actual thoughts about their research to become public. But where
possible, we should share transcripts for independent re-analysis and re-use.
Third, we should use the same techniques to increase the
reliability of our claims in qualitative research, as we do in quantitative
research. I recently have been annoyed by some extremely biased qualitative
studies in metascience, where the researchers who performed the research
clearly wanted their work to lead to a very specific outcome. It is easy to
tell the story you want in qualitative research, if you reject the idea of
reliability. But in my lab, we use methods that prevent us from allowing us to
say what we want to be true, if we are wrong. In my current research project, I
reserved 8000 euro out of the 1.5 million euro budget to hire ‘red teams’ (Lakens,
2020) to
criticize the studies before we perform them. However, I had planned to only
use red teams for large quantitative studies. I now think that I should also
use the red teams for the qualitative studies I planned in the proposal, to
make sure the coding of themes is reliable.
References
Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The Place of Inter-Rater Reliability in Qualitative Research: An Empirical Study. Sociology, 31(3), 597–606. https://doi.org/10.1177/0038038597031003015
Campbell, J. L., Quincy, C., Osserman, J., &
Pedersen, O. K. (2013). Coding In-depth Semistructured Interviews: Problems of
Unitization and Intercoder Reliability and Agreement. Sociological Methods
& Research, 42(3), 294–320.
https://doi.org/10.1177/0049124113500475
Guest, G., MacQueen, K. M., & Namey, E. (2012). Applied
Thematic Analysis.
Kis, A., Tur, E. M., Vaesen, K., Houkes, W., &
Lakens, D. (2025). Academic research values: Conceptualization and initial
steps of scale development. PLOS ONE, 20(3), e0318086.
https://doi.org/10.1371/journal.pone.0318086
Lakens, D. (2020). Pandemic
researchers—Recruit your own best critics. Nature, 581(7807),
Article 7807. https://doi.org/10.1038/d41586-020-01392-8
Lourie, M., & McPhail, G. (2024). ‘A Realist
Conceptual Methodology for Qualitative Educational Research: A Modest
Proposal.’ New Zealand Journal of Educational Studies, 59(2),
393–407. https://doi.org/10.1007/s40841-024-00344-4
MacQueen,
K. M., McLellan, E., Kay, K., & Milstein, B. (1998). Codebook
Development for Team-Based Qualitative Analysis. CAM Journal, 10(2),
31–36. https://doi.org/10.1177/1525822X980100020301
Makel, M. C., Caroleo, S. A., Meyer, M., Pei, M. A.,
Fleming, J. I., Hodges, J., Cook, B., & Plucker, J. (2025). Qualitative
Analysis of Open-ended Responses from Education Researchers on Questionable and
Open Research Practices. OSF. https://doi.org/10.35542/osf.io/n2gby
Mesquida, C., Murphy, J., Warne, J., & Lakens, D.
(2025). On the replicability of sports and exercise science research:
Assessing the prevalence of publication bias and studies with underpowered
designs by a z-curve analysis. SportRxiv.
https://doi.org/10.51224/SRXIV.534
O’Connor, C., & Joffe, H. (2020). Intercoder
Reliability in Qualitative Research: Debates and Practical Guidelines. International
Journal of Qualitative Methods, 19, 1609406919899220.
https://doi.org/10.1177/1609406919899220
Popay, J., Rogers, A., & Williams, G. (1998).
Rationale and standards for the systematic review of qualitative literature in
health services research. Qualitative Health Research, 8(3),
341–351. https://doi.org/10.1177/104973239800800305
Rasti, S., Vaesen, K., & Lakens, D. (2025). A
Framework for Describing the Levels of Scientific Coordination. OSF.
https://doi.org/10.31234/osf.io/eq269_v1
Schölvinck,
A.-F., Uygun-Tunç, D., Lakens, D., Vaesen, K., & Hessels, L. K. (2024). How
qualitative criteria can improve the assessment process of interdisciplinary
research proposals. Research Evaluation, 33, rvae049.
https://doi.org/10.1093/reseval/rvae049
Seale, C. (1999). Quality in Qualitative Research. Qualitative
Inquiry, 5(4), 465–478. https://doi.org/10.1177/107780049900500402
No comments:
Post a Comment