The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, May 25, 2026

Evaluating Dr. Cuddy’s Claim that the Debunking of Power Posing is a Myth

In this blog post I will analyse the arguments that Dr. Amy Cuddy provided in a LinkedIn post “The "Power Posing Was Debunked" Myth: What the Research Actually Shows — and Why Scientific Discourse Matters” on February 26. You can find the LinkedIn post here:

https://www.linkedin.com/pulse/power-posing-debunked-myth-what-research-actually-shows-amy-cuddy-t6lnc

In the post, Cuddy says she was “effectively silenced” by an “attempt to shut down this line” of research. She credits “the courage of the individual scientists who kept going despite enormous pressure not to” for the fact that she can still summarize “what the evidence now shows”.

Power posing has two categories of claimed effects. The first effect is on self-reported feelings. For example, if we instruct people to stand in a constricted versus an expanded posture, they will self-report feeling more powerful. There is an ongoing debate about whether, or how much, this effect is caused by a demand effect (i.e., people report what they think the investigator wants them to say, not what they actually feel). A meta-analysis has shown this self-report effect is larger in within-subject designs, and in studies without a cover story (Körner et al., 2022). The second effect is on physiological or behavioral outcomes. This is the contested area, and the research outcome that Cuddy is mainly trying to defend in her blog post. If you want to explore a meta-analysis on these two categories of effects, you can do so at https://metaanalyses.shinyapps.io/bodypositions/ (made by Körner et al., 2022). I would especially recommend exploring the QRP/Publication bias tab for the physiological and behavioral outcomes.

At the end of the post, Cuddy writes that she is thankful that not everyone stopped doing research on power poses, because then: “We would not know what we now know — which is that these effects are real, that they matter, and that the story people were told was wrong.”  She concludes with: “The evidence is there. It has been there for years. All I am asking is that people look at it.”

I am happy to do so. Let’s go.

Trying to find the references

I tried to look up the references cited by Cuddy in her post. However, this reference:

Andolfi, V. R., & Antonietti, A. (2020). Contractive vs. expansive body posture effects on convergent-integrative thinking tasks. Journal of Creative Behavior, 54(4), 871–880.

does not exist in literature databases, and the authors (who do exist) do not list this paper on their own websites. An inspection of the journal’s website shows that a different article was published in volume 54, issue 4 on these pages. This raises questions about how this reference was generated, with generation by AI being a plausible candidate (also in view of the 4 malformed references I will point out below). The reference appears in the following sentence in Cuddy’s LinkedIn post:

Andolfi and Antonietti (2020, Journal of Creative Behavior) provided further evidence that contractive postures specifically benefited convergent-integrative thinking tasks. That level of specificity — where the direction of the effect depends on the type of cognitive task — is exactly the kind of finding that emerges when a field matures.

When Cuddy says ‘The evidence is there’, this is not correct for the Andolfi and Antonietti article, which does not seems to exist in the scholarly record.

There are an additional 4 references that suggest that the literature review may have in part been generated by automated tools, but for these 4 references, there are papers that match the content discussed in the literature review in the LinkedIn post.

 

Reference in LinkedIn post

Actual Reference

Michinov, E., & Michinov, N. (2020). Creativity connected with body posture: The effects of expansive and contractive postures on creative performance. Psychology of Aesthetics, Creativity, and the Arts, 14(1), 116–127

Michinov, N., & Michinov, E. (2022). Do open or closed postures boost creative performance? The effects of postural feedback on divergent and convergent thinking. Psychology of Aesthetics, Creativity, and the Arts, 16(3), 504–518. https://doi.org/10.1037/aca0000306

 

Wainio-Theberge, S., Bhatt, M., Bhattacharyya, K., et al. (2025). Neural correlates of power-related postures and their behavioural consequences: A preliminary electrophysiological investigation. Social Cognitive and Affective Neuroscience, 20(1), nsaf03

 

Wainio-Theberge, S., & Armony, J. L. (2025). Neural correlates of power-related postures and their behavioural consequences: A preliminary electrophysiological investigation. Social Cognitive and Affective Neuroscience, 20(1), nsaf036. https://doi.org/10.1093/scan/nsaf036

 

Elkjær, E., Mikkelsen, M. B., Michalak, J., Mennin, D. S., & O'Toole, M. S. (2023). Using bodily displays to facilitate approach action outcomes within the context of a personally relevant task. Frontiers in Psychology, 14, 1147printing

Elkjær, E., Mikkelsen, M. B., Tramm, G., Michalak, J., Mennin, D. S., & O’Toole, M. S. (2022). Using bodily displays to facilitating approach action outcomes within the context of a personally relevant task. Brain and Behavior, 13(1), e2855. https://doi.org/10.1002/brb3.2855

 

Körner, R., Köhler, H., & Schütz, A. (2020). Powerful and confident children through expansive body postures? A preregistered test of the effects of power posing on children. School Psychology International, 41(4), 315–330.

 

Körner, R., Köhler, H., & Schütz, A. (2020). Powerful and confident children through expansive body postures? A preregistered study of fourth graders. School Psychology International, 41(4), 315–330. https://doi.org/10.1177/0143034320912306

 

 

We see all these references that are incorrect refer to the later literature, and summarize the research of the people who ‘kept going’. These references are at the core of the argument Cuddy is making.

Evaluating the evidence: Three examples

Cuddy wrote a narrative review, which requires that the validity of the conclusions, and the strength of the evidence, needs to be evaluated for every study. Let’s carefully examine some of the papers she cited and evaluate the evidence. Cuddy writes about a first study:

Wainio-Theberge and colleagues (2025, Social Cognitive and Affective Neuroscience) published the first EEG study of power posing, finding significant effects on arousal and valence, with suggestive differences in frontal brain activity between expansive and contractive postures. A new neural methodology for a question people said was already answered.

From the description in Cuddy’s blog, you might assume the “significant effects on arousal and valence, with suggestive differences in frontal brain activity between expansive and contractive postures” would support the hypothesis. But this is not the case. The significant effects were actually in the opposite direction of the hypothesis. This is not mentioned in the abstract of the Wainio-Theberge et al. article, and one would need to read the paper to get this information:

We found no significant posture differences in the EEG spectral exponent (t(101) = 1.01, P = .32). In contrast, a significant posture effect was observed for frontal asymmetry (t(101) = −2.63, P = .01); however, post hoc t-tests in each group separately (‘Models 1c and 1e’) revealed that the effect was in the opposite direction as hypothesized (see Discussion). Namely, we observed a significant right-lateralized frontal alpha asymmetry (FAA) in the contractive group (t(45) = 2.17, P = .04) and a left-lateralized FAA in the expansive one which failed to reach significance (t(55) = −1.63, P = .11).

Cuddy writes in the blog that she responded to journalists skeptical about power posing: “I spent more than ten hours responding — reviewing the literature, pulling citations, writing carefully, anticipating distortions” In this case, her review of the literature presented a finding as providing support for power posing, when in fact the effect was in the opposite direction of the hypothesis.

As a second paper, let’s take Barel and colleagues (2024). First, I want to thank the authors for sharing their data, after I tried to access it by clicking the google drive link in the article. All numbers were reproducible. Cuddy cites the paper as follows:

“As other researchers began testing that broader construct, using different measures in different populations, they found effects consistently: action orientation (Huang et al., 2011, Psychological Science), […], and risk-taking itself, partially (Barel et al., 2024, BMC Psychology).

It is unclear what is meant by 'partially', as the authors are clear that they found that power posing did not affect risk-taking: "There was no statistically significant distribution in risk-taking between high and low power conditions [χ2 = 0.00, p > 0.99]." The risk-taking outcome that Cuddy cites the study for is a clear null result.

The basis for "partially" is presumably a separate analysis reported in the paper: in a logistic regression predicting risk-taking, the authors found a significant interaction between power condition and cortisol change and write that they "did partially replicate an effect of changes in cortisol levels on risk-taking." But note that they claim an effect of cortisol changes on risk, not of power posing on risk. For power posing to affect risk through cortisol, power posing would first have to change cortisol, and it did not: the authors report no main effects of time or power on cortisol. With that first link missing, the high-power participants whose cortisol fell are not a subgroup of people for whom the power pose worked, as their cortisol would have moved the same way without any pose. The significant effect is a within-group association between two measures, which can't be attributed to the power posing manipulation.

To their credit, the authors themselves never claim power posing affected risk-taking. This framing comes from Cuddy, who presents the paper as a partial replication of a risk-taking effect after a power posing manipulation, which the study did not support.

When discussing a third paper, Cuddy writes: “Körner, Köhler, and Schütz (2020, School Psychology International) conducted a preregistered study of 108 German fourth graders — children — and found that expansive postures increased self-esteem, positive feelings, feelings of power, and even children's perceptions of their relationship with their teacher. The strongest effects were on school-related self-esteem. This is exactly the kind of applied, developmentally informed research that matters — taking findings from the lab and asking whether they help real children in real classrooms.”

The Körner et al study was preregistered: https://aspredicted.org/blind.php?x=sn4su9 with 4 t-tests to examine 4 dependent variables of interest. Of the 4 tests, 2 are significant (p = 0.04 and p = 0.013), but neither survive a correction for multiple comparisons (0.05/4 = 0.0125) which was necessary in this analysis.

The blog by Cuddy states “The strongest effects were on school-related self-esteem.” But the biggest effect is actually on the student-teacher relationship:

Finally, there was a significant difference between the two groups regarding the pictures related to the student–teacher relationship: high power posers more frequently chose the picture showing a good student–teacher relationship than low power posers, Χ²(1) = 11.181, p = .001, φ = –.322.

But there is a problem with this finding. Students spent months building a relationship with their teacher. Then, as part of the experiment, the students posed for 60 seconds and self-reported on that relationship, without any further interaction with the teacher.  There is no possible causal mechanism for the power pose to impact the relationship with teachers. Although unintended, this question is an excellent probe for demand effects. As the power pose can’t change history and impact the actual relationship between students and teachers, the observed effect can only be caused by a demand effect. Neither Cuddy nor the original authors realized this. Cuddy instead concludes: “This is exactly the kind of applied, developmentally informed research that matters — taking findings from the lab and asking whether they help real children in real classrooms.”

Evaluating the Research Line

Evaluating evidence is effortful and messy. Single studies always have weaknesses, and the reader might reasonably wonder whether I’m cherry-picking a few bad apples from an otherwise strong set. I don’t think I am, and I will explain the more general pattern I observed when reading all the cited papers.

Exploratory claims

The Körner et al (2020) study above was preregistered, and therefore we were able to evaluate that the claims were not severely tested, as they would not survive the required correction for multiple comparisons (Lakens, 2019). But most claims in the papers that Cuddy cites are based on exploratory analyses. The studies all have many dependent variables, and a large number of tests can be performed. These studies observe a mix of significant and non-significant results, but the significant results have a high probability of being Type 1 errors and can’t be presented as evidence. If researchers in this field would perform more direct replication studies, and would preregister their studies more, they could address this problem. Some preregistered their studies, which is excellent, but some don't, even though they work in a highly contested research area, and the significant results primarily come from exploratory analyses.

Researchers in the field are often honest about this, but especially in a narrative summary, it is easy to lose track of the fact that most of the authors of studies cited by Cuddy do not consider their own findings to be strong evidence. For example, Metzler et al (2023) write “Finally, it is important to transparently report on the level of evidence this study provides for power pose effects on low-level social behavior. This requires mentioning its exploratory nature [...] we are convinced that the medium effect sizes, given our sample size, would require replication before strong conclusions can be drawn”. I would say this is especially important given that the main result was a 3-way interaction with a p-value of 0.03: “the predicted three-fold interaction suggested that this effect of emotion on action choices (more avoidance for anger than fear) changed between sessions as a function of adopted pose (OR = 1.19, 95% CI[1.02, 1.38], z = 2.18, p = .029)”.

Another example comes from Elkjær et al (2022). The main finding is: “Concerning approach tendencies, the 2 × 3 interaction analysis on DAT “approach threat 1” was significant (F(1, 87) = 3.27, p = .043, ηp2 = .07). Regarding DAT avoid threat (1 + 2), the overall 2 × 3 interaction analysis was significant (F(1, 87) = 6.39, p = .003, ηp2 = .13).” The study was preregistered (https://aspredicted.org/blind.php?x=9j3b38) which allows us to see that the preregistered predictions are not supported. The authors predicted significant effects for the expansive condition compared to both the constricted condition and the control condition. However, they did not find effects compared to the control condition. Such patterns of mixed results are present in many studies in the literature. On the one hand, this is part of normal research, especially early on in research lines, when researchers have not figured out how to reliably produce the effect they are examining. On the other hand, power posing has been studied since 2010, and a research line can never get a strong basis if it does not move beyond a literature where all significant results are based on exploratory partial confirmations.

If you want to see the exploration of data in action, I would recommend looking at the OSF repository related to the paper by Michinov and Michinov (2024): www.osf.io/c9mzh, and see which variables and ways of computing variables are reported in the final paper, and which are not.

Underpowered studies and selection for significance

The sample sizes in the studies cited by Cuddy are often small – especially for key sub-group analyses, when the total sample size might be distributed across cells in a 2x3 design. This would not be problematic if the effects of power posing were known to be large. But even the self-report effect where participants indicate they feel more or less powerful has a rather small effect size of only g = 0.37 (see https://metaanalyses.shinyapps.io/bodypositions/). Less direct effects, for example on behavior, are likely to have a much smaller effects (unless researchers can propose strong theoretical arguments why more indirect effects would be larger, see Anvari et al., 2023). In one-tailed independent t-tests, 80% power would require 184 participants (92 per condition), but none of the studies are close to achieving such sample sizes.

The research area of power posing is also characterized by the selective reporting of significant results. This combination of underpowered studies and selection for significance leads to highly inflated effect sizes. We can see these effects in Andolfi et al (2017):

The effect sizes of an open or closed posture simply can’t be in the range of d = 1.22, or even d = 0.69 (for examples of realistic effect sizes to expect based on group differences, see DataColada 18). The effects are inflated, and there is no way of knowing what the true effect sizes are. They might be zero, as many replication studies of exactly such implausibly large effects based on studies with tiny samples have turned out to be.

The study by Michinov and Michinov similarly shows effects for significant tests that are too large. Adopting a posture for a few minutes can’t plausibly influence creative tasks with effects such as d = 0.634. When you evaluate evidence, thinking about selective reporting and inflated effects should be part of the evaluation.

 

Quality of the design and analysis

I could not help noticing that there is a lot of room to improve the quality of the study design and analysis, as reported in papers in this literature. This in itself does not mean that the evidence is unreliable, but it does not make it easier for a research field to generate high quality evidence. For example, Elkjær et al (2022) report the following power analysis:

“Based on a priori power calculations, using a repeated-measures ANOVA interaction analysis, 2 (time; before vs. after the manipulation) × 3 (condition; EXP, CON, N), 90 participants were required to detect a small effect size (d = 0.34), with an alpha of .05 and a beta of .20.”

At first sight, this looks like best practice. They acknowledge power posing effects are small (d = 0.34 is very much in line with the meta-analysis they published in the same year). Regrettably, what the authors actually did was enter an f = -.34, not a d, as you can see in the screenshot below, which leads to a sample size that is much lower than what they would actually have needed to achieve high power, according to their own meta-analytic effect size estimate:

This means that despite the power analysis, the study was still massively underpowered. The sample size justifications in all studies cited by Cuddy are problematic. This is probably true for many research lines, but it is especially problematic for a research line where researchers are still trying to establish if the basic effect exists or not.

While reading the articles, I also noticed many of the issues that we often see in other literatures when research teams lack statistical expertise. There are often small inconsistencies in the correct degrees of freedom, incorrectly performed statistical tests, an overreliance on p-values despite underpowered studies, and misinterpretations of non-significant results. I don’t want to single out more examples, but it would probably be good for the field if researchers would enlist some methodological and statistical expertise if they want to generate reliable evidence.  

Tools to evaluate claims

Cuddy writes: “When people are told that research is fake — without being given the tools to evaluate that claim — it doesn't just affect one researcher or one line of work. It feeds a broader cynicism: that science can't be trusted, that findings are arbitrary, that expertise is performance.” I strongly agree. This is why I have created a free textbook, Improving Your Statistical Inferences, to learn how to evaluate the actual evidence in scientific papers. Here are three decent heuristics to follow when you evaluate the evidence in a research line:

  1. If a finding shows what you want to be true, be extra skeptical.
  2. If you have a strong conflict of interest, be extra skeptical.
  3. Studies with low power due to too small sample sizes, lack of preregistration, no direct replications, strong indications of selective reporting, low methodological quality, repeating limitations in discussion sections without addressing them, implausibly large effect sizes, a lack of impact on other research areas, significant claims that mainly come from exploratory analyses, continued uncertainty about the basic effect after more than a decade and dozens of studies, and the research community disengaging with a literature are all signs of a lack of evidence.

According to Cuddy, she “live[s] inside a false narrative” where power posing is incorrectly believed to be a ‘myth’, and she believes that “none of this would have happened if the methods guys, and the journalists who trusted them without doing proper research, hadn't created the conditions that made it happen.”

 

Scientific criticism is a cornerstone of a healthy science

When I read Cuddy’s LinkedIn post, I was highly skeptical of the claim that there was evidence for effects of power posing on measures other than self-report, and that the debunking was a 'myth'. But my first response was to ignore the post. I did not want to examine the evidence behind the claims Cuddy made, because I am clearly one of the “method guys” who, according to Cuddy “manufactured the "debunked" narrative and aimed it, with great precision, at a single researcher”. If I would criticize her post, would I be seen as contributing to “the bullying I was subjected to”, as Cuddy writes?

But I care about criticism in science. And I think it is important that we can criticize scientific claims. My decision to not follow up on examining the claims in the blog post kept nagging me. Cuddy has 900,000 followers on LinkedIn who have read the very strong statement that it is a “myth” that power posing was debunked. If the evidence she presented was overstated – as I feared – scientific criticism would be needed to correct the record. I think it is essential to increase social safety in academia, while being able to criticize each other. I do not want bullying and scientific criticism to become conflated. Scientific criticism is too important for a healthy science to shy away from it, for fear of being called a bully. 

I think scientific criticism is a cornerstone of a reliable science. We have a responsibility to criticize public claims that we believe to be incorrect (either because they are AI generated, miscitations, or overstate the evidence). When I asked whether criticism like this should be voiced publicly (here, here, here, and here), most of the people in my network remarked that such criticisms should be voiced publicly. Others thought I should share these issues privately. In a way, I always have found it comforting to do things which you know will upset some scientists either way. It makes it easier to act on my own principles. And I believe it is essential for a science that aims to contribute to society to maintain a healthy culture of public scientific criticism.

 

 

Thanks to Nina, Sajedeh, Nick and Lisa for feedback on this blog post.

 

 

References

Andolfi, V. R., Di Nuzzo, C., & Antonietti, A. (2017). Opening the mind through the body: The effects of posture on creative processes. Thinking Skills and Creativity, 24, 20–28. https://doi.org/10.1016/j.tsc.2017.02.012

Anvari, F., Kievit, R., Lakens, D., Pennington, C. R., Przybylski, A. K., Tiokhin, L., Wiernik, B. M., & Orben, A. (2023). Not All Effects Are Indispensable: Psychological Science Requires Verifiable Lines of Reasoning for Whether an Effect Matters. Perspectives on Psychological Science, 18(2), 503–507. https://doi.org/10.1177/17456916221091565

Barel, E., Shahrabani, S., Mahagna, L., Massalha, R., Colodner, R., & Tzischinsky, O. (2024). The effects of power posing on neuroendocrine levels and risk-taking. BMC Psychology, 12(1), 726. https://doi.org/10.1186/s40359-024-02194-7

Elkjær, E., Mikkelsen, M. B., Tramm, G., Michalak, J., Mennin, D. S., & O’Toole, M. S. (2022). Using bodily displays to facilitating approach action outcomes within the context of a personally relevant task. Brain and Behavior, 13(1), e2855. https://doi.org/10.1002/brb3.2855

Körner, R., Röseler, L., Schütz, A., & Bushman, B. J. (2022). Dominance and prestige: Meta-analytic review of experimentally induced body position effects on behavioral, self-report, and physiological dependent variables. Psychological Bulletin, 148(1–2), 67–85. https://doi.org/10.1037/bul0000356

Lakens, D. (2019). The value of preregistration for psychological science: A conceptual analysis. Japanese Psychological Review, 62(3), 221–230. https://doi.org/10.24602/sjpr.62.3_221

Metzler, H., Vilarem, E., Petschen, A., & Grèzes, J. (2023). Power pose effects on approach and avoidance decisions in response to social threat. PLOS ONE, 18(8), e0286904. https://doi.org/10.1371/journal.pone.0286904

Michinov, N., & Michinov, E. (2024). Can Sitting Postures Influence the Creative Mind? Positive Effect of Contractive Posture on Convergent-Integrative Thinking. Creativity Research Journal, 36(1), 58–69. https://doi.org/10.1080/10400419.2022.2072557

Sunday, May 3, 2026

There are 13 ways to analyse a replication study, but only one of them is coherent.

Paul Simon’s says there are 50 ways to leave your lover, and a similar abundance characterizes the contemporary literature on how to analyse a replication study. The recent SCORE replication project in Nature makes this explicit by reporting “13 replication success metrics along with the number of papers to which each metric could be applied” (Tyner et al., 2026). The reason for this, as the authors note, is that “No singular success metric has been accepted as optimal and universally applicable in the literature.” With so many approaches to choose from, it should not be a surprise that the inferences based on the studies vary substantially. As the authors report: “Thirteen methods for evaluating replication success provided estimates ranging from 28.6% to 74.8% (median of 49.3%).”

Fig. 1 from the SCORE project, with 13 ways to analyze a replication study.

Although it is understandable for such a massive project to refrain from strongly opinionated choices, I think we can all agree that we do not want to compute a median for different approaches to analyse replication success, and just like we don’t want 13 ways to analyse an original study, we don’t want 13 ways to analyse a replication study. The current state of how to analyse replication studies is not as much an embarrassment of riches, but an embarrassment of a lack of a coherent epistemology. In science, we don’t compute statistics just because we can. We compute statistics because they help us to generate knowledge. In my experience, in science no “singular” anything is ever “accepted” – and especially not an approach to statistical inferences (as the ever-ongoing disagreement between Bayesians and Frequentists illustrates). Instead of waiting for agreement, we should think about which approach to the analysis of a replication study coherently follows from our epistemology.

Two metrics do the bulk of the epistemic work in the replication project. The authors are explicit that “Primary reporting emphasizes the two most reported replication metrics: statistical significance and effect size comparisons.” The authors are right to focus on these two measures, as these two inferences – whether the p-value is significant, and whether the effect sizes are similar – reflect what researchers treat as substantive claims in studies. In our recent paper ‘How to analyse a replication study’ we point out that these two inferences are directly related to basic statements that scientists make: whether the same ordinal decision can be made in the replication, and whether the magnitude of the effect meaningfully corresponds to the original estimate (Lakens et al., 2026).

While the authors understandably do not make too strong a point about their focus on 2 of the 13 metrics, their decision is not just based on which measures are commonly reported but also based on an epistemically coherent approach to the analysis of a replication study. Replication projects like SCORE are unmistakably designed within a methodological falsificationist framework, even when this philosophical background is not mentioned explicitly. In a falsificationist framework, which is implemented thought the Neyman-Pearson approach to statistical inferences, error rates are tightly controlled, and claims are made based on the results of a hypothesis test. It is clear the score authors followed a methodological falsificationist framework and the Neyman-Pearson approach to statistics as they preregistered (allowing others to evaluate how severely they controlled their error rates), designed studies with high statistical power, and set a prespecified alpha level.

From our perspective, the central goal of a replication study is not to ask a new statistical question, but to revisit the same question as in the original study, after collecting new data. As we argue, “the goal of a replication study is to examine whether the same basic statement that was made in an original study can also be made when new data are collected in a faithful empirical realization of the original test.” Within a methodological falsificationist framework, original findings enter the literature as tentative basic statements, justified by a statistical decision rule with known error rates. Replication studies function as a critical check on whether such statements were merely the result of a Type I error. As we note, “one of the main functions of a replication study is to determine the stability of a finding, which should not be present if the original finding was a Type I error.” Even though replications inevitably change many auxiliary conditions—because, as Popper already observed, experiments can never be repeated exactly—this does not undermine their epistemic role. Instead, replications test whether the original claim survives variation in factors that were implicitly relegated to the ceteris paribus clause, thereby probing whether the original inference was robust or not.

This view has direct implications for how replication studies should be analysed. If original claims were made by rejecting a null hypothesis at a prespecified alpha level, then “the replication study should evaluate whether the same claim can be made using the same inferential criterion.” And if researchers also treated the original effect size estimate as part of the claim (which they seem to do in practice) then replication studies must additionally examine whether effect sizes differ meaningfully. While the other eleven approaches to assessing replication success are statistically legitimate and can certainly be reported, they answer different questions and rest on different epistemic commitments. The coherent approach is the one that aligns epistemological aims, philosophy of science, and statistical procedure. Statistics follow from the questions we ask, the questions we ask follow from our philosophy of science, and without such coherence we cannot have a principled approach to the analysis of replication studies. Maybe it is time to admit that there are 13 statistical quantities that can be computed for a replication study, but there is really only one way to do so coherently from a specific philosophy of scientific knowledge generation.

 

References

Lakens, D., Leach, N. M., Haans, A., Uygun Tunç, D., & Tunç, M. N. (2026). How to analyze a replication study. PsyArXiv. https://osf.io/preprints/psyarxiv/7ydgu_v1

Tyner, A. H., Abatayo, A. L., Daley, M., Field, S., Fox, N., Haber, N. A., Hahn, K. M., Struhl, M. K., Mawhinney, B., Miske, O., Silverstein, P., Soderberg, C. K., Stankov, T., Abbasi, A., Aberson, C. L., Aczel, B., Adamkovič, M., Albayrak, N., Allen, P. J., … Errington, T. M. (2026). Investigating the replicability of the social and behavioural sciences. Nature, 652(8108), 143–150. https://doi.org/10.1038/s41586-025-10078-y

Sunday, February 8, 2026

On the reliability and reproducibility of qualitative research

With my collaborators, I am increasingly performing qualitative research. I find qualitative research projects a useful way to improve my understanding of behaviors that I want to explore with other methods in the future. For example, some years ago I performed qualitative interviews with researchers who believed their own research had no value whatsoever. Although I did not intend to publish these interviews, they provided important insights for other projects that I am engaged in now. I was involved in qualitative research on the assessment process of interdisciplinary research (Schölvinck et al., 2024), and we performed interviews to understand how researchers interpret a questionnaire we were developing that measures personal values in science (Kis et al., 2025). Together with Anna van ‘Veer I supervised Julia Weschenfelder who interviewed scientists on what they believed the value of their research was, and I have hired Julia as a PhD as part of a large project on the meaningful interpretation of effect sizes. She is planning interviews with researchers about what determined the maximum sample size they are willing to collect (if you want to be interviewed about this, reach out!). With Sajedeh Rasti, who is completing her PhD in my lab, we have spent the last 2 years interviewing people who played important roles in the creation of large-scale coordinated research projects in science.

As a supervisor, I am always very actively involved in research projects, and I joined as many of the (extremely interesting) interviews Sajedeh performed, and I listened to the audio recordings of all interviews that Julia performed to give my interpretation of what the scientists discussed when interviewed. Yet it never occurred to me to independently perform the thematic analysis for these interviews, and compare the themes we derived. I became aware of this peculiarity after reading a great qualitative paper analyzing open questions in a study on questionable research practices (Makel et al., 2025). In this paper, two teams independently analyze themes in the same set of open questions. They largely find the same themes, and conclude: “our two independent analysis teams reported themes that were generally similar or overlapping, suggesting a robustness of the findings. We believe this suggests that independent qualitative analyst teams with similar positionality can use unique analytic paths and reach largely similar destinations. This contributes to the ongoing conversation within the qualitative research community about whether reproducibility and replicability are relevant or possible in qualitative research”.


Sometimes you read a paper that makes perfect sense – of course two independent teams should reach the same conclusions when they qualitatively analyze the same data – and yet, it was not part of your workflow. This is especially peculiar, because we use this exact workflow when we code other data sources. For example, Sajedeh Rasti will soon share a preprint on papers written by large teams of scientists. In this paper, we classify these papers into different categories, depending on the interdependencies that require coordination (e.g., epistemic, logistical, financial, etc., see
Rasti et al., 2025). Sajedeh and I double-coded papers, and we discussed our levels of agreement. This is the normal thing to do. It is strange that I never considered the same approach when the data comes from interviews.

 

Research on inter-rater reliability of thematic coding

I looked into the literature to search for papers similar to Makel et al., 2025, where the same qualitative data is analyzed by multiple coders to examine how reproducible the themes are that are identified in the data. There are many more papers than the few I will list here, and someone should write a paper summarizing this literature. But this is a Sunday morning blog post, and not a systematic review, so I will just present some papers that I found interesting.

Armstrong and colleagues had six researchers independently analyze the same single focusgroup transcript, and found close agreement on the basic themes, but substantial divergence in how those themes were interpreted and organized, with each analyst having “‘packaged the themes differently.(Armstrong et al., 1997). The authors in this paper then go on and say these differences demonstrate the inherent subjectivity in qualitative research. But this is not the message I take away from this paper at all. All coders of any type of data will differ slightly in the details they highlight. What matters most in this project is not how in verbal summaries of the themes, researchers highlight different details – that is to be expected, but also largely irrelevant – but that there is such clear agreement on the themes identified. If I read the examples in the paper, the differences are mainly in detail, where some summarize the themes at a higher level, and others on a more detailed level. Those who summarize the themes at a detailed level will of course pick out specific details others did not mention – but this can easily be improved upon by instructing coders better about the level of detail at which results should be provided, and how details should be chosen in verbal summaries.

Campbell and colleagues (2013) provide a great example of the work needed to reach high inter-rater reliability. They say that “Reliability is just as important for qualitative research as it is for quantitative research” and argue that replicability problems stem less from disagreement over themes per se than from “the unitization problem—that is, identifying appropriate blocks of text for a particular code or codes,” which can “wreak havoc when researchers try to establish intercoder reliability for in-depth semistructured interviews.” Using their own empirical coding exercise, they show that even with a detailed codebook, intercoder reliability remained modest (“54 percent reliability on average”), reflecting the interpretive complexity of semistructured interviews where “more than one theme and therefore more than one code may be applicable at once.” However, when disagreements were resolved through discussion, intercoder agreement rose dramatically to 96%, and 91% of initial disagreements were resolved. In my view, qualitative research is nothing special in this respect, as it is often difficult to achieve high inter-rater reliability in any coding project. For example, we have the same difficulty in reaching agreement when we code what the main hypothesis in a scientific paper is in metascientific research projects (Mesquida et al., 2025).

The importance and challenges of clear coding instructions related to text-segmentation is also discussed in a classic paper on the reliability qualitative research (MacQueen et al., 1998). Based on extensive experience with qualitative research at the Centers for Disease Control and Prevention, MacQueen and colleagues offer useful suggestions on creating a codebook that will lead to high reliability. Through an iterative process in which multiple coders independently code the same text, compare results, and revise definitions, they show how disagreement over themes is an indication of ambiguous codes – and not an inherent limitation of qualitative research. Reproducibility can be achieved by repeatedly checking whether coders can “apply the codes in a consistent manner” and refining the codebook until agreement is acceptable. The authors argue that we should not expect that coders will naturally “see” the same themes, but that they can code the same themes reliably if researchers use a disciplined, transparent, and collective codebook development process that supports reproducible qualitative analysis without denying its interpretive character.

A realist ontology in qualitative research

The idea of reliability, or reproducibility (the two concepts become intertwined in a lovely way in qualitative research, as the coder is the measurement device, so to say) in qualitative research emerges most naturally from a scientific realist perspective on knowledge generation. There are qualitative researchers who adopt philosophical perspectives that attempt to argue reliability and reproducibility are not relevant for qualitative research. I have engaged with these ideas a lot over the years, and find them unconvincing. Seale and Silverman (1997) push back strongly against the idea that reliability would not apply in qualitative research, and write “We believe that such a position can amount to methodological anarchy and resist this on 2 grounds. First, it simply makes no sense to argue that all knowledge and feelings are of equal weight and value. Even in everyday life, we readily sort fact from fancy. Why, therefore, should science be any different? Second, methodological anarchy offers a clearly negative message to the audiences of qualitative health research, suggesting that its proponents have given up claims to validity.”

I am personally more sympathetic to the view expressed by Popay and colleagues (1998): “On one side, there are those who argue that there is nothing unique about qualitative research and that traditional definitions of reliability, validity, objectivity, and generalizability apply across both qualitative and quantitative approaches. On the other side, there are those postmodernists who contend that there can be no criteria for judging qualitative research outcomes (Fuchs, 1993). In this radical relativist position, all criteria are doubtful and none can be privileged. However, both of these positions are unsatisfactory. The second is nihilistic and precludes any distinction based on systematic or other criteria. If the first is adopted, then, at best, qualitative research will always be seen as inferior to quantitative research. At worst, there is a danger that poor-quality qualitative research, which meets criteria inappropriate for the assessment of such evidence, will be privileged.” There are unique aspects of reliability in qualitative research, but qualitative research will not be taken seriously by the majority of scientists if researchers do not engage with reliability at all.

O’Conner and Joffe provide a useful guide on how to achieve inter-coder reliability in qualitative research in psychology, based on their own extensive experience (O’Connor & Joffe, 2020). They argue that “ICR helps qualitative research achieve this communicative function by showing the basic analytic structure has meaning that extends beyond an individual researcher. The logic is that if separate individuals converge on the same interpretation of the data, it implies “that the patterns in the latent content must be fairly robust and that if the readers themselves were to code the same content, they too would make the same judgments” (Potter & Levine-Donnerstein, 1999, p. 266).” They highlight how there are both external incentives to care about reliability (as it can function as a signal of quality) but also has direct benefits for the researchers performing the qualitative research.

If you feel similarly, and want to educate your students about qualitative methods where reliability and reproducibility are important, there is a nice paper that can be used to introduce your students to realist ontologies in qualitative research by Lourie and McPhail (2024). They note how the methodology literature in qualitative research often engages with interpretivist-constructivist approaches. Among my own collaborators, this perspective is not seen as appealing, and we prefer to build our qualitative research from a realist ontology. In this philosophy, inter-rater reliability, and reproducibility, are important aspects in knowledge generation (Seale, 1999). A good textbook on applied thematic analysis from this perspective is Applied Thematic Analysis (Guest et al., 2012).

I am very grateful to Makel and colleagues for making me realize I had overlooked the importance of independent coding of themes in qualitative research, and establish inter-rater reliability or reproducibility in qualitative research.

 

Personal take-home messages

After reflecting on this topic, there are some points that I am taking away from the work on reliability or reproducibility of thematic coding.

First, independent thematic analysis, and comparing how we code qualitative data, should be a standard practice in the qualitative studies in my lab. The suggestions by MacQueen and colleagues provide useful guidance that we should follow. There is nothing special about qualitative data sources in this respect.

Second, we already ask all participants if we can share full transcripts, and many agree. In our case, as we primarily interview scientists, so we are in the lucky position that they value transparency and data sharing, and that the content of our interviews is not particularly sensitive. Data sharing is of course not always possible. For example, in my interviews on why researchers felt their own research lacked any value whatsoever, many did continue to receive funding for this research, and they would not want their actual thoughts about their research to become public. But where possible, we should share transcripts for independent re-analysis and re-use.

Third, we should use the same techniques to increase the reliability of our claims in qualitative research, as we do in quantitative research. I recently have been annoyed by some extremely biased qualitative studies in metascience, where the researchers who performed the research clearly wanted their work to lead to a very specific outcome. It is easy to tell the story you want in qualitative research, if you reject the idea of reliability. But in my lab, we use methods that prevent us from allowing us to say what we want to be true, if we are wrong. In my current research project, I reserved 8000 euro out of the 1.5 million euro budget to hire ‘red teams’ (Lakens, 2020) to criticize the studies before we perform them. However, I had planned to only use red teams for large quantitative studies. I now think that I should also use the red teams for the qualitative studies I planned in the proposal, to make sure the coding of themes is reliable.

 

References

Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The Place of Inter-Rater Reliability in Qualitative Research: An Empirical Study. Sociology, 31(3), 597–606. https://doi.org/10.1177/0038038597031003015

Campbell, J. L., Quincy, C., Osserman, J., & Pedersen, O. K. (2013). Coding In-depth Semistructured Interviews: Problems of Unitization and Intercoder Reliability and Agreement. Sociological Methods & Research, 42(3), 294–320. https://doi.org/10.1177/0049124113500475

Guest, G., MacQueen, K. M., & Namey, E. (2012). Applied Thematic Analysis.

Kis, A., Tur, E. M., Vaesen, K., Houkes, W., & Lakens, D. (2025). Academic research values: Conceptualization and initial steps of scale development. PLOS ONE, 20(3), e0318086. https://doi.org/10.1371/journal.pone.0318086

Lakens, D. (2020). Pandemic researchers—Recruit your own best critics. Nature, 581(7807), Article 7807. https://doi.org/10.1038/d41586-020-01392-8

Lourie, M., & McPhail, G. (2024). ‘A Realist Conceptual Methodology for Qualitative Educational Research: A Modest Proposal.’ New Zealand Journal of Educational Studies, 59(2), 393–407. https://doi.org/10.1007/s40841-024-00344-4

MacQueen, K. M., McLellan, E., Kay, K., & Milstein, B. (1998). Codebook Development for Team-Based Qualitative Analysis. CAM Journal, 10(2), 31–36. https://doi.org/10.1177/1525822X980100020301

Makel, M. C., Caroleo, S. A., Meyer, M., Pei, M. A., Fleming, J. I., Hodges, J., Cook, B., & Plucker, J. (2025). Qualitative Analysis of Open-ended Responses from Education Researchers on Questionable and Open Research Practices. OSF. https://doi.org/10.35542/osf.io/n2gby

Mesquida, C., Murphy, J., Warne, J., & Lakens, D. (2025). On the replicability of sports and exercise science research: Assessing the prevalence of publication bias and studies with underpowered designs by a z-curve analysis. SportRxiv. https://doi.org/10.51224/SRXIV.534

O’Connor, C., & Joffe, H. (2020). Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines. International Journal of Qualitative Methods, 19, 1609406919899220. https://doi.org/10.1177/1609406919899220

Popay, J., Rogers, A., & Williams, G. (1998). Rationale and standards for the systematic review of qualitative literature in health services research. Qualitative Health Research, 8(3), 341–351. https://doi.org/10.1177/104973239800800305

Rasti, S., Vaesen, K., & Lakens, D. (2025). A Framework for Describing the Levels of Scientific Coordination. OSF. https://doi.org/10.31234/osf.io/eq269_v1

Schölvinck, A.-F., Uygun-Tunç, D., Lakens, D., Vaesen, K., & Hessels, L. K. (2024). How qualitative criteria can improve the assessment process of interdisciplinary research proposals. Research Evaluation, 33, rvae049. https://doi.org/10.1093/reseval/rvae049

Seale, C. (1999). Quality in Qualitative Research. Qualitative Inquiry, 5(4), 465–478. https://doi.org/10.1177/107780049900500402