The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, February 8, 2026

On the reliability and reproducibility of qualitative research

With my collaborators, I am increasingly performing qualitative research. I find qualitative research projects a useful way to improve my understanding of behaviors that I want to explore with other methods in the future. For example, some years ago I performed qualitative interviews with researchers who believed their own research had no value whatsoever. Although I did not intend to publish these interviews, they provided important insights for other projects that I am engaged in now. I was involved in qualitative research on the assessment process of interdisciplinary research (Schölvinck et al., 2024), and we performed interviews to understand how researchers interpret a questionnaire we were developing that measures personal values in science (Kis et al., 2025). Together with Anna van ‘Veer I supervised Julia Weschenfelder who interviewed scientists on what they believed the value of their research was, and I have hired Julia as a PhD as part of a large project on the meaningful interpretation of effect sizes. She is planning interviews with researchers about what determined the maximum sample size they are willing to collect (if you want to be interviewed about this, reach out!). With Sajedeh Rasti, who is completing her PhD in my lab, we have spent the last 2 years interviewing people who played important roles in the creation of large-scale coordinated research projects in science.

As a supervisor, I am always very actively involved in research projects, and I joined as many of the (extremely interesting) interviews Sajedeh performed, and I listened to the audio recordings of all interviews that Julia performed to give my interpretation of what the scientists discussed when interviewed. Yet it never occurred to me to independently perform the thematic analysis for these interviews, and compare the themes we derived. I became aware of this peculiarity after reading a great qualitative paper analyzing open questions in a study on questionable research practices (Makel et al., 2025). In this paper, two teams independently analyze themes in the same set of open questions. They largely find the same themes, and conclude: “our two independent analysis teams reported themes that were generally similar or overlapping, suggesting a robustness of the findings. We believe this suggests that independent qualitative analyst teams with similar positionality can use unique analytic paths and reach largely similar destinations. This contributes to the ongoing conversation within the qualitative research community about whether reproducibility and replicability are relevant or possible in qualitative research”.


Sometimes you read a paper that makes perfect sense – of course two independent teams should reach the same conclusions when they qualitatively analyze the same data – and yet, it was not part of your workflow. This is especially peculiar, because we use this exact workflow when we code other data sources. For example, Sajedeh Rasti will soon share a preprint on papers written by large teams of scientists. In this paper, we classify these papers into different categories, depending on the interdependencies that require coordination (e.g., epistemic, logistical, financial, etc., see
Rasti et al., 2025). Sajedeh and I double-coded papers, and we discussed our levels of agreement. This is the normal thing to do. It is strange that I never considered the same approach when the data comes from interviews.

 

Research on inter-rater reliability of thematic coding

I looked into the literature to search for papers similar to Makel et al., 2025, where the same qualitative data is analyzed by multiple coders to examine how reproducible the themes are that are identified in the data. There are many more papers than the few I will list here, and someone should write a paper summarizing this literature. But this is a Sunday morning blog post, and not a systematic review, so I will just present some papers that I found interesting.

Armstrong and colleagues had six researchers independently analyze the same single focusgroup transcript, and found close agreement on the basic themes, but substantial divergence in how those themes were interpreted and organized, with each analyst having “‘packaged the themes differently.(Armstrong et al., 1997). The authors in this paper then go on and say these differences demonstrate the inherent subjectivity in qualitative research. But this is not the message I take away from this paper at all. All coders of any type of data will differ slightly in the details they highlight. What matters most in this project is not how in verbal summaries of the themes, researchers highlight different details – that is to be expected, but also largely irrelevant – but that there is such clear agreement on the themes identified. If I read the examples in the paper, the differences are mainly in detail, where some summarize the themes at a higher level, and others on a more detailed level. Those who summarize the themes at a detailed level will of course pick out specific details others did not mention – but this can easily be improved upon by instructing coders better about the level of detail at which results should be provided, and how details should be chosen in verbal summaries.

Campbell and colleagues (2013) provide a great example of the work needed to reach high inter-rater reliability. They say that “Reliability is just as important for qualitative research as it is for quantitative research” and argue that replicability problems stem less from disagreement over themes per se than from “the unitization problem—that is, identifying appropriate blocks of text for a particular code or codes,” which can “wreak havoc when researchers try to establish intercoder reliability for in-depth semistructured interviews.” Using their own empirical coding exercise, they show that even with a detailed codebook, intercoder reliability remained modest (“54 percent reliability on average”), reflecting the interpretive complexity of semistructured interviews where “more than one theme and therefore more than one code may be applicable at once.” However, when disagreements were resolved through discussion, intercoder agreement rose dramatically to 96%, and 91% of initial disagreements were resolved. In my view, qualitative research is nothing special in this respect, as it is often difficult to achieve high inter-rater reliability in any coding project. For example, we have the same difficulty in reaching agreement when we code what the main hypothesis in a scientific paper is in metascientific research projects (Mesquida et al., 2025).

The importance and challenges of clear coding instructions related to text-segmentation is also discussed in a classic paper on the reliability qualitative research (MacQueen et al., 1998). Based on extensive experience with qualitative research at the Centers for Disease Control and Prevention, MacQueen and colleagues offer useful suggestions on creating a codebook that will lead to high reliability. Through an iterative process in which multiple coders independently code the same text, compare results, and revise definitions, they show how disagreement over themes is an indication of ambiguous codes – and not an inherent limitation of qualitative research. Reproducibility can be achieved by repeatedly checking whether coders can “apply the codes in a consistent manner” and refining the codebook until agreement is acceptable. The authors argue that we should not expect that coders will naturally “see” the same themes, but that they can code the same themes reliably if researchers use a disciplined, transparent, and collective codebook development process that supports reproducible qualitative analysis without denying its interpretive character.

A realist ontology in qualitative research

The idea of reliability, or reproducibility (the two concepts become intertwined in a lovely way in qualitative research, as the coder is the measurement device, so to say) in qualitative research emerges most naturally from a scientific realist perspective on knowledge generation. There are qualitative researchers who adopt philosophical perspectives that attempt to argue reliability and reproducibility are not relevant for qualitative research. I have engaged with these ideas a lot over the years, and find them unconvincing. Seale and Silverman (1997) push back strongly against the idea that reliability would not apply in qualitative research, and write “We believe that such a position can amount to methodological anarchy and resist this on 2 grounds. First, it simply makes no sense to argue that all knowledge and feelings are of equal weight and value. Even in everyday life, we readily sort fact from fancy. Why, therefore, should science be any different? Second, methodological anarchy offers a clearly negative message to the audiences of qualitative health research, suggesting that its proponents have given up claims to validity.”

I am personally more sympathetic to the view expressed by Popay and colleagues (1998): “On one side, there are those who argue that there is nothing unique about qualitative research and that traditional definitions of reliability, validity, objectivity, and generalizability apply across both qualitative and quantitative approaches. On the other side, there are those postmodernists who contend that there can be no criteria for judging qualitative research outcomes (Fuchs, 1993). In this radical relativist position, all criteria are doubtful and none can be privileged. However, both of these positions are unsatisfactory. The second is nihilistic and precludes any distinction based on systematic or other criteria. If the first is adopted, then, at best, qualitative research will always be seen as inferior to quantitative research. At worst, there is a danger that poor-quality qualitative research, which meets criteria inappropriate for the assessment of such evidence, will be privileged.” There are unique aspects of reliability in qualitative research, but qualitative research will not be taken seriously by the majority of scientists if researchers do not engage with reliability at all.

O’Conner and Joffe provide a useful guide on how to achieve inter-coder reliability in qualitative research in psychology, based on their own extensive experience (O’Connor & Joffe, 2020). They argue that “ICR helps qualitative research achieve this communicative function by showing the basic analytic structure has meaning that extends beyond an individual researcher. The logic is that if separate individuals converge on the same interpretation of the data, it implies “that the patterns in the latent content must be fairly robust and that if the readers themselves were to code the same content, they too would make the same judgments” (Potter & Levine-Donnerstein, 1999, p. 266).” They highlight how there are both external incentives to care about reliability (as it can function as a signal of quality) but also has direct benefits for the researchers performing the qualitative research.

If you feel similarly, and want to educate your students about qualitative methods where reliability and reproducibility are important, there is a nice paper that can be used to introduce your students to realist ontologies in qualitative research by Lourie and McPhail (2024). They note how the methodology literature in qualitative research often engages with interpretivist-constructivist approaches. Among my own collaborators, this perspective is not seen as appealing, and we prefer to build our qualitative research from a realist ontology. In this philosophy, inter-rater reliability, and reproducibility, are important aspects in knowledge generation (Seale, 1999). A good textbook on applied thematic analysis from this perspective is Applied Thematic Analysis (Guest et al., 2012).

I am very grateful to Makel and colleagues for making me realize I had overlooked the importance of independent coding of themes in qualitative research, and establish inter-rater reliability or reproducibility in qualitative research.

 

Personal take-home messages

After reflecting on this topic, there are some points that I am taking away from the work on reliability or reproducibility of thematic coding.

First, independent thematic analysis, and comparing how we code qualitative data, should be a standard practice in the qualitative studies in my lab. The suggestions by MacQueen and colleagues provide useful guidance that we should follow. There is nothing special about qualitative data sources in this respect.

Second, we already ask all participants if we can share full transcripts, and many agree. In our case, as we primarily interview scientists, so we are in the lucky position that they value transparency and data sharing, and that the content of our interviews is not particularly sensitive. Data sharing is of course not always possible. For example, in my interviews on why researchers felt their own research lacked any value whatsoever, many did continue to receive funding for this research, and they would not want their actual thoughts about their research to become public. But where possible, we should share transcripts for independent re-analysis and re-use.

Third, we should use the same techniques to increase the reliability of our claims in qualitative research, as we do in quantitative research. I recently have been annoyed by some extremely biased qualitative studies in metascience, where the researchers who performed the research clearly wanted their work to lead to a very specific outcome. It is easy to tell the story you want in qualitative research, if you reject the idea of reliability. But in my lab, we use methods that prevent us from allowing us to say what we want to be true, if we are wrong. In my current research project, I reserved 8000 euro out of the 1.5 million euro budget to hire ‘red teams’ (Lakens, 2020) to criticize the studies before we perform them. However, I had planned to only use red teams for large quantitative studies. I now think that I should also use the red teams for the qualitative studies I planned in the proposal, to make sure the coding of themes is reliable.

 

References

Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The Place of Inter-Rater Reliability in Qualitative Research: An Empirical Study. Sociology, 31(3), 597–606. https://doi.org/10.1177/0038038597031003015

Campbell, J. L., Quincy, C., Osserman, J., & Pedersen, O. K. (2013). Coding In-depth Semistructured Interviews: Problems of Unitization and Intercoder Reliability and Agreement. Sociological Methods & Research, 42(3), 294–320. https://doi.org/10.1177/0049124113500475

Guest, G., MacQueen, K. M., & Namey, E. (2012). Applied Thematic Analysis.

Kis, A., Tur, E. M., Vaesen, K., Houkes, W., & Lakens, D. (2025). Academic research values: Conceptualization and initial steps of scale development. PLOS ONE, 20(3), e0318086. https://doi.org/10.1371/journal.pone.0318086

Lakens, D. (2020). Pandemic researchers—Recruit your own best critics. Nature, 581(7807), Article 7807. https://doi.org/10.1038/d41586-020-01392-8

Lourie, M., & McPhail, G. (2024). ‘A Realist Conceptual Methodology for Qualitative Educational Research: A Modest Proposal.’ New Zealand Journal of Educational Studies, 59(2), 393–407. https://doi.org/10.1007/s40841-024-00344-4

MacQueen, K. M., McLellan, E., Kay, K., & Milstein, B. (1998). Codebook Development for Team-Based Qualitative Analysis. CAM Journal, 10(2), 31–36. https://doi.org/10.1177/1525822X980100020301

Makel, M. C., Caroleo, S. A., Meyer, M., Pei, M. A., Fleming, J. I., Hodges, J., Cook, B., & Plucker, J. (2025). Qualitative Analysis of Open-ended Responses from Education Researchers on Questionable and Open Research Practices. OSF. https://doi.org/10.35542/osf.io/n2gby

Mesquida, C., Murphy, J., Warne, J., & Lakens, D. (2025). On the replicability of sports and exercise science research: Assessing the prevalence of publication bias and studies with underpowered designs by a z-curve analysis. SportRxiv. https://doi.org/10.51224/SRXIV.534

O’Connor, C., & Joffe, H. (2020). Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines. International Journal of Qualitative Methods, 19, 1609406919899220. https://doi.org/10.1177/1609406919899220

Popay, J., Rogers, A., & Williams, G. (1998). Rationale and standards for the systematic review of qualitative literature in health services research. Qualitative Health Research, 8(3), 341–351. https://doi.org/10.1177/104973239800800305

Rasti, S., Vaesen, K., & Lakens, D. (2025). A Framework for Describing the Levels of Scientific Coordination. OSF. https://doi.org/10.31234/osf.io/eq269_v1

Schölvinck, A.-F., Uygun-Tunç, D., Lakens, D., Vaesen, K., & Hessels, L. K. (2024). How qualitative criteria can improve the assessment process of interdisciplinary research proposals. Research Evaluation, 33, rvae049. https://doi.org/10.1093/reseval/rvae049

Seale, C. (1999). Quality in Qualitative Research. Qualitative Inquiry, 5(4), 465–478. https://doi.org/10.1177/107780049900500402

  

Saturday, December 6, 2025

Dogmatic Bayesianism Disorder

Note: This humorously intended sarcastic blog post directly mimics classifications of psychological disorders in the Diagnostic and Statistical Manual of the American Psychological Association, but Dogmatic Bayesianism is not in the DSM-5 as an actual psychological disorder - for now.

 

Dogmatic Bayesianism Disorder

Diagnostic Criteria                                              F61.1

A. Individuals are convinced that they know what other people want to know.

 

B. Individuals believe that everyone who is not a Bayesian is wrong. They almost continuously try to convince others, and there is often a strong feeling others need to be ‘saved’.

 

C. Strong tendency to ridicule non-Bayesians, including historical figures. Karl Popper (who opposed Bayesianism, as it opens the door to dogmatism in science) is an especially common target.

 

D. A pervasive obsession with Bayesian statistics, and the use of p-values or hypothesis testing by others, at the expense of flexibility, openness, and acknowledging that other coherent statistical approaches exist to deal with uncertainty.

 

E. Symptoms cause clinically significant impairment in social, occupational, or other important areas of current functioning. The forceful insertion of Bayesian statistics into conversations can drive away conversation partners, to the point where almost all social contact is restricted to other dogmatic Bayesians.

 

F. Individuals believe everyone else will become a Bayesian. Older dogmatic Bayesian often lose bets that everyone will have become a Bayesian by a specific year. If this bet is lost, they continue to claim everyone will become a Bayesian in the future and consider this a rationally updated belief.

 

Specify if:

Persistent: The disorder has been present for more than 12 months.

Specify current severity:

Dogmatic Bayesianism Disorder is specified as severe when an academic exhibits all symptoms of the disorder, at which point individuals are classified as having Bayesian Evangelical Syndrome (BES).

 

Prevalence

Less than 0.01% of the general population is affected, but prevalence is estimated to be at 1% to 2% in statistically inclined academics. Due to diagnostic criterion E prevalence can be up to 80% on certain internet forums. Men are more frequently affected than women, at a ratio estimated at 100:1 or greater.  

Development and course

Development of the condition often starts in the early thirties, but onset can occur at an older age, in which case the symptoms are often much more severe. There is currently no treatment, and dogmatic Bayesianism typically increases with age. In rare cases (less than 1%) spontaneous recovery seems to occur, after which individuals often experience prolonged feelings of shame and embarrassment.

Risk and Prognostic Factors

The disorder is more common in individuals with obsessive-compulsive disorder (OCD).

Wednesday, October 29, 2025

Why we should stop using statistical techniques that have not been adequately vetted by experts in psychology

 In a recent post on Bluesky, where Richard Morey reflects on a paper he published with Clintin Davis-Stober that points out concerns with the p-curve method (Morey & Davis-Stober, 2025), he writes:

 


Also, I think people should stop using forensic meta-analytic techniques that have not been adequately vetted by experts in statistics. The p-curve papers have very little statistical detail, and were published in psych journals. They did not get the scrutiny appropriate to their popularity.

 

Although I understand this post as an affective response, I also think this kind of thought is extremely dangerous and undermines science. In this blog post I want to unpack some of the consequences of thoughts like this, and how to deal with quality control instead.

 

Adequately vetted by experts

 

I am a big fan of better vetting of scientific work by experts. I would like expert statisticians to vet the power analysis and statistical analyses in all your papers. But there are some problems. The first is in identifying expert statisticians. There are many statisticians, but some get things wrong. Of course, those are not the experts that we want to do the vetting. So how do we identify expert statisticians?

Let’s see if we can identify expert statisticians by looking at Sue Duval and Richard Tweedie. A look at their CV might convince you they are experts in statistics. But wait! They developed the ‘trim-and-fill’ method. The abstract of their classic 2000 paper is below:

A text on a white background

AI-generated content may be incorrect.

It turns out that, unlike they write in their abstract, the point estimate for the meta-analytic effect size after adjusting for missing studies is not approximately correct at all (Peters et al., 2007; Terrin et al., 2003). So clearly, Duval and Tweedie are statisticians, but not the expert statisticians that we want to vet others. They got things wrong, and more problematically, they got things wrong in the Journal of the American Statistical Association.

 

In some cases, the problems in the work by statisticians is so easy to spot, even a lowly psychologist like myself can point out the problems. When a team of biostatisticians proposed a ‘second generation p-value’, without mentioning equivalence tests anywhere in their paper, two psychologists (myself and Marie Delacre) had to point out that the statistic they had invented was very similar to an equivalence test, except that it had a number of undesirable properties (Lakens & Delacre, 2020). I guess based on this anecdotal experience, there is nothing left but to create the rule that we should stop using statistical tests that have not been adequately vetted by experts in psychology.

 

Although it greatly helps to have expertise in topics that you want to scrutinize, sometimes the most fatal criticism comes from elsewhere. Experts make mistakes – overconfidence is a thing. I recently very confidently made a statement in a (signed) peer review, that (I am still examining the topic) I might have been wrong about. I don’t want to be the expert to ‘vet’ a method and allow it to be used based on my authority. More importantly, I think no one should want a science where authorities tell us which methods are vetted, and which are not. It would undermine the very core of what science is to me – a fallible system of knowledge generation which relies on open mutual criticism.

 

Scrutiny appropriate to their popularity

 

I am a big fan of increasing our scrutiny based on how popular something is. Indeed, this is exactly what Peder Isager, myself, and our collaborators propose in our work on the Replication Value: The more popular a finding is, and the less certain, the more deserving of an independent direct replication the study is (Isager et al., 2023, 2024).

 

There are two challenges. The first is that at the moment that a method is first published we do not know how popular it will become. So there is a time where methods exists, and are used, without being criticized, as their popularity takes some time to become clear. The first paper on p-curve analysis was published in 2014 (Simonsohn et al., 2014), with an update in 2015 (Simonsohn et al., 2015). A very compelling criticism of p-curve that pointed out strong limitations was published in a preprint in 2017, and appeared in print 2 years later (Carter et al., 2019). It convincingly showed that p-curve does not work well under heterogeneity, and there often is heterogeneity. Other methods, such as z-curve analysis, were developed and showed better performance under heterogeneity (Brunner & Schimmack, 2020).

 

It seems a bit of a stretch to say the p-curve method did not get scrutiny appropriate to its popularity when there were many papers that criticized it, relatively quickly (Aert et al., 2016; Bishop & Thompson, 2016; Ulrich & Miller, 2018). What is fair to say is that statisticians failed to engage with an incredibly important topic (test for publication bias) that addressed a clear need in many scientific communities, as most of the criticism was by psychological methodologists. I fully agree that statisticians should have engaged more with this technique. I believe the reason that they didn’t is because there is a real problem in the reward structure in statistics, where statisticians get greater rewards inside their field by proposing a 12th approach to compute confidence intervals around a non-parametric effect size estimate for a test that no one uses, than to help psychologists solve a problem they really need a solution for. Indeed, for a statistician, publication bias is a very messy business, and there will never be a sufficiently rigorous test for publication bias to get credit from fellow statisticians. There are no beautiful mathematical solutions, no creative insights, there is only the messy reality of a literature that is biased by human actions that we can never adequately capture in a model. The fact that empirical researchers often don’t know where to begin to evaluate the reliability of claims in their publication-bias-ridden field is not something statisticians care about. But they should care about it.  

 

I hope statisticians will start to scrutinize things appropriate to their popularity. If a statistical technique is cited 500 times, 3 statisticians need to drop whatever they are doing, and scrutinize the hell out of this technique. We can randomly select them for ‘statisticians duty’.

 

Quality control in science

 

It might come as a surprise, but I don’t actually think we should stop using methods that are not adequately vetted by psychologists, or statisticians for that matter, because I don’t want a science where authorities tell others which methods they can use. Scrutiny is important, but we can’t know how extensively methods should be vetted, we don’t know how to identify experts, everyone – including ‘experts’ – is fallible. It is naïve to think ‘expert vetting’ will lead to clear answers about the methods we should use, and should not use. If we can’t even reach agreement about the use of p-values, no one should believe we will ever reach agreement about the use of methods to detect publication bias, which will always be messy at best.  

 

I like my science free from authority arguments. Everyone should do their best to criticize everyone else, and if they are able, themselves. Some people will be in a better position to criticize some papers than others, but it is difficult to predict where the most fatal criticism will come from. Treating statistics papers published in a statistics journal as superior to papers in a psychology journal is too messy to be a good idea, and boil down to a form of elitism that I can’t condone. Sometimes even a lowly 20% statistician can point out flaws in methods proposed by card-carrying statisticians.

 

What we can do better is implementing actual quality control. Journal peer review will not suffice, because it is only as good as the two or three peers that happen to be willing and available to review a paper. But it is a start. We should enable researchers to see how well papers are peer reviewed by journals. Without transparency, we can’t calibrate our trust (Vazire, 2017). Peer reviews should be open, for all papers, including papers proposing new statistical methods.

 

If we want our statistical methods to be of high quality, we need to specify quality standards. Morey and Davis-Stober point out the limitations of simulation-based tests of a method and convincingly argue for the value of evaluating the mathematical properties of a testing procedure. If as a field we agree that an evaluation of the mathematical properties of a test is desirable, we should track whether this evaluation has been performed, or not. We could have a long checklist of desirable quality control standards – e.g., a method has been tested on real datasets, it has been compared to similar methods, those comparisons have been performed objectively based on a well-justified set of criteria, etc.

 

One could create a database where for each method the quality standards that have been met and that have not been met are listed. If considered useful, the database could also track how often a method is used, by tracking citations, and listing papers that have implemented the method (as opposed to those merely discussing the method). When statistical methods become widely used, the database would point researchers to which methods deserve more scrutiny. The case of magnitude-based inference in sport science reveals the importance of a public call for scrutiny when a method becomes widely popular, especially when this popularity is limited to a single field.

 

The more complex methods are, the more limitations they have. This will be true for all methods that aim to deal with publication bias, because the way scientists bias the literature is difficult to quantify. Maybe as a field we will come to agree that tests for bias are never accurate enough, and we will recommend people to just look at the distribution of p-values without performing a test. Alternatively, we might believe that it is useful to have a testing procedure that too often suggests a literature contains at least some non-zero effects, because we feel we need some way to intersubjectively point out that there is bias in a literature, even if this is based on an imperfect test. Such discussions require a wide range of stakeholders, and the opinion of statisticians about the statistical properties of a test is only one source of input in this discussion. Imperfect procedures are implemented all the time, if they are the best we have, and doing nothing is also not working.

 

Statistical methods are rarely perfect from their inception, and all have limitations. Although I understand the feeling, banning all tests that have not been adequately vetted by an expert is inherently unscientific. Such a suggestion would destroy the very core of science – an institution that promotes mutual criticism, while accepting our fallibility. As Popper (1962) reminds us: “if we respect truth, we must search for it by persistently searching for our errors: by indefatigable rational criticism, and self-criticism.”

 




References

Aert, R. C. M. van, Wicherts, J. M., & Assen, M. A. L. M. van. (2016). Conducting Meta-Analyses Based on p Values Reservations and Recommendations for Applying p-Uniform and p-Curve. Perspectives on Psychological Science, 11(5), 713–729. https://doi.org/10.1177/1745691616650874

Bishop, D. V., & Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value. PeerJ, 4, e1715.

Brunner, J., & Schimmack, U. (2020). Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance. Meta-Psychology, 4. https://doi.org/10.15626/MP.2018.874

Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2019). Correcting for Bias in Psychology: A Comparison of Meta-Analytic Methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144. https://doi.org/10.1177/2515245919847196

Isager, P. M., Lakens, D., van Leeuwen, T., & van ’t Veer, A. E. (2024). Exploring a formal approach to selecting studies for replication: A feasibility study in social neuroscience. Cortex, 171, 330–346. https://doi.org/10.1016/j.cortex.2023.10.012

Isager, P. M., van Aert, R. C. M., Bahník, Š., Brandt, M. J., DeSoto, K. A., Giner-Sorolla, R., Krueger, J. I., Perugini, M., Ropovik, I., van ’t Veer, A. E., Vranka, M., & Lakens, D. (2023). Deciding what to replicate: A decision model for replication study selection under resource and knowledge constraints. Psychological Methods, 28(2), 438–451. https://doi.org/10.1037/met0000438

Lakens, D., & Delacre, M. (2020). Equivalence Testing and the Second Generation P-Value. Meta-Psychology, 4, 1–11. https://doi.org/10.15626/MP.2018.933

Morey, R. D., & Davis-Stober, C. P. (n.d.). On the poor statistical properties of the P-curve meta-analytic procedure. Journal of the American Statistical Association, 0(ja), 1–19. https://doi.org/10.1080/01621459.2025.2544397

Peters, J. L., Sutton, A. J., Jones, D. R., Abrams, K. R., & Rushton, L. (2007). Performance of the trim and fill method in the presence of publication bias and between-study heterogeneity. Statistics in Medicine, 26(25), 4544–4562. https://doi.org/10.1002/sim.2889

Popper, K. R. (1962). Conjectures and refutations: The growth of scientific knowledge. Routledge.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-Curve and Effect Size Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science, 9(6), 666–681.

Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of Experimental Psychology. General, 144(6), 1146–1152. https://doi.org/10.1037/xge0000104

Terrin, N., Schmid, C. H., Lau, J., & Olkin, I. (2003). Adjusting for publication bias in the presence of heterogeneity. Statistics in Medicine, 22(13), 2113–2126. https://doi.org/10.1002/sim.1461

Ulrich, R., & Miller, J. (2018). Some properties of p-curves, with an application to gradual publication bias. Psychological Methods, 23(3), 546–560. https://doi.org/10.1037/met0000125

Vazire, S. (2017). Quality Uncertainty Erodes Trust in Science. Collabra: Psychology, 3(1), 1. https://doi.org/10.1525/collabra.74