The 20% Statistician

A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, November 29, 2020

Why I care about replication studies

In 2009 I attended a European Social Cognition Network meeting in Poland. I only remember one talk from that meeting: A short presentation in a nearly empty room. The presenter was a young PhD student - Stephane Doyen. He discussed two studies where he tried to replicate a well-known finding in social cognition research related to elderly priming, which had shown that people walked more slowly after being subliminally primed with elderly related words, compared to a control condition.

His presentation blew my mind. But it wasn’t because the studies failed to replicate – it was widely known in 2009 that these studies couldn’t be replicated. Indeed, around 2007, I had overheard two professors in a corridor discussing the problem that there were studies in the literature everyone knew would not replicate. And they used this exact study on elderly priming as one example. The best solution the two professors came up with to correct the scientific record was to establish an independent committee of experts that would have the explicit task of replicating studies and sharing their conclusions with the rest of the world. To me, this sounded like a great idea.

And yet, in this small conference room in Poland, there was this young PhD student, acting as if we didn’t need specially convened institutions of experts to inform the scientific community that a study could not be replicated. He just got up, told us about how he wasn’t able to replicate this study, and sat down.

It was heroic.

If you're struggling to understand why on earth I thought this was heroic, then this post is for you. You might have entered science in a different time. The results of replication studies are no longer communicated only face to face when running into a colleague in the corridor, or at a conference. But I was impressed in 2009. I had never seen anyone give a talk in which the only message was that an original effect didn’t stand up to scrutiny. People sometimes presented successful replications. They presented null effects in lines of research where the absence of an effect was predicted in some (but not all) tests. But I’d never seen a talk where the main conclusion was just: “This doesn’t seem to be a thing”.

On 12 September 2011 I sent Stephane Doyen an email. “Did you ever manage to publish some of that work? I wondered what has happened to it.” Honestly, I didn’t really expect that he would manage to publish these studies. After all, I couldn’t remember ever having seen a paper in the literature that was just a replication. So I asked, even though I did not expect he would have been able to publish his findings.

Surprisingly enough, he responded that the study would soon appear in press. I wasn’t fully aware of new developments in the publication landscape, where Open Access journals such as PlosOne published articles as long as the work was methodologically solid, and the conclusions followed from the data. I shared this news with colleagues, and many people couldn’t wait to read the paper: An article, in print, reporting the failed replication of a study many people knew to be not replicable. The excitement was not about learning something new. The excitement was about seeing replication studies with a null effect appear in print.

Regrettably, not everyone was equally excited. The publication also led to extremely harsh online comments from the original researcher about the expertise of the authors (e.g., suggesting that findings can fail to replicate due to “Incompetent or ill-informed researchers”), and the quality of PlosOne (“which quite obviously does not receive the usual high scientific journal standards of peer-review scrutiny”). This type of response happened again, and again, and again. Another failed replication led to a letter by the original authors that circulated over email among eminent researchers in the area, was addressed to the original authors, and ended with “do yourself, your junior co-authors, and the rest of the scientific community a favor. Retract your paper.”

Some of the historical record on discussions between researchers around between 2012-2015 survives online, in Twitter and Facebook discussions, and blogs. But recently, I started to realize that most early career researchers don’t read about the replication crisis through these original materials, but through summaries, which don’t give the same impression as having lived through these times. It was weird to see established researchers argue that people performing replications lacked expertise. That null results were never informative. That thanks to dozens of conceptual replications, the original theoretical point would still hold up even if direct replications failed. As time went by, it became even weirder to see that none of the researchers whose work was not corroborated in replication studies ever published a preregistered replication study to silence the critics. And why were there even two sides to this debate? Although most people agreed there was room for improvement and that replications should play some role in improving psychological science, there was no agreement on how this should work. I remember being surprised that a field was only now thinking about how to perform and interpret replication studies if we had been doing psychological research for more than a century.

I wanted to share this autobiographical memory, not just because I am getting old and nostalgic, but also because young researchers are most likely to learn about the replication crisis through summaries and high-level overviews. Summaries of history aren’t very good at communicating how confusing this time was when we lived through it. There was a lot of uncertainty, diversity in opinions, and lack of knowledge. And there were a lot of feelings involved. Most of those things don't make it into written histories. This can make historical developments look cleaner and simpler than they actually were.

It might be difficult to understand why people got so upset about replication studies. After all, we live in a time where it is possible to publish a null result (e.g., in journals that only evaluate methodological rigor, but not novelty, journals that explicitly invite replication studies, and in Registered Reports). Don't get me wrong: We still have a long way to go when it comes to funding, performing, and publishing replication studies, given their important role in establishing regularities, especially in fields that desire a reliable knowledge base. But perceptions about replication studies have changed in the last decade. Today, it is difficult to feel how unimaginable it used to be that researchers in psychology would share their results at a conference or in a scientific journal when they were not able to replicate the work by another researcher. I am sure it sometimes happened. But there was clearly a reason those professors I overheard in 2007 were suggesting to establish an independent committee to perform and publish studies of effects that were widely known to be not replicable.

As people started to talk about their experiences trying to replicate the work of others, the floodgates opened, and the shells fell off peoples' eyes. Let me tell you that, from my personal experience, we didn't call it a replication crisis for nothing. All of a sudden, many researchers who thought it was their own fault when they couldn't replicate a finding started to realize this problem was systemic. It didn't help that in those days it was difficult to communicate with people you didn't already know. Twitter (which is most likely the medium through which you learned about this blog post) launched in 2006, but up to 2010 hardly any academics used this platform. Back then, it wasn't easy to get information outside of the published literature. It's difficult to express how it feels when you realize 'it's not me - it's all of us'. Our environment influences which phenotypic traits express themselves. These experiences made me care about replication studies.

If you started in science when replications were at least somewhat more rewarded, it might be difficult to understand what people were making a fuss about in the past. It's difficult to go back in time, but you can listen to the stories by people who lived through those times. Some highly relevant stories were shared after the recent multi-lab failed replication of ego-depletion (see tweets by Tom Carpenter and Dan Quintana). You can ask any older researcher at your department for similar stories, but do remember that it will be a lot more difficult to hear the stories of the people who left academia because most of their PhD consisted of failures to build on existing work.

If you want to try to feel what living through those times must have been like, consider this thought experiment. You attend a conference organized by a scientific society where all society members get to vote on who will be a board member next year. Before the votes are cast, the president of the society informs you that one of the candidates has been disqualified. The reason is that it has come to the society’s attention that this candidate selectively reported results from their research lines: The candidate submitted only those studies for publication that confirmed their predictions, and did not share studies with null results, even though these null results were well designed studies that tested sensible predictions. Most people in the audience, including yourself, were already aware of the fact that this person selectively reported their results. You knew publication bias was problematic from the moment you started to work in science, and the field knew it was problematic for centuries. Yet here you are, in a room at a conference, where this status quo is not accepted. All of a sudden, it feels like it is possible to actually do something about a problem that has made you feel uneasy ever since you started to work in academia.

You might live through a time where publication bias is no longer silently accepted as an unavoidable aspect of how scientists work, and if this happens, the field will likely have a very similar discussion as it did when it started to publish failed replication studies. And ten years later, a new generation will have been raised under different scientific norms and practices, where extreme publication bias is a thing of the past. It will be difficult to explain to them why this topic was a big deal a decade ago. But since you’re getting old and nostalgic yourself, you think that it’s useful to remind them, and you just might try to explain it to them in a 2 minute TikTok video.

History merely repeats itself. It has all been done before. Nothing under the sun is truly new.
Ecclesiastes 1:9

Thanks to Farid Anvari, Ruben Arslan, Noah van Dongen, Patrick Forscher, Peder Isager, Andrea Kis, Max Maier, Anne Scheel, Leonid Tiokhin, and Duygu Uygun for discussing this blog post with me (and in general for providing such a stimulating social and academic environment in times of a pandemic).

Saturday, October 17, 2020

The p-value misconception eradication challenge

If you have educational material that you think will do a better job at preventing p-value misconceptions than the material in my MOOC, join the p-value misconception eradication challenge by proposing an improvement to my current material in a new A/B test in my MOOC.

I launched a massive open online course “Improving your statistical inferences” in October 2016. So far around 47k students have enrolled, and the evaluations suggest it has been a useful resource for many researchers. The first week focusses on p-values, what they are, what they aren’t, and how to interpret them.

Arianne Herrera-Bennet was interested in whether an understanding of p-values was indeed “impervious to correction” as some statisticians believe (Haller & Krauss, 2002, p. 1) and collected data on accuracy rates on ‘pop quizzes’ between August 2017 and 2018 to examine if there was any improvement in p-value misconceptions that are commonly examined in the literature. The questions were asked at the beginning of the course, after relevant content was taught, and at the end of the course. As the figure below from the preprint shows, there was clear improvement, and accuracy rates were quite high for 5 items, and reasonable for 3 items.


We decided to perform a follow-up from September 2018 where we added an assignment to week one for half the students in an ongoing A/B test in the MOOC. In this new assignment, we didn’t just explain what p-values are (as in the first assignment in the module all students do) but we also tried to specifically explain common misconceptions, to explain what p-values are not. The manuscript is still in preparation, but there was additional improvement for at least some misconceptions. It seems we can develop educational material that prevents p-value misconceptions – but I am sure more can be done. 

In my paper to appear in Perspectives on Psychological Science on “The practical alternative to the p-value is the correctly used p-value” I write:

“Looking at the deluge of papers published in the last half century that point out how researchers have consistently misunderstood p-values, I am left to wonder: Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? It is nowadays relatively straightforward to create online apps where people can simulate studies and see the behavior of p values across studies, which can easily be combined with exercises that fit the knowledge level of bachelor and master students. The second point I want to make in this article is that a dedicated attempt to develop evidence based educational material in a cross-disciplinary team of statisticians, educational scientists, cognitive psychologists, and designers seems worth the effort if we really believe young scholars should understand p values. I do not think that the effort statisticians have made to complain about p-values is matched with a similar effort to improve the way researchers use p values and hypothesis tests. We really have not tried hard enough.”

If we honestly feel that misconceptions of p-values are a problem, and there are early indications that good education material can help, let’s try to do all we can to eradicate p-value misconceptions from this world.

We have collected enough data in the current A/B test. I am convinced the experimental condition adds some value to people’s understanding of p-values, so I think it would be best educational practice to stop presenting students with the control condition.

However, I there might be educational material out there that does a much better job than the educational material I made, to train away misconceptions. So instead of giving all students my own new assignment, I want to give anyone who thinks they can do an even better job the opportunity to demonstrate this. If you have educational material that you think will work even better than my current material, I will create a new experimental condition that contains your teaching material. Over time, we can see which materials performs better, and work towards creating the best educational material to prevent misunderstandings of p-values we can.

If you are interested in working on improving p-value education material, take a look at the first assignment in the module that all students do, and look at the new second assignment I have created to train away misconception (and the answers). Then, create (or adapt) educational material such that the assignment is similar in length and content. The learning goal should be to train away common p-value misconceptions – you can focus on any and all you want. If there are multiple people who are interested, we collectively vote on which material we should test first (but people are free to combine their efforts, and work together on one assignment). What I can offer is getting your material in front of between 300 and 900 students who enroll each week. Not all of them will start, not all of them will do the assignments, but your material should reach at least several hundreds of learners a year, of which around 40% has a masters degree, and 20% has a PhD – so you will be teaching fellow scientists (and beyond) to improve how they work.  

I will incorporate this new assignment, and make it publicly available on my blog, as soon as it is done and decided on by all people who expressed interest in creating high quality teaching material. We can evaluate the performance by looking at the accuracy rates on test items. I look forward to seeing your material, and hope this can be a small step towards an increased effort in improving statistics education. We might have a long way to go to completely eradicate p-value misconceptions, but we can start.

Saturday, September 19, 2020

P-hacking and optional stopping have been judged violations of scientific integrity

On July 28, 2020, the first Dutch academic has been judged to have violated the code of conduct for research integrity for p-hacking and optional stopping with the aim of improving the chances of obtaining a statistically significant result. I think this is a noteworthy event that marks a turning point in the way the scientific research community interprets research practices that up to a decade ago were widely practiced. The researcher in question violated scientific integrity in several other important ways, including withdrawing blood without ethical consent, and writing grant proposals in which studies and data were presented that had not been performed and collected. But here, I want to focus on the judgment about p-hacking and optional stopping.

When I studied at Leiden University from 1998 to 2002 and commuted by train from my hometown of Rotterdam I would regularly smoke a cigar in the smoking compartment of the train during my commute. If I would enter a train today and light a cigar, the responses I would get from my fellow commuters would be markedly different than 20 years ago. They would probably display moral indignation or call the train conductor who would give me a fine. Times change.

When the report on the fraud case of Diederik Stapel came out, the three committees were surprised by a research culture that accepted “sloppy science”. But it did not directly refer to these practices as violations of the code of conduct for research integrity. For example, on page 57 they wrote:

 “In the recommendations, the Committees not only wish to focus on preventing or reducing fraud, but also on improving the research culture. The European Code refers to ‘minor misdemeanours’: some data massage, the omission of some unwelcome observations, ‘favourable’ rounding off or summarizing, etc. This kind of misconduct is not categorized under the ‘big three’ (fabrication, falsification, plagiarism) but it is, in principle, equally unacceptable and may, if not identified or corrected, easily lead to more serious breaches of standards of integrity.”

Compare this to the report by LOWI, the Dutch National Body for Scientific Integrity, for a researcher at Leiden University who was judged to violate the code of conduct for research integrity for p-hacking and optional stopping (note this is my translation from Dutch of the advice on page 17 point IV, and point V on page 4):

“The Board has rightly ruled that Petitioner has violated standards of academic integrity with regard to points 2 to 5 of the complaint.”

With this, LOWI has judged that the Scientific Integrity Committee of Leiden University (abbreviated as CWI in Dutch) ruled correctly with respect to the following:

“According to the CWI, the applicant also acted in violation of scientific integrity by incorrectly using statistical methods (p-hacking) by continuously conducting statistical tests during the course of an experiment and by supplementing the research population with the aim of improving the chances of obtaining a statistically significant result.”

As norms change, what we deemed a misdemeanor before, is now simply classified as a violation of academic integrity. I am sure this is very upsetting for this researcher. We’ve seen similar responses in the past years, where single individuals suffered more than average researcher for behaviors that many others performed as well. They might feel unfairly singled out. The only difference between this researcher at Leiden University, and several others who performed identical behaviors, was that someone in their environment took the 2018 Netherlands Code of Conduct for Research Integrity seriously when they read section 3.7, point 56:

Call attention to other researchers’ non-compliance with the standards as well as inadequate institutional responses to non-compliance, if there is sufficient reason for doing so.

When it comes to smoking, rules in The Netherlands are regulated through laws. You’d think this would mitigate confusion, insecurity, and negative emotions during a transition – but that would be wishful thinking. In The Netherlands the whole transition has been ongoing for close to two decades, from an initial law allowing a smoke-free working environment in 2004, to a completely smoke-free university campus in August 2020.

The code of conduct for research integrity is not governed by laws, and enforcement of the code of conduct for research integrity is typically not anyone’s full time job. We can therefore expect the change to be even more slow than the changes in what we feel is acceptable behavior when it comes to smoking. But there is notable change over time.

We see a shift from the “big three” types of misconduct (fabrication, falsification, plagiarism), and somewhat vague language of misdemeanors, that is “in principle” equally unacceptable, and might lead to more serious breaches of integrity, to a clear classification of p-hacking and optional stopping as violations of scientific integrity. Indeed, if you ask me, the ‘bigness’ of plagiarism pales compared to how publication bias and selective reporting distort scientific evidence.

Compare this to smoking laws in The Netherlands, where early on it was still allowed to create separate smoking rooms in buildings, while from August 2020 onwards all school and university terrain (i.e., the entire campus, inside and outside of the buildings) needs to be a smoke-free environment. Slowly but sure, what is seen as acceptable changes.

I do not consider myself to be an exceptionally big idiot – I would say I am pretty average on that dimension – but it did not occur to me how stupid it was to enter a smoke-filled train compartment and light up a cigar during my 30 minute commute around the year 2000. At home, I regularly smoked a pipe (a gift from my father). I still have it. Just looking at the tar stains now makes me doubt my own sanity.


This is despite that fact that the relation between smoking and cancer was pretty well established since the 1960’s. Similarly, when I did my PhD between 2005 and 2009 I was pretty oblivious to the error rate inflation due to optional stopping, despite that fact that one of the more important papers on this topic was published by Armitage, McPherson, and Rowe in 1969. I did realize that flexibility in analyzing data could not be good for the reliability of the findings we reported, but just like when I lit a cigar in the smoking compartment in the train, I failed to adequately understand how bad it was.

When smoking laws became stricter, there was a lot of discussion in society. One might even say there was a strong polarization, where on the one hand newspaper articles appeared that claimed how outlawing smoking in the train was ‘totalitarian’, while we also had family members who would no longer allow people to smoke inside their house, which led my parents (both smokers) to stop visiting these family members. Changing norms leads to conflict. People feel personally attacked, they become uncertain, and in the discussions that follow we will see all opinions ranging from how people should be free to do what they want, to how people who smoke should pay more for healthcare.

We’ve seen the same in scientific reform, although the discussion is more often along the lines of how smoking can’t be that bad if my 95 year old grandmother has been smoking a packet a day for 70 years and feels perfectly fine, to how alcohol use or lack of exercise are much bigger problems and why isn’t anyone talking about those.

But throughout all this discussion, norms just change. Even my parents stopped smoking inside their own home around a decade ago. The Dutch National Body for Scientific Integrity has classified p-hacking and optional stopping as violations of research integrity. Science is continuously improving, but change is slow. Someone once explained to me that correcting the course of science is like steering an oil tanker - any change in direction takes a while to be noticed. But when change happens, it’s worth standing still to reflect on it, and look at how far we’ve come.

Wednesday, August 5, 2020

Feasibility Sample Size Justification

This blog post is now included in the paper "Sample size justification" available at PsyArXiv.
When you perform an experiment, you want it to provide an answer to your research question that is as informative as possible. However, since all scientists are faced with resource limitations, you need to balance the cost of collecting each additional datapoint against the increase in information that datapoint provides. In economics, this is known as the Value of Information (Eckermann et al., 2010). Calculating the value of information is notoriously difficult. You need to specify the costs and benefits of possible outcomes of the study, and quantifying a utility function for scientific research is not easy.

Because of the difficulty of quantifying the value of information, scientists use less formal approaches to justify the amount of data they set out to collect. That is, if they provide a justification for the number of observations to begin with. Even though in some fields a justification for the number of observations is required when submitting a grant proposal to a science funder, a research proposal to an ethical review board, or a manuscript for submission to a journal. In some research fields, the number of observations is stated, but not justified. This makes it difficult to evaluate how informative the study was. Referees can’t just assume the number of observations is sufficient to provide an informative answer to your research question, so leaving out a justification for the number of observations is not best practice, and a reason reviewers can criticize your submitted article.

A common reason why a specific number of observations is collected is because collecting more data was not feasible. Note that all decisions for the sample size we collect in a study are based on the resources we have available in some way. A feasibility justification makes these resource limitations the primary reason for the sample size that is collected. Because we always have resource limitations in science, even when feasibility is not our primary justification for the number of observations we plan to collect, feasibility is always at least a secondary reason for any sample size justification. Despite the omnipresence of resource limitations, the topic often receives very little attention in texts on experimental design. This might make it feel like a feasibility justification is not appropriate, and you should perform an a-priori power analysis or plan for a desired precision instead. But feasibility limitations play a role in every sample size justification, and therefore regardless of which justification for the sample size you provide, you will almost always need to include a feasibility justification as well.

Time and money are the two main resource limitations a scientist faces. Our master students write their thesis in 6 months, and therefore their data collection is necessarily limited in whatever can be collected in 6 months, minus the time needed to formulate a research question, design an experiment, analyze the data, and write up the thesis. A PhD student at our department would have 4 years to complete their thesis, but is also expected to complete multiple research lines in this time. In addition to limitations on time, we have limited financial resources. Although nowadays it is possible to collect data online quickly, if you offer participants a decent pay (as you should) most researchers do not have the financial means to collect thousands of datapoints.

A feasibility justification puts the limited resources at the center of the justification for the sample size that will be collected. For example, one might argue that 120 observations is the most that can be collected in the three weeks a master student has available to collect data, when each observation takes an hour to collect. A PhD student might collect data until the end of the academic year, and then needs to write up the results over the summer to stay on track to complete the thesis in time.

A feasibility justification thus starts with the expected number of observations (N) that a researcher expects to be able to collect. The challenge is to evaluate whether collecting N observations is worthwhile. The answer should sometimes be that data collection is not worthwhile. For example, assume I plan to manipulate the mood of participants using funny cartoons and then measure the effect of mood on some dependent variable - say the amount of money people donate to charity. I should expect an effect size around d = 0.31 for the mood manipulation (Joseph et al., 2020), and seems unlikely that the effect on donations will be larger than the effect size of the manipulation. If I can only collect mood data from 30 participants in total, how do we decide if this study will be informative?

How informative is the data that is feasible to collect?

If we want to evaluate whether the feasibility limitations make data collection uninformative, we need to think about what the goal of data collection is. First of all, having data always provide more knowledge than not having data, so in an absolute sense, all additional data that is collected is better than not collecting data. However, in line with the idea that we need to take into account costs and benefits, it is possible that the cost of data collection outweighs the benefits. To determine this, one needs to think about what the benefits of having the data are. The benefits are clearest when we know for certain that someone is going to make a decision, with or without data. If this is the case, then any data you collect will reduce the error rates of a well-calibrated decision process, even if only ever so slightly. In these cases, the value of information might be positive, as long as the reduction in error rates is more beneficial than the costs of data collection. If your sample size is limited and you know you will make a decision anyway, perform a compromise power analysis, where you balance the error rates, given a specified effect size and sample size.

Another way in which a small dataset can be valuable is if its existence makes it possible to combine several small datasets into a meta-analysis. This argument in favor of collecting a small dataset requires 1) that you share the results in a way that a future meta-analyst can find them regardless of the outcome of the study, and 2) that there is a decent probability that someone will perform a meta-analysis in the future which inclusion criteria would contain your study, because a sufficient number of small studies exist. The uncertainty about whether there will ever be such a meta-analysis should be weighed against the costs of data collection. Will anyone else collect more data on cognitive performance during bungee jumps, to complement the 12 data points you can collect?

One way to increase the probability of a future meta-analysis is if you commit to performing this meta-analysis yourself in the future. For example, you might plan to repeat a study for the next 12 years in a class you teach, with the expectation that a meta-analysis of 360 participants would be sufficient to achieve around 90% power for d = 0.31. If it is not plausible you will collect all the required data by yourself, you can attempt to set up a collaboration, where fellow researchers in your field commit to collecting similar data, with identical measures, over the next years. If it is not likely sufficient data will emerge over time, we will not be able to draw informative conclusions from the data, and it might be more beneficial to not collect the data to begin with, and examine an alternative research question with a larger effect size instead.

Even if you believe over time sufficient data will emerge, you will most likely compute statistics after collecting a small sample size. Before embarking on a study where your main justification for the sample size is based on feasibility, you can expect. I propose that a feasibility justification for the sample size, in addition to a reflection on the plausibility that a future meta-analysis will be performed, and/or the need to make a decision, even with limited data, is always accompanied by three statistics, detailed in the following three sections.

The smallest effect size that can be statistically significant

In Figure @ref(fig:power-effect1) the distribution of Cohen’s d given 15 participants per group is plotted when the true effect size is 0 (or the null-hypothesis is true), and when the true effect size is d = 0.5. The blue area is the Type 2 error rate (the probability of not finding p < α, when there is a true effect, and α = 0.05). 1- the Type 2 error is the statistical power of the test, given an assumption about a true effect size in the population. Statistical power is the probability of a test to yield a statistically significant result if the alternative hypothesis is true. Power depends on the Type 1 error rate (α), the true effect size in the population, and the number of observations.

Null and alternative distribution, assuming d = 0.5, alpha = 0.05, and N = 15 per group.

You might seen such graphs before. The only thing I have done is to transform the t-value distribution that is commonly used in these graphs, and calculated the distribution for Cohen’s d. This is a straightforward transformation, but instead of presenting the critical t-value the figure provides the critical d-value. For a two-sided independent t-test, this is calculated as:

qt(1-(a / 2), (n1 + n2) - 2) * sqrt(1/n1 + 1/n2)

where ‘a’ is the alpha level (e.g., 0.05) and N is the sample size in each independent group. For the example above, where alpha is 0.05 and n = 15:

qt(1-(0.05 / 2), (15 * 2) - 2) * sqrt(1/15 + 1/15)

## [1] 0.7479725

The critical t-value (2.0484071) is also provided in commonly used power analysis software such as G*Power. We can compute the critical Cohen’s d from the t-value and sample size using .

The critical t-value is provided by G*Power software.

When you will test an association between variables with a correlation, G*Power will directly provide you with the critical effect size. When you compute a correlation based on a two-sided test, your alpha level is 0.05, and you have 30 observations, only effects larger than r = 0.361 will be statistically significant. In other words, the effect needs to be quite large to even have the mathematical possibility of becoming statistically significant.

The critical r is provided by G*Power software.

The critical effect size gives you information about the smallest effect size that, if observed, would by statistically significant. If you observe a smaller effect size, the p-value will be larger than your significance threshold. You always have some probability of observing effects larger than the critical effect size. After all, even if the null hypothesis is true, 5% of your tests will yield a significant effect. But what you should ask yourself is whether the effect sizes that could be statistically significant are realistically what you would expect to find. If this is not the case, it should be clear that there is little (if any) use in performing a significance test. Mathematically, when the critical effect size is larger than effects you expect, your statistical power will be less than 50%. If you perform a statistical test with less than 50% power, your single study is not very informative. Reporting the critical effect size in a feasibility justification should make you reflect on whether a hypothesis test will yield an informative answer to your research question.

Compute the width of the confidence interval around the effect size

The second statistic to report alongside a feasibility justification is the width of the 95% confidence interval around the effect size. 95% confidence intervals will capture the true population parameter 95% of the time in repeated identical experiments. The more uncertain we are about the true effect size, the wider a confidence interval will be. Cumming (2013) calls the difference between the observed effect size and its upper 95% confidence interval (or the lower 95% confidence interval) the margin of error (MOE).

# Compute the effect size d and 95% CI
res <-
MOTE::d.ind.t(m1 = 0, m2 = 0, sd1 = 1, sd2 = 1, n1 = 15, n2 = 15, a = .05)

# Print the result

## [1] "$d_s$ = 0.00, 95\\% CI [-0.72, 0.72]"

If we compute the 95% CI for an effect size of 0, we see that with 15 observations in each condition of an independent t-test the 95% CI ranges from -0.72 to 0.72. The MOE is half the width of the 95% CI, 0.72. This clearly shows we have a very imprecise estimate. A Bayesian estimator who uses an uninformative prior would compute a credible interval with the same upper and lower bound, and might conclude they personally believe there is a 95% chance the true effect size lies in this interval. A frequentist would reason more hypothetically: If the observed effect size in the data I plan to collect is 0, I could only reject effects more extreme than d = 0.72 in an equivalence test with a 5% alpha level (even though if such a test would be performed, power might be low, depending on the true effect size). Regardless of the statistical philosophy you plan to rely on when analyzing the data, our evaluation of what we can conclude based on the width of our interval tells us we will not learn a lot. Effect sizes in the range of d = 0.7 are findings such as “People become aggressive when they are provoked”, “People prefer their own group to other groups”, and “Romantic partners resemble one another in physical attractiveness” (Richard et al., 2003). The width of the confidence interval tells you that you can only reject the presence of effects that are so large, if they existed, you would probably already have noticed them. It might still be important to establish these large effects in a well-controlled experiment. But since most effect sizes in we should realistically expect are much smaller, we do not learn something we didn’t already know from the data that plan to collect. Even without data, we would exclude effects larger than d = 0.7 in most research lines.

We see this the MOE is almost, but not exactly, the same as the critical effect size d we observed above (d = 0.7479725). The reason for this is that the 95% confidence interval is calculated based on the t-distribution. If the true effect size is not zero, the confidence interval is calculated based on the non-central t-distribution, and the 95% CI is asymmetric. The figure below vizualizes three t-distributions, one symmetric at 0, and two asymmetric distributions with a noncentrality parameter of 2 and 3. The asymmetry is most clearly visible in very small samples (the distribution in the plot have 5 degrees of freedom) but remain noticeable when calculating confidence intervals and statistical power. For example, for a true effect size of d = 0.5 the 95% CI is [-0.23, 1.22]. The MOE based on the lower bound is 0.7317584 and based on the upper bound is 0.7231479. If we compute the 95% CI around the critical effect size (d = 0.7479725) we see the 95% CI ranges from exactly 0.00 to 1.48. If the 95% CI excludes zero, the test is statistically significant. In this case the lowerbound of the confidence interval exactly touches 0, which means we would observe a p = 0.05 if we exactly observed the critical effect size.

Central (black) and 2 non-central (red and blue) t-distributions.

Where computing the critical effect size can make it clear that a p-value is of little interest, computing the 95% CI around the effect size can make it clear that the effect size estimate is of little value. It will often be so uncertain, and the range of effect sizes you will not be able to reject if there is no effect is so large, the effect size estimate is not very useful. This is also the reason why performing a pilot study to estimate an effect size for an a-priori power analysis is not a sensible strategy (Albers & Lakens, 2018; Leon et al, 2011). Your effect size estimate will be so uncertain, it is not a good guide in an a-priori power analysis.

However, it is possible that the sample size is large enough to exclude some effect sizes that are still a-priori plausible. For example, with 50 observations in each independent group, you have 82% power for an equivalence test with bounds of -0.6 and 0.6. If the literature includes claims of effect size estimates larger than 0.6, and if effect larger than 0.6 can be rejected based on your data, this might be sufficient to tentatively start to question claims in the literature, and the data you collect might fulfill that very specific goal.

Plot a sensitivity power analysis

In a sensitivity power analysis the sample size and the alpha level are fixed, and you compute the effect size you have the desired statistical power to detect. For example, in the Figure below the sample size in each group is set to 15, the alpha level is 0.05, and the desired power is set to 90%. The sensitivity power analysis shows we have 90% power to detect an effect of d = 1.23.

Sensitivity power analysis in G*Power software.

Perhaps you feel a power of 90% is a bit high, and you would be happy with 80% power. We can plot a sensitivity curve across all possible levels of statistical power. In the figure below we see that if we desire 80% power, the effect size should be d = 1.06. The smaller the true effect size, the lower the power we have. This plot should again remind us not to put too much faith in a significance test when are sample size is small, since for 15 observations in each condition, statistical power is very low for anything but extremely large effect sizes.

Plot of the effect size against the desired power when n = 15 per group and alpha = 0.05.

If we look at the effect size that we would have 50% power for, we see it is d = 0.7411272. This is very close to our critical effect size of d = 0.7479725 (the smallest effect size that, if observed, would be significant). The difference is due to the non-central t-distribution.

Reporting a feasibility justification.

To summarize, I recommend addressing the following components in a feasibility sample size justification. Addressing these points explicitly will allow you to evaluate for yourself if collecting the data will have scientific value. If not, there might be other reasons to collect the data. For example, at our department, students often collect data as part of their education. However, if the primary goal of data collection is educational, the sample size that is collected can be very small. It is often educational to collect data from a small number of participants to experience what data collection looks like in practice, but there is often no educational value in collecting data from more than 10 participants. Despite the small sample size, we often require students to report statistical analyses as part of their education, which is fine as long as it is clear the numbers that are calculated can not meaningfully be interpreted.  Te table below should help to evaluate if the interpretation of statistical tests has any value, or not.

Overview of recommendations when reporting a sample size justification based on feasibility.

What to address?

How to address it?

Will a future meta-analysis be performed?

Consider the plausibility that sufficient highly similar studies will be performed in the future to, eventually, make a meta-analysis possible

Will a decision be made, regardless of the amount of data that is available?

If it is known that a decision will be made, with or without data, then any data you collect will reduce error rates.

What is the critical effect size?

Report and interpret the critical effect size, with a focus on whether a hypothesis test would even be significant for expected effect sizes. If not, indicate you will not interpret the data based on p-values.

What is the width of the confidence interval?

Report and interpret the width of the confidence interval. What will an estimate with this much uncertainty be useful for? If the null hypothesis is true, would rejecting effects outside of the confidence interval be worthwhile (ignoring you might have low power to actually test against these values)?

Which effect sizes would you have decent power to detect?

Report a sensitivity power analysis, and report the effect sizes you could detect across a range of desired power levels (e.g., 80%, 90%, and 95%), or plot a sensitivity curve of effect sizes against desired power.

If the study is not performed for educational purposes, but the goal is answer a research question, the feasibility justification might indicate that there is no value in collecting the data. If it wasn’t possible to conclude that one should not proceed with the data collection, there is no use of justifying the sample size. There should be cases where it is unlikely there will ever be enough data to perform a meta-analysis (for example because of a lack of general interest in the topic), the information will not be used to make any decisions, and the statistical tests do not allow you to test a hypothesis or estimate an effect size estimate with any useful accuracy. It should be a feasibility justification - not a feasibility excuse. If there is no good justification to collect the maximum number of observations that is feasible, performing the study nevertheless is a waste of participants time, and/or a waste of money if data collection has associated costs. Collecting data without a good justification why the planned sample size will yield worthwhile information has an ethical component. As Button and colleagues Button et al (2013) write:

Low power therefore has an ethical dimension — unreliable research is inefficient and wasteful. This applies to both human and animal research.

Think carefully if you can defend data collection based on a feasibility justification. Sometimes data collection is just not feasible, and we should accept this.