This blog post is now included in the paper "Sample size justification" available at PsyArXiv.
Underpowered studies make it very difficult
to learn something useful from the studies you perform. Low power means you
have a high probability of finding non-significant results, even when there is
a true effect. Hypothesis tests which high rates of false negatives (concluding
there is nothing, when there is something) become a malfunctioning tool. Low
power is even more problematic combined with publication
bias (shiny app). After repeated
warnings over at least half a century, high quality journals are starting to
ask authors who rely on hypothesis tests to provide a sample size justification
based on statistical power.
The first time researchers use power
analysis software, they typically think they are making a mistake, because the
sample sizes required to achieve high power for hypothesized effects are much
larger than the sample sizes they collected in the past. After double checking
their calculations, and realizing the numbers are correct, a common response is
that there is no way they are able to collect this number of observations.
Published articles on power analysis rarely
tell researchers what they should do if they are hired on a 4 year PhD project
where the norm is to perform between 4 to 10 studies that can cost at most 1000
euro each, learn about power analysis, and realize there is absolutely no way
they will have the time and resources to perform high-powered studies, given
that an effect size estimate from an unbiased registered report suggests the effect
they are examining is half as large as they were led to believe based on a published
meta-analysis from 2010. Facing a job market that under the best circumstances
is a nontransparent marathon for uncertainty-fetishists, the prospect of
high quality journals rejecting your work due to a lack of a solid sample size
justification is not pleasant.
The reason that published articles do not
guide you towards practical solutions for a lack of resources, is that there are
no solutions for a lack of resources. Regrettably, the mathematics do not care
about how small the participant payment budget is that you have available. This
is not to say that you can not improve your current practices by reading up on best
practices to increase the efficiency of data collection. Let me give you an
overview of some things that you should immediately implement if you use
hypothesis tests, and data collection is costly.
1) Use
directional tests where relevant. Just following statements such as ‘we
predict X is larger than Y’ up with a logically consistent test of that claim (e.g.,
a one-sided t-test) will easily give you an increase of 10% power in any
well-designed study. If you feel you need to give effects in both directions a
non-zero probability, then at least use lopsided
tests.
2) Use sequential analysis whenever possible.
It’s like optional stopping, but then without the questionable inflation of the
false positive rate. The efficiency gains are so great that, if you complain
about the recent push towards larger sample sizes without already having incorporated
sequential analyses, I will have a hard time taking you seriously.
3) Increase your alpha level. Oh yes, I am
serious. Contrary to what you might believe, the recommendation to use an alpha
level of 0.05 was not the sixth of the ten commandments – it is nothing more
than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify
Your Alpha paper as an argument to not require an alpha level of 0.005: “without
(1) increased funding, (2) a reward system that values large-scale
collaboration and (3) clear recommendations for how to evaluate research with
sample size constraints, lowering the significance threshold could adversely
affect the breadth of research questions examined.” If you *have* to make a
decision, and the data you can feasibly collect is limited, take a moment to
think about how problematic Type 1 and Type 2 error rates are, and maybe minimize
combined error rates instead of rigidly using a 5% alpha level.
4) Use within designs where possible.
Especially when measurements are strongly correlated, this can lead to a substantial
increase in power.
5) If you read this blog or follow me on
Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible
paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and
money: Designing statistically powerful studies while minimizing financial
costs (link). They
discuss I) better ways to screen participants for studies where participants
need to be screened before participation, II) assigning participants unequally
to conditions (if the control condition is much cheaper than the experimental
condition, for example), III) using multiple measurements to increase
measurement reliability (or use well-validated measures, if I may add), and IV)
smart use of (preregistered, I’d recommend) covariates.
6) If you are really brave, you might want
to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably,
almost all approaches to statistical inferences become very limited when the
number of observations is small. If you are very confident in your predictions
(and your peers agree), incorporating prior information will give you a
benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and
colleagues.
Now if you care about efficiency, you might
already have incorporated all these things. There is no way to further improve
the statistical power of your tests, and by all plausible estimates of effects
sizes you can expect or the smallest effect size you would be interested in,
statistical power is low. Now what should you do?
What
to do if best practices in study design won’t save you?
The first thing to realize is that you
should not look at statistics to save you. There are no secret tricks or
magical solutions. Highly informative experiments require a large number of
observations. So what should we do then? The solutions below are, regrettably,
a lot more work than making a small change to the design of your study. But it is
about time we start to take them seriously. This is a list of solutions I see –
but there is no doubt more we can/should do, so by all means, let me know your
suggestions on twitter or in the comments.
1) Ask for a lot more money in your
grant proposals.
Some grant organizations distribute funds
to be awarded as a function of how much money is requested. If you need more
money to collect informative data, ask for it. Obviously grants are incredibly
difficult to get, but if you ask for money, include a budget that acknowledges
that data collection is not as cheap as you hoped some years ago. In my
experience, psychologists are often asking for much less money to collect data
than other scientists. Increasing the requested funds for participant payment
by a factor of 10 is often reasonable, given the requirements of journals to provide
a solid sample size justification, and the more realistic effect size estimates
that are emerging from preregistered studies.
2) Improve management.
If the implicit or explicit goals that you
should meet are still the same now as they were 5 years ago, and you did not
receive a miraculous increase in money and time to do research, then an update
of the evaluation criteria is long overdue. I sincerely hope your manager is
capable of this, but some ‘upward management’ might be needed. In the coda of Lakens &
Evers (2014) we wrote “All else being equal, a researcher running properly
powered studies will clearly contribute more to cumulative science than a
researcher running underpowered studies, and if researchers take their science
seriously, it should be the former who is rewarded in tenure systems and reward
procedures, not the latter.” and “We believe reliable research should be
facilitated above all else, and doing so clearly requires an immediate and irrevocable
change from current evaluation practices in academia that mainly focus on
quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure
track, I thought it would be at least principled if I sent this coda to the
head of my own department. He replied that the things we wrote made perfect sense,
instituted a recommendation to aim for 90% power in studies our department
intends to publish, and has since then tried to make sure quality, and not
quantity, is used in evaluations within the faculty (as you might have guessed,
I am not on the job market, nor do I ever hope to be).
3) Change what is expected from PhD
students.
When I did my PhD, there was the assumption
that you performed enough research in the 4 years you are employed as a full-time
researcher to write a thesis with 3 to 5 empirical chapters (with some chapters
having multiple studies). These studies were ideally published, but at least
publishable. If we consider it important for PhD students to produce multiple publishable
scientific articles during their PhD’s, this will greatly limit the types of
research they can do. Instead of evaluating PhD students based on their
publications, we can see the PhD as a time where researchers learn skills to
become an independent researcher, and evaluate them not based on publishable
units, but in terms of clearly identifiable skills. I personally doubt data collection
is particularly educational after the 20th participant, and I would
probably prefer to hire a post-doc who
had well-developed skills in programming, statistics, and who broadly read the
literature, then someone who used that time to collect participant 21 to 200. If
we make it easier for PhD students to demonstrate their skills level (which
would include at least 1 well written article, I personally think) we can
evaluate what they have learned in a more sensible manner than now. Currently, difference
in the resources PhD students have at their disposal are a huge confound as we
try to judge their skill based on their resume. Researchers at rich
universities obviously have more resources – it should not be difficult to develop
tools that allow us to judge the skills of people where resources are much less
of a confound.
4) Think about the questions we collectively
want answered, instead of the questions we can individually answer.
Our society has some serious issues that
psychologists can help address. These questions are incredibly complex. I have long
lost faith in the idea that a bottom-up organized scientific discipline that rewards
individual scientists will manage to generate reliable and useful knowledge
that can help to solve these societal issues. For some of these questions we
need well-coordinated research lines where hundreds of scholars work together,
pool their resources and skills, and collectively pursuit answers to these
important questions. And if we are going to limit ourselves in our research to
the questions we can answer in our own small labs, these big societal
challenges are not going to be solved. Call me a pessimist. There is a reason
we resort to forming unions and organizations that have to goal to collectively
coordinate what we do. If you greatly dislike team science, don’t worry –
there will always be options to make scientific contributions by yourself. But
now, there are almost no ways for scientists who want to pursue huge challenges
in large well-organized collectives of hundreds or thousands of scholars (for a recent exception
that proves my rule by remaining unfunded: see the Psychological Science Accelerator).
If you honestly believe your research question is important enough to be
answered, then get together with everyone who also thinks so, and pursue
answeres collectively. Doing so should, eventually (I know science funders are
slow) also be more convincing as you ask for more resources to do the resource
(as in point 1).
If you are upset that as a science we lost
the blissful ignorance surrounding statistical power, and are requiring researchers
to design informative studies, which hits substantially harder in some research
fields than in others: I feel your pain. I have argued against universally
lower alpha levels for you, and have tried to write accessible statistics
papers that make you more efficient without increasing sample sizes. But if you
are in a research field where even best practices in designing studies will not
allow you to perform informative studies, then you need to accept the
statistical reality you are in. I have already written too long a blog post,
even though I could keep going on about this. My main suggestions are to ask
for more money, get better management, change what we expect from PhD students,
and self-organize – but there is much more we can do, so do let me know your top
suggestions. This will be one of the many challenges our generation faces, but if
we manage to address it, it will lead to a much better science.
Power = people?
ReplyDeleteThomas Schmidt, University of Kaiserslautern, Germany
There is yet another way to improve your power: Use more trials from the participants you have. Actually power depends on two things: the number of participants in the sample and the reliability of the measurements. However, reliability directly depends on the number of trials. There are several simulation studies that show that both levels (people and trials) are about equally important in determining statistical power. There are many areas of psychology that successfully work with small groups of subjects but massive repetition of measurement -- psychophysics is a good example. In my research, I almost invariably use eight participants and control power entirely by the number of sessions. In my experience, well-trained subjects perform so much more reliably than untrained ones that they can give you high data quality even with limited resources. There is also a convenient, citable name for this approach: Smith & Little (2018) call it "small-N design".
Apart from statistical power, there is yet another time-honoured concept that is used in engineering: measurement precision. Precision can simply be defined by setting an upper limit to the standard error of the dependent variable -- all you need is a rough idea about the standard deviation. In a recent paper, we included the following passage to justify our sample sizes (Biafora & Schmidt, 2019):
"In multi-factor repeated-measures designs, statistical power is difficult to predict because too many terms are unknown. Instead, we control measurement precision at the level of individual participants in single conditions. We calculate precision as s/√r (Eisenhart, 1969), where s is a single participant's standard deviation in a given cell of the design and r is the number of repeated measures per cell and subject. With r = 120 and 240 in the priming and prime identification task, respectively, we expect a precision of about 5.5 ms in response times (assuming individual SDs around 60 ms), at most 4.6 percentage points in error rates, and at most 3.2 percentage points in prime identification accuracy (assuming the theoretical maximum SD of .5)."