A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, May 11, 2017

How a power analysis implicitly reveals the smallest effect size you care about

When designing a study, you need to justify the sample size you aim to collect. If one of your goals is to observe a p-values lower than the alpha level you decided upon (e.g., 0.05), one justification for the sample size can be a power analysis. A power analysis tells you the probability of observing a statistically significant effect, based on a specific sample size, alpha level, and true effect size. At our department, people who use power as a sample size justification need to aim for 90% power if they want to get money from the department to collect data.

A power analysis is performed based on the effect size you expect to observe. When you expect an effect with a Cohen’s d of 0.5 in an independent two-tailed t-test, and you use an alpha level of 0.05, you will have 90% power with 86 participants in each group. What this means, is that only 10% of the distribution of effects sizes you can expect when d = 0.5 and n = 86 falls below the critical value required to get a p < 0.05 in an independent t-test.

In the figure below, the power analysis is visualized by plotting the distribution of Cohen’s d given 86 participants per group when the true effect size is 0 (or the null-hypothesis is true), and when d = 0.5. The blue area is the Type 2 error rate (the probability of not finding p < α, when there is a true effect).

You’ve probably seen such graphs before (indeed, G*power, widely used power analysis software, provides these graphs as output). The only thing I have done is to transform the t-value distribution that is commonly used in these graphs, and calculated the distribution for Cohen’s d. This is a straightforward transformation, but instead of presenting the critical t-value the figure provides the critical d-value. I think people find it easier to interpret d than t. Only t-tests which yield a t 1.974, or a d 0.30, will be statistically significant. All effects smaller than d = 0.30 will never be statistically significant with 86 participants in each condition.
If you design a study where results will be analyzed with an independent two-tailed t-test with α = 0.05, the smallest true effect you can statistically detect is determined exclusively by the sample size. The (unknown) true effect size only determines how far to the right the distribution of d-values lies, and thus, which percentage of effect sizes will be larger than the smallest effect size of interest (and will be statistically significant – or the statistical power).

I think it is reasonable to assume that if you decide to collect data for a study where you plan to perform a null-hypothesis significance test, you are not interested in effect sizes that will never be statistically significant. If you design a study that has 90% power for a medium effect of d = 0.5, the sample size you decide to use means effects smaller than d = 0.3 will never be statistically significant. We can use this fact to infer what your smallest effect size of interest, or SESOI (Lakens, 2014), will be. Unless you state otherwise, we can assume your SESOI is d = 0.3, and any effects smaller than this effect size are considered too small to be interesting. Obviously, you are free to explicitly state any effect smaller than d = 0.5 or d = 0.4 is already too small to matter for theoretical or practical purposes. But without such an explicit statement about what your SESOI is, we can infer it from your power analysis.

This is useful. Researchers who use null-hypothesis significance testing often only specify the effect they expect when the null is true (d = 0), but not the smallest effect size that should still be considered support for their theory when there is a true effect. This leads to a psychological science that is unfalsifiable (Morey & Lakens, under review). Alternative approaches to determining what the smallest effect size of interest is have recently been suggested. For example, Simonsohn (2015) suggested to set the smallest effect size of interest to 33% of the effect size in the original study could detect. For example, if an original study used 20 participants per group, the smallest effect size of interest would be d = 0.49 (which is the effect size they had 33% power to detect with n = 20).

Let’s assume the original study used a sample size of n = 20 per group. The figure below shows that an observed effect size of d = 0.8 would be statistically significant (d = 0.8 lies to the right of the critical d-value), but that the critical d-value is d = 0.64. That means that effects smaller than d = 0.64 would never be statistically significant in a study with 20 participants per group in a between-subjects design. I think it makes more sense to assume the smallest effect size of interest for researchers who design a study with n = 20 is d = 0.64, rather than d = 0.49. 

The figures can be produced by a new Shiny app I created (the Shiny app also plots power curves and the p-value distribution [they are not all visible on Shinyapps.org, but you can try HERE as long as bandwidth lasts, or just grab the code and app from GitHub] – I might discuss these figures in a future blog post). If you have designed your next study, check the critical d-value to make sure that the smallest effect size you care about, isn’t smaller than the critical effect size you can actually detect. If you think smaller effects are interesting, but you don’t have the resources, specify your SESOI explicitly in your article. You can also use this specified smallest effect size of interest in an equivalence test to statistically reject any effect large enough that you deem it worthwhile (Lakens, 2017), which will help interpreting t-tests where p > α. In short, we really need to start specifying the effects we expect under the alternative model, and if you don’t know where to start, your power analysis might have been implicitly telling you what your smallest effect size of interest is.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. https://doi.org/10.1002/ejsp.2023

Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. https://doi.org/10.1177/1948550617697177

Morey, R. D., & Lakens, D. (under review). Why most of psychology is statistically unfalsifiable.

Simonsohn, U. (2015). Small Telescopes Detectability and the Evaluation of Replication Results. Psychological Science, 26(5), 559–569. https://doi.org/10.1177/0956797614567341

Friday, April 14, 2017

Five reasons blog posts are of higher scientific quality than journal articles

The Dutch toilet cleaner ‘WC-EEND’ (literally: 'Toilet Duck') aired a famous commercial in 1989 that had the slogan ‘We from WC-EEND advise… WC-EEND’. It is now a common saying in The Netherlands whenever someone gives an opinion that is clearly aligned with their self-interest. In this blog, I will examine the hypothesis that blogs are, on average, of higher quality than journal articles. Below, I present 5 arguments in favor of this hypothesis.  [EDIT: I'm an experimental psychologist. Mileage of what you'll read below may vary in other disciplines].

1. Blogs have Open Data, Code, and Materials

When you want to evaluate scientific claims, you need access to the raw data, the code, and the materials. Most journals do not (yet) require authors to make their data publicly available (whenever possible). The worst case example when it comes to data sharing is the American Psychological Association. In the ‘Ethical Principles of Psychologists and Code of Conduct’ of this professional organization that supported torture, point 8.14 says that psychologists only have to share data when asked to by ‘competent professionals’ for the goal to ‘verify claims’, and that these researchers can charge money to compensate any costs that are made when they have to respond to a request for data. Despite empirical proof that most scientists do not share their data when asked, the APA considers this ‘ethical conduct’. It is not. It’s an insult to science. But it’s the standard that many relatively low quality scientific journals, such as the Journal of Experimental Psychology: General, hide behind to practice closed science.

On blogs, the norm is to provide access to the underlying data, code, and materials. For example, here is Hanne Watkins, who uses data she collected to answer some questions about the attitudes of early career researchers and researchers with tenure towards replications. She links to the data and materials, which are all available on the OSF. Most blogs on statistics will link to the underlying code, such as this blog by Will Gervais on whether you should run well-powered studies or many small-powered studies. On average, it seems to me almost all blogs practice open science to a much higher extent than scientific journals.

2. Blogs have Open Peer Review

Scientific journal articles use peer review as quality control. The quality of the peer review process is as high as the quality of the peers that were involved in the review process. The peer review process was as biased as the biases of the peers that were involved in the review process. For most scientific journal articles, I can not see who reviewed a paper, or check the quality, or the presence of bias, because the reviews are not open. Some of the highest quality journals in science, such as PeerJ and Royal Society Open Science, have Open Peer Review, and journals like Frontiers at least specify the names of the reviewers of a publication. Most low quality journals (e.g., Science, Nature) have 100% closed peer review, and we don’t even know the name the handling editor of a publication. It is often impossible to know whether articles were peer reviewed to begin with, and what the quality of the peer review process was.

Some blogs have Open pre-publication Peer Review. If you read the latest DataColada blog post, you can see the two reviews of the post by experts in the field (Tom Stanley and Joe Hilgard) and several other people who shared thoughts before the post went online. On my blog, I sometimes ask people for feedback before I put a blog post online (and these people are thanked in the blog if they provided feedback), but I also have a comment section. This allows people to point out errors and add comments, and you can see how much support or criticism a blog has received. For example, in this blog on why omega squared is a better effect size to use than eta-squared, you can see why Casper Albers disagreed by following a link to a blog post he wrote in response. Overall, the peer review process in blog posts is much more transparent. If you see no comments on a blog post, you have the same information about the quality of the peer review process as you’d have for the average Science article. Sure, you may have subjective priors about the quality of the review process at Science (ranging from ‘you get in if your friend is an editor’ to ‘it’s very rigorous’) but you don’t have any data. But if a blog has comments, at least you can see what peers thought about a blog post, giving you some data, and often very important insights and alternative viewpoints.

3. Blogs have no Eminence Filter

Everyone can say anything they want on a blog, as long as it does not violate laws regarding freedom of speech. It is an egalitarian and democratic medium. This aligns with the norms in science. As Merton (1942) writes: “The acceptance or rejection of claims entering the lists of science is not to depend on the personal or social attributes of their protagonist; his race, nationality, religion, class, and personal qualities are as such irrelevant.” We see even Merton was a child of his times – he of course meant that his *or her* race, etcetera, is irrelevant.

Everyone can write a blog, but not everyone is allowed to publish in a scientific journal. As one example, criticism recently arose about a special section in Perspectives on Psychological Science about ‘eminence’ in which the only contribution from a woman was about gender and eminence. It was then pointed out that this special section only included the perspectives on eminence by old American men, and that there might be an issue with diversity in viewpoints in this outlet.

I was personally not very impressed by the published articles in this special section, probably because the views on how to do science as expressed by this generation of old American men does not align with my views on science. I have nothing against old (or dead) American men in general (Meehl be praised), but I was glad to hear some of the most important voices in my scientific life submitted responses to this special issue. Regrettably, all these responses were rejected. Editors can make those choices, but I am worried about the presence of an Eminence Filter in science, especially one that in this specific case filters out some of the voices that have been most important in shaping me as a scientist. Blogs allows these voices to be heard, which I think is closer to the desired scientific norms discussed by Merton.

4. Blogs have Better Error Correction

In a 2014 article, we published a Table 1 of sample sizes required to design informative studies for different statistical approaches. We stated these are sample sizes per condition, but for 2 columns, these are actually the total sample sizes you need. We corrected this in an erratum. I know this erratum was published, and I would love to link to it, but honest to Meehl, I can not find it. I just spend 15 minutes searching for it in any way I can think of, but there is no link to it on the journal website, and I can’t find it in Google scholar. I don’t see how anyone will become aware of this error when they download our article.

When I make an error in a blog post, I can go in and update it. I am pretty confident that I make approximately as many errors in my published articles as I make in my blog posts, but the latter are much easier to fix, and thus, I would consider my blogs more error-free, and of higher quality. There are some reasons why you can not just update scientific articles (we need a stable scientific record), and there might be arguments for better and more transparent version control of blog posts, but for the consumer, it’s just very convenient that mistakes can easily be fixed in blogs, and that you will always read the best version.

5. Blogs are Open Access (and might be read more).

It’s obvious that blogs are open access. This is a desirable property of high quality science. It makes the content more widely available, and I would not be surprised (but I have no data) that blog posts are *on average* read more than scientific articles because they are more accessible. Getting page views is not, per se, an indication of scientific quality. A video on Pen Pineapple Apple Pen gets close to 8 million views, and we don’t consider that high quality music (I hope). But views are one way to measure how much impact blogs have on what scientists think.

I only have data for page views from my own blog. I’ve made a .csv file with the page views of all my blog posts publicly available (so you can check my claims below about page views of specific blog posts below, cf. point 1 above). There is very little research on the impact of blogs on science. They are not cited a lot (even though you can formally cite them) but they can have clear impact, and it would be interesting to study how big their impact is. I think it would be a fun project to compare the impact of blogs with the impact of scientific articles more formally. Should be a fun thesis project for someone studying scientometrics.

Some blog posts that I wrote get more views than the articles I comment on. One commentary blog post I wrote on a paper which suggested there was ‘A surge of p-values between 0.041 and 0.049 in recent decades’. The paper received 7147 view at the time of writing. My blog post received 11285 views so far. But it is not universally true that my blogs get more pageviews than the articles I comment on. A commentary I wrote on a horribly flawed paper by Gilbert and colleagues in Science, where they misunderstood how confidence intervals work, has only received 12190 hits so far, but the article info of their Science article tells me their article received three times as many views for the abstract, 36334, and also more views for the full text (19124). On the other hand, I do have blog posts that have gotten more views than this specific Science article (e.g., this post on Welch’s t-test which has 38127 hits so far). I guess the main point of these anecdotes is not surprising, but nevertheless worthwhile to point out: Blog are read, sometimes a lot.


I’ve tried to measure blogs and journal articles on some dimensions that, I think, determine their scientific quality. It is my opinion that blogs, on average, score better on some core scientific values, such as open data and code, transparency of the peer review process, egalitarianism, error correction, and open access. It is clear blogs impact the way we think and how science works. For example, Sanjay Srivastava’s pottery barn rule, proposed in a 2012 blog, will be implemented in the journal Royal Society Open Science. This shows blogs can be an important source of scientific communication. If the field agrees with me, we might want to more seriously consider the curation of blogs, to make sure they won’t disappear in the future, and maybe even facilitate assigning DOI’s to blogs, and the citation of blog posts.

Before this turns into a ‘we who write blogs recommend blogs’ post, I want to make clear that there is no intrinsic reason why blogs should have higher scientific quality than journal articles. It’s just that the authors of most blogs I read put some core scientific values into practice to a greater extent than editorial boards at journals. I am not recommending we stop publishing in journals, but I want to challenge the idea that journal publications are the gold standard of scientific output. They fall short on some important dimensions of scientific quality, where they are outperformed by blog posts. Pointing this out might inspire some journals to improve their current standards.

Tuesday, March 14, 2017

Equivalence testing in jamovi

One of the challenges of trying to get people to improve their statistical inferences is access to good software. After 32 years, SPSS still does not give a Cohen’s d effect size when researchers perform a t-test. I’m a big fan of R nowadays, but I still remember when it I thought R looked so complex I was convinced I was not smart enough to learn how to use it. And therefore, I’ve always tried to make statistics accessible to a larger, non-R using audience. I know there is a need for this – my paper on effect sizes from 2013 will reach 600 citations this week, and the spreadsheet that comes with the article is a big part of its success.

So when I wrote an article about thebenefits of equivalence testing for psychologists, I also made a spreadsheet. But really, what we want is easy to use software that combines all the ways in which you can improve your inferences. And in recent years, we see some great SPSS alternatives that try to do just that, such as PSPP, JASP, and more recently, jamovi.

Jamovi is made by developers who used to work on JASP, and you’ll see JASP and jamovi look and feel very similar. I’d recommend downloading and installing both these excellent free software packages. Where JASP aims to provide Bayesian statistical methods in an accessible and user-friendly way (and you can do all sorts of Bayesian analyses in JASP), the core aim of jamovi is wanting to make software that is ‘“community driven”, where anyone can develop and publish analyses, and make them available to a wide audience’. This means that if I develop statistical analyses, such as equivalence tests, I can make these available through jamovi for anyone who wants to use these tests. I think that’s really cool, and I’m super excited my equivalence testing package TOSTER is now available as a jamovi module. 

You can download the latest version of jamovi here. The latest version at the time of writing is Install, and open the software. Then, install the TOSTER module. Click the + module button:

Install the TOSTER module:

And you should see a new menu option in the task bar, called TOSTER:

To play around with some real data, let’s download the data from Study 7 from Yap et al, in press, from the Open Science Framework: https://osf.io/pzqj2/. This study examines the effect of weather (good vs bad days) on mood and life satisfaction. Like any researcher who takes science seriously, Yap, Wortman, Anusic, Baker, Scherer, Donnellan, and Lucas made their data available with the publication. After downloading the data, we need to replace the missing values indicated with NA with “” in a text editor (CTRL H, find and replace), and then we can read in the data in jamovi. If you want to follow along, you can also directly download the jamovi file here.

Then, we can just click the TOSTER menu, select a TOST independent samples t-test, select ‘condition’ as condition, and analyze for example the ‘lifeSat2’ variable, or life satisfaction. Then we need to select an equivalence bound. For this DV we have data from approximately 117 people on good days, and 167 people on bad days. We need 136 participants in each condition to have 90% power to reject effects of d = 0.4 or larger, so let’s select d = 0.4 as an equivalence bound. I’m not saying smaller effects are not practically relevant – they might very well be. But if the authors were interested in smaller effects, they would have collected more data. So I’m assuming here the authors thought an effect of d = 0.4 would be small enough to make them reconsider the original effect by Schwarz & Clore (1983), which was quite a bit larger with a d = 1.38.

In the screenshot above you see the analysis and the results. By default, TOSTER uses Welch’s t-test, which is preferable over Student’s t-test (as we explain in this recent article), but if you want to reproduce the results in the original article, you can check the ‘Assume equal variances’ checkbox. To conclude equivalence in a two-sided test, we need to be able to reject both equivalence bounds, and with p-values of 0.002 and < 0.001, we do. Thus, we can reject an effect larger than d = 0.4 or smaller than d = -0.4, and given these equivalence bounds, conclude the effect is too small to be considered support for the presence of an effect that is large enough, for our current purposes, to matter.

Jamovi runs on R, and it’s a great way to start to explore R itself, because you can easily reproduce the analysis we just did in R. To use equivalence tests with R, we can download the original datafile (R will have no problems with NA as missing values), and read it into R.  Then, in the top right corner of jamovi, click the … options window, and check the box ‘syntax mode’.

You’ll see the output window changing to the input and output style of R. You can simply right-click the syntax on the top, right-click, choose Syntax>Copy and then co to R, and paste the syntax in R:

Running this code gives you exactly the same results as jamovi.

I collaborated a lot with Jonathon Love on getting the TOSTER package ready for jamovi. The team is incredibly helpful, so if you have a nice statistics package that you want to make available to a huge ‘not-yet-R-using’ community, I would totally recommend checking out the developers hub and getting started! We are seeing all sorts of cool power-analyses Shiny apps, meta-analysis spreadsheets, and meta-science tools like p-checker that now live on websites all over the internet, but that could all find a good home in jamovi. If you already have the R code, all you need to do is make it available as a module!

If you use it, you can cite it as: Lakens, D. (in press). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. DOI: 10.1177/1948550617697177