A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, April 14, 2017

Five reasons blog posts are of higher scientific quality than journal articles



The Dutch toilet cleaner ‘WC-EEND’ (literally: 'Toilet Duck') aired a famous commercial in 1989 that had the slogan ‘We from WC-EEND advise… WC-EEND’. It is now a common saying in The Netherlands whenever someone gives an opinion that is clearly aligned with their self-interest. In this blog, I will examine the hypothesis that blogs are, on average, of higher quality than journal articles. Below, I present 5 arguments in favor of this hypothesis.  [EDIT: I'm an experimental psychologist. Mileage of what you'll read below may vary in other disciplines].

1. Blogs have Open Data, Code, and Materials


When you want to evaluate scientific claims, you need access to the raw data, the code, and the materials. Most journals do not (yet) require authors to make their data publicly available (whenever possible). The worst case example when it comes to data sharing is the American Psychological Association. In the ‘Ethical Principles of Psychologists and Code of Conduct’ of this professional organization that supported torture, point 8.14 says that psychologists only have to share data when asked to by ‘competent professionals’ for the goal to ‘verify claims’, and that these researchers can charge money to compensate any costs that are made when they have to respond to a request for data. Despite empirical proof that most scientists do not share their data when asked, the APA considers this ‘ethical conduct’. It is not. It’s an insult to science. But it’s the standard that many relatively low quality scientific journals, such as the Journal of Experimental Psychology: General, hide behind to practice closed science.

On blogs, the norm is to provide access to the underlying data, code, and materials. For example, here is Hanne Watkins, who uses data she collected to answer some questions about the attitudes of early career researchers and researchers with tenure towards replications. She links to the data and materials, which are all available on the OSF. Most blogs on statistics will link to the underlying code, such as this blog by Will Gervais on whether you should run well-powered studies or many small-powered studies. On average, it seems to me almost all blogs practice open science to a much higher extent than scientific journals.


2. Blogs have Open Peer Review


Scientific journal articles use peer review as quality control. The quality of the peer review process is as high as the quality of the peers that were involved in the review process. The peer review process was as biased as the biases of the peers that were involved in the review process. For most scientific journal articles, I can not see who reviewed a paper, or check the quality, or the presence of bias, because the reviews are not open. Some of the highest quality journals in science, such as PeerJ and Royal Society Open Science, have Open Peer Review, and journals like Frontiers at least specify the names of the reviewers of a publication. Most low quality journals (e.g., Science, Nature) have 100% closed peer review, and we don’t even know the name the handling editor of a publication. It is often impossible to know whether articles were peer reviewed to begin with, and what the quality of the peer review process was.

Some blogs have Open pre-publication Peer Review. If you read the latest DataColada blog post, you can see the two reviews of the post by experts in the field (Tom Stanley and Joe Hilgard) and several other people who shared thoughts before the post went online. On my blog, I sometimes ask people for feedback before I put a blog post online (and these people are thanked in the blog if they provided feedback), but I also have a comment section. This allows people to point out errors and add comments, and you can see how much support or criticism a blog has received. For example, in this blog on why omega squared is a better effect size to use than eta-squared, you can see why Casper Albers disagreed by following a link to a blog post he wrote in response. Overall, the peer review process in blog posts is much more transparent. If you see no comments on a blog post, you have the same information about the quality of the peer review process as you’d have for the average Science article. Sure, you may have subjective priors about the quality of the review process at Science (ranging from ‘you get in if your friend is an editor’ to ‘it’s very rigorous’) but you don’t have any data. But if a blog has comments, at least you can see what peers thought about a blog post, giving you some data, and often very important insights and alternative viewpoints.

3. Blogs have no Eminence Filter


Everyone can say anything they want on a blog, as long as it does not violate laws regarding freedom of speech. It is an egalitarian and democratic medium. This aligns with the norms in science. As Merton (1942) writes: “The acceptance or rejection of claims entering the lists of science is not to depend on the personal or social attributes of their protagonist; his race, nationality, religion, class, and personal qualities are as such irrelevant.” We see even Merton was a child of his times – he of course meant that his *or her* race, etcetera, is irrelevant.

Everyone can write a blog, but not everyone is allowed to publish in a scientific journal. As one example, criticism recently arose about a special section in Perspectives on Psychological Science about ‘eminence’ in which the only contribution from a woman was about gender and eminence. It was then pointed out that this special section only included the perspectives on eminence by old American men, and that there might be an issue with diversity in viewpoints in this outlet.

I was personally not very impressed by the published articles in this special section, probably because the views on how to do science as expressed by this generation of old American men does not align with my views on science. I have nothing against old (or dead) American men in general (Meehl be praised), but I was glad to hear some of the most important voices in my scientific life submitted responses to this special issue. Regrettably, all these responses were rejected. Editors can make those choices, but I am worried about the presence of an Eminence Filter in science, especially one that in this specific case filters out some of the voices that have been most important in shaping me as a scientist. Blogs allows these voices to be heard, which I think is closer to the desired scientific norms discussed by Merton.

4. Blogs have Better Error Correction


In a 2014 article, we published a Table 1 of sample sizes required to design informative studies for different statistical approaches. We stated these are sample sizes per condition, but for 2 columns, these are actually the total sample sizes you need. We corrected this in an erratum. I know this erratum was published, and I would love to link to it, but honest to Meehl, I can not find it. I just spend 15 minutes searching for it in any way I can think of, but there is no link to it on the journal website, and I can’t find it in Google scholar. I don’t see how anyone will become aware of this error when they download our article.

When I make an error in a blog post, I can go in and update it. I am pretty confident that I make approximately as many errors in my published articles as I make in my blog posts, but the latter are much easier to fix, and thus, I would consider my blogs more error-free, and of higher quality. There are some reasons why you can not just update scientific articles (we need a stable scientific record), and there might be arguments for better and more transparent version control of blog posts, but for the consumer, it’s just very convenient that mistakes can easily be fixed in blogs, and that you will always read the best version.

5. Blogs are Open Access (and might be read more).


It’s obvious that blogs are open access. This is a desirable property of high quality science. It makes the content more widely available, and I would not be surprised (but I have no data) that blog posts are *on average* read more than scientific articles because they are more accessible. Getting page views is not, per se, an indication of scientific quality. A video on Pen Pineapple Apple Pen gets close to 8 million views, and we don’t consider that high quality music (I hope). But views are one way to measure how much impact blogs have on what scientists think.

I only have data for page views from my own blog. I’ve made a .csv file with the page views of all my blog posts publicly available (so you can check my claims below about page views of specific blog posts below, cf. point 1 above). There is very little research on the impact of blogs on science. They are not cited a lot (even though you can formally cite them) but they can have clear impact, and it would be interesting to study how big their impact is. I think it would be a fun project to compare the impact of blogs with the impact of scientific articles more formally. Should be a fun thesis project for someone studying scientometrics.

Some blog posts that I wrote get more views than the articles I comment on. One commentary blog post I wrote on a paper which suggested there was ‘A surge of p-values between 0.041 and 0.049 in recent decades’. The paper received 7147 view at the time of writing. My blog post received 11285 views so far. But it is not universally true that my blogs get more pageviews than the articles I comment on. A commentary I wrote on a horribly flawed paper by Gilbert and colleagues in Science, where they misunderstood how confidence intervals work, has only received 12190 hits so far, but the article info of their Science article tells me their article received three times as many views for the abstract, 36334, and also more views for the full text (19124). On the other hand, I do have blog posts that have gotten more views than this specific Science article (e.g., this post on Welch’s t-test which has 38127 hits so far). I guess the main point of these anecdotes is not surprising, but nevertheless worthwhile to point out: Blog are read, sometimes a lot.

Conclusion


I’ve tried to measure blogs and journal articles on some dimensions that, I think, determine their scientific quality. It is my opinion that blogs, on average, score better on some core scientific values, such as open data and code, transparency of the peer review process, egalitarianism, error correction, and open access. It is clear blogs impact the way we think and how science works. For example, Sanjay Srivastava’s pottery barn rule, proposed in a 2012 blog, will be implemented in the journal Royal Society Open Science. This shows blogs can be an important source of scientific communication. If the field agrees with me, we might want to more seriously consider the curation of blogs, to make sure they won’t disappear in the future, and maybe even facilitate assigning DOI’s to blogs, and the citation of blog posts.

Before this turns into a ‘we who write blogs recommend blogs’ post, I want to make clear that there is no intrinsic reason why blogs should have higher scientific quality than journal articles. It’s just that the authors of most blogs I read put some core scientific values into practice to a greater extent than editorial boards at journals. I am not recommending we stop publishing in journals, but I want to challenge the idea that journal publications are the gold standard of scientific output. They fall short on some important dimensions of scientific quality, where they are outperformed by blog posts. Pointing this out might inspire some journals to improve their current standards.

Tuesday, March 14, 2017

Equivalence testing in jamovi


One of the challenges of trying to get people to improve their statistical inferences is access to good software. After 32 years, SPSS still does not give a Cohen’s d effect size when researchers perform a t-test. I’m a big fan of R nowadays, but I still remember when it I thought R looked so complex I was convinced I was not smart enough to learn how to use it. And therefore, I’ve always tried to make statistics accessible to a larger, non-R using audience. I know there is a need for this – my paper on effect sizes from 2013 will reach 600 citations this week, and the spreadsheet that comes with the article is a big part of its success.

So when I wrote an article about thebenefits of equivalence testing for psychologists, I also made a spreadsheet. But really, what we want is easy to use software that combines all the ways in which you can improve your inferences. And in recent years, we see some great SPSS alternatives that try to do just that, such as PSPP, JASP, and more recently, jamovi.

Jamovi is made by developers who used to work on JASP, and you’ll see JASP and jamovi look and feel very similar. I’d recommend downloading and installing both these excellent free software packages. Where JASP aims to provide Bayesian statistical methods in an accessible and user-friendly way (and you can do all sorts of Bayesian analyses in JASP), the core aim of jamovi is wanting to make software that is ‘“community driven”, where anyone can develop and publish analyses, and make them available to a wide audience’. This means that if I develop statistical analyses, such as equivalence tests, I can make these available through jamovi for anyone who wants to use these tests. I think that’s really cool, and I’m super excited my equivalence testing package TOSTER is now available as a jamovi module. 

You can download the latest version of jamovi here. The latest version at the time of writing is 0.7.0.2. Install, and open the software. Then, install the TOSTER module. Click the + module button:


Install the TOSTER module:


And you should see a new menu option in the task bar, called TOSTER:



To play around with some real data, let’s download the data from Study 7 from Yap et al, in press, from the Open Science Framework: https://osf.io/pzqj2/. This study examines the effect of weather (good vs bad days) on mood and life satisfaction. Like any researcher who takes science seriously, Yap, Wortman, Anusic, Baker, Scherer, Donnellan, and Lucas made their data available with the publication. After downloading the data, we need to replace the missing values indicated with NA with “” in a text editor (CTRL H, find and replace), and then we can read in the data in jamovi. If you want to follow along, you can also directly download the jamovi file here.

Then, we can just click the TOSTER menu, select a TOST independent samples t-test, select ‘condition’ as condition, and analyze for example the ‘lifeSat2’ variable, or life satisfaction. Then we need to select an equivalence bound. For this DV we have data from approximately 117 people on good days, and 167 people on bad days. We need 136 participants in each condition to have 90% power to reject effects of d = 0.4 or larger, so let’s select d = 0.4 as an equivalence bound. I’m not saying smaller effects are not practically relevant – they might very well be. But if the authors were interested in smaller effects, they would have collected more data. So I’m assuming here the authors thought an effect of d = 0.4 would be small enough to make them reconsider the original effect by Schwarz & Clore (1983), which was quite a bit larger with a d = 1.38.


In the screenshot above you see the analysis and the results. By default, TOSTER uses Welch’s t-test, which is preferable over Student’s t-test (as we explain in this recent article), but if you want to reproduce the results in the original article, you can check the ‘Assume equal variances’ checkbox. To conclude equivalence in a two-sided test, we need to be able to reject both equivalence bounds, and with p-values of 0.002 and < 0.001, we do. Thus, we can reject an effect larger than d = 0.4 or smaller than d = -0.4, and given these equivalence bounds, conclude the effect is too small to be considered support for the presence of an effect that is large enough, for our current purposes, to matter.

Jamovi runs on R, and it’s a great way to start to explore R itself, because you can easily reproduce the analysis we just did in R. To use equivalence tests with R, we can download the original datafile (R will have no problems with NA as missing values), and read it into R.  Then, in the top right corner of jamovi, click the … options window, and check the box ‘syntax mode’.


You’ll see the output window changing to the input and output style of R. You can simply right-click the syntax on the top, right-click, choose Syntax>Copy and then co to R, and paste the syntax in R:


Running this code gives you exactly the same results as jamovi.

I collaborated a lot with Jonathon Love on getting the TOSTER package ready for jamovi. The team is incredibly helpful, so if you have a nice statistics package that you want to make available to a huge ‘not-yet-R-using’ community, I would totally recommend checking out the developers hub and getting started! We are seeing all sorts of cool power-analyses Shiny apps, meta-analysis spreadsheets, and meta-science tools like p-checker that now live on websites all over the internet, but that could all find a good home in jamovi. If you already have the R code, all you need to do is make it available as a module!

If you use it, you can cite it as: Lakens, D. (in press). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. DOI: 10.1177/1948550617697177

Friday, March 10, 2017

No, the p-values are not to blame: Part 53

In the latest exuberant celebration of how Bayes Factors will save science, Ravenzwaaij and Ioannidis write: “our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences.” Uh oh – what have these evil p-values been up to this time?

Because the Food and Drug Administration thinks two significant studies are a good threshold before they'll allow you to put stuff in your mouth, in a simple simulation, Ravenzwaaij and Ioannidis look at what Bayes factors have to say when researchers find exactly two p < 0.05.

If you find two effects in 2 studies, and there is a true effect of d = 0.5, the data is super-duper convincing. The blue bars below indicate Bayes Factors > 20, the tiny green parts are BF > 3 but < 20 (still fine).


 Even when you study a small effect with d = 0.2, after observing two significant results in two studies, everything is hunky-dory.


So p-values work like a charm, and there is no problem. THE END.

What's that you say? This simple message does not fit your agenda? And it's unlikely to get published? Oh dear! Let's see what we can do!

Let's define 'support for the null-hypothesis' as a BF < 1. After all, just as a 49.999% rate of heads in a coin flip is support for a coin biased towards tails, any BF < 1 is stronger support for the null, than for the alternative. Yes, normally researchers consider 1/3 > BF < 3 as 'inconclusive' but let's ignore that for now.

The problem is we don't even have BF < 1 in our simulations so far. So let's think of something else. Let's introduce our good old friend lack of power!

Now we simulate a bunch of studies, until we find exactly 2 significant results. Let's say we do 20 studies where the true effect is d = 0.2, and only find an effect in 2 studies. We have 15% power (because we do a tiny study examining a tiny effect). This also means that the effect size estimates in the 18 other studies have to be small enough not to be significant. Then, we calculate Bayes Factors "for the combined data from the total number of trials conducted." Now what do we find?


Look! Black stuff! That's bad. The 'statistical evidence actually favors the null hypothesis', at least based on a BF < 1 cut-off. If we include the possibility of 'inconclusive evidence' (applying the widely used 1/3 > BF < 3 thresholds), we see that actually, when you find only 2 out of 20 significant studies when you have 15% power, the overall data is sometimes inconclusive (but not support for H0).

That's not surprising. When we have 20 people per cell, and d = 0.2, when we combine all the data to calculate the Bayes factor (so we have N = 400 per cell) the data is inconclusive sometimes. After all, we only have 88% power! That's not bad, but the data you collect will sometimes still be inconclusive!

Let's see if we can make it even worse, by introducing our other friend, publication bias. They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors).


Wowzerds, what a darkness! Aren't you surprised? No, I didn't think so.

To conclude: Inconclusive results happen. In small samples and small effects, there is huge variability in the data. This is not only true for p-values, but it is just as true of Bayes Factors (see my post on Dance of the Bayes Factors here).

I can understand the authors might be disappointed by the lack of enthusiasm of the FDA (which cares greatly about controlling error rates, given that they deal with life and death) to embrace Bayes Factors. But the problems the authors simulate are not going to be fixed by replacing p-values by Bayes Factors. It's not that "Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications." Publication bias and lack of power lead to spurious decision making - regardless of the statistic you throw at the data.

I'm gonna bet that a little less Bayesian propaganda, a little less p-value bashing for no good reason, and a little more acknowledgement of the universal problems of publication bias and too small sample sizes for any statistical inference we try to make, is what will really improve science in the long run.




P.S. The authors shared their simulation script with the publication, which was extremely helpful in understanding what they actually did, and which allowed me to make the figure above which includes an 'inconclusive' category (in which I also used a slightly more realistic prior when you expect small effects -I don't think it matters but I'm too impatient to redo the simulation with the same prior and only different cut-offs).