The 20% Statistician: Checking your Stats, and Some Errors we Make

Nuijten et al (2015) created statcheck, a free R package that you can set to work on a pdf or html file, or a folder of files, to check the reported t-tests, F-tests, correlations, and some others tests. Like your spellchecker, you will want to run statcheck when working as an editor, reviewer, author, supervisor, or teacher on any empirical article that contains t-tests, F-tests, correlations, or chi-square tests.

Here’s how it works. First, you need to install open source software that will allow R to convert PDF files to text. The steps are a bit long and tricky, but I made a step-by-step summary which should help you to get this to work.

Then, to check a single article, run the following R code (changing the path to the PDF you want to check):

# install and load statcheck

if(!require(statcheck)){install.packages('statcheck')}

library(statcheck)

checkPDF("C:/Users/Daniel/statcheck/Zhang2015.pdf")

That’s all.

You will get output (click on the screenshot below for a bigger version) where you can see a column for the reported p-values, the re-computed p-values, a summary of each test, and then a column called Error which will say FALSE if there is no error, and TRUE if there is an error. I analyzed a recent paper my PhD student Chao Zhang wrote, and I was happy to see the way we worked on this article (Chao doing the analyses, me double-checking them) prevented us from making errors. I also looked at earlier papers, and I regrettably did make a few rounding errors and copy-paste errors in my publications. Even though nothing changed the conclusions (indicated by the column ‘DecisionError’), usin Statcheck would have easily prevented these errors. Statcheck can make some errors, so be sure to check where each tests is identified correctly, especially when it flags something as an error.

Some Errors We Make

Nuijten and colleagues applied Statcheck to a huge amount of articles, and report how often people make errors when reporting statistical tests in a new paper. When reading the paper, I immediately saw how useful Statcheck was. But I also felt some annoyance that there was no clear analysis of the things we did wrong. I felt someone told me I was doing things wrong, without telling me what it was I did wrong. But then a wise man said I should not blame the authors for not writing the paper I would have written. Which is especially true given that Nuijten et al have shared all their data, and their beautiful and reproducible analysis script.

So I took a look at what we did wrong (R script), and below I will give a recommendation on how to fix a large majority of the problems.

Of the 258105 tests, there were 24961 errors, of which 3581 were decision errors (changing the conclusion of p > 0.05 to p < 0.05 or vice versa), but they are all caused mainly by the same types of errors. First, people make copy-paste errors. Second, people reported p = 0.000 1279 times, when they should have reported p < 0.001. Three errors are worth looking into in some more detail.

Incorrect use of < instead of =

By far the largest number of errors is the use of < instead of =. For example, F(1, 68) = 4.88, p < .03 is incorrect, because the p-value is actually 0.0305, which is not < 0.03. It happens thousands and thousands of times. Indeed, if we look at the difference between the reported and re-computed p-values for all the errors, we see the difference in p-values is mostly tiny (smaller than 0.01). This is the main reason. When you read the byline ‘One in eight articles contain data-reporting mistakes that affect their conclusions' you might not think the solution is simply to replace ‘<’ by ‘=’. I believe it largely is (but this deserves a closer look).

Use of one-sided tests

Using one-sided tests, without saying so (or at least without Statcheck recognizing the words ‘one-sided’, ‘one-tailed’, or ‘directional’ in the text) is another source of errors. The frequency of one-tailed tests (as I assume, without pre-registration of the analysis plan) is rather high. One-tailed tests are fine, and perhaps even more in line with your prediction than a two-tailed test, but I’d feel more comfortable if people pre-register one-sided predictions if they have them, and report them if they are performed. Statcheck is great for finding non-disclosed one-tailed tests.

Incorrect Rounding and Reporting

963 times, people round a p-value between 0.05 and 0.06 to p < 0.05. The latter is clearly wrong (but remember people make the same rounding error far removed from the magical p = 0.05 threshold as well, so this is just the incorrect use of < instead of = as noted above). 241 times, researchers report a p >= 0.055 to p < 0.05, and 128 times, people round a p-value between 0.055 and 0.06 to p = 0.05 (really using the = sign). This is just pathetic. When you hear ‘1.4% of p-values are grossly inconsistent’, this is the kind of behavior you think about. It makes up approximately 10% of the 3581 decision errors, and even though it is just 0.14% of all reported p-values, I think it is depressingly high. Statcheck can help reduce these errors.

Altogether, the 3581 decision errors are made up mostly by incorrect rounding, the use of one-sided tests without explicitly stating this through the words ‘one-tailed’, ‘one-sided’ or ‘directional’, the use of < instead of =, and the approximately 350 (give or take a hundred) false positives (note there might also be false negatives, which would increase the number of errors).

These errors are visible in the plot below. In the left of the graph, we see differences of -1, where Statcheck often computes a p-value of 1 because it misunderstands the test. The large bar in the center is mainly due to the use of < instead of =, and the slightly larger slope on the left of this large bar is due to the use of one-sided tests, and incorrect rounding.

My main goal in looking at the data in detail was to be able to provide practical recommendations to prevent the specific errors we make (even though Nuijten et al suggest co-authors double-check their analyses and share all data). The recommendation is surprisingly straightforward, and nicely with the theme of this blog on how 20% of the effort will fix 80% of the problems:

Report exact p-values, rounded to three decimals (e.g., p = 0.016), or use p < 0.001. Mention the use of one-tailed tests. Double-check all numbers (for example by using Statcheck!).

I'd like to thanks Michele Nuijten for her help in correcting some of my assumptions and analyses, and for feedback on an earlier draft of this blog post.

12 comments:

UnknownOctober 29, 2015 at 2:19 PM
Nice post! I agree - I also discovered errors in my own manuscripts when checking them (at least, before I started to use knitr).

For people who want to check their own manuscript and are less amazed by the idea of installing several command line tools (or even never started R): They can type the relevant test statistics into the p-checker app: http://shinyapps.org/apps/p-checker/

This is a bit more manual work, but probably easier for many. (And you get additional indices of evidential value as a free add-on!).
UnknownOctober 29, 2015 at 3:03 PM
> Felix, I think the idea of fully automized statistics checks using p-checker is a worthwhile blog to write!

Absolutely! Actually, it's remarkably easy, extending your code snippet:

------
library(statcheck)

# This is a retracted paper
download.file(url="http://www.communicationcache.com/uploads/1/0/8/8/10887248/money_and_mimicry-when_being_mimicked_makes_people_feel_threatened.pdf", destfile="check.pdf", method="curl")

report <- checkPDF("check.pdf")

# Transfer report to p-checker
browseURL(paste("http://shinyapps.org/apps/p-checker/?syntax=", paste(levels(report$Raw), collapse="\n")))
------

Users should be aware, however, that R-index, p-curve, etc. only need the focal hypothesis tests, while statcheck extracts all test statistics.

Furthermore, I realized that statcheck failed on many PDFs I tried. At the end, we probably cannot avoid some hand-coding.
UnknownNovember 2, 2015 at 4:03 PM
This is fantastic! I have shared it with my professor, Jay Van Bavel, and we've shared it with the whole lab. We are making it a policy to run this program before submitting any manuscript. I suspect this may become common practice in our field in short order.

One question. I am on a Mac, and I have noticed that the instructions for adding xpdf to the path are geared for Windows users. I'm therefore at a loss as to how to install the script on my machine! Would you mind providing a little guidance about how to get the script set up for Mac users? (I suspect this will be useful to more than me, given the prevalence of Mac users in the field!)

Thanks!

Daniel Yudkin
Advanced Doctoral Candidate in Social Psychology, New York University
AnonymousNovember 7, 2015 at 2:40 PM
This is great, thanks!

You wrote that it works with correlations, but that seems not to be the case, the few PDFs I've tried.

It worked with most papers I've tried but APA papers like the JEPs and Emotion did not work for me. It seems the equal signs are coded as underscores or blankspaces in those papers. It can to some extent be fixed manually by "search and replace-function" in the text editor, I guess.

Great help for checking errors in one's own manuscripts, nevertheless!
AnonymousNovember 22, 2015 at 7:00 AM
Hi Daniel, does this work with non-parametric analyses and post-hoc corrected data?
UnknownOctober 18, 2016 at 11:53 AM
APA style research paper writing is a professional style of writing. It may also be referred to as a particular standard format that is followed for writing academic and research papers. See more statistics homework help
AndyDecember 30, 2016 at 11:03 AM
just for future readers, you can check your papers online now too here http://statcheck.io/
vikasFebruary 3, 2021 at 3:00 PM
This comment has been removed by a blog administrator.
vikasFebruary 3, 2021 at 3:01 PM
This comment has been removed by a blog administrator.

The 20% Statistician

Thursday, October 29, 2015

Checking your Stats, and Some Errors we Make

12 comments: