A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Sunday, December 13, 2015

Plotting Scopus article level citation data in R

The Royal Society has decided to publish journal citations distributions. This makes sense. The journal impact factor is a single number trying to summarize a distribution, but it’s almost always better to plot your data. Some have been hopeful that visualizing such distributions will make it clear what a troublesome statistic the journal impact factor is, and hope that other journals will also be open with their data.

I want to point out that all this data is readily available to anyone who has access to Scopus, and at the bottom of this post I’ll share the R code to create these plots yourself.

Go to Scopus, and search for any journal you’d like. Here, I’ll illustrate this process by a search for the journal Psychological Science, which has ISSN number 0956-7976. You can search for any range of years, but Scopus will only allow you to export 2000 cases at once. I limited this search to issues from 2010-2015.Due to copyright reasons, I cannot share the Scopus data I downloaded.

Then, select all results, and export ‘all available information’ as a .csv file, as illustrated in the animation below.

Now you have the data, plotting the citations is straightforward, and can be done with the code below (the plots in this blog posts look a bit different then the output in the code, but the data is the same). For example, here is the distribution of citations for Psychological Science for the years 2010-2015. The tail is so long, that I cut off the x-axis at 200 citations. Three (most notably, Simonsohn, Nelson, & Simmons, 2011, with 662 citations) papers are cited more than 200 times.

The data is clearly skewed, and obviously papers are cited more often, as the years go by. The differences between the means:

        Year    Mean
1       2010    34.551724
2       2011    25.329167
3       2012    18.460465
4       2013    12.055016
5       2014    6.176471
6       2015    1.814815

and medians:

        Year    Median
1       2010    25
2       2011    17
3       2012    15
4       2013    8
5       2014    4
6       2015    1

are obvious. You would probably exclude extreme outliers when analyzing your own data, but journals obviously like to keep them in because they boost the impact factor, even though they are not representative.

Feel free to play around with the script, and link to your plots in the comments below, or tweet them to me at @Lakens.

Friday, December 11, 2015

Can you explain why you did not share data and materials when publishing your article?

I recently signed the Peer Reviewers’ Openness Initiative. At its core, it boils down to one very simple thing: As a reviewer, I will from 2017 onwards ask authors to explain why they can not share their data and materials. Without an explanation, I will choose not to review this specific article. 

In Peter Singer’s ‘The Life You Can Save’ (2009) he describes a simple situation. You walk past a shallow pond where you see a small child who is in danger of drowning. No one else is around, but you can easily save the child if you act immediately. You won’t have time to take off your shoes, and the shoes you are wearing will be ruined, and no one will refund them. Will you save the child at the expense of your shoes?

The answer many people give is: “Yes, sure”. Peter Singer goes on to argue that the same amount of money you would be willing to spend in this situation, could be used right now to save the life of children somewhere else in the world.

Why this story stuck with me, because it forces you to explain your behavior. Why don’t I give more to charity?

I personally think it is important to be able to rationalize some important behaviors I perform. When it comes to my work, which is paid for by taxpayers, I feel I need to give them optimal value for their money. When I share my data, stimuli, and materials, science will become more transparent and efficient. If I don’t adhere to these open science principles, I think I need to give an explanation. That’s why from 2013, most of the data, materials, and scripts of papers I was a first author or co-author on are publically available.

As in Peter Singer’s scenario, the rationalization not to do something is sometimes difficult, and sometimes easy. If you don’t have enough money as it is, you don’t have any money to donate to others. Similarly, if you can not share materials, such as the IAPS pictures I used in Lakens, Fockenberg, Lemmens, Ham, & Midden, 2013, the justification is easy. At other times, such as when you are considering spending money on gadgets you don’t really need, or when the materials and data have no copyright or privacy issues, you might be affectively inclined to come up with an excuse, only to realize they don’t hold up after careful deliberation.

It’s this latter category we aim to address with the Peer Reviewers’ Openness Initiative. It is so easy to just ignore this rational justification process when you are a little busy. The goal is to make people ask themselves: Could I share the data, materials, and stimuli? Would doing so make science more transparent and efficient?

I’ve started send out tweets to let you know how many papers I review share all data and materials, or explain why this was not possible. So far, I’m at 3/3. After all, journals like PLOS already ask authors to specify the reasons for restrictions on public data deposition in line with the PRO initiative (they just don’t ask authors to include stimuli or materials whenever possible). I have a strong conviction that researchers want to do what is best for science. Every now and then, we just need someone who asks us to reflect upon, and explain, our behavior. 

If you want to help remind researchers they need to rationalize why they are not sharing data, materials, and stimuli, you can sign the PRO initiative here.

For other views related to the Initiative, see blog posts by Richard Morey, Candice Morey, and Rolf Zwaan.
[Read the paper -- Sign the Initiative -- More resources for open science]

Thursday, December 3, 2015

Zotero – Finally a Good Reference Manager

If you are from my generation, you know what UP UP DOWN DOWN LEFT RIGHT LEFT RIGHT B A START means, your first e-mail account ended with hotmail.com, and the first five times you tried to use a reference manager, it sucked up so much time you were better off setting type by hand.

The fifth reference manager I tried, in October 2010, was Zotero. It wasn’t user friendly, had limited options, and I soon dismissed it like all the others.

But recently, thankfully, some people pointed my attention to Zotero again, and said it worked great. In first instance, I categorized them as people who will even say GitHub is user-friendly. You know, because they are tech-savvy youngsters who programmed in Minecraft on their iPad when they were 5.

But Zotero is great, and I’m so excited I just need to tell you some of it’s great features before you, like me, go on without it because you still think nothing can beat copy-pasting references by clicking Google Scholar’s ‘cite’ button.

Zotero has a standalone app. You can download it here, but be sure to also install the extension for the browser you use. If you load a webpage that has a scientific article on it, a symbol (either a folder, or a page) will appear in the browser bar. 
If you click it, Zotero will automatically add the reference to your database. If you thought that was cool, wait until you see that Zotero also automatically downloads the PDF file (if you have access to it). If you use it on Google Scholar, and there’s a link to a PDF file there, Zotero will also download the PDF file (see below).

That’s right. In one click, you have the reference, and the PDF file stored in your database. See the .gif below that illustrates this process.

Talking about the database: wouldn’t it be nice if the database could be synced across multiple computers? It surely would, and it surely can!

Set up a box account. It will give you 10 GB to sync (which should be enough) for free. The Zotero servers will only allow you to sync 300MB for free, which is enough for the database, but not for the attachments. Create the account, and use dav.box.com/dav and your account name and password to sync (see below). Wait for everything to sync (I had 3 GB, which took a while), and then you can download the files to a second PC.

If you already have a large number of PDF files on your computer, just drag them into Zotero, select them, right-click, and choose ‘Retrieve Meta-data for PDF’ (see the .gif below). Zotero will recognize most (but not all) PDF files. If you have a few thousand articles, Google Scholar (which it uses) will block you for a day. Be patient, and spread out the automatic recognition over a few days.

Obviously Zotero comes with an easy to use add-in for word, and adding references and the bibliography is really easy (in any citation style you want – it has APA 6th edition). After enabling PDF indexing in the options, you can use Zotero for a super fast search through the content of all PDF files in your database. You can also create groups – I created two, one for each PhD student I supervise, so I can easily share papers I come across with them when I think they should read them, and vice versa.

In short, I’m completely sold. I was missing out on a great tool. Thanks to Maarten Derxen and Mark Dingemanse for convincing me to try Zotero again. 

Are there some cool features of Zotero I missed? Let me know in the comments!

Sunday, November 22, 2015

The relation between p-values and the probability H0 is true is not weak enough to ban p-values

The journal of Basic and Applied Social Pychology banned the p-value in 2015, after Trafimow (2014) had explained in an editorial a year earlier that inferential statistics were no longer required. In the 2014 editorial, Trafimow notes how: “The null hypothesis significance procedure has been shown to be logically invalid and to provide little information about the actual likelihood of either the null or experimental hypothesis (see Trafimow, 2003; Trafimow & Rice, 2009)”. The goal of this blog post is to explain why the arguments put forward in Trafimow & Rice (2009) are incorrect. Their simulations illustrate how meaningless questions provide meaningless answers, but they do not reveal a problem with p-values. Editors can do with their journal as they like - even ban p-values. But if the simulations upon which such a ban is based are meaningless, the ban itself becomes meaningless.

To calculate the probability that the null-hypothesis is true, given some data we have collected, we need to use Bayes’ formula. Cohen (1994) shows how the posterior probability of the null-hypothesis, given a statistically significant result (the data), can be calculated based on a formula that is a poor man’s Bayesian updating function. Instead of creating distributions around parameters, his approach simply uses the p-value of a test (which is related to the observed data), the power of the study, and the prior probability the null-hypothesis is true, to calculate the posterior probability H0 is true, given the observed data. Before we look at the formula, some definitions:

P(H0) is the prior probability (P) the null hypothesis (H0) is true.
P(H1) is the probability (P) the alternative hypothesis (H1) is true. Since I’ll be considering only a single alternative hypothesis here, either the null hypothesis or the alternative hypothesis is true, and thus P(H1) = 1- P(H0). We will use 1-P(H0) in the formula below.
P(D|H0) is the probability (P) of the data (D), or more extreme data, given that the null hypothesis (H0) is true. In Cohen’s approach, this is the p-value of a study.
P(D|-H0) is the probability of the data (a significant result), given that H0 is not true, or when the alternative hypothesis is true. This is the statistical power of a study.
P(H0|D) is the probability of the null-hypothesis, given the data. This is our posterior belief in the null-hypothesis, after the data has been collected. According to Cohen (1994), it’s what we really want to know. People often mistake the p-value as the probability the null-hypothesis is true.

If we ignore the prior probability for a moment, the formula in Cohen (1994) is simply:

More formally, and including the prior probabilities, the formula is:

In the numerator, we calculate the probability that we observed a significant p-value when the null hypothesis is true, and divide it by the total probability of finding a significant p-value when either the null-hypothesis is true or the alternative hypothesis is true. The formula shows that the lower the p-value in the numerator, and the higher the power, the lower the probability of the null-hypothesis, given the significant result you have observed. Both depend on the same thing: the sample size, and the formula gives an indication why larger sample sizes mean more informative studies.

How are p and P(H0|D) related?

Trafimow and Rice (2009) used the same formula mentioned in Cohen (1994) to calculate P(HO|D) to examine whether p-values drawn from a uniform distribution between 0 and 1 were linearly correlated with P(H0|D). In their simulations, the value for the power of the study is also drawn from a uniform distribution, as is the prior P(H0). Thus, all three variables in the formula are randomly drawn from a uniform distribution. Trafimow & Rice (2009) provide an example (I stick to the more commonly used D for the data, where they use F):

“For example, the first data set contained the random values .540 [p(F|H0)], .712 [p(H0)], and .185 [p(F|–H0)]. These values were used in Bayes’ formula to derive p(H0|F) = .880.”

The first part of the R code below reproduces this example. The remainder of the R script reproduces the simulations reported by Trafimow and Rice (2009). In the simulation, Trafimow and Rice (2009) draw p-values from a uniform distribution. I simulate data when the true effect size is 0, which also implies p-values are uniformly distributed.

The correlation between p-values and the probability the null-hypothesis is true, given the data (P(H0|D) is 0.37. This is a bit lower than the correlation of 0.396 reported by Tramifow and Rice (2009). The most likely reason for this is that they used Excel, which has a faulty random number generator that should not be used for simulations. Although Trafimow and Rice (2009) say that “The present authors’ main goal was to test the size of the alleged correlation. To date, no other researchers have done so” we already find in Kreuger (2001): “Second, P(D|H0) and P(H0|D) are correlated (r = .38)” which was also based on a simulation in Excel (Krueger, personal communication). So, it seems Krueger was the first to examine this correlation, and the estimate of r = 0.37 is most likely correct. Figure 1 presents a plot of the simulated data.

It is difficult to discern the pattern in Figure 1. Based on the low correlation of 0.37, Trafimow & Rice (2009, p. 266) remark that this result “fails to provide a compelling justification for computing p values”, and it “does not constitute a compelling justification for their routine use in social science research”. But they are wrong. They also note the correlation only accounts for 16% in the variance between the relation – which is what you get, when calculating a linear correlation coefficient for values that are logarithmically related, as we will see below. The only conclusion we can draw based on this simulation, is that the authors asked a meaningless question (calculating a linear correlation coefficient), which they tried to answer with a simulation in which it is impossible to see the pattern they are actually interested in.

Using a fixed P(H0) and power

The low correlation is not due to the ‘poorness’ (Trafimow & Rice, 2009, p. 264) of the relation between the p-value and P(H0|D), which is, as I will show below, perfectly predictable, but with their choice to randomly choose values for the P(D|-H0) and P(H0). If we fix these values (to any value you deem reasonable) we can see the p-value and P(H0|D) are directly related. In Figure 2, the prior probability of H0 is fixed to 0.1, 0.5, or 0.9, and the power (P(D|-H0) is also fixed to 0.1, 0.5, or 0.9. These plots show that the p-value and P(H0|D) are directly related and fall on a logarithmic scale.

Lower p-values always mean P(H0|D) is lower, compared to higher p-values. It’s important to remember that significant p-values (left of the vertical red line) don’t necessarily mean that the probability that H0 is true is less likely than the probability that H1 is true (see the bottom-left plot, where P(H0|D) is larger than 0.5 after a significant effect of p = 0.049). The horizontal red lines indicate the prior probability that the null hypothesis is true. We see that high p-values make the probability that H0 is true more likely (but sometimes the change in probability is rather unimpressive), and low p-values makes this probability less likely.

The problem Trafimow and Rice have identified is not that p-values are meaningless, but that large simulations where random values are chosen for the prior probability and the power do not clearly reveal a relationship between p-values and P(H0|D), and that quantifying the relation between two variables with the improper linear term does not explain a lot of variation. Figure 1 consists of many single points from randomly chosen curves as shown in Figure 2.

No need to ban p-values

This poor man’s Bayesian updating function lacks some of the finesse a real Bayesian updating procedure has. For example, it treats a p = 0.049 as an outcome that has a 5% probability of being observed when there is a true effect. This dichotomous thinking keeps things simple, but it’s also incorrect, because in a high-powered experiment, a p = 0.049 will rarely be observed (p’s < 0.01 are massively more likely) and more sensitive updating functions, such a true Bayesian statistics, will allow you to evaluate outcomes more precisely. Obviously, any weakness in the poor man’s Bayesian updating formula also applies to its use to criticize the relation between p-values and the posterior probability the null-hypothesis is true, as Trafimow and Rice (2009) have done (also see the P.S.).

Significant p-values generally make the null-hypothesis less likely, as long as the alpha level is chosen sensibly. When the sample size is very large, and the statistical power is very high, choosing an alpha level of 0.05 can lead to situations where a p-value smaller than 0.05 is actually more likely to be observed when the null-hypothesis is true, than when the alternative hypothesis is true (Lakens & Evers, 2014). Researchers who have compared true bayesian statistics with p-values acknowledge they will often lead to similar conclusions, but recommend to decrease the alpha level as a function of the sample size (e.g., Cameron & Trivedi, 2005, p. 279; Good, 1982; Zellner, 1971, p. 304). Some recommendations have been put forward, but these have not yet been evaluated extensively. For now, researchers are simply advised to use their own judgment when setting their alpha level for analyses where sample sizes are large, or statistical power is very high. Alternatively, researchers might opt to never draw conclusions about the evidence for or against the null-hypothesis, and simply aim to control their error rates in lines of research that test theoretical predictions, following a Neyman-Pearson perspective on statistical inferences.

If you really want to make statements about the probability the null-hypothesis is true, given the data, p-values are not the tool of choice (Bayesian statistics is). But p-values are related to evidence (Good, 1992), and in exploratory research where priors are uncertain and the power of the test is unknown, p-values might be the something to fall back on. There is absolutely no reason to dismiss or even ban them because the ‘poorness’ of the relation between p-values and the posterior probability that the null-hypothesis is true. What is needed is a better understanding of the relationship between p-values and the probability the null-hypothesis is true by educating all researchers how to correctly interpret p-values.


This might be a good moment to note that Trafimow and Rice (2009) calculate posterior probabilities of p-values assuming the alternative hypothesis is true, but simulate studies with uniform p-values, meaning that the null-hypothesis is true. This is somewhat peculiar. Hagen (1997) explains the correct updating formula under the assumption that the null-hypothesis is true. He proposes to use exact p-values, when the null hypothesis is true, but I disagree. When the null-hypothesis is true, every p-value is equally likely, and thus I would calculate P(H0|D) using:

Assuming H0 and H1 are a-priori equally likely, the formula simplifies to:

This formula shows how the null-hypothesis can become more likely when a non-significant result is observed, contrary to popular belief that non-significant findings don't tell you anything about the likelyhood the null-hypothesis is true, not as a function of the p-value you observe (after all they are uniformly distributed, so every p-value is equally uninformative), but through Bayes’ formula. The higher the power of a study, the more likely the null-hypothesis becomes after a non-significant result.

Cameron, A. C. and P. K. Trivedi (2005). Microeconometrics: Methods and Applications. New York: Cambridge University Press.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Good, I. J. (1982). Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16, 65-66.
Good, I. J. (1992). The Bayes/non-Bayes compromise: A brief review. Journal of the American Statistical Association, 87, 597-606.
Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.
Lakens, D. & Evers, E. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9, 278-292. DOI: 10.1177/1745691614528520.
Lew, M. J. (2013). To P or not to P: on the evidential nature of P-values and their place in scientific inference. arXiv:1311.0081.
Trafimow D. (2014). Editorial. Basic and Applied Social Psychology, 36, 1–2.
Trafimow D., Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37, 1–2.
Trafimow, D., & Rice, S. (2009). A test of the null hypothesis significance testing procedure correlation argument. The Journal of General Psychology, 136, 261-270.
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: John Wiley.