A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Tuesday, May 26, 2015

After how many p-values between 0.025-0.05 should one start getting concerned about robustness?

Like insects sensitive to the infrared spectrum, researchers have evolved a special sensitivity to p-values between 0.025 and 0.05. Sure, you’ve got your p < .001’s and n.s.’s, but your life as a scientist might depend on your special sensitivity to 0.025<p<0.05.

 Our increased sensitivity to these p-values might make us forget what a small part of the p-value spectrum we are talking about here – just 2.5%. At the same time, we are slowly realizing that too many high p-values are rather unlikely. This lead Michael Inzlicht to wonder:

Now if someone who is so serious about improving the way he works as Michael Inzlicht wants to know something, I’m more than happy to help.

I wanted to give the best possible probability. Using this handy interactive visualization it is easy to move some sliders around and see which power has the highest percentage of p-values between 0.025 and 0.05 (give it a try = it’s around 56% power, when approximately 11% of p-values will fall within this small section). If we increase or decrease the power, p-values are either spread out more uniformly, or most of them will be very small. Assuming we are examining a true effect, the probability of finding two p-values in a row within 0.025 and 0.05 is simply 11% times 11%, or 0.11*0.11=0.012. At the very best, published papers that simply report what they find will contain two p-values between 0.025 and 0.05 1.2% of the time.

NOTE 1: Richard Morey noted on Twitter this calculation ignores how researchers will typically not run two studies in a row, regardless of the outcome of the first study. They will typically run Study 2 only if Study 1 was statistically significant. If so, we need to calculate the conditional probability that Study 2 found a significant effect between 0.025-0.05, conditional on the probability that Study 1 found a significant effect between 0.025-0.05 (with 56% power). Thus: p(0.025<p<0.05|p<0.05, assuming 56% power). This probability is 21%, which makes the probability across two studies 0,21*0,11=0.023, or 2.3%. 

We can also simulate independent t-tests with 56% power, and count how often we find two p-values between 0.025 and 0.05 in a row. The R script below gives us the same answer to our question.

Note 2: Frederik Aust and Rogier Kievit remarked on Twitter that if you multiply enough probabilities, the number will always be small in the end. I agree. We can compare observing 2 p-values between 0.025-0.05 when we have 56% power with observing these two p-values when the null-hypothesis is true. If we again take the conditional probability, this is 0.5*0.025 - 0.0125. The conditional probability with 56% power was 0.023. This means the observed pattern across two studies is 1.84 times more likely under the alternative hypothesis than under the null-hypothesis. Even though we have 2 significant studies, the evidence for the alternative hypothesis is purely anecdotal. 

If you want to focus on the probability of p-values between 0.01 and 0.05, the interactive visualization shows the optimal power is around 62%, when approximately 24% of the p-values will fall between 0.01 and 0.05. Finding two studies within these p-values is not improbable (it happens around 0.24*0.24 = 6% of the time), but a third study within this interval occurs only 1.3% of the time. Again, it can happen, but not very often. 

If you want to use the frequency of p-values as an indication of the robustness of results, don't decide when to use the rule or which boundaries you will use after looking at the data. But if you always use the rule to doubt the robustness of two p-values between 0.025 and 0.05 in two study papers, and three p-values between 0.01 and 0.05 in three study papers, you won't make too many errors in the long run. An obvious exception is when authors pre-registered all their studies. 

What should you do as an editor when you encounter a set of studies with p-values that are relatively unlikely to occur? First, you can ask the authors to discuss the situation. For example, when you explicitly mention the set of studies becomes more probable when a non-significant finding is added, the authors might be happy to oblige. Second, one of my favorite solutions is to decide upon an in principle acceptance (assuming the article is fit for publication), but ask the authors to add one replication. The authors are guaranteed of publication irrespective of the outcome of the replication, but we have a better knowledge of what is likely to be true.

Monday, May 18, 2015

The perfect t-test

I've created an easy to use R script that will import your data, and performs and writes up a state-of-the-art dependent or independent t-test. The goal of this script is to examine whether more researcher-centered statistical tools (i.e., a one-click analysis script that checks normality assumptions, calculates effect sizes and their confidence intervals, creates good figures, calculates Bayesian and robust statistics, and writes the results section) increases the use of novel statistical procedures. Download the script here: https://github.com/Lakens/Perfect-t-test. For comments, suggestions, or errors, e-mail me at D.Lakens@tue.nl. The script will likely be updated - check back for updates or follow me @Lakens to be notified of updates.

Correctly comparing two groups is remarkably challenging. When performing a t-test researchers rarely manage to follow all recommendations that statisticians have made over the years. Where statisticians update their recommendations, statistical textbooks often do not. Even though reporting effect sizes and their confidence intervals has been recommended for decades (e.g., Cohen, 1990), statistical software (e.g., SPSS 22) often does not provide these statistics. Progress is slow, and Sharpe (2013) points to a lack of awareness, a lack of time, a lack of easily usable software, and a lack of education as some of the main reasons for the resistance to adopting statistical innovations.

Here, I propose a way to speed up the widespread adoption of the state-of-the-art statistical techniques by providing researchers with an easy to use script in free statistical software (R) that will perform and report all statistical analyses, practically with a single button press. The script (Lakens, 2015, available at https://github.com/Lakens/Perfect-t-test) follows state-of-the-art recommendations (see below), creates plots of the data, and writes the results section, including a minimally required interpretation of the statistical results.

Automated analyses might strike readers as a bad idea because it facilitates mindless statistics. Having performed statistics mindlessly for most of my professional career, I sincerely doubt access to this script would have reduced my level of understanding. If anything, reading an automatically generated results section of your own data that includes statistics you are not accustomed to calculate or report is likely to make you think more about the usefulness of these statistics, not less. However, the goal of this script is not to educate people. The main goal is to get researchers to perform and report the analyses they should, and make this as efficient as possible.

Comparing two groups

Keselman, Othman, Wilcox, and Fradette (2004) proposed the a more robust two-sample t-test that provides better Type 1 error control in situations of variance heterogeneity and nonnormality, but their recommendations have not been widely implemented. Researchers might in general be unsure whether it is necessary to change the statistical tests they use to analyze and report comparisons between groups. As Wilcox, Granger, and Clark (2013, p. 29) remark: “All indications are that generally, the safest way of knowing whether a more modern method makes a practical difference is to actually try it.” Making sure conclusions based on multiple statistical approaches converge is an excellent way to gain confidence in your statistical inferences. This R script calculates traditional Frequentist statistics, Bayesian statistics, and robust statistics, using both a hypothesis testing as an estimation approach, to invite researchers to examine their data from different perspectives.

Since Frequentist and Bayesian statistics are based on assumptions of equal variances and normally distributed data, the R script provides boxplots and histograms with kernel density plots overlaid with a normal distribution curve to check for outliers and normality. Kernel density plots are a non-parametric technique to visualize the distribution of a continuous variable. They are similar to a histogram, but less dependent on the specific choice of bins used when creating a histogram. The graphs plot both the normal distribution, as the kernel density function, making it easier to visually check whether the data is normally distributed or not. Q-Q plots are provided as an additional check for normality.

Yap and Sim (2011) show that no single test for normality will perform optimally for all possible distributions. They conclude (p. 2153): “If the distribution is symmetric with low kurtosis values (i.e. symmetric short-tailed distribution), then the D'Agostino-Pearson and Shapiro-Wilkes tests have good power. For symmetric distribution with high sample kurtosis (symmetric long-tailed), the researcher can use the JB, Shapiro-Wilkes, or Anderson-Darling test." All four normality tests are provided in the R script. Levene’s test for the equality of variances is provided, although for independent t-tests, Welch’s t-test (which does not require equal variances) is provided by default, following recommendations by Ruxton (2006). A short explanation accompanies all plots and assumption checks to help researchers to interpret the results. 

The script also creates graphs that, for example, visualize the distribution of the datapoints, and provide both within as between confidence intervals:

The script provides interpretations for effect sizes based on the classifications ‘small’, ‘medium’, and ‘large’. Default interpretations of the size of an effect based on these three categories should only be used as a last resort, and it is preferable to interpret the size of the effect in relation to other effects in the literature, or in terms of its practical significance. However, since researchers often do not interpret effect sizes (if they are reported to begin with), the default interpretation (and the suggestion to interpret effect sizes in relation to other effects in the literature) should at least function as a reminder that researchers are expected to interpret effect sizes. The common language effect size  (McGraw & Wong, 1992) is provided as an additional way to communicate the effect size.

Similarly, the Bayes Factor is classified into anecdotal, moderate, strong, very strong, and decisive evidence for the alternative or null hypothesis, following Jeffreys (1961), even though researchers are reminded that default interpretations of the strength of the evidence should not distract from the fact that strength of evidence is a continuous function of the Bayes Factor. We can expect researchers will rely less on default interpretations, the more acquainted they become with these statistics, but for novices some help in interpreting effect sizes and Bayes Factors will guide their interpretation.

Running the Markdown script

R Markdown scripts provide a way to create fully reproducible reports from data files. The script combines the commands to perform all statistical analyses with the written sections of the final output. Calculated statistics and graphs are inserted into the written report at specified locations. After installing the required packages, preparing the data, and specifying some variables in the Markdown document, the report can be generated (and thus, the analysis procedure can be performed) with a single mouse-click (scroll down for an example of the output).

The R Markdown script and the ReadMe file contain detailed instructions on how to run the script, and how to install required packages, including the PoweR package (Micheaux & Tran, 2014) to perform the normality tests, HLMdiag to create the Q-Q plots (Loy & Hofmann, 2014). ggplot2 for all plots (Wickham, 2009), car (Fox & Weisberg, 2011) to perform Levene's test, MBESS (Kelley, 2007) to calculate effect sizes and their confidence intervals, WRS for the robust statistics (Wilcox & Schönbrodt, 2015), BootsES to calculate a robust effect size for the independent t-test (Kirby & Gerlanc, 2013), BayesFactor for the bayes factor (Morey & Rouder, 2015), and BEST (Kruschke & Meredith, 2014) to calculate the Bayesian highest density interval.

The data file (which should be stored in the same folder that contains the R markdown script) needs to be tab delimited with a header at the top of the file (which can easily be created from SPSS by saving data through the 'save as' menu and selecting 'save as type: Tab delimited (*.dat)', or in Excel by saving the data as ‘Text (Tab delimited) (.txt)’. For the independent t-test the data file needs to contain at least two columns (one specifying the independent variable, and one specifying the dependent variable, and for the dependent t-test the data file needs to contain three columns, one subject identifier column, and two columns for the two dependent variables. The script for dependent t-tests allows you to select a subgroup for the analysis, as long as the data file contains an additional grouping variable (see the demo data). The data files can contain irrelevant data, which will be ignored by the script. Finally, researchers need to specify the names (or headers) of the independent and dependent variables, as well as grouping variables. Finally, there are some default settings researchers can change, such as the sidedness of the test, the alpha level, the percentage for the confidence intervals, and the scalar on the prior for the Bayes Factor.

The script can be used to create either a word document or a html document. The researchers can easily interpret all the assumption checks, look at the data for possible outliers, and (after minor adaptations) copy-paste the result sections into their article. 

The statistical results the script generates has been compared against the results provided by SPSS, JASP, ESCI, online Bayes Factor calculators, and BEST online. Minor variations in the HDI calculation between BEST online and this script are possible depending on the burn-in samples and number of samples, and for huge t-values there are minor variations between JASP and the latest version of the Bayes Factor package used in this script. This program is distributed in the hope that it will be useful, but without any warranty. If you find an error, please contact me at D.Lakens@tue.nl.

Promoting Statistical Innovations

Statistical software is built around individual statistical tests, while researchers perform a set of procedures. Although it is not possible to create standardized procedures for all statistical analyses, most, if not all, of the steps researchers have to go through when they want to report correlations, regression analyses, ANOVA’s, and meta-analyses are sufficiently structured. These tests make up a large portion of analyses reported in journal articles. Demonstrating this, David Kenny has created R scripts that will perform and report mediation and moderator analyses. Felix Schönbrodt has created a Shiny app that performs several meta-analytic techniques. Making statistical innovations more accessible has a high potential to substantially improve the quality of the statistical tests researchers perform and report. Statisticians who take the application of generated knowledge seriously should try to experiment with the best way to get researchers to use state-of-the-art techniques. R markdown scripts are an excellent method to combine statistical analyses and a written report in free software. Shiny apps might make these analyses even more accessible, because they no longer require users to install R and R packages.

Despite the name of this script, there is probably not such a thing as a ‘perfect’ report of a statistical test. Researchers might prefer to report standard errors instead of standard deviations, perform additional checks for normality, different Bayesian or robust statistics, or change the figures. The benefit of markdown scripts with a GNU license stored on GitHub is that they can be forked (copied to a new repository) where researchers are free to remove, add, or change sections of the script to create their own ideal test. After some time, a number of such scripts may be created, allowing researchers to choose an analysis procedure that most closely matches their desires. Alternatively, researchers can post feature requests or errors that can be incorporated in future versions of this script.

It is important that researchers attempt to draw the best possible statistical inferences from their data. As a science, we need to seriously consider the most efficient way to accomplish this. Time is scarce, and scientists need to master many skills in addition to statistics. I believe that some of the problems in adopting new statistical procedures discussed by Sharpe (2013) such as lack of time, lack of awareness, lack of education, and lack of easy to use software can be overcome by scripts that combine traditional and more novel statistics, are easy to use, and provide a brief explanation of what is calculated while linking to the relevant literature. This approach might be a small step towards a better understanding of statistics for individual researchers, but a large step towards better reporting practices.


Baguley, T. (2012). Calculating and graphing within-subject confidence intervals for ANOVA. Behavior research methods, 44, 158-175.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Fox, J. & Weisberg, S. (2011). An R Companion to Applied Regression, Second edition. Sage, Thousand Oaks CA.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press, Clarendon Press.
Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20, 1-24.
Kirby, K. N., & Gerlanc, D. (2013). BootES: An R package for bootstrap confidence intervals on effect sizes. Behavior Research Methods, 45, 905-927.
Kruschke, J. K., & Meredith, M. (2014). BEST: Bayesian Estimation Supersedes the t-test. R package version 0.2.2, URL: http://CRAN.R-project.org/package=BEST.
Lakens, D. (2015). The perfect t-test (version 0.1.0). Retrieved from https://github.com/Lakens/perfect-t-test. doi:10.5281/zenodo.17603
Loy, A., & Hofmann, H. (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models. R. Journal of Statistical Software, 56, pp. 1-28. URL: http://www.jstatsoft.org/v56/i05/.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361-365.
Micheaux, P., & Tran, V. (2012). PoweR. URL: http://www.biostatisticien.eu/PoweR/.
Morey R and Rouder J (2015). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.11-1, URL: http://CRAN.R-project.org/package=BayesFactor
Sharpe, D. (2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychological Methods, 18, 572-582.
Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer New York. ISBN 978-0-387-98140-6, URL: http://had.co.nz/ggplot2/book.
Wilcox, R. R., Granger, D. A., Clark, F. (2013). Modern robust statistical methods: Basics with illustrations using psychobiological data. Universal Journal of Psychology, 1, 21-31.
Wilcox, R. R., & Schönbrodt, F. D. (2015). The WRS package for robust statistics in R (version 0.27.5). URL: https://github.com/nicebread/WRS.

Yap, B. W., & Sim, C. H. (2011). Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141-2155.