A blog on statistics, methods, philosophy of science, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, May 18, 2015

The perfect t-test

I've created an easy to use R script that will import your data, and performs and writes up a state-of-the-art dependent or independent t-test. The goal of this script is to examine whether more researcher-centered statistical tools (i.e., a one-click analysis script that checks normality assumptions, calculates effect sizes and their confidence intervals, creates good figures, calculates Bayesian and robust statistics, and writes the results section) increases the use of novel statistical procedures. Download the script here: https://github.com/Lakens/Perfect-t-test. For comments, suggestions, or errors, e-mail me at D.Lakens@tue.nl. The script will likely be updated - check back for updates or follow me @Lakens to be notified of updates.

Correctly comparing two groups is remarkably challenging. When performing a t-test researchers rarely manage to follow all recommendations that statisticians have made over the years. Where statisticians update their recommendations, statistical textbooks often do not. Even though reporting effect sizes and their confidence intervals has been recommended for decades (e.g., Cohen, 1990), statistical software (e.g., SPSS 22) often does not provide these statistics. Progress is slow, and Sharpe (2013) points to a lack of awareness, a lack of time, a lack of easily usable software, and a lack of education as some of the main reasons for the resistance to adopting statistical innovations.

Here, I propose a way to speed up the widespread adoption of the state-of-the-art statistical techniques by providing researchers with an easy to use script in free statistical software (R) that will perform and report all statistical analyses, practically with a single button press. The script (Lakens, 2015, available at https://github.com/Lakens/Perfect-t-test) follows state-of-the-art recommendations (see below), creates plots of the data, and writes the results section, including a minimally required interpretation of the statistical results.

Automated analyses might strike readers as a bad idea because it facilitates mindless statistics. Having performed statistics mindlessly for most of my professional career, I sincerely doubt access to this script would have reduced my level of understanding. If anything, reading an automatically generated results section of your own data that includes statistics you are not accustomed to calculate or report is likely to make you think more about the usefulness of these statistics, not less. However, the goal of this script is not to educate people. The main goal is to get researchers to perform and report the analyses they should, and make this as efficient as possible.

Comparing two groups

Keselman, Othman, Wilcox, and Fradette (2004) proposed the a more robust two-sample t-test that provides better Type 1 error control in situations of variance heterogeneity and nonnormality, but their recommendations have not been widely implemented. Researchers might in general be unsure whether it is necessary to change the statistical tests they use to analyze and report comparisons between groups. As Wilcox, Granger, and Clark (2013, p. 29) remark: “All indications are that generally, the safest way of knowing whether a more modern method makes a practical difference is to actually try it.” Making sure conclusions based on multiple statistical approaches converge is an excellent way to gain confidence in your statistical inferences. This R script calculates traditional Frequentist statistics, Bayesian statistics, and robust statistics, using both a hypothesis testing as an estimation approach, to invite researchers to examine their data from different perspectives.

Since Frequentist and Bayesian statistics are based on assumptions of equal variances and normally distributed data, the R script provides boxplots and histograms with kernel density plots overlaid with a normal distribution curve to check for outliers and normality. Kernel density plots are a non-parametric technique to visualize the distribution of a continuous variable. They are similar to a histogram, but less dependent on the specific choice of bins used when creating a histogram. The graphs plot both the normal distribution, as the kernel density function, making it easier to visually check whether the data is normally distributed or not. Q-Q plots are provided as an additional check for normality.

Yap and Sim (2011) show that no single test for normality will perform optimally for all possible distributions. They conclude (p. 2153): “If the distribution is symmetric with low kurtosis values (i.e. symmetric short-tailed distribution), then the D'Agostino-Pearson and Shapiro-Wilkes tests have good power. For symmetric distribution with high sample kurtosis (symmetric long-tailed), the researcher can use the JB, Shapiro-Wilkes, or Anderson-Darling test." All four normality tests are provided in the R script. Levene’s test for the equality of variances is provided, although for independent t-tests, Welch’s t-test (which does not require equal variances) is provided by default, following recommendations by Ruxton (2006). A short explanation accompanies all plots and assumption checks to help researchers to interpret the results. 

The script also creates graphs that, for example, visualize the distribution of the datapoints, and provide both within as between confidence intervals:

The script provides interpretations for effect sizes based on the classifications ‘small’, ‘medium’, and ‘large’. Default interpretations of the size of an effect based on these three categories should only be used as a last resort, and it is preferable to interpret the size of the effect in relation to other effects in the literature, or in terms of its practical significance. However, since researchers often do not interpret effect sizes (if they are reported to begin with), the default interpretation (and the suggestion to interpret effect sizes in relation to other effects in the literature) should at least function as a reminder that researchers are expected to interpret effect sizes. The common language effect size  (McGraw & Wong, 1992) is provided as an additional way to communicate the effect size.

Similarly, the Bayes Factor is classified into anecdotal, moderate, strong, very strong, and decisive evidence for the alternative or null hypothesis, following Jeffreys (1961), even though researchers are reminded that default interpretations of the strength of the evidence should not distract from the fact that strength of evidence is a continuous function of the Bayes Factor. We can expect researchers will rely less on default interpretations, the more acquainted they become with these statistics, but for novices some help in interpreting effect sizes and Bayes Factors will guide their interpretation.

Running the Markdown script

R Markdown scripts provide a way to create fully reproducible reports from data files. The script combines the commands to perform all statistical analyses with the written sections of the final output. Calculated statistics and graphs are inserted into the written report at specified locations. After installing the required packages, preparing the data, and specifying some variables in the Markdown document, the report can be generated (and thus, the analysis procedure can be performed) with a single mouse-click (scroll down for an example of the output).

The R Markdown script and the ReadMe file contain detailed instructions on how to run the script, and how to install required packages, including the PoweR package (Micheaux & Tran, 2014) to perform the normality tests, HLMdiag to create the Q-Q plots (Loy & Hofmann, 2014). ggplot2 for all plots (Wickham, 2009), car (Fox & Weisberg, 2011) to perform Levene's test, MBESS (Kelley, 2007) to calculate effect sizes and their confidence intervals, WRS for the robust statistics (Wilcox & Schönbrodt, 2015), BootsES to calculate a robust effect size for the independent t-test (Kirby & Gerlanc, 2013), BayesFactor for the bayes factor (Morey & Rouder, 2015), and BEST (Kruschke & Meredith, 2014) to calculate the Bayesian highest density interval.

The data file (which should be stored in the same folder that contains the R markdown script) needs to be tab delimited with a header at the top of the file (which can easily be created from SPSS by saving data through the 'save as' menu and selecting 'save as type: Tab delimited (*.dat)', or in Excel by saving the data as ‘Text (Tab delimited) (.txt)’. For the independent t-test the data file needs to contain at least two columns (one specifying the independent variable, and one specifying the dependent variable, and for the dependent t-test the data file needs to contain three columns, one subject identifier column, and two columns for the two dependent variables. The script for dependent t-tests allows you to select a subgroup for the analysis, as long as the data file contains an additional grouping variable (see the demo data). The data files can contain irrelevant data, which will be ignored by the script. Finally, researchers need to specify the names (or headers) of the independent and dependent variables, as well as grouping variables. Finally, there are some default settings researchers can change, such as the sidedness of the test, the alpha level, the percentage for the confidence intervals, and the scalar on the prior for the Bayes Factor.

The script can be used to create either a word document or a html document. The researchers can easily interpret all the assumption checks, look at the data for possible outliers, and (after minor adaptations) copy-paste the result sections into their article. 

The statistical results the script generates has been compared against the results provided by SPSS, JASP, ESCI, online Bayes Factor calculators, and BEST online. Minor variations in the HDI calculation between BEST online and this script are possible depending on the burn-in samples and number of samples, and for huge t-values there are minor variations between JASP and the latest version of the Bayes Factor package used in this script. This program is distributed in the hope that it will be useful, but without any warranty. If you find an error, please contact me at D.Lakens@tue.nl.

Promoting Statistical Innovations

Statistical software is built around individual statistical tests, while researchers perform a set of procedures. Although it is not possible to create standardized procedures for all statistical analyses, most, if not all, of the steps researchers have to go through when they want to report correlations, regression analyses, ANOVA’s, and meta-analyses are sufficiently structured. These tests make up a large portion of analyses reported in journal articles. Demonstrating this, David Kenny has created R scripts that will perform and report mediation and moderator analyses. Felix Schönbrodt has created a Shiny app that performs several meta-analytic techniques. Making statistical innovations more accessible has a high potential to substantially improve the quality of the statistical tests researchers perform and report. Statisticians who take the application of generated knowledge seriously should try to experiment with the best way to get researchers to use state-of-the-art techniques. R markdown scripts are an excellent method to combine statistical analyses and a written report in free software. Shiny apps might make these analyses even more accessible, because they no longer require users to install R and R packages.

Despite the name of this script, there is probably not such a thing as a ‘perfect’ report of a statistical test. Researchers might prefer to report standard errors instead of standard deviations, perform additional checks for normality, different Bayesian or robust statistics, or change the figures. The benefit of markdown scripts with a GNU license stored on GitHub is that they can be forked (copied to a new repository) where researchers are free to remove, add, or change sections of the script to create their own ideal test. After some time, a number of such scripts may be created, allowing researchers to choose an analysis procedure that most closely matches their desires. Alternatively, researchers can post feature requests or errors that can be incorporated in future versions of this script.

It is important that researchers attempt to draw the best possible statistical inferences from their data. As a science, we need to seriously consider the most efficient way to accomplish this. Time is scarce, and scientists need to master many skills in addition to statistics. I believe that some of the problems in adopting new statistical procedures discussed by Sharpe (2013) such as lack of time, lack of awareness, lack of education, and lack of easy to use software can be overcome by scripts that combine traditional and more novel statistics, are easy to use, and provide a brief explanation of what is calculated while linking to the relevant literature. This approach might be a small step towards a better understanding of statistics for individual researchers, but a large step towards better reporting practices.


Baguley, T. (2012). Calculating and graphing within-subject confidence intervals for ANOVA. Behavior research methods, 44, 158-175.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Fox, J. & Weisberg, S. (2011). An R Companion to Applied Regression, Second edition. Sage, Thousand Oaks CA.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press, Clarendon Press.
Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20, 1-24.
Kirby, K. N., & Gerlanc, D. (2013). BootES: An R package for bootstrap confidence intervals on effect sizes. Behavior Research Methods, 45, 905-927.
Kruschke, J. K., & Meredith, M. (2014). BEST: Bayesian Estimation Supersedes the t-test. R package version 0.2.2, URL: http://CRAN.R-project.org/package=BEST.
Lakens, D. (2015). The perfect t-test (version 0.1.0). Retrieved from https://github.com/Lakens/perfect-t-test. doi:10.5281/zenodo.17603
Loy, A., & Hofmann, H. (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models. R. Journal of Statistical Software, 56, pp. 1-28. URL: http://www.jstatsoft.org/v56/i05/.
McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361-365.
Micheaux, P., & Tran, V. (2012). PoweR. URL: http://www.biostatisticien.eu/PoweR/.
Morey R and Rouder J (2015). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.11-1, URL: http://CRAN.R-project.org/package=BayesFactor
Sharpe, D. (2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychological Methods, 18, 572-582.
Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer New York. ISBN 978-0-387-98140-6, URL: http://had.co.nz/ggplot2/book.
Wilcox, R. R., Granger, D. A., Clark, F. (2013). Modern robust statistical methods: Basics with illustrations using psychobiological data. Universal Journal of Psychology, 1, 21-31.
Wilcox, R. R., & Schönbrodt, F. D. (2015). The WRS package for robust statistics in R (version 0.27.5). URL: https://github.com/nicebread/WRS.

Yap, B. W., & Sim, C. H. (2011). Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141-2155.


  1. significant p-value
    confidence interval
    credible interval
    normal distribution
    and now
    perfect t-test

    Why is it so important to make up names that include judgemental qualifiers? Are all other t-tests (including every future proposal) "imperfect"? You write that there is no "perfect" test. So whats the point of calling it that way?

    I think "a bank of two-sample difference tests" describes much better what you do. The qualifier "difference" is crucial. If the relevant ES is a quotient or some more complex model is needed then your perfect test will provide misleading results.

    As you probably know by now, I'm opposed to one-click solutions. One-click solutions usually don't mean less time spent with data analysis = more free time for other stuff. Rather the time for data analysis is held constant and the number of one-click analyses of the same dataset increases = data dredging.

    1. Hi Matus,

      thanks for your comment - I was affraid no one would be critical about this! The 'perfect t-test' name is pure PR. It draws attention. Remember I'm trying to convince people to use it, because I think it will improve the way they report statistics, so everything is allowed to draw them away from SPSS to R (surely, you must agree?).

      If it leads to more data dredging - I don't know. It's not something this is meant to solve - there is pre-registration for that, and I invested some time in a special issue on pre-registered replications to combat results due to data dredging. One thing at a time!


    2. > I was affraid no one would be critical about this!
      Haha, you can always count on me in this regard.

      PR: You could go into a (sarcastic) clickbait-overdrive with something like: "BLAST your way to SUPER-$$$ignificant results with the Lakens TVRBO-T!!!" :D

  2. Hi Daniel,

    In a previous post, you wrote "It shows that when interpreting p-values, it is important to take the power of the study into consideration.". I completely agree, that's why I regret the absence of power calculation.

    In any case your perfect t-test is terrific!


    1. Thanks! The power of a study is not a statistical inference - it requires knowledge about the true effect size, which requires a theoretical model. So I don't see how my script could do that. Instead, after you get the p-value, a human needs to take power into account. Pos-hoc power is not useful for that, as I explained on this blog.

  3. Isn't it possible to add a functionality to calculate power based on the assumed true effect size à la pwr.t.test?

    1. Yes, but it seems more useful to integrate that into the p-checker app: http://shinyapps.org/apps/p-checker/

    2. Oops... I didn't click 'reply', sorry.
      I thought it would have been better all together, but it's just my opinion ;-)

  4. Professor Lakens,

    Excellent post! I will study it meticulously.
    I have some questions, though:
    1) Why you, EJ Wagenmakers and others seem to prefer Bayes Factor as the representant of Bayesian inference, if there is some advocacy against it (e.g., http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2606016)?
    2) The Markdown has exact the same look of Kruschke's Doing Bayesian Data Analysis. Is it the default or are you sharing scripts? :)
    3) I also believe that we (psychologists) should be more well-informed of new statistical methods that surpass the traditional ones (I use only Welch's test since I read an older post of yours). But, being our field slow in updating, do you think that a paper reporting your "perfect t test" would have the same publication's probability than the same paper reporting a common t test? If not, how to overcome this barrier?


  5. Thanks for the script - such kind of tools made by professional statistician can really motivate us (backwoods-biologists) to learning R. Special thanks for clear and comprehensive text accompanying results - i think script really earn this "perfect" in the title. Any plans/possibility to write similar script and supplementary text for few groups comparison (OW-ANOVA, K-W, some modern randomization/permutations stuff)?

    PS. I tried independent t-test version. There is typo "whih" in the text. In Robust statistics section "Figure 1. Means and 95% CI, and violin plot" only first part of fig. is plotted (no violin plot - and no warning message in R output). Just a feedback to let your improve the code/text - thank you very much for you work!

    1. Still playing with script, now on my own data - so a bit more feedback. The statement "This script uses the reshape2 package to convert data from wide to long format" seems to be wrong - user need to use melt() by their own. The "ppnr" column (for stable tab-delimited input?) may be omitted. The color highlighting of user variables and main results in the output text is much desired.

  6. You forgot the Keselman article in the references. Nice work though!

  7. FYI: package ‘yuenbt’ is not available (for R version 3.2.3), this breaks the script.

    1. Thanks for informing me - I'm still thinking about how this script will be shared, finally, and whether I will maintain it, and if so, how.

    2. I'm sure it took you a lot of time to prepare this and I'm sure it would be very useful for a number of people. However, I think you might be running the risk of creating something, you'd have to keep on updating constantly (library/R updates; not to mention answering numerous questions and requests), so perhaps a tutorial for doing all those things you script aims to achieve would be less time consuming and serve an educational purpose at the same time. After all, those who would be able to work around potential bugs, won't really use it; R newbies will be unable to use a broken version, so I guess R's life-cycle renders the task of maintaining the perfect t-test script almost impossible without crowd sourcing its maintenance.

    3. As an R newbie just discovering this script, I am getting the 'yuenbt' error— any easy workarounds?

      Thanks for all your work Daniel!

    4. I've fixed the yuenbt error (I think - the script now runs on my computer).

    5. So far only for the independent t-test. If I have more time, will fix it for the dependent t-test as well.

  8. Hi Daniel, many thanks for this resource. Would there be any chance of converting them to a Shiny app, to really drop the barriers to use? A really nice example of similar resources that have been implemented in a Web interface is https://www.estimationstats.com/#/