Friday, February 27, 2015

Which statistics should you report?

Throughout the history of psychological science, there has been a continuing debate about which statistics are used and how these statistics are reported. I distinguish between reporting statistics, and interpreting statistics. This is important, because a lot of the criticism on the statistics researchers use comes from how statistics are interpreted, not how they are reported.

When it comes to reporting statistics, my approach is simple: The more, the merrier. At the very minimum, descriptive statistics (e.g., means and standard deviations) are required to understand the reported data, preferably complemented with visualizations of the data (for example in online supplementary material). This should include the sample sizes (per condition for between subject designs), and correlations between dependent variables in within subject designs. The number of participants per condition, and especially the correlation between dependent variables, are often not reported, but are necessary for future meta-analyses.

If you want to communicate the probability the alternative hypothesis is true, given the data, you should report Bayesian statistics, such as Bayes Factors. But in novel lines of research, you might simply want to know whether your data is surprising, assuming there is no true effect, and choose to report p-values for this purpose.

But there is a much more important aspect to consider when reporting statistics. Given that every study is merely a data-point in a future meta-analysis, all meta-analytic data should be presented to be able to include the data in future meta-analyses.

What is meta-analytic data?
 

What meta-analytical data is, depends on the meta-analytical technique that is used. The most widely known meta-analytical technique is the meta-analysis of effect sizes (often simply abbreviated as meta-analysis). In a meta-analysis of effect sizes, researchers typically combine standardized effect sizes across studies, and provide an estimate of, and confidence intervals around, the meta-analytic effect size. A number of standardized effect sizes exist to combine effects across studies that use different measures, such as Cohen’s d, correlations, and odds-ratios (note that there are often many different ways to calculate these types of effect sizes).

Recently, novel meta-analytical techniques have been developed. For example, p-curve analysis uses the test statistics (e.g., t-values, F-values, and their degrees of freedom) as input, and analyses the distribution of p-values. This analysis can indicate the p-value distribution is uniformly distributed (as expected when the null-hypothesis is true), or that the p-value distribution is right-skewed (as expected when the alternative hypothesis is true). P-curve analysis has a number of benefits, of which the most noteworthy is that it is performed on p-values below 0.05. Due to publication bias, non-significant effects are often not shared between researchers, which is a challenge for meta-analyses of effect sizes. P-curve analysis does not require access to non-significant results to evaluate the evidential value of a set of studies, which makes it an important complementary meta-analytical technique to meta-analysis of effect sizes. Similarly, Bayesian meta-analytical techniques often rely on test statistics, and not on standardized effect sizes.

If researchers want to facilitate future meta-analytical efforts, they should report effect sizes and statistical tests for the comparisons they are making. Furthermore, since you should not report point estimates without indicating the uncertainty in those point estimates, researchers need to provide confidence intervals around effect size intervals. Finally, when the unstandardized data can clearly communicate the practical relevance of the effect (for example, when you measured your dependent variable in scales we can easily interpret, such as time or money) researchers might simply choose to report the mean difference (and accompanying confidence interval).

To conclude, the best recommendation I can currently think of when reporting statistics is to provide means, standard deviations, sample sizes (per condition in between designs), correlations between dependent measures in within designs, statistical tests (such as a t-test), p-values, effect sizes and their confidence intervals, and Bayes Factors. For example, when reporting a Stroop effect, we might write:

The mean reaction times of 95 participants in the Congruent condition (M = 14.88, SD = 4.50) was smaller than the mean of participants in the Incongruent condition (M = 23.69, SD = 5.41, dependent measures correlate r = 0.30). The average difference between conditions is -8.81 seconds (SD = 5.91), 95% CI = [-10.01;-7.61], t(94) = -14.54, p < 0.001, Hedges' g = -1.76, 95% CI [-2.12;-1.42]. The data are logBF10 = 52.12 times more probable under the alternative hypothesis than under the null hypothesis.

Obviously the best way to prevent discussion about which statistics you report and to facilitate future meta-analyses is to share your raw data – online repositories have made this so easy, there no longer a good reason not to share your data (except for some datasets where there are certain ethical and privacy related aspects to consider).

Which statistics you interpret is a very different question, which I personally find much less interesting, given that the interpretation of single studies is just an intermittent summary while the field waits for the meta-analysis. A good approach is to interpret all statistics you report, and to trust your conclusions most when all statistical inferences provide converging support for your conclusion.

2 comments:

  1. I agree that reporting a range of descriptives is useful and should be encouraged. I'm not sure that in text reporting is as helpful as graphs (especially if raw data is available).

    "Finally, when the unstandardized data can clearly communicate the practical relevance of the effect (for example, when you measured your dependent variable in scales we can easily interpret, such as time or money) researchers might simply choose to report the mean difference (and accompanying confidence interval)."

    I slightly take issue with this. The unstandardised effect is nearly always more useful and interpretable. The example you give has a time difference in seconds. The standardized effect size is not particularly useful as you can't compare it to other studies without also adjusting the estimates for the type of sample (which effects variability of RTs), the type of stimuli and crucially how many items and how time was measured. As the SD is idiosyncratically related to the study characteristics it doesn't really tell us anything beyond what the t statistic or BF does. For example, if I replicated this study and found a time difference of 9.2 seconds but g = 0.9 this would not imply the effect was 50% smaller, merely that my data are noisier.

    ReplyDelete
    Replies
    1. Dear Thom, thank you again for your response on my blog. My recommendations are rather general, and I want to facilitate meta-analyses (and assume these are often done across types of DV's, so reactions times, scales, etc). What would you suggest as an alternative write up? I am happy to concede my write up is not the best, but I also like to see a better alternative before giving up my suggestion. Obviously, when data is shared, it all hardly matters, but what should be reported when data are not shared?

      Delete