Throughout the history of psychological
science, there has been a continuing debate about which statistics are used and
how these statistics are reported. I distinguish between reporting statistics, and interpreting
statistics. This is important, because a lot of the criticism on the statistics
researchers use comes from how statistics are interpreted, not how they are
reported.
When it comes to reporting statistics, my
approach is simple: The more, the merrier. At the very minimum, descriptive
statistics (e.g., means and standard deviations) are required to understand the
reported data, preferably complemented with visualizations of the data (for
example in online supplementary material). This should include the sample sizes
(per condition for between subject designs), and correlations between dependent
variables in within subject designs. The number of participants per condition,
and especially the correlation between dependent variables, are often not
reported, but are necessary for future meta-analyses.
If you want to communicate the probability
the alternative hypothesis is true, given the data, you should report Bayesian
statistics, such as Bayes Factors. But in novel lines of research, you might
simply want to know whether your data is surprising, assuming there is no true
effect, and choose to report p-values
for this purpose.
But there is a much more important aspect
to consider when reporting statistics. Given that every study is merely
a data-point in a future meta-analysis, all meta-analytic data should be
presented to be able to include the data in future meta-analyses.
What is meta-analytic data?
What meta-analytical data is, depends on
the meta-analytical technique that is used. The most widely known meta-analytical
technique is the meta-analysis of effect
sizes (often simply abbreviated as meta-analysis). In a meta-analysis of
effect sizes, researchers typically combine standardized effect sizes across
studies, and provide an estimate of, and confidence intervals around, the
meta-analytic effect size. A number of standardized effect sizes exist to
combine effects across studies that use different measures, such as Cohen’s d, correlations, and odds-ratios (note
that there are often many different ways to calculate these types of effect
sizes).
Recently, novel meta-analytical techniques
have been developed. For example, p-curve
analysis uses the test statistics (e.g., t-values,
F-values, and their degrees of
freedom) as input, and analyses the distribution of p-values. This analysis can indicate the p-value distribution is uniformly distributed (as expected when the
null-hypothesis is true), or that the p-value
distribution is right-skewed (as expected when the alternative hypothesis is
true). P-curve analysis has a number
of benefits, of which the most noteworthy is that it is performed on p-values below 0.05. Due to publication
bias, non-significant effects are often not shared between researchers, which
is a challenge for meta-analyses of effect sizes. P-curve analysis does not require access to non-significant results
to evaluate the evidential value of a set of studies, which makes it an
important complementary meta-analytical technique to meta-analysis of effect
sizes. Similarly, Bayesian meta-analytical techniques often rely on test statistics,
and not on standardized effect sizes.
If researchers want to facilitate future meta-analytical
efforts, they should report effect sizes and statistical tests for the comparisons
they are making. Furthermore, since you should not report point estimates
without indicating the uncertainty in those point estimates, researchers need
to provide confidence intervals around effect size intervals. Finally, when the
unstandardized data can clearly communicate the practical relevance of the
effect (for example, when you measured your dependent variable in scales we can
easily interpret, such as time or money) researchers might simply choose to
report the mean difference (and accompanying confidence interval).
To conclude, the best recommendation I can
currently think of when reporting statistics is to provide means, standard
deviations, sample sizes (per condition in between designs), correlations
between dependent measures in within designs, statistical tests (such as a t-test), p-values, effect sizes and their confidence intervals, and Bayes
Factors. For example, when reporting a Stroop effect, we might write:
The mean reaction times of 95 participants in
the Congruent condition (M = 14.88, SD = 4.50) was smaller than the mean of
participants in the Incongruent condition (M
= 23.69, SD = 5.41, dependent
measures correlate r = 0.30). The average
difference between conditions is -8.81 seconds (SD = 5.91), 95% CI = [-10.01;-7.61], t(94) = -14.54, p <
0.001, Hedges' g = -1.76, 95% CI
[-2.12;-1.42]. The data are logBF10 = 52.12 times more probable under
the alternative hypothesis than under the null hypothesis.
Obviously the best way to prevent
discussion about which statistics you report and to facilitate future meta-analyses is to share your raw data –
online repositories have made this so easy, there no longer a good reason not
to share your data (except for some datasets where there are certain ethical
and privacy related aspects to consider).
Which statistics you interpret is a very different question, which I personally find
much less interesting, given that the interpretation of single studies is just
an intermittent summary while the field waits for the meta-analysis. A good
approach is to interpret all statistics you report, and to trust your conclusions
most when all statistical inferences provide converging support for your conclusion.
I agree that reporting a range of descriptives is useful and should be encouraged. I'm not sure that in text reporting is as helpful as graphs (especially if raw data is available).
ReplyDelete"Finally, when the unstandardized data can clearly communicate the practical relevance of the effect (for example, when you measured your dependent variable in scales we can easily interpret, such as time or money) researchers might simply choose to report the mean difference (and accompanying confidence interval)."
I slightly take issue with this. The unstandardised effect is nearly always more useful and interpretable. The example you give has a time difference in seconds. The standardized effect size is not particularly useful as you can't compare it to other studies without also adjusting the estimates for the type of sample (which effects variability of RTs), the type of stimuli and crucially how many items and how time was measured. As the SD is idiosyncratically related to the study characteristics it doesn't really tell us anything beyond what the t statistic or BF does. For example, if I replicated this study and found a time difference of 9.2 seconds but g = 0.9 this would not imply the effect was 50% smaller, merely that my data are noisier.
Dear Thom, thank you again for your response on my blog. My recommendations are rather general, and I want to facilitate meta-analyses (and assume these are often done across types of DV's, so reactions times, scales, etc). What would you suggest as an alternative write up? I am happy to concede my write up is not the best, but I also like to see a better alternative before giving up my suggestion. Obviously, when data is shared, it all hardly matters, but what should be reported when data are not shared?
Delete