In a recent FiveThirtyEight article, a statistical approach known as magnitude based inferences, popular in sports sciences, was severely criticized. Critically evaluating approaches to statistical inferences is important. In my own work on statistical inferences, I try to ask myself whenever a problem is identified: "So how can I improve?". In this blog, I'll highlight 3 ways to move beyond magnitude based inferences, achieving the same goals, but with more established procedures. I hope sport scientists in doubt of how to analyze their data will learn some other approaches that are less contested, but equally useful.

The key goal in magnitude based inferences is to improve upon limitations of null-hypothesis significance tests. As Batterham & Hopkins (2006) write: “A confidence interval alone or in conjunction with a P value still does not overtly address the question of the clinical, practical, or mechanistic importance of an outcome.” To complement p-values and confidence intervals, they propose to evaluate confidence intervals in relation to two thresholds, which I prefer to call the ‘smallest effect size of interest’.

Magnitude Based Inference

Although I’m not a particularly sporty person, I recently participated in the Rotterdam City Run, where we ran through our beautiful city, but also through the dressing room of a theater, around an indoor swimming pool, a lap through the public library, and the fourth story of a department store. The day before the run, due to train problems I couldn’t get to work, and I wasn’t able to bring my running shoes (which I use at the university sport center) home. I thought it would be OK to run on sneakers, under the assumption ‘how bad can it be’? So let’s assume we examine the amount of physical discomfort people experience when running on running shoes, or on normal sneakers. As we discuss in Lakens, McLatchie, Isager, Scheel, & Dienes (under review), Kelly (2001) reports that the smallest effect size that leads to an individual to report feeling “a little better” or “a little worse” is 12 mm (95% CI [9; 12]) on a 100 mm visual analogue scale of pain intensity. So let’s say I would have been fine with running on sneakers instead of real running shoes if after the run I would be within 12 mm of the pain rating I would have given on good running shoes. In other words, I consider a 12 mm difference trivial – sure, I didn’t have my running shoes, but that’s a trivial thing when going on a city run. I also leave open the possibility that my running shoes aren’t very good, and that I might actually feel better after running on my normal sneakers – unlikely, but who knows.

In formal terms, I have set my equivalence bounds to a difference of 12 mm when running on sneakers, or when running on decent running shoes. All differences within this equivalence range (the light middle section in the figure below, from Batterham and Hopkins, 2006) are considered trivial. We see the inferences we can draw from the confidence interval depending on whether the CI overlaps with the equivalence bounds. Batterham and Hopkins refer to effects as ‘beneficial’ as long as they are not harmful. This is a bit peculiar, since from the third confidence interval from the top, we can see that this implies calling a finding ‘beneficial’ when it is not statistically significant (the CI overlaps with 0), a conclusions we would not normally draw based on a non-significant result.

Batterham and Hopkins suggest to use verbal labels to qualify the different forms of ‘beneficial’ outcomes to get around the problem of simply calling a non-significant result ‘beneficial’. Instead of just saying an effect is beneficial, they suggest labeling it as ‘possible beneficial’.

Problems with Magnitude Based Inference

In a recent commentary, Sainani (2018) points out that even though the idea to move beyond p-values and confidence intervals is a good one, magnitude based inference has problems in terms of error control. Her commentary was not the first criticism raised about problems with magnitude based inference, but it seems it will have the greatest impact. The (I think somewhat overly critical) article on FiveThirtyEight details the rise and fall of magnitude based inference. As Sainani summarizes: “Where Hopkins and Batterham’s method breaks down is when they go beyond simply making qualitative judgments like this and advocate translating confidence intervals into probabilistic statements such as: the effect of the supplement is ―very likely trivial or ―likely beneficial”

Even though Hopkins and Batterham (2016) had published an article stating that magnitude based inference outperforms null-hypothesis significance tests in terms of error rates, Sainani shows conclusively that this is not correct. The conclusions by Hopkins and Batterham (2016) were based on an incorrect definition of Type 1 and Type 2 error rates. When defined correctly, the Type 1 error rate turns out to be substantially higher for magnitude based inferences (MBI) depending on the smallest effect size of interest that is used to define the equivalence bounds (or the ‘trivial’ range) and the sample size (see Figure 1G below from Sainani, in press). Note that the main problem is not that error rates are always higher (as the graphs shows) - just that they will often be, when following the recommendations by Batterham and Hopkins.

How to Move Forward?

The idea behind magnitude based inference is a good one, and not surprisingly, statisticians had though about exactly the limitations of null-hypothesis tests and confidence intervals that are raised by Batterham and Hopkins. The idea to use confidence intervals to draw inferences about whether effects are trivially small, or large enough to matter, has been fully developed before, and sport scientists can use these more established methods. This is good news for people working in sports and exercise science who want to not simply fall back to null-hypothesis tests now that magnitude based inference has been shown to be a problematic approach.

Indeed, in a way it is surprising Batterham and Hopkins never reference the extensive literature to approaches that are on a much better statistical footing than magnitude based inference, but that are extremely similar in their goal.

The ROPE procedure

The first approach former users of magnitude-based inference could switch to is the ROPE procedure as suggested by John Kruschke (for an accessible introduction, see https://osf.io/s5vdy/). As pointed out by Sainani, the use of confidence intervals by Batterham and Hopkins to make probability judgments about the probability of true values “requires interpreting confidence intervals incorrectly, as if they were Bayesian credible intervals.” Not surprisingly, one solution moving forward for exercise and sports science is thus to switch to using Bayesian credible (or highest density) intervals.

As Kruschke (2018) clearly explains, the Bayesian posterior can be used to draw conclusions about the probability that the effect is trivial, or large enough to be deemed beneficial. The similarity to magnitude based inference should be obvious, with the added benefit that the ROPE procedure rests on a strong formal footing.

Equivalence Testing

One of the main points of criticism on magnitude based inference demonstrated conclusively by Sainani (2018) is that of poor error control. Error control is a useful property of a tool to draw statistical inferences, since it will guarantee that (under certain assumptions) you will not draw erroneous conclusions more often that some threshold you desire.

Error control is the domain of Frequentist inferences, and especially the Neyman-Pearson approach to statistical inferences. The procedure that strongly mirrors magnitude based inferences from a Frequentist approach to statistical inferences is equivalence testing. It happens to be a topic I’ve worked on myself in the last year, among other things creating an R package (TOSTER) and writing tutorial papers to help psychologists to start using equivalence tests (e.g., Lakens, 2017, Lakens, Isager, Scheel, 2018).

As the Figure below (from an excellent article by Rogers, Howard, & Vessey, 1993) shows, equivalence tests also show great similarity with magnitude based inference. It similarly builds on 90% confidence intervals, and allows researchers to draw similar conclusions as magnitude based inference aimed to do, while carefully controlling error rates.

Minimal Effect Tests

Another idea in magnitude based inference is to not test against the null, but to test against the smallest effect size of interest, when concluding an effect is beneficial. In such cases, we do not want to simply reject an effect size of 0 – we want to be able to reject all effects that are too small to be trivial. Luckily, this also already exists, and it is known as minimal effect testing. Instead of a point null hypothesis, a minimal effects test aims to reject effects within the equivalence range (for a discussion, see Murphy & Myors, 1999.

Conclusion

There are some good suggestions underlying the idea of magnitude based inferences. And a lot of the work by Batterham and Hopkins has been to convince their field to move beyond null-hypothesis tests and confidence intervals, and to interpret the results in a more meaningful manner. This is a huge accomplishment, even if the approach they have suggested lacks a formal footing and good error control. Many or their recommendations about how to think about which effects in their field are trivial are extremely worthwhile. As someone who has worked on trying to get people to improve their statistical inferences, I know how much work goes into trying to move your discipline forward, and the work by Batterham and Hopkins on this front has been extremely worthwhile.

At this moment, I think the biggest risk is that the field falls back to only performing null-hypothesis tests. The ideas underlying magnitude based inferences are strong, and luckily, we have the ROPE procedure, equivalence testing, and minimal effect tests. These procedures are well vetted (equivalence testing is recommended by the Food and Drug Administration) and will allow sports and exercise scientists to achieve the same goals. I hope they will take all the have learned from Batterham and Hopkins about drawing inferences by taking into account the effect sizes predicted by a theory, or that are deemed practically relevant, and apply these insights using the ROPE procedure, equivalence tests, or minimal effect tests.

P.S. Don't try to run 10k through a city on sneakers.

References

Batterham, A. M., & Hopkins, W. G. (2006). Making Meaningful Inferences About Magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50–57. https://doi.org/10.1123/ijspp.1.1.50

Hopkins, W. G., & Batterham, A. M. (2016). Error Rates, Decisive Outcomes and Publication Bias with Several Inferential Methods. Sports Medicine, 46(10), 1563–1573. https://doi.org/10.1007/s40279-016-0517-x

Kruschke, J. K. (2018). Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 2515245918771304. https://doi.org/10.1177/2515245918771304

Lakens, D., Scheel, A. M., & Isager, P. M. (2017). Equivalence Testing for Psychological Research: A Tutorial. PsyArXiv. https://doi.org/10.17605/OSF.IO/V3ZKT

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2018). Improving Inferences about Null Effects with Bayes Factors and Equivalence Tests. PsyArXiv. https://doi.org/10.17605/OSF.IO/QTZWR

Murphy, K. R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84(2), 234.

Sainani, K. L. (2018). The Problem with “Magnitude-Based Inference.” Medicine & Science in Sports & Exercise, Publish Ahead of Print. https://doi.org/10.1249/MSS.0000000000001645

When the scientific discipline started to emerge, Bacon (1620) remarked: “And it is nothing strange if a thing not held in honour does not prosper”. His idea was that we need to give honour to scientists who do good work, otherwise no one would want to become a scientist. In the current academic climate, prestige works as a double-edged sword: It can be useful as a source of extrinsic motivation, but it can also tempt researchers to violate scientific norms that, as long as the norm violations go unnoticed, increase prestige (McPherson, 1994). One attractive feature of using prestige as a reward mechanism in science is that researchers who aim to gain prestige value their reputation. A good reputation is maintained by not violating scientific norms, and a good reputation is lost when norm violations are discovered. Therefore, for researchers interested in gaining prestige, the goal to maintain a good reputation provides a selfish reason not to violate scientific norms (Milinski, Semmann, & Krambeck, 2002).

As Partha and David (1994) discuss in their work on the economics of science, the loss of reputation can be seen as a form of punishment for people who violated scientific norms, with the goal to maintain long term cooperation within the scientific enterprise (and prevent ‘defection’, or norm violations). In their view, science can be seen as a social dilemma, where the trade-off is to do what is good for the collective, or what is good for yourself. These two goals are not always aligned. Partha and David note how punishment in science often consists of ostracism, or “exclusion from the circle of cooperators in the future’, after norm violations are made public. For example, after a norm violation, colleagues might no longer want to work with you.

There has been a lot of discussion about what constitutes a norm violation in science, how researchers should act when they realize they have violated a norm, and the desirability of pointing out norm violations in public. To me, it seems that if we accept a system that rewards individuals through prestige, we also need to accept a system that leads to suffering and distress when individuals lose prestige. We will inevitably see differences in which (if any) violations people think deserve to be punished by loss of reputation. A reward system based on prestige does not, by definition, lend itself to exact quantification. People do not receive prestige proportional to their contributions to science, and there is no process that guarantees that the loss of prestige after a norm violation is proportional to the severity of the norm violation. This discussion is complicated even more by the fact that when norm violations are followed by ostracism, we should not only expect a loss of prestige, but also strong personal distress (the meta-analytic effect size of negative effects due to being ostracized is d = -1.4, which is one of the biggest effects in social psychology).

Using punishment to prevent the violation of scientific norms is an inherently messy mechanism. It is difficult, if not impossible, to detach prestige from subjective feelings. It seems impossible to contain punishment of perceived norm violations purely to a reduction in prestige, even if one wanted to. I personally believe that as long as we have a system that confers individual prestige on the basis of scientific accomplishments, we also opt-in to a system that requires punishing researchers who have gained their prestige by violating scientific norms. If we choose to prevent scientific norm violations through punishment and ostracism, and information about norm violations can now be much more widely shared than before through social media, the field needs to come together to more clearly define norm violations, and reasonable sanctions for specific norm violations.

A recent case illustrates this point well. Robert Sternberg recently resigned as editor of Perspectives on Psychological Science. He self-plagiarized, and excessively cited his own work. In a system that rewards scientists with prestige based on their performance, it seems necessary to incorporate information about self-plagiarism and self-citation into our judgment of how much prestige Sternberg should get. Whether his behavior is a scientific norm violation is a matter of debate. In the Netherlands cases of perceived scientific norm violations are transparently dealt with by the LOWI (unlike countries such as the USA where perceived scientific norm violations are intentionally hidden from public view and rarely, if ever, dealt with in a transparent manner). A very similar case in economics, where Peter Nijkamp (like Sternberg, ranked in the Top 100 of his field) excessively self-plagiarized was seen as a questionable research practice, and careless, but not as a breach of scientific integrity. Note that even if excessive self-plagiarism is not officially a breach of scientific integrity, fellow scientists can still perceive this as a norm violation, and ostracize researchers who act in this manner.

The email below (which despite being written so badly it reads like a spoof email, seems to be a real email by a real lawyer working for Sternberg) highlights the negative affective consequences of this punishment process for the people who lose prestige because of perceived scientific norm violations. This is just one example, but many high-profile cases where researchers have lost prestige due to perceived norm violations will lead to experiences of “intentional efforts to inflict emotional harm” on behalf of the researchers who have received criticism.

I am merely observing this weird situation we have gotten ourselves into. Because we collectively accept a system that rewards individual scientists through prestige, I can feel both sympathy for people who experience negative affect when their reputation suffers, as indignation when they sent lawyers after people who publicly share perceived norm violations. I don’t see a solution as long as we have a scientific system that rewards individuals through prestige. Allow me to self-plagiarize: If we accept a system that rewards individuals through prestige, we also need to accept a system that leads to suffering and distress when these individuals lose prestige.

If we care enough about this problem to try to solve it, we might have to seriously reconsider the role prestige plays in science.

References

McPherson, M. S. (1994). Part three. How should liberal education be financed. Public purpose and public accountability in liberal education. New Directions for Higher Education, 1994(85), 81–92. https://doi.org/10.1002/he.36919948512

Milinski, M., Semmann, D., & Krambeck, H.-J. (2002). Reputation helps solve the ‘tragedy of the commons.’ Nature, 415(6870), 424. https://doi.org/10.1038/415424a

Partha, D., & David, P. A. (1994). Toward a new economics of science. Research Policy, 23(5), 487–521. https://doi.org/10.1016/0048-7333(94)01002-1

The 20% Statistician

Thursday, May 17, 2018

Moving Beyond Magnitude Based Inferences