Comments on The 20% Statistician: Dance of the Bayes factors

2021-08-16T11:38:13.226+02:00

This comment has been removed by a blog administrator.

Thanks Daniel for your inspiring work. I recently ...

2016-11-13T11:17:09.485+01:00

Thanks Daniel for your inspiring work. I recently gave a talk where I show many statistical dances in graphical form, and I mentioned your blog post: http://www.aviz.fr/badstats#sec0

As you probably know I have done such simulations ...

2016-09-19T10:39:56.513+02:00

As you probably know I have done such simulations in the past as well. Your argument here actually seems to repeat some of the debate I had about the Boekel replication study summarized in my now (ancient) blog post: https://neuroneurotic.net/2015/03/26/failed-replication-or-flawed-reasoning/

What these simulations are essentially doing is to restate what the Bayes Factor means. You get a dance of the BFs with little conclusive evidence either way because most results simply do not provide conclusive evidence for either model. Is this really a problem? It seems to tell us what we want to know.

What I think is a problem is when results are misinterpreted (which was the reason for my blog post then although my thinking has also evolved since then). I think while Bayesians keep reminding us that we should condition on the data and not the truth, the way we report and discuss results is still largely focused on the latter. Even studies employing Bayesian hypothesis tests seem guilty of this.

We recently published a paper (my first with default BFs) where I've tried to counteract this by emphasizing that a BF quantifies relative evidence. In this we were mostly trying to make an inference whether participants are guessing so the BFs are all tests whether accuracy at the group level is at chance. Most of the BF10 > 1/3 even though they're below 1. What this means is that you should update your belief that people can do the task better than chance towards the null, but not by very much. If you are really convinced (Bem style) that people can actually do the task, then this won't reduce your belief by much but it's up to you to decide if your belief is justified in the first place.

What you say about the March of p-values not being...

2016-09-19T10:19:05.065+02:00

What you say about the March of p-values not being true also applies to the default Bayes Factors. I had similar simulations of this in my BSE preprint: http://biorxiv.org/content/early/2015/04/02/017327

You can see that even the weight of evidence (log BF) varies quite dramatically when the true effect size is large.

The more I learn about this though the more I realize it comes down to philosophy only. As a Bayesian you shouldn't care about the long run. It's all about the data and seen from this perspective this is a feature, not a bug. I still haven't decided entirely in how far the long run actually matters.

nhận order đặt mua hàng từ website 1688 Công ty ch...

2016-08-17T06:24:54.334+02:00

nhận order đặt mua hàng từ website 1688
Công ty chuyên nhận order đặt mua hàng từ trên web 1688
chuyên nhận order đặt mua hàng từ website 1688 ở Việt Nam

This is the long and ever ongoing discussion withi...

2016-08-02T14:00:33.594+02:00

This is the long and ever ongoing discussion within Bayesian statistics, between the objective Bayesians and the Subjective Bayesians. There are arguments for both sides, but I don't think there is one side that is right and another side that is wrong. They both have benefits and downsides, so the duality will continue I'm afraid.

Hi Daniel, very interesting post. However, one cri...

2016-08-02T12:48:58.937+02:00

Hi Daniel,
very interesting post. However, one criticism on Bayesian statistics is that it is (too) subjective. You advise to stay away from default priors. I have the feeling that I fuel the criticism that it's subjective by staying away from the default priors.

> For me, there is one important difference bet...

2016-07-20T15:55:22.469+02:00

> For me, there is one important difference between the dance of the p-values and the dance of the Bayes factors: When people draw dichotomous conclusions, p-values allow you to control your error rate in the long run, while error rates are ignored when people use Bayes factors.

You can choose the BF threshold such that BF test the p(H_0|D) with the 0-1 loss function used by the neyman-pearson framework. For instance for p(H_0)=p(H_1) and alpha=0.05, then H_0 can be rejected if B_10 > 19 and in general the threshold is given by (1-alpha)/alpha*p(H_1)/p(H_0). Of course this is rarely done in the literature, so your point stands; but in principle BF are amendable to your complaint.

Even better strategy is to just forget about bayes factors and to derive a test from first principles based on the loss function. Christian Roberts' book The bayesian choice covers this topic. See also Roberts' opinion on Bayes Factors: http://stats.stackexchange.com/a/204334

As for my opinion, I'm with Geoff Cumming on this. Just escape the false trichotomy of "p-values, confidence intervals, or Bayes factors" by going with bayesian estimation (Kruschke's bayesian new statistics)

The dance of the p-values or BF is so shockingly l...

2016-07-19T04:53:08.475+02:00

The dance of the p-values or BF is so shockingly large because we do not have a proper understanding of p-values or BF. BF are ratios and we are not very used to ratios. Using log(BF) or log(p) might help, but I prefer converting everything into z-scores. You got a z-score of 1.96 (p = .05), You need an 80% CI interval around that. You take the standard normal and you get z = qnorm(.80) = .84, so you have an interval from 1.12 to 2.80. (corresponding p-values are p = .26 to .005.

If you got p = .01, you get z = 3.29, 80%CI = 2.45 to 4.13. corresponding p-values are .01 to .0000359.

I find 2.45 to 4.14 easier to interpret than .01 to .0000359, but it expresses the same amount of uncertainty about the probability of obtaining a test statistic when the null hypothesis is true.

Very nice Daniel! For me, take home message 1 is t...

2016-07-19T01:38:56.774+02:00

Very nice Daniel!
For me, take home message 1 is that BFs are subject to sampling variability. Of course! But folks don't seem to recognise this sufficiently. I hope now they will. On repeated sampling they dance, as do p values (and means and CIs and values of sample SD and…).

Take home message 2 is that, for yet one more reason, we should move on from dichotomous decision making and use estimation. DDM tempts with seductive but illusory certainty. It does so even when given the better-sounding label ‘Hypothesis Testing’. Estimation, by contrast (1) focuses on effect sizes, which are the outcomes of most research interest and what we should interpret; (2) is necessary for meta-analysis, which is essential in a world of Open Science and replication; and (3) is likely to encourage quantitative modelling and the development of a more quantitative discipline. For me it is much more important that folks move on from DDM to estimation than whether they choose to do estimation using confidence intervals or credible intervals.

Indeed Bayesian estimation centred on credible intervals may be a major way of the future. Please someone develop some great materials for teaching beginners using this approach!

You mention ‘March of the p values’ if power is high. With high power, yes, we more often get small p values. Of course. But the weird thing is that the extent of sampling variability of p is still very high. I quantify that with the p interval: the 80% prediction interval for p, given just the p value found in an initial study. For initial p = .001, the p interval (all p values two-tail) is (.0000003, .14). So there’s an 80% chance a replication gives p in that interval, a 10% chance of p < .0000003, and fully a 10% chance of p > .14! Hardly a march! Possibly more surprisingly, that is all true whatever the value of N! (Provided N is not very small.) Yes, that’s hard to believe, but note that the interval is conditional on that initial p value, not on a particular true effect size. There’s more in the paper:
Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300.

Geoff Cumming
g.cumming@latrobe.edu.au

Hi Tim, indeed, we could say the dataset contain l...

2016-07-18T12:12:07.100+02:00

Hi Tim, indeed, we could say the dataset contain little evidence. P-values would vary as much (I explicitly say this example is chosen to show the variability, just as when Cumming uses 50% power).

If you sample until you get a BF < 0.1, whether or not you will get misleading evidence depends on your max sample size, and when and where you look. It's simply not easy to know, because there are no formal error control mechanisms. But if 75 people in each cell is your max, and the effect size is small, you can never get strong evidence favoring HA. The real problem here is the bad prior. You should use a better prior to solve this problem.

Interesting post, thanks Daniel! As you know my kn...

2016-07-18T12:03:11.825+02:00

Interesting post, thanks Daniel! As you know my knowledge of Bayesian stats is very rough, as I'm still in the early stages of learning it. Nevertheless, some comments:

You say that:
25% of BF10 < 1/3 (which can be interpreted as support for the null)
25% of BF10 > 3 (which can be interpreted as support for the alternative)
50% of BF10 are between 1/3 and 3 (inconclusive)

Eyeballing the graph, it seems the mean and medium BF10 are between 2 and 3.

So basically, what you've shown is that for a dataset with little evidence, you'll never get substantive evidence (BF10 > 10), and what you will find is fairly random and often inconclusive. While this is good to know, it does not seem surprising, nor an argument for or against BF?

On the latter point: how would p-values do in this case?

Like you said: don't interpret BF as absolute evidence, or rely exclusively on dichotomous interpretations of BF.

As a researcher, I would more typically sample until I get a specific BF, say BF10 > 10 for example. I will try to simulate that myself, but I'm fairly certain that with such a design errors are extremely rare, albeit you might run into practical issues with the required sample size.

All in all, I guess that given your focus on controlling error rates, either setting alpha very low or BF high are both perfectly fine. The alpha has the advantage of an explicit (max) level of long-run error, while the BF has the advantage of quantifying evidence.