Comments on The 20% Statistician: The Null Is Always False (Except When It Is True)

It's not super-clear that Cohen wasn't. Me...

2019-04-10T00:22:45.369+02:00

It's not super-clear that Cohen wasn't. Meehl, after all, didn't talk much about experimental randomized interventions, and he was called on it by Oakes (https://www.gwern.net/docs/statistics/1975-oakes.pdf) who gave as a counter-example the now-forgotten OEO 'performance contracting' school reform experiment (https://www.gwern.net/docs/sociology/1972-page.pdf) where despite randomization of dozens of schools with ~33k students, not a single null could be rejected.

The whole testing part of the blog seems pointless...

2017-01-16T00:42:22.952+01:00

The whole testing part of the blog seems pointless to me. You just randomly assigned participants to two conditions and didn't find an effect. Big deal. Pick some difference that you could use to assign people and find there is no difference in the population and then you'd be able to contradict the argument. Cohen certainly wasn't foolish enough to not recognize that randomly assigned individuals will show no effect.

Actually, to my mind what's really going on is...

2015-07-22T19:55:35.251+02:00

Actually, to my mind what's really going on is that "The Null is Always False (Even When It is True)."

What do I mean by this? What I mean - and what I think Cohen means by the null always being false in the real world - is that the real world is too complex for the null hypothesis to be true. By which I mean - the null hypothesis can be really, truly TRUE - but if you take a large enough sample size you will find an effect "showing otherwise" to whichever p-value you want. This is not because the null hypothesis is false - but rather because your experiment is imperfect.

We use controls to minimize confounds, and good study design does a good job of making sure that any confounds remaining are very small. But it is practically impossible to ELIMINATE confounds. This is what Cohen means by "a stray electron." Cohen is basically saying, you generate two distributions with the same computer code and if you add enough zeroes to your n eventually a significant difference pops out. And this is not a p=.05 oh you got 1/20 unlucky difference he means. He means a real difference will exist. You ran the same code, but at different times, and some minute physical difference in the computer run conditions caused the ever so slightly imperfect random number generation to have ever so slight (but real!) differences in the two runtimes.

What Cohen is basically saying is that you can't control conditions perfectly. There's no such thing as a perfect experiment, even in simulation studies. You can't control everything, and when it comes to chaotic real world systems everything has SOME effect. Just the fact that you test subjects at different dates - it's a week later now, some world event happened, it changed the subject's thoughts... there's going to be a real effect. Maybe it's .00001, but there's going to be an effect, and if you have enough n, and your power actually increases with n, then you'll eventually detect it.

If you model the level of confounds in your experiment as a random variable, what is the probability that you just happen to hit exactly 0? It doesn't even matter what the probability distribution is, the chance of hitting EXACTLY 0 to perfect precision is, in fact, EXACTLY 0. The only thing you're sure about is that your experiment isn't perfect.

The point being... if you get p=.000000001, on a difference of .5%, and then you say you reject the null hypothesis because it's just so UNLIKELY... you're in for some pain. Because what you've detected isn't that the null isn't true, what you've detected is the imperfection in your ability to create an experimental setup that actually tests the theoretical null.

The experimental null you're testing is an APPROXIMATION of the theoretical null. You can not reasonably expect to ever create an experiment with NO confounds of any arbitrarily small magnitude.

The theoretical null may or may not be true. The experimental null is ALWAYS false, in the limit of large n. You can not control for every confound - you can not even conceive of every confound!

But the problem is when people ignore the fact that experimental or systematic error can only be reduced, not eliminated, and then go on to think that p=.000000001 at a miniscule effect size is strong evidence against the null. But what a Bayesian says is, "I expect (have a prior) that even if the theoretical null is true, there's going to be some tiny confound I couldn't control, so if I see a very small effect, it's most likely a confound." Unless you SPECIFICALLY hypothesized (had a prior for!) a very small effect size, finding a small effect is strong evidence FOR the null regardless of the p value!

Because our experiments AREN'T perfect. Because the null is always false (even when it's true).

Hi Felix, I think you are completely right. If NHS...

2014-06-14T08:05:22.799+02:00

Hi Felix, I think you are completely right. If NHST is to blame for something, it's that people do not generate well-thought through alternative models. to test. Knowing that the null-hypothesis is rejected is at best a first step, and in some cases (such as when you examine gender effects) not even a very interesting first step. Although I think I point out in the blog NHST is limited, I also feel that people have too easily dismissed it as completely useless because they've hear 'the null is always false'. As a first step (perhaps to decide whether it's even necessary to start to create alternative models) it can play a role.

Hi guys, my 2 cents: every model is an abstract...

2014-06-13T08:56:18.498+02:00

Hi guys, my 2 cents:

*every* model is an abstraction and simplification of reality, so of course the point nil (as any other point hypothesis) is "false". But, just as a map can be a useful summary of the landscape (although it is wrong in most points), a point hypothesis can be useful summary of my belief, such as "There is no effect". So for me it can (sometimes) make sense to assume a point hypothesis as a wrong, but useful, simplification of reality.

More important: As soon as you explicitly commit yourself to an alternative hypothesis H1 (either a point H1, or a spread-out H1 as in Bayes factors), you can compare the predictive success of H0 and H1 against each other. Assume, for example, the true ES is 0.02, and your sample has an estimated ES of .025.

Which hypothesis predicts data better: delta = 0, or delta = 0.4? Certainly the former.

When your CI shrinks, it will certainly exclude the null value with large enough samples. Then you conclude that it is improbable that the population has delta=0. But, and here's the point, it is even *more* improbable that the population has delta=0.4! So if you compare the likelihoods of both hypotheses, both are improbable (on an absolute scale), but the H0 still is more probable than the H1 (although you would reject H0 using the CI approach). So the conclusions at large samples are different: The CI approach says "There is a non-zero effect (although very small)". The Bayes factor approach says "Data fit better to the hypothesis 'There is no effect' than to the hypothesis 'There is an effect'".
And even with Bayes factors that allow any possible H1 (even very small ES), in the example case the BF will much longer point towards the H0 than the CI approach - which is a good property IHMO.

https://dl.dropboxusercontent.com/u/4472780/blog-pic/p_vs_BF.jpg

So it's less about "Which hypothesis is *true*?" (at the end virtually all of them are false), but rather "Which hypothesis is the best available description of the phenomenon?".

You are right that estimation can be very useful. ...

2014-06-13T07:45:22.767+02:00

You are right that estimation can be very useful. It becomes increasingly useful, the better the model you have. Without a good model, it becomes difficult to interpret data in light of hypotheses. So NHST can be seen as a model (instead of just reporting the effect size estimate by itself) even though it is the most minimally sufficient model you can use.

Hi Daniel - I suppose your argument that a true re...

2014-06-12T22:46:07.189+02:00

Hi Daniel - I suppose your argument that a true relationship will vary randomly (assuming that population changes each second) is tenable. (I could imagine someone else arguing that each time the population changes then the population effect size changes, but that doesn't really help us much as that is like shooting at a moving target.) However, this is why Cohen's argument still makes sense to me: Why assume the effect size is anything (nil or null)? Why not just try to estimate the thing and be happy with that?

Hi Ryne - I agree we don't need significance t...

2014-06-12T18:04:49.162+02:00

Hi Ryne - I agree we don't need significance tests if we can meaure the entore population. But you were not convinced by my argument that the 'true' relationship varies continuously around a value (either 0 or an effect size) and that we therefore should not worry about it being exactly 0 at any moment of the day or 0.00002 - but that we can just assume it is 0 and test against that assumption? Why not?

Sorry Daniel, I fail to see how this successful re...

2014-06-12T17:07:54.929+02:00

Sorry Daniel, I fail to see how this successful refutes Cohen's point. If there is a population that exists, there is a true relationship between any two variables (whether both are measured or one is manipulated). If that true relationship is r = .00 (to the last decimal point with no rounding) then the null -- actually more properly the nil -- hypothesis is true. Otherwise it is false.

All this business with significance testing has to do with samples. When one has the population, one can put significance tests away.