Thursday, June 12, 2014

The Null Is Always False (Except When It Is True)

An often heard criticism of null-hypothesis significance testing is that the null is always false. The idea is that average differences between two samples will never be exactly zero (there will practically always be a tiny difference, even if it is only 0.001). Furthermore, if the sample size is large enough, tiny differences can be statistically significant. Both these statements are correct, but they do not mean the null is never true.

The null-hypothesis assumes the difference between the means in the two populations is exactly zero. However, the two means in the samples drawn from these two populations vary with each sample (and the less data you have, the greater the variance). The difference between two means will get really really close to zero when the number of samples approaches infinity. This is a core assumption in Frequentist approaches to statistics. It’s therefore not important that the observed difference in your sample isn’t exactly zero, as long as the difference in the population is zero.

Some researchers, such as Cohen (1990) have expressed their doubt that the difference in the population is ever exactly zero. As Cohen says:

The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is always false in the real world. It can only be true in the bowels of a computer processor running a Monte Carlo study (and even then a stray electron may make it false). If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null is always false, what’s the big deal about rejecting it? (p. 1308).

One ‘big deal’ about rejecting it, is that to reject a small difference (e.g., a Cohen’s d of 0.001) you need a sample size of at least 31 million participants to have a decent chance of observing such a statistical difference in a t-test. With such sample sizes, almost all statistics we use (e.g., checks for normality) break down and start to return meaningless results.

Another ‘big deal’ is that we don’t know whether the observed difference will remain equally large irrespective of the increase in sample size (as should happen, when it is an accurately measured true effect) or whether it will become smaller and smaller, without ever becoming statistically significant, the more measurements are added (as should happen when there is actually no effect). Hagen (1997) explains this latter situation in his article ‘In Praise of the Null-Hypothesis Significance Test’ to prevent people from mistakenly assuming that every observed difference will become significant if you simply add participants. He writes:

‘Thus, although it may appear that larger and larger Ns are chasing smaller and smaller differences, when the null is true, the variance of the test statistic, which is doing the chasing, is a function of the variance of the differences it is chasing. Thus, the "chaser" never gets any closer to the "chasee."’
 

What’s a ‘real’ effect?

The more important question is whether it is true that there are always real differences in the real world, and what the ‘real world’ is. Let’s consider the population of people in the real world. While you read this sentence, some individuals in this population have died, and some were born. For most questions in psychology, the population is surprisingly similar to an eternally running Monte Carlo simulation. Even if you could measure all people in the world in a millisecond, and the test-retest correlation was perfect, the answer you would get now would be different from the answer you would get in an hour. Frequentists (the people that use NHST) are not specifically interested in the exact value now, or in one hour, or next week Thursday, but in the average value in the ‘long’ run. The value in the real world today might never be zero, but it’s never anything, because it’s continuously changing. If we want to make generalizable statements about the world, I think the fact that the null-hypothesis is never precisely true at any specific moment is not a problem. I’ll ignore more complex questions for now, such as how we can establish whether effects vary over time.

When perfect randomization to conditions is possible, and the null-hypothesis is true, every p-value is going to be just as likely. There a great blog post by Jim Grange explaining that p-values are uniformly distributed if the null is true using simulations in R. Take the script from his blog, and change the sample size (e.g., to 100000 in each group), or change the variances, and as long as the means of the two groups remain identical, p-values will be uniformly distributed. Although it is theoretically possible that differences are randomly fluctuating around zero in the long term, some researchers have argued this is often not true. Especially in correlational research, or in any situation where participants are not randomly assigned to conditions, this is a real problem.

Meehl talks about how in psychology every individual-difference variable (e.g., trait, status, demographic) correlates with every other variable, which means the null is practically never true. In these situations, it’s not that testing against the null-hypothesis is meaningless, but it’s not informative. If everything correlates with everything else, you need to create good models, and test those. A simple null-hypothesis significance test will not get you very far. I agree.



Random Assignment vs. Crud

To illustrate when NHST can be used to as a source of information in large samples, and when NHST is not informative in large samples, I’ll analyze data of large dataset with 6344 participants from the Many Labs project. I’ve analyzed 10 dependent variables to see whether they were influenced by A) Gender, and B) Assignment to the high or low anchoring condition in the first study. Gender is a measured individual difference variable, and not a manipulated variable, and might thus be affected by what Meehl calls the crud factor. Here, I want to illustrate this is A) probably often true for individual difference variables, but perhaps not always true, and B) it is probably never true for when analyzing differences between groups individuals were randomly assignment to.

You can download the CleanedData.sav Many Labs Data here, and my analysis syntax here. I perform 8 t-tests and 2 Chi-square tests on 10 dependent variables, while the factor is either gender, or the random assignment to the high or low condition for the first question in the anchoring paradigm. You can download the output here. When we analyze the 10 dependent variables as a function of the anchoring condition, none of the differences are statistically significant (even though there are more than 6000 participants). You can play around with the script, repeating the analysis for the conditions related to the other three anchoring questions (remember to correct for multiple comparisons if you perform many tests), and see how randomization does a pretty good job at returning non-significant results even in very large sample sizes. If the null is always false, it is remarkably difficult to reject. Obviously, when we analyze the answer people gave on the first anchoring question, we find a huge effect of the high vs. low anchoring condition they were randomly assigned to. Here, NHST works. There is probably something going on. If the anchoring effect was a completely novel phenomenon, this would be an important first finding, to be followed by replications and extensions, and finally model building and testing.

The results change dramatically if we use Gender as a factor. There are Gender effects on dependent variables related to quote attribution, system justification, the gambler’s fallacy, imagined contact, the explicit evaluation of arts and math, and the norm of reciprocity. There are no significant differences in political identification (as conservative or liberal), on the response scale manipulation, or on gain vs. loss framing (even though p = .025, such a high p-value is stronger support for the null-hypothesis than for the alternative hypothesis with 5500 participants). It’s surprising that the null-hypothesis (gender does not influence the responses participants give) is rejected for seven out of ten effects. Personally (perhaps because I’ve got very little expertise in gender effects) I was actually extremely surprised, even though the effects are small (with Cohen d’s or around 0.09). This, ironically, shows that NHST works - I've learned gender effects are much more widespread than I'd have though before I wrote this blog post.


It also shows we have learned very little, because NHST when examining gender differences does not really tell us anything about WHY gender influences all these different dependent variables. We need better models to really know what’s going on. For the studies where there was no significant effect (such as political orientation), it is risky to conclude gender is irrelevant – perhaps there are moderators, and gender and political identification are related. 


Conclusion

We can reject the hypothesis that the null is always false. Generalizing statements about how the null-hypothesis is always false, and thus how null-hypothesis significance testing is a meaningless endeavor, are only partially accurate. The null hypothesis is always false, when it is false, but it’s true when it’s true. It's difficult to know when a not statistically significant difference reflects a Type 2 error (there is an effect, but it will only become significant if the statistical power is increased, for example by collecting more data), or whether it actually means the null is true. Null-hypothesis significance testing cannot be used to answer these questions. NHST can only reject the null-hypothesis, and when observed differences are not statistically significant, the outcome of a significance test necessarily remains inconclusive. But assuming the null-hypothesis is true in exploratory research, at least in experiments where random assignment to conditions is possible, is a useful statistical tool.

9 comments:

  1. Sorry Daniel, I fail to see how this successful refutes Cohen's point. If there is a population that exists, there is a true relationship between any two variables (whether both are measured or one is manipulated). If that true relationship is r = .00 (to the last decimal point with no rounding) then the null -- actually more properly the nil -- hypothesis is true. Otherwise it is false.

    All this business with significance testing has to do with samples. When one has the population, one can put significance tests away.

    ReplyDelete
    Replies
    1. Hi Ryne - I agree we don't need significance tests if we can meaure the entore population. But you were not convinced by my argument that the 'true' relationship varies continuously around a value (either 0 or an effect size) and that we therefore should not worry about it being exactly 0 at any moment of the day or 0.00002 - but that we can just assume it is 0 and test against that assumption? Why not?

      Delete
    2. Hi Daniel - I suppose your argument that a true relationship will vary randomly (assuming that population changes each second) is tenable. (I could imagine someone else arguing that each time the population changes then the population effect size changes, but that doesn't really help us much as that is like shooting at a moving target.) However, this is why Cohen's argument still makes sense to me: Why assume the effect size is anything (nil or null)? Why not just try to estimate the thing and be happy with that?

      Delete
    3. You are right that estimation can be very useful. It becomes increasingly useful, the better the model you have. Without a good model, it becomes difficult to interpret data in light of hypotheses. So NHST can be seen as a model (instead of just reporting the effect size estimate by itself) even though it is the most minimally sufficient model you can use.

      Delete
  2. Hi guys, my 2 cents:

    *every* model is an abstraction and simplification of reality, so of course the point nil (as any other point hypothesis) is "false". But, just as a map can be a useful summary of the landscape (although it is wrong in most points), a point hypothesis can be useful summary of my belief, such as "There is no effect". So for me it can (sometimes) make sense to assume a point hypothesis as a wrong, but useful, simplification of reality.

    More important: As soon as you explicitly commit yourself to an alternative hypothesis H1 (either a point H1, or a spread-out H1 as in Bayes factors), you can compare the predictive success of H0 and H1 against each other. Assume, for example, the true ES is 0.02, and your sample has an estimated ES of .025.

    Which hypothesis predicts data better: delta = 0, or delta = 0.4? Certainly the former.

    When your CI shrinks, it will certainly exclude the null value with large enough samples. Then you conclude that it is improbable that the population has delta=0. But, and here's the point, it is even *more* improbable that the population has delta=0.4! So if you compare the likelihoods of both hypotheses, both are improbable (on an absolute scale), but the H0 still is more probable than the H1 (although you would reject H0 using the CI approach). So the conclusions at large samples are different: The CI approach says "There is a non-zero effect (although very small)". The Bayes factor approach says "Data fit better to the hypothesis 'There is no effect' than to the hypothesis 'There is an effect'".
    And even with Bayes factors that allow any possible H1 (even very small ES), in the example case the BF will much longer point towards the H0 than the CI approach - which is a good property IHMO.

    https://dl.dropboxusercontent.com/u/4472780/blog-pic/p_vs_BF.jpg

    So it's less about "Which hypothesis is *true*?" (at the end virtually all of them are false), but rather "Which hypothesis is the best available description of the phenomenon?".

    ReplyDelete
    Replies
    1. Hi Felix, I think you are completely right. If NHST is to blame for something, it's that people do not generate well-thought through alternative models. to test. Knowing that the null-hypothesis is rejected is at best a first step, and in some cases (such as when you examine gender effects) not even a very interesting first step. Although I think I point out in the blog NHST is limited, I also feel that people have too easily dismissed it as completely useless because they've hear 'the null is always false'. As a first step (perhaps to decide whether it's even necessary to start to create alternative models) it can play a role.

      Delete
  3. Actually, to my mind what's really going on is that "The Null is Always False (Even When It is True)."

    What do I mean by this? What I mean - and what I think Cohen means by the null always being false in the real world - is that the real world is too complex for the null hypothesis to be true. By which I mean - the null hypothesis can be really, truly TRUE - but if you take a large enough sample size you will find an effect "showing otherwise" to whichever p-value you want. This is not because the null hypothesis is false - but rather because your experiment is imperfect.

    We use controls to minimize confounds, and good study design does a good job of making sure that any confounds remaining are very small. But it is practically impossible to ELIMINATE confounds. This is what Cohen means by "a stray electron." Cohen is basically saying, you generate two distributions with the same computer code and if you add enough zeroes to your n eventually a significant difference pops out. And this is not a p=.05 oh you got 1/20 unlucky difference he means. He means a real difference will exist. You ran the same code, but at different times, and some minute physical difference in the computer run conditions caused the ever so slightly imperfect random number generation to have ever so slight (but real!) differences in the two runtimes.

    What Cohen is basically saying is that you can't control conditions perfectly. There's no such thing as a perfect experiment, even in simulation studies. You can't control everything, and when it comes to chaotic real world systems everything has SOME effect. Just the fact that you test subjects at different dates - it's a week later now, some world event happened, it changed the subject's thoughts... there's going to be a real effect. Maybe it's .00001, but there's going to be an effect, and if you have enough n, and your power actually increases with n, then you'll eventually detect it.

    If you model the level of confounds in your experiment as a random variable, what is the probability that you just happen to hit exactly 0? It doesn't even matter what the probability distribution is, the chance of hitting EXACTLY 0 to perfect precision is, in fact, EXACTLY 0. The only thing you're sure about is that your experiment isn't perfect.

    The point being... if you get p=.000000001, on a difference of .5%, and then you say you reject the null hypothesis because it's just so UNLIKELY... you're in for some pain. Because what you've detected isn't that the null isn't true, what you've detected is the imperfection in your ability to create an experimental setup that actually tests the theoretical null.

    The experimental null you're testing is an APPROXIMATION of the theoretical null. You can not reasonably expect to ever create an experiment with NO confounds of any arbitrarily small magnitude.

    The theoretical null may or may not be true. The experimental null is ALWAYS false, in the limit of large n. You can not control for every confound - you can not even conceive of every confound!

    But the problem is when people ignore the fact that experimental or systematic error can only be reduced, not eliminated, and then go on to think that p=.000000001 at a miniscule effect size is strong evidence against the null. But what a Bayesian says is, "I expect (have a prior) that even if the theoretical null is true, there's going to be some tiny confound I couldn't control, so if I see a very small effect, it's most likely a confound." Unless you SPECIFICALLY hypothesized (had a prior for!) a very small effect size, finding a small effect is strong evidence FOR the null regardless of the p value!

    Because our experiments AREN'T perfect. Because the null is always false (even when it's true).

    ReplyDelete
  4. The whole testing part of the blog seems pointless to me. You just randomly assigned participants to two conditions and didn't find an effect. Big deal. Pick some difference that you could use to assign people and find there is no difference in the population and then you'd be able to contradict the argument. Cohen certainly wasn't foolish enough to not recognize that randomly assigned individuals will show no effect.

    ReplyDelete
    Replies
    1. It's not super-clear that Cohen wasn't. Meehl, after all, didn't talk much about experimental randomized interventions, and he was called on it by Oakes (https://www.gwern.net/docs/statistics/1975-oakes.pdf) who gave as a counter-example the now-forgotten OEO 'performance contracting' school reform experiment (https://www.gwern.net/docs/sociology/1972-page.pdf) where despite randomization of dozens of schools with ~33k students, not a single null could be rejected.

      Delete