When testing predictions in empirical research, researchers
often report whether their results are statistically different from 0. For
example, a researcher might be interested in whether the use of a cell-phone
increases the likelihood of getting into a car accident compared to not using a
cell phone. A researcher might ask one group of individuals to use cell phones
while driving in a driving simulator, while a control group does not use cell
phones. The researcher might predict cell phone users get into more accidents,
and test this prediction by comparing whether the difference between the two
groups in the experiment is statistically different from zero. This is typically referred to as
null-hypothesis significance testing (NHST). The ‘significance’ part
of
this name is a misnomer: what we understand as the 'significance' of a
finding in normal English depends on its theoretical or practical
importance and has very little to do with statistics. Instead, we will therefore
refer to such tests as ‘null-hypothesis
testing’, and use ‘statistical
difference’ for what is sometimes referred to in the literature as a
‘significant finding’ or a ‘statistically significant finding’.
Assume we ask two groups of 10 people how much they liked the extended directors cut version of the Lord of the Rings (LOTR) trilogy. The first group consists of my friends, and the second groups consists of friends of my wife. Our friends rate the trilogy on a score from 1 to 10. We can calculate the average rating by my friends, which is 8.7, and the average rating by my wife’s friends, which is 7.7. The difference is 1.
Assume we ask two groups of 10 people how much they liked the extended directors cut version of the Lord of the Rings (LOTR) trilogy. The first group consists of my friends, and the second groups consists of friends of my wife. Our friends rate the trilogy on a score from 1 to 10. We can calculate the average rating by my friends, which is 8.7, and the average rating by my wife’s friends, which is 7.7. The difference is 1.
My
Friends
|
My
Wife’s Friends
|
|
Friend 1
|
9,00
|
9,00
|
Friend 2
|
7,00
|
6,00
|
Friend 3
|
8,00
|
7,00
|
Friend 4
|
9,00
|
8,00
|
Friend 5
|
8,00
|
7,00
|
Friend 6
|
9,00
|
9,00
|
Friend 7
|
9,00
|
8,00
|
Friend 8
|
10,00
|
8,00
|
Friend 9
|
9,00
|
8,00
|
Friend 10
|
9,00
|
7,00
|
M
|
8,70
|
7,70
|
SD
|
0,82
|
0,95
|
To get these data in R, we use:
We can see in the table that the ratings vary, not just between the groups, but also within the groups. The average difference is 1 scale point between groups, but the ratings also differ between friends within each group. To take a look at the distribution of these scores, let’s plot the in a density plot (closely related to a histogram).
In null-hypothesis testing we try to answer
this question by calculating the probability of observing a specific, or more
extreme, outcome of a test (i.e., a difference in movie ratings of 1 point or
more) assuming that the null hypothesis is true (i.e., there is no real
difference between how much my friends and my wife’s friends like the extended
directors cut of LOTR). This probability is called the p-value.
The null-hypothesis assumes that if we would ask an infinite number of my friends and an infinite number of my wife’s friends how much they like LOTR, the difference between these huge groups is exactly 0. However, in a small sample of my friends and my wife’s friends (say 10 friends in each group), random variation is very likely to lead to a difference larger or smaller than 0. How do we know whether an observed difference is due to random variation around a difference of 0, or whether an observed difference is due to random variation around a real difference between our friends?
The null-hypothesis assumes that if we would ask an infinite number of my friends and an infinite number of my wife’s friends how much they like LOTR, the difference between these huge groups is exactly 0. However, in a small sample of my friends and my wife’s friends (say 10 friends in each group), random variation is very likely to lead to a difference larger or smaller than 0. How do we know whether an observed difference is due to random variation around a difference of 0, or whether an observed difference is due to random variation around a real difference between our friends?
To answer this question we need to know is what constitutes
a reasonable expectation about how much the differences between groups can
vary, if we would repeatedly ask samples of groups of friends to rate LOTR. When
we compare two groups, we use the means, standard deviations, and number of
observations in each group to calculate the t-statistic.
With the means (M), standard deviations (SD)
and sample size (N) for each of the
two groups from the Table above we can examine the probability that the difference between the two
groups is larger than 0, given the data we have available. This is done by
calculating the t-statistic, which is
related to a probability (a p-value) of getting the observed or a more extreme t-statistic in the t-distribution. The t-distribution
describes samples drawn from a population. It is similar to the normal
distribution (which describes the entire population). Because the t-distribution describes the probability
density function for a specific sample, the shape depends on the sample size.
The larger the sample, the more similar the t-distribution
becomes to the normal distribution. Below, you can see the normal distribution
(with M = 0 and SD = 1) in black, and the t-distribution
for 10 and 5 degrees of freedom. We can see the t-distribution has somewhat thicker tails than the normal distribution.
If we find a difference which is statistically different from 0 with p < .05, this means that the probability of observing this difference, assuming there is no difference between all my and my wife’s friends, is relatively small. That is, if both types of friends would enjoy LOTR equally (on average), then the probability of getting the data that we have (or data that is more extreme than the one we have) is small (0.02151 in our previous example). Hence, the data we have observed should therefore be considered surprising. That’s all a p-value smaller than 0.05 means.
x = seq(-4.5,4.5,length=1000) y<-dnorm(x,0,1) plot(x, y, type="l", col="black", lwd=3) lines(x,dt(x,df=10),col='red',type='l') lines(x,dt(x,df=5),col='blue',type='l') legend(-4,0.4, # places a legend at the appropriate place c("Normal","t(10)", "t(5)"), # puts text in the legend lty=c(1,1), # gives the legend appropriate symbols (lines) lwd=c(2.5,2.5,2.5),col=c("black", "red","blue")) # format lines and colors
Now we
have a probability density function to compare our t-statistic against,
all we need to do is to calculate the t-test. Actually, you should
first check whether both variables appear to come from a normal
distribution and have equal variances, but we're skipping the tests for
assumptions here. In R, you use the command below (here, we
assume
equal variances), which returns the output (in grey):
t.test(myfriends,mywifesfriends, var.equal = TRUE)
Two Sample t-test
data: myfriends and
mywifesfriends
t = 2.5175, df = 18, p-value = 0.02151
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
0.1654875 1.8345125
sample estimates:
mean of x mean of y
8.7 7.7
The t-value is
2.5175. The t-test returns the
probability of finding a difference as extreme, or more extreme, as the
observed difference. We can graph the t-distribution (for df = 18) and a vertical line at t=2.5175 in R using:
Let’s recall the definition of a p-value: A p-value is the probability of the observed or more extreme data, assuming the null hypothesis is true. The assumption that the null-hypothesis is true is represented by the t-distribution being centered around 0, which is the t-value if the difference between the two groups is exactly 0. The probability of an outcome as high, or higher, as the one observed, is indicated by the blue area. This is known as a one-tailed probability. It is equivalent to the probability that my friends like the extended directors cut of LOTR more than my wife’s friends.
Let’s recall the definition of a p-value: A p-value is the probability of the observed or more extreme data, assuming the null hypothesis is true. The assumption that the null-hypothesis is true is represented by the t-distribution being centered around 0, which is the t-value if the difference between the two groups is exactly 0. The probability of an outcome as high, or higher, as the one observed, is indicated by the blue area. This is known as a one-tailed probability. It is equivalent to the probability that my friends like the extended directors cut of LOTR more than my wife’s friends.
However, that wasn’t my question. My question was whether my
friends and my wife’s friends would differ in how much they liked the LOTR
movies. It would have been just as interesting to find my friends liked the
movies less than my wife’s friends. Since many of my friends have read the book
more than 20 times, and it’s often the case that people like the book better
than the movie, an opposite outcome would have been just as likely.
Furthermore, there has only been one person in my family who has been to a marathon
movie night in the cinema when the third LOTR movie came out, and it wasn’t me.
So let’s plot the probability that we found a positive or negative difference,
as or more extreme as the one observed (the two-tailed probability). Let's plot this:
I’ve plotted two critical values by horizontal grey dotted
lines, which indicate the critical t-values for a difference to be significant
(p < .05) for a t-distribution
with 18 degrees of freedom. We can see that the observed difference is more
extreme than the minimal critical values. However, we want to know what the exact
probability of the two blue areas under the curve is. The t-test already provided us with the information, but let’s calculate it. For the left tail, the
probability of a value smaller than t=-2.5175
can be found by:
x=seq(-5,5,length=1000) plot(x,dt(x,df=18),col='black',type='l') abline(v=2.5175, lwd=2) abline(v=-2.5175, lwd=2) abline(v=-2.100922,col="grey",lty=3) abline(v=2.100922, col="grey",lty=3) x=seq(2.5175,5,length=100) z<-(dt(x,df=18)) polygon(c(2.5175,x,8),c(0,z,0),col="blue") x=seq(-5,-2.5175,length=100) z<-(dt(x,df=18)) polygon(c(-8,x,-2.5175),c(0,z,0),col="blue")
pt(-2.5175,df=18)
0.01075515
If we would simply change the sign on -2.5175, we would get
the probability of a value smaller than 2.5175. What we want is the probability
of a value higher than that, which equals:
1-pt(2.5175,df=18)
0.01075515
This value is identical to the probability in the other tail
because the t-distribution is
symmetric. If we add these two probabilities together, we get p = 0.02151. This is the same as the p-value provided by the t-test we performed earlier returned.
What does a p<.05 mean?
If we find a difference which is statistically different from 0 with p < .05, this means that the probability of observing this difference, assuming there is no difference between all my and my wife’s friends, is relatively small. That is, if both types of friends would enjoy LOTR equally (on average), then the probability of getting the data that we have (or data that is more extreme than the one we have) is small (0.02151 in our previous example). Hence, the data we have observed should therefore be considered surprising. That’s all a p-value smaller than 0.05 means.
People, even seasoned researchers, often think p-values
mean a lot more. Remember that p-values
are a statement about the probability of observing the data, given that the null-hypothesis
is true. It is not the probability that the null-hypothesis is true, given the
data (we need Bayesian statistics if we want to make such statements). Results can be
surprising when the null-hypothesis is true, but even more surprising when the
alternative hypothesis is true (which
would make you think that the H0 hypothesis is perhaps the one to go for after
all). Morale: since a p-value only
takes the null-hypothesis into account, it says nothing about the probability
under the alternative hypothesis.
If a p-value is
larger than 0.05, the date we have observed are not surprising. This doesn’t mean the
null-hypothesis is true. The p-value
can only be used to test to reject the null-hypothesis. It can never be used to
accept the null-hypothesis is being true.
Finally, a p-value
is not the probability that a finding will replicate. It’s not 97.849% likely
the observed difference in LOTR evaluations will be replicated. We can easily show this in R by
simulating some friends. We will use the observed means and standard deviations
of the 10 friends of my wife and me, to draw 10 new friends from a normal
distribution with the same means and standard deviations. Let’s also perform a
t-test, and plot the results.
myfriends<-rnorm(n = 10, mean = 8.7, sd = 0.82) mywifesfriends<-rnorm(n = 10, mean = 7.7, sd = 0.95) m1=mean(myfriends) m2=mean(mywifesfriends) sd1=sd(myfriends) sd2=sd(mywifesfriends) t.test(myfriends,mywifesfriends, var.equal = TRUE) #perform the t-test d1 <- density(myfriends) d2 <- density(mywifesfriends) plot(range(d1$x, d2$x), range(d1$y, d2$y), type = "n", xlab = "LOTR Rating", ylab = "Density") legend(5, 0.4, legend=paste("mean=",round(m2,2),"; sd=",round(sd2,2), sep="")) legend(9.1, 0.4, legend=paste("mean=",round(m1,2),"; sd=",round(sd1,2), sep="")) polygon(d1, col=rgb(0,0,1,1/4), border="black") polygon(d2, col=rgb(1,0,0,1/4), border="black")
Run this script 20 times, and note whether the p-value in
the outcome of the t-test is smaller
than 0.05. You will probably see some variation in the p-values (with a p>.05 several times), and quite some veriation in the means. This happens, because there is a lot of variation in
small samples. Let’s change the number of simulated friends of my wife and me
(n = 10 in the first two lines of the script) to 100 simulated friends. Run the
adapted script 20 times. You will see that the p-values are very low
practically every time you run the script.
Look at the means and standard deviations. What do you notice with respect to the means and standard deviations? How is it possible that the t-value becomes more extreme, while the means and standard deviations stay the same (or actually become closer to the true means and standard deviations with increasing sample size)? The answer can be found in the formula for the t-statistic, where (described in general terms) the difference between means is divided by the pooled standard deviation, divided by the square root of the sample size. This √n aspect of the function explains why, all else remaining equal and when there is a real difference to be observed, t-values increase, and p-values decrease, the larger the sample size.
Look at the means and standard deviations. What do you notice with respect to the means and standard deviations? How is it possible that the t-value becomes more extreme, while the means and standard deviations stay the same (or actually become closer to the true means and standard deviations with increasing sample size)? The answer can be found in the formula for the t-statistic, where (described in general terms) the difference between means is divided by the pooled standard deviation, divided by the square root of the sample size. This √n aspect of the function explains why, all else remaining equal and when there is a real difference to be observed, t-values increase, and p-values decrease, the larger the sample size.
Great post as usual. I'm excited to see you're moving to R. However, you need to immediately switch over to ggplot2 for plotting. Leave base graphics (the Klingon of graphics) for 1992.
ReplyDeletehttp://ggplot2.org/
http://docs.ggplot2.org/current/
I think it is a basic principle of good programming that you should avoid needless, limiting abstractions unless they give you a sufficiently large payback in terms of usability. And frankly I don't think ggplot2 has enough bang for the buck to warrant the widespread adoption that it enjoys today. Learning the ins and outs of ggplot2's plotting system is not exactly trivial, and that is all time that could be spent just learning how to use the base graphics system well, which is of course what ggplot2 uses "under the hood." As soon as you need to make some absolutely simple tweaks to your plots that ggplot2 wasn't specifically designed to make easy for you, you're going to have a bad time: take a look at the questions tagged "ggplot2" at StackOverflow if you don't believe me. Daniel, it looks like you can already do some fairly sophisticated things in the base graphics system, so it seems to me that you're better off just staying the course.
ReplyDelete+1
DeleteHi Jake, that is a very good point about avoiding unnecessary abstractions. The difficulty of learning ggplot2 I believe to be a subjective quality, and the amount (and type) of questions about ggplot2 on StackOverflow could indicate its success and adaptability to a variety of problems and skill levels, and not the contrary as you suggest.
DeleteYou are absolutely right in that as Dr. Lakens is quite obviously making great plots with base graphics already, my call for switching to ggplot2 was not reasonable.
I kid, but the fact that R is built with C (or Fortran?) shouldn't convince us to use C instead.
Hi, thanks for the suggestion. I think visualization is very important, especially when teaching statistics to students. That's why I tried to make some good figures. For now, I think they suffice - learning ggplot would be nice, but I have so much to learn, I don't think it will happen any time soon. But it's on the list!
DeleteAs you know, I write my blog in IPython notebook. There is a similar tool for R, check it out: http://ramnathv.github.io/rNotebook/
ReplyDeleteInitially it may take some time to figure out how translate notebooks into blogposts but in the long the notebooks give unconstrained flexibility and of course they look good :)
This comment has been removed by the author.
ReplyDeleteYou have used p values here to determine if the differences in means between groups is statistically significant based on the assumptions of the null hypothesis. But this does not mean that this difference is significant in real life. Nor does it contribute to your hypothesis in any way. I mean, the p value you got tells you that the result is unusual if the null hypothesis is true (zero difference between groups), but since your samples are small and your null is never really going to be exactly zero, this isn't saying much. More importantly, it doesn't tell you anything about your original hypothesis given your data (personally I don't think that means of 7 and 8 are very significantly different). My question is, why use p values at all if they are not informative and do not tell you anything about your hypothesis?
ReplyDeleteHi Daniel, the answer is extremely simple: You should use them, because they answer the question you are interested in. If you cannot think of a question a p-value answers, don't use them. Please read around on my blog for some answers to your other questions, such as your statement about the null never being true, which I hear a lot *yawn* but I've addressed here: http://daniellakens.blogspot.nl/2014/06/the-null-is-always-false-except-when-it.html
DeleteBut if your question was 'is there is a difference between the groups in real life?', can't you just get this by comparing means and SDs, and deciding for yourself if the difference is meaningful? Why bother using p values at all? Seems unnecessarily complex to me.
DeleteIs it even logically correct to reject the null hypothesis just b/c your p value is so low it is unusual? In your example, the p value tells you that the diff between the two groups is statistically significant i.e it is unusual assuming that no difference exists. But you don't know if you are right to make this assumption in the first place, since you don't know anything about either group. So how then can you use the p value to answer a question?
I guess p values would make sense if you had reason to believe that there was no difference between groups to begin with. Then you could use p values as a model to fit your data to tell you how close your data is to the null model. But you would still need to know what a suitable sig. level would be. You can't just arbitrarily pick 0.05 because it's convention, right? Why not 0.001? (This link on p values and error rates - http://www.dcscience.net/Sellke-Bayarri-Berger-calibration-of-P-2001.pdf)