Thursday, August 21, 2014

What is a p-value?

I'll be teaching my first statistics lectures this September together with +Chris Snijders. While preparing some assignments, I gained a new appreciation for clearly explaining the basics of statistics. I'm also becoming an R fanboy, so I thought I'd adapt one assignment into this blog post to explain what a p-value is, and how to plot data, make figures, and do a t-test in R, for people who like to brush up or learn some of the basics.

When testing predictions in empirical research, researchers often report whether their results are statistically different from 0. For example, a researcher might be interested in whether the use of a cell-phone increases the likelihood of getting into a car accident compared to not using a cell phone. A researcher might ask one group of individuals to use cell phones while driving in a driving simulator, while a control group does not use cell phones. The researcher might predict cell phone users get into more accidents, and test this prediction by comparing whether the difference between the two groups in the experiment is statistically different from zero. This is typically referred to as null-hypothesis significance testing (NHST). The ‘significance’ part of this name is a misnomer: what we understand as the 'significance' of a finding in normal English depends on its theoretical or practical importance and has very little to do with statistics. Instead, we will therefore refer to such tests as ‘null-hypothesis testing’, and use ‘statistical difference’ for what is sometimes referred to in the literature as a ‘significant finding’ or a ‘statistically significant finding’.

Assume we ask two groups of 10 people how much they liked the extended directors cut version of the Lord of the Rings (LOTR) trilogy. The first group consists of my friends, and the second groups consists of friends of my wife. Our friends rate the trilogy on a score from 1 to 10. We can calculate the average rating by my friends, which is 8.7, and the average rating by my wife’s friends, which is 7.7. The difference is 1.



My Friends
My Wife’s Friends
 Friend 1
9,00
9,00
 Friend 2
7,00
6,00
 Friend 3
8,00
7,00
 Friend 4
9,00
8,00
 Friend 5
8,00
7,00
 Friend 6
9,00
9,00
 Friend 7
9,00
8,00
 Friend 8
10,00
8,00
 Friend 9
9,00
8,00
 Friend 10
9,00
7,00
M
8,70
7,70
SD
0,82
0,95

To get these data in R, we use:

myfriends = c(9,7,8,9,8,9,9,10,9,9) 
mywifesfriends = c(9,6,7,8,7,9,8,8,8,7)

We can see in the table that the ratings vary, not just between the groups, but also within the groups. The average difference is 1 scale point between groups, but the ratings also differ between friends within each group. To take a look at the distribution of these scores, let’s plot the in a density plot (closely related to a histogram).
 
d1 <- density(myfriends)
d2 <- density(mywifesfriends)
plot(range(d1$x, d2$x), range(d1$y, d2$y), type = "n", xlab = "LOTR Rating", ylab = "Density")
polygon(d1, col=rgb(0,0,1,1/4), border=rgb(0,0,1))
polygon(d2, col=rgb(1,0,0,1/4), border=rgb(1,0,0,1))
We can see the groups overlap to a certain extent, but there is also some non-overlap. So is the difference between the two groups just random variation, or can we conclude my friends really like the extended directors cut of the Lord of the Rings (LOTR) trilogy more than my wife’s friends?

In null-hypothesis testing we try to answer this question by calculating the probability of observing a specific, or more extreme, outcome of a test (i.e., a difference in movie ratings of 1 point or more) assuming that the null hypothesis is true (i.e., there is no real difference between how much my friends and my wife’s friends like the extended directors cut of LOTR). This probability is called the p-value.

The null-hypothesis assumes that if we would ask an infinite number of my friends  and an infinite number of my wife’s friends how much they like LOTR, the difference between these huge groups is exactly 0. However, in a small sample of my friends and my wife’s friends (say 10 friends in each group), random variation is very likely to lead to a difference larger or smaller than 0. How do we know whether an observed difference is due to random variation around a difference of 0, or whether an observed difference is due to random variation around a real difference between our friends?

To answer this question we need to know is what constitutes a reasonable expectation about how much the differences between groups can vary, if we would repeatedly ask samples of groups of friends to rate LOTR. When we compare two groups, we use the means, standard deviations, and number of observations in each group to calculate the t-statistic.

With the means (M), standard deviations (SD) and sample size (N) for each of the two groups from the Table above we can examine the probability that the difference between the two groups is larger than 0, given the data we have available. This is done by calculating the t-statistic, which is related to a probability (a p-value)  of getting the observed or a more extreme t-statistic in the t-distribution. The t-distribution describes samples drawn from a population. It is similar to the normal distribution (which describes the entire population). Because the t-distribution describes the probability density function for a specific sample, the shape depends on the sample size. The larger the sample, the more similar the t-distribution becomes to the normal distribution. Below, you can see the normal distribution (with M = 0 and SD = 1) in black, and the t-distribution for 10 and 5 degrees of freedom. We can see the t-distribution has somewhat thicker tails than the normal distribution.

x = seq(-4.5,4.5,length=1000)
y<-dnorm(x,0,1)
plot(x, y, type="l", col="black", lwd=3)
lines(x,dt(x,df=10),col='red',type='l')
lines(x,dt(x,df=5),col='blue',type='l')
legend(-4,0.4, # places a legend at the appropriate place
c("Normal","t(10)", "t(5)"), # puts text in the legend
lty=c(1,1), # gives the legend appropriate symbols (lines)
lwd=c(2.5,2.5,2.5),col=c("black", "red","blue")) # format lines and colors
Now we have a probability density function to compare our t-statistic against, all we need to do is to calculate the t-test. Actually, you should first check whether both variables appear to come from a normal distribution and have equal variances, but we're skipping the tests for assumptions here. In R, you use the command below (here, we assume equal variances), which returns the output (in grey):
t.test(myfriends,mywifesfriends, var.equal = TRUE)

     Two Sample t-test

data:  myfriends and mywifesfriends
t = 2.5175, df = 18, p-value = 0.02151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1654875 1.8345125
sample estimates:
mean of x mean of y
      8.7       7.7

The t-value is 2.5175. The t-test returns the probability of finding a difference as extreme, or more extreme, as the observed difference. We can graph the t-distribution (for df = 18) and a vertical line at t=2.5175 in R using:
 
x=seq(-5,5,length=1000)
plot(x,dt(x,df=18),col='black',type='l')
abline(v=2.5175, lwd=2)
x=seq(2.5175,5,length=100)
z<-(dt(x,df=18))
polygon(c(2.5175,x,8),c(0,z,0),col="blue")


Let’s recall the definition of a p-value: A p-value is the probability of the observed or more extreme data, assuming the null hypothesis is true. The assumption that the null-hypothesis is true is represented by the t-distribution being centered around 0, which is the t-value if the difference between the two groups is exactly 0. The probability of an outcome as high, or higher, as the one observed, is indicated by the blue area. This is known as a one-tailed probability. It is equivalent to the probability that my friends like the extended directors cut of LOTR more than my wife’s friends.


However, that wasn’t my question. My question was whether my friends and my wife’s friends would differ in how much they liked the LOTR movies. It would have been just as interesting to find my friends liked the movies less than my wife’s friends. Since many of my friends have read the book more than 20 times, and it’s often the case that people like the book better than the movie, an opposite outcome would have been just as likely. Furthermore, there has only been one person in my family who has been to a marathon movie night in the cinema when the third LOTR movie came out, and it wasn’t me. So let’s plot the probability that we found a positive or negative difference, as or more extreme as the one observed (the two-tailed probability). Let's plot this:

x=seq(-5,5,length=1000)
plot(x,dt(x,df=18),col='black',type='l')
abline(v=2.5175, lwd=2)
abline(v=-2.5175, lwd=2)
abline(v=-2.100922,col="grey",lty=3)
abline(v=2.100922, col="grey",lty=3)
x=seq(2.5175,5,length=100)
z<-(dt(x,df=18))
polygon(c(2.5175,x,8),c(0,z,0),col="blue")
x=seq(-5,-2.5175,length=100)
z<-(dt(x,df=18))
polygon(c(-8,x,-2.5175),c(0,z,0),col="blue")
I’ve plotted two critical values by horizontal grey dotted lines, which indicate the critical t-values for a difference to be significant (p < .05) for a t-distribution with 18 degrees of freedom. We can see that the observed difference is more extreme than the minimal critical values. However, we want to know what the exact probability of the two blue areas under the curve is. The t-test already provided us with the information,  but let’s calculate it. For the left tail, the probability of a value smaller than t=-2.5175 can be found by:

pt(-2.5175,df=18)
0.01075515

If we would simply change the sign on -2.5175, we would get the probability of a value smaller than 2.5175. What we want is the probability of a value higher than that, which equals:

1-pt(2.5175,df=18)
0.01075515

This value is identical to the probability in the other tail because the t-distribution is symmetric. If we add these two probabilities together, we get p = 0.02151. This is the same as the p-value provided by the t-test we performed earlier returned.

 

What does a p<.05 mean?


If we find a difference which is statistically different from 0 with p < .05, this means that the probability of observing this difference, assuming there is no difference between all my and my wife’s friends, is relatively small. That is, if both types of friends would enjoy LOTR equally (on average), then the probability of getting the data that we have (or data that is more extreme than the one we have) is small (0.02151 in our previous example). Hence, the data we have observed should therefore be considered surprising. That’s all a p-value smaller than 0.05 means.

People, even seasoned researchers, often think p-values mean a lot more. Remember that p-values are a statement about the probability of observing the data, given that the null-hypothesis is true. It is not the probability that the null-hypothesis is true, given the data (we need Bayesian statistics if we want to make such statements). Results can be surprising when the null-hypothesis is true, but even more surprising when the alternative hypothesis is true (which would make you think that the H0 hypothesis is perhaps the one to go for after all). Morale: since a p-value only takes the null-hypothesis into account, it says nothing about the probability under the alternative hypothesis.
If a p-value is larger than 0.05, the date we have observed are not surprising. This doesn’t mean the null-hypothesis is true. The p-value can only be used to test to reject the null-hypothesis. It can never be used to accept the null-hypothesis is being true.

Finally, a p-value is not the probability that a finding will replicate. It’s not 97.849% likely the observed difference in LOTR evaluations will be replicated. We can easily show this in R by simulating some friends. We will use the observed means and standard deviations of the 10 friends of my wife and me, to draw 10 new friends from a normal distribution with the same means and standard deviations. Let’s also perform a t-test, and plot the results.

myfriends<-rnorm(n = 10, mean = 8.7, sd = 0.82)
mywifesfriends<-rnorm(n = 10, mean = 7.7, sd = 0.95)
m1=mean(myfriends)
m2=mean(mywifesfriends)
sd1=sd(myfriends)
sd2=sd(mywifesfriends)
t.test(myfriends,mywifesfriends, var.equal = TRUE) #perform the t-test
d1 <- density(myfriends)
d2 <- density(mywifesfriends)
plot(range(d1$x, d2$x), range(d1$y, d2$y), type = "n", xlab = "LOTR Rating", ylab = "Density")
legend(5, 0.4, legend=paste("mean=",round(m2,2),"; sd=",round(sd2,2), sep=""))
legend(9.1, 0.4, legend=paste("mean=",round(m1,2),"; sd=",round(sd1,2), sep=""))
polygon(d1, col=rgb(0,0,1,1/4), border="black")
polygon(d2, col=rgb(1,0,0,1/4), border="black")

Run this script 20 times, and note whether the p-value in the outcome of the t-test is smaller than 0.05. You will probably see some variation in the p-values (with a p>.05 several times), and quite some veriation in the means. This happens, because there is a lot of variation in small samples. Let’s change the number of simulated friends of my wife and me (n = 10 in the first two lines of the script) to 100 simulated friends. Run the adapted script 20 times. You will see that the p-values are very low practically every time you run the script.

Look at the means and standard deviations. What do you notice with respect to the means and standard deviations? How is it possible that the t-value becomes more extreme, while the means and standard deviations stay the same (or actually become closer to the true means and standard deviations with increasing sample size)? The answer can be found in the formula for the t-statistic, where (described in general terms) the difference between means is divided by the pooled standard deviation, divided by the square root of the sample size. This √n aspect of the function explains why, all else remaining equal and when there is a real difference to be observed, t-values increase, and p-values decrease, the larger the sample size.

10 comments:

  1. Great post as usual. I'm excited to see you're moving to R. However, you need to immediately switch over to ggplot2 for plotting. Leave base graphics (the Klingon of graphics) for 1992.

    http://ggplot2.org/
    http://docs.ggplot2.org/current/

    ReplyDelete
  2. I think it is a basic principle of good programming that you should avoid needless, limiting abstractions unless they give you a sufficiently large payback in terms of usability. And frankly I don't think ggplot2 has enough bang for the buck to warrant the widespread adoption that it enjoys today. Learning the ins and outs of ggplot2's plotting system is not exactly trivial, and that is all time that could be spent just learning how to use the base graphics system well, which is of course what ggplot2 uses "under the hood." As soon as you need to make some absolutely simple tweaks to your plots that ggplot2 wasn't specifically designed to make easy for you, you're going to have a bad time: take a look at the questions tagged "ggplot2" at StackOverflow if you don't believe me. Daniel, it looks like you can already do some fairly sophisticated things in the base graphics system, so it seems to me that you're better off just staying the course.

    ReplyDelete
    Replies
    1. Hi Jake, that is a very good point about avoiding unnecessary abstractions. The difficulty of learning ggplot2 I believe to be a subjective quality, and the amount (and type) of questions about ggplot2 on StackOverflow could indicate its success and adaptability to a variety of problems and skill levels, and not the contrary as you suggest.

      You are absolutely right in that as Dr. Lakens is quite obviously making great plots with base graphics already, my call for switching to ggplot2 was not reasonable.

      I kid, but the fact that R is built with C (or Fortran?) shouldn't convince us to use C instead.

      Delete
    2. Hi, thanks for the suggestion. I think visualization is very important, especially when teaching statistics to students. That's why I tried to make some good figures. For now, I think they suffice - learning ggplot would be nice, but I have so much to learn, I don't think it will happen any time soon. But it's on the list!

      Delete
  3. As you know, I write my blog in IPython notebook. There is a similar tool for R, check it out: http://ramnathv.github.io/rNotebook/

    Initially it may take some time to figure out how translate notebooks into blogposts but in the long the notebooks give unconstrained flexibility and of course they look good :)

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. You have used p values here to determine if the differences in means between groups is statistically significant based on the assumptions of the null hypothesis. But this does not mean that this difference is significant in real life. Nor does it contribute to your hypothesis in any way. I mean, the p value you got tells you that the result is unusual if the null hypothesis is true (zero difference between groups), but since your samples are small and your null is never really going to be exactly zero, this isn't saying much. More importantly, it doesn't tell you anything about your original hypothesis given your data (personally I don't think that means of 7 and 8 are very significantly different). My question is, why use p values at all if they are not informative and do not tell you anything about your hypothesis?

    ReplyDelete
    Replies
    1. Hi Daniel, the answer is extremely simple: You should use them, because they answer the question you are interested in. If you cannot think of a question a p-value answers, don't use them. Please read around on my blog for some answers to your other questions, such as your statement about the null never being true, which I hear a lot *yawn* but I've addressed here: http://daniellakens.blogspot.nl/2014/06/the-null-is-always-false-except-when-it.html

      Delete
    2. But if your question was 'is there is a difference between the groups in real life?', can't you just get this by comparing means and SDs, and deciding for yourself if the difference is meaningful? Why bother using p values at all? Seems unnecessarily complex to me.

      Is it even logically correct to reject the null hypothesis just b/c your p value is so low it is unusual? In your example, the p value tells you that the diff between the two groups is statistically significant i.e it is unusual assuming that no difference exists. But you don't know if you are right to make this assumption in the first place, since you don't know anything about either group. So how then can you use the p value to answer a question?

      I guess p values would make sense if you had reason to believe that there was no difference between groups to begin with. Then you could use p values as a model to fit your data to tell you how close your data is to the null model. But you would still need to know what a suitable sig. level would be. You can't just arbitrarily pick 0.05 because it's convention, right? Why not 0.001? (This link on p values and error rates - http://www.dcscience.net/Sellke-Bayarri-Berger-calibration-of-P-2001.pdf)

      Delete