Wednesday, June 8, 2016

What’s so bad about teeny tiny p-values?

Every so often, I’ll see someone make fun of a really small p-value, usually along with some line like “If your p-value is smaller than 1/(number of molecules in the universe), you must be doing something wrong”. At first, this sounds like a good burn, but thinking about it a bit more, I just don’t get this criticism.

First, the number itself. Is it somehow because the number of molecules in the universe is so large? Perhaps this conjures up some image of “well, this result is saying this effect could never happy anywhere ever in the whole universe by chance—that seems crazy!”, and makes it seem like there’s some flaw in the computation or deduction. Pretty easy to spot the flaw in that logic, of course: configurational space can of course be much larger than the raw number of constituent parts. For example, let’s say I mix some red dye into a cup of water and then pour half of the dyed water into another cup. Now there is some probability that, randomly, all the red dye stays in one cup and no dye goes in the other. That probability is 1/(2^numberOfDyeMolecules), which is clearly going to be a pretty teeny-tiny number.

Here’s another example that may hit a bit closer to home: during cell division, the nuclear envelope breaks down, and so many nuclear molecules (say, lincRNA) must get scattered throughout the cell (and yes, we have observed this to be the case for e.g. Xist and a few others). Then, once the nucleus reforms, those lincRNA seem to be right back in the nucleus. What is the chance that the lincRNA just happened to be back in the nucleus by chance? Well, again, 1/(2^numberOfRNAMolecules) (assuming a 50/50 nucleus/cytoplasm split), which for many lincRNA is probably like 1/1024 or so, but for something like MALAT1, would be 1/(2^2000) or so. I think we can pretty safely reject the hypothesis that there is no active trafficking of MALAT1 back into the nucleus… :)

I think the more substantial concern people raise with these p-values is that when you get something so small, it probably means that you’re not taking into account some sort of systematic error; in other words, the null model isn’t right. For instance, let’s say I measured a slight difference in the concentration of dye molecules in the second cup above. Even a pretty small change will have an infinitesimal p-value, but the most likely scenario is that some systematic error is responsible (like dye getting absorbed by the material on the second glass or the glasses having slightly different transparencies or whatever). In genomics—or basically any study where you are doing a lot of comparisons—the same sort of thing can happen if the null/background model is slightly off for each of a large number of comparisons.

All that said, I still don’t see why people make fun of small p-values. If you have a really strong effect, then it’s entirely possible that you can get such a tiny p-value. In which case, the response is typically “well, if it’s that obvious, then why do any statistics?” Okay, fine, I’m totally down with that! But then we’re basically saying that there’s no really strong effects out there: if you’re doing enough comparisons that you might get one of these tiny p-values, then any strong, real effect must generate one of these p-values, no? In fact, if you don’t get a tiny p-value for one of these multi-factorial comparisons, then you must be looking at something that is only a minor effect at best, like something that only explains a small amount of the variance. Whether that matter or not is a scientific question, not a statistical one, but one thing I can say is that I don’t know many examples (at least in our neck of the molecular biology woods) in which something which was statistically significant but explained only a small amount of the variance was really scientifically meaningful. Perhaps GWAS is a counterexample to my point? Dunno. Regardless, I just don’t see much justification in mocking the teeny tiny p-value.

Oh, and here’s a teeny tiny p-value from our work. Came from comparing some bars in a graph that were so obviously different that only a reviewer would have the good sense to ask for the p-value… ;)

Update, 6/9/2016:
I think that there are a lot of examples that I think illustrate some of these points better. First, not all tiny p-values are necessarily the result of obvious differences or faulty null models in large numbers of comparisons. Take Uri Alon's network motifs work. Following his book, he showed that in the transcriptional network of E. coli (424 nodes, 519 edges), there were 40 examples of autoregulation. Is this higher, lower or equal to what you would expect? Well, maybe you have a good intuitive handle on random network theory, but for me, the fact that this is very far from the null expectation of around 1.2±1.1 autoregulatory motifs (p-value something like 10^-30) is not immediately obvious. One can (and people) quibble about the particular type of random network model, but in the end, the p-values were always teeny tiny and I don't think that is either obvious or unimportant.

Second, the fact that a large number of small effects can give a tiny p-value doesn't automatically discount their significance. My impression from genetics is that many phenotypes are composed of large numbers of small effects. Moreover, the effects of a perturbation of, say, a gene knockout can lead to a large number of small effects. To say those are not meaningful is an (open, to my mind) scientific question, but whether the p-value is small or not is not really relevant.

This is not to say that all tiny p-values mean there's some science worth looking into. Some of the worst offenders are the examples of "binning", where, e.g. half-life of individual genes correlates with some DNA sequence element, R^2=0.21, p=10^-17 (totally made up example, no offense to anyone in this field!). No strong rule comes from this, so who knows if we actually learned something. I suppose an argument can be made either way, but the bottom line is that those are scientific questions, and the size of the p-value is irrelevant. If the p-value were bigger, would that really change anything?


  1. I have no rigorous argument here but I do find tiny p-values unsettling when based on more observational/less mathematical data. In your example, if your observations had been slightly different then the p-value could have changed—and even if it were lower by a factor of a trillion trillion trillion, I expect you'd still be happy to report a p-value of 1/(10^60). In other words we are using for the exact same purpose numbers that could potentially vary by tens of orders of magnitude, which strikes me as strange. That's why I think Vanishingly small p-values might just as well be reported being as less than (some arbitrary threshold of convincingness).

    1. I see this point, and I think reporting "p < 10^-20" has some merit. That said, this is related to another criticism I often see that I think makes no sense, which is reporting a lot of decimal places on a value. Like impact factor of Annals of the Antarctic Snow Society is 4.5132. Yes, it is likely not particularly meaningful that it's 4.5132 instead of 4.5133. But whatever, that's the value you computed. Signifying error or precision in a method is best done with an error range, like 4.4583-4.6981. Conflating numeric precision with error reporting is a bad idea, in my opinion. Reporting p < 10^-20 is similar. Just report a range, but the value you report is the value you computed, whatever that may be.

  2. "If you have a really large effect you will have a very small p-value"
    Well it's the effect size we should be interested in, it is that which is biologically relevant not the p-value. If you have a very large n, you'll find it quite easy to generate small p-values. What information is really missing if you reported p<0.0001 ?

    1. This is the subject of another blog post I have on deck, but I'm not so sure why effect size is what we should be interested in, necessarily. Clinically, sure, but biologically, not so clear, it depends.

    2. Arjun, if the point of the p-value is to reject the null in favor of an alternative theory, then would you agree that the *practical value* of the alternative theory is the effect size?

      On the other hand, clearly general relativity has super small differences from newtonian mechanics in almost all conditions we experience as humans, and so would have a super small "effect size" while still being deeply significant to science.

    3. Well, I dunno, I think "practical value" is sort of hard to define. I think that effect size is a surrogate that is useful in cases where you essentially have no theory, like in the clinic. Sadly, this is also true in most of biology. I think one of the goals of systems biology is to predict *exactly* what a value should be upon a perturbation, not just that there is an effect (which is basically a 50/50 coin flip "prediction" as to the sign). We are clearly pretty far from that goal, but I still have hope (sometimes). :)