Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures

Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:

http://xkcd.com/882/

 

One of the most popular ways to correct for multiple testing is to estimate or control the false discovery rate. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold t significant, then borrowing notation from this great introduction to false discovery rates 

fdr3

 

So F(t) is the (unknown) total number of null hypotheses called significant and S(t) is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.

 

fdr4

 

To get an estimate of the FDR we just need an estimate for  E[_F(t)] _ and E[S(t)]. _The latter is pretty easy to estimate as just the total number of rejections (the number of _p < t). If you assume that the p-values follow the expected distribution then E[_F(t)]  can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by _t since the p-values are uniform. To do this, we need an estimate for \pi_0, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.

Combining the above equation with our estimates for E[_F(t)] _ and _E[S(t)] _we get:

 

fdr5

 

The q-value is a multiple testing analog of the p-value and is defined as:

fdr6

 

This is of course a very loose version of this and you can get a more technical description here. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its “testing partners”. Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.

uniform-pvals

Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:

significant-pvals

 

Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.

A couple of things to note:

  • Obviously doing this on purpose to change the significance of a calculated p-value is cheating and shouldn’t be done.
  • For correctly calculated p-values on a related set of hypotheses this is actually a sensible property to have - if you have almost all very small p-values and one very large p-value, you are doing a set of tests where almost everything appears to be alternative and you should weight that in some sensible way.
  • This is the reason that sometimes a “multiple testing adjusted” p-value (or q-value) is smaller than the p-value itself.
  • This doesn’t affect non-adaptive FDR procedures - but those procedures still depend on the “testing partners” of any p-value through the total number of tests performed. This is why people talk about the so-called “multiple testing burden”. But that is a subject for a future post. It is also the reason non-adaptive procedures can be severely underpowered compared to adaptive procedures when the p-values are correct.
  • I’ve appended the code to generate the histograms and calculate the q-values in this post in the following gist.