Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

If you were going to write a paper about the false discovery rate you should have done it in 2002

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

  • empirical processes
  • proportional hazards model
  • generalized linear model
  • semiparametric
  • generalized estimating equation
  • false discovery rate
  • microarray statistics
  • lasso shrinkage
  • rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

  • Controlling the false discovery rate: a practical and powerful approach to multiple testing
  • Thresholding of statistical maps in functional neuroimaging using the false discovery rate
  • The control of the false discovery rate in multiple testing under dependency
  • Controlling the false discovery rate in behavior genetics research
  • Identifying differentially expressed genes using false discovery rate controlling procedures
  • The positive false discovery rate: A Bayesian interpretation and the q-value
  • On the adaptive control of the false discovery rate in multiple testing with independent statistics
  • Implementing false discovery rate control: increasing your power
  • Operating characteristics and extensions of the false discovery rate procedure
  • Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:

 

citations-boxplot

You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:

  • Emp. Proc. 1990.241
  • Prop. Haz. 1990.929
  • GLM 1994.433
  • Semi-param. 1994.433
  • GEE 2000.379
  • FDR 2002.760
  • microarray 2003.600
  • lasso 2004.900
  • rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.