Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

If you were going to write a paper about the false discovery rate you should have done it in 2002

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

  • empirical processes
  • proportional hazards model
  • generalized linear model
  • semiparametric
  • generalized estimating equation
  • false discovery rate
  • microarray statistics
  • lasso shrinkage
  • rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

  • Controlling the false discovery rate: a practical and powerful approach to multiple testing
  • Thresholding of statistical maps in functional neuroimaging using the false discovery rate
  • The control of the false discovery rate in multiple testing under dependency
  • Controlling the false discovery rate in behavior genetics research
  • Identifying differentially expressed genes using false discovery rate controlling procedures
  • The positive false discovery rate: A Bayesian interpretation and the q-value
  • On the adaptive control of the false discovery rate in multiple testing with independent statistics
  • Implementing false discovery rate control: increasing your power
  • Operating characteristics and extensions of the false discovery rate procedure
  • Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:

 

citations-boxplot

You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:

  • Emp. Proc. 1990.241
  • Prop. Haz. 1990.929
  • GLM 1994.433
  • Semi-param. 1994.433
  • GEE 2000.379
  • FDR 2002.760
  • microarray 2003.600
  • lasso 2004.900
  • rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.

How to find the science paper behind a headline when the link is missing

I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.

 

Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):

 

Screen Shot 2015-01-15 at 1.11.22 PM

 

 

Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.

 

Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:

Untitled presentation (2)

 

And some key words:

 

Untitled presentation (3)

 

Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.

 

 

 

Untitled presentation (4)

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.

 

Screen Shot 2015-01-15 at 1.31.38 PM

 

Step 4 Victory

Often this will come up with the article you are looking for:

Screen Shot 2015-01-15 at 1.32.45 PM

 

Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag [I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.

 

Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):

 

Screen Shot 2015-01-15 at 1.11.22 PM

 

 

Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.

 

Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:

Untitled presentation (2)

 

And some key words:

 

Untitled presentation (3)

 

Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.

 

 

 

Untitled presentation (4)

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.

 

Screen Shot 2015-01-15 at 1.31.38 PM

 

Step 4 Victory

Often this will come up with the article you are looking for:

Screen Shot 2015-01-15 at 1.32.45 PM

 

Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag](https://twitter.com/hashtag/icanhazpdf) and your contact info. Then you just have to hope that someone will send it to you (they often do).

 

 

Statistics and R for the Life Sciences: New HarvardX course starts January 19

The first course of our Biomedical Data Science online curriculum

starts next week. You can sign up here. Instead of relying on

mathematical formulas to teach statistical concepts, students can

program along as we show computer code for simulations that illustrate

the main ideas of exploratory data analysis and statistical inference

(p-values, confidence intervals and power calculations for example).

By doing this, students will learn Statistics and R simultaneously and

will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each

video students will have the opportunity to assess their understanding

through homeworks involving coding in R. A big improvement over the

first version is that we have added dozens of assessment.

Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.

 

edx_screenshot_v2

Beast mode parenting as shown by my Fitbit data

This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.

Here is Saturday:

saturday

 

 

There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:

 

sunday

Check that out. I was totally asleep from like 4am-6am there. Nice.

Stay tuned for much more from my Fitbit data over the next few weeks.

 

 

Sunday data/statistics link roundup (1/4/15)

  1. I am digging this visualization of your life in weeks. I might have to go so far as to actually make one for myself.
  2. I’m very excited about the new podcast TalkingMachines and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)
  3. I love that they call Ben Goldacre the anti-Dr. Oz in this piece, especially given how often Dr. Oz is telling the truth.
  4. If you haven’t read it yet, this piece in the Economist on statisticians during the war is really good.
  5. The arXiv celebrated it’s 1M paper upload. It costs less to run than the top 2 executives at PLoS make. It is too darn expensive to publish open access right now.