If you were going to write a paper about the false discovery rate you should have done it in 2002

16 Jan 2015

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

empirical processes
proportional hazards model
generalized linear model
semiparametric
generalized estimating equation
false discovery rate
microarray statistics
lasso shrinkage
rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

Controlling the false discovery rate: a practical and powerful approach to multiple testing
Thresholding of statistical maps in functional neuroimaging using the false discovery rate
The control of the false discovery rate in multiple testing under dependency
Controlling the false discovery rate in behavior genetics research
Identifying differentially expressed genes using false discovery rate controlling procedures
The positive false discovery rate: A Bayesian interpretation and the q-value
On the adaptive control of the false discovery rate in multiple testing with independent statistics
Implementing false discovery rate control: increasing your power
Operating characteristics and extensions of the false discovery rate procedure
Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:

You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:

Emp. Proc. 1990.241
Prop. Haz. 1990.929
GLM 1994.433
Semi-param. 1994.433
GEE 2000.379
FDR 2002.760
microarray 2003.600
lasso 2004.900
rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations. It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.

How to find the science paper behind a headline when the link is missing

15 Jan 2015

I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.

Amazingly, less than 60% of university news releases link to the papers they're describing http://t.co/daN11xYvKs pic.twitter.com/QtneZUAeFD

— Justin Wolfers (@JustinWolfers) January 15, 2015

Before you believe anything your read about science in the news, you need to go and find the original article. When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing. Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):

Screen Shot 2015-01-15 at 1.11.22 PM

Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.

Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:

And some key words:

Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.

Step 4 Victory

Often this will come up with the article you are looking for:

Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag [I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.

Amazingly, less than 60% of university news releases link to the papers they're describing http://t.co/daN11xYvKs pic.twitter.com/QtneZUAeFD

— Justin Wolfers (@JustinWolfers) January 15, 2015

Before you believe anything your read about science in the news, you need to go and find the original article. When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing. Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):

Screen Shot 2015-01-15 at 1.11.22 PM

Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.

Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:

And some key words:

Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.

Step 4 Victory

Often this will come up with the article you are looking for:

Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag](https://twitter.com/hashtag/icanhazpdf) and your contact info. Then you just have to hope that someone will send it to you (they often do).

Statistics and R for the Life Sciences: New HarvardX course starts January 19

12 Jan 2015

The first course of our Biomedical Data Science online curriculum

starts next week. You can sign up here. Instead of relying on

mathematical formulas to teach statistical concepts, students can

program along as we show computer code for simulations that illustrate

the main ideas of exploratory data analysis and statistical inference

(p-values, confidence intervals and power calculations for example).

By doing this, students will learn Statistics and R simultaneously and

will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each

video students will have the opportunity to assess their understanding

through homeworks involving coding in R. A big improvement over the

first version is that we have added dozens of assessment.

Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.

Beast mode parenting as shown by my Fitbit data

07 Jan 2015

This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.

Here is Saturday:

There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:

Check that out. I was totally asleep from like 4am-6am there. Nice.

Stay tuned for much more from my Fitbit data over the next few weeks.

Sunday data/statistics link roundup (1/4/15)

04 Jan 2015

I am digging this visualization of your life in weeks. I might have to go so far as to actually make one for myself.
I’m very excited about the new podcast TalkingMachines and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)
I love that they call Ben Goldacre the anti-Dr. Oz in this piece, especially given how often Dr. Oz is telling the truth.
If you haven’t read it yet, this piece in the Economist on statisticians during the war is really good.
The arXiv celebrated it’s 1M paper upload. It costs less to run than the top 2 executives at PLoS make. It is too darn expensive to publish open access right now.

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

If you were going to write a paper about the false discovery rate you should have done it in 2002

How to find the science paper behind a headline when the link is missing

Statistics and R for the Life Sciences: New HarvardX course starts January 19

Beast mode parenting as shown by my Fitbit data

Sunday data/statistics link roundup (1/4/15)