16 Jan 2015
People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:
- empirical processes
- proportional hazards model
- generalized linear model
- semiparametric
- generalized estimating equation
- false discovery rate
- microarray statistics
- lasso shrinkage
- rna-seq statistics
Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.
- Controlling the false discovery rate: a practical and powerful approach to multiple testing
- Thresholding of statistical maps in functional neuroimaging using the false discovery rate
- The control of the false discovery rate in multiple testing under dependency
- Controlling the false discovery rate in behavior genetics research
- Identifying differentially expressed genes using false discovery rate controlling procedures
- The positive false discovery rate: A Bayesian interpretation and the q-value
- On the adaptive control of the false discovery rate in multiple testing with independent statistics
- Implementing false discovery rate control: increasing your power
- Operating characteristics and extensions of the false discovery rate procedure
- Adaptive linear step-up procedures that control the false discovery rate
People who work in this area will recognize that many of these papers are the most important/most cited in the field.
Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:
You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:
- Emp. Proc. 1990.241
- Prop. Haz. 1990.929
- GLM 1994.433
- Semi-param. 1994.433
- GEE 2000.379
- FDR 2002.760
- microarray 2003.600
- lasso 2004.900
- rna-seq 2010.765
I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations. It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.
The code for my analysis is here. It is super hacky so have mercy.
15 Jan 2015
I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.
Before you believe anything your read about science in the news, you need to go and find the original article. When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing. Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.
Here is the news article (link):
Step 1: Look for a link to the article
Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.
Step 2: Look for names of the authors, scientific key words and journal name if available
You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:
And some key words:
Step 3 Use Google Scholar
You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.
Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.
Step 4 Victory
Often this will come up with the article you are looking for:
Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag [I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.
Before you believe anything your read about science in the news, you need to go and find the original article. When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing. Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.
Here is the news article (link):
Step 1: Look for a link to the article
Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.
Step 2: Look for names of the authors, scientific key words and journal name if available
You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:
And some key words:
Step 3 Use Google Scholar
You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.
Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.
Step 4 Victory
Often this will come up with the article you are looking for:
Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag](https://twitter.com/hashtag/icanhazpdf) and your contact info. Then you just have to hope that someone will send it to you (they often do).
12 Jan 2015
The first course of our Biomedical Data Science online curriculum
starts next week. You can sign up here. Instead of relying on
mathematical formulas to teach statistical concepts, students can
program along as we show computer code for simulations that illustrate
the main ideas of exploratory data analysis and statistical inference
(p-values, confidence intervals and power calculations for example).
By doing this, students will learn Statistics and R simultaneously and
will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each
video students will have the opportunity to assess their understanding
through homeworks involving coding in R. A big improvement over the
first version is that we have added dozens of assessment.
Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.
07 Jan 2015
This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.
Here is Saturday:
There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:
Check that out. I was totally asleep from like 4am-6am there. Nice.
Stay tuned for much more from my Fitbit data over the next few weeks.