Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (5/27)

  1. Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…
  2. An interview with Steve Blank about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years,  at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. 
  3. An interesting post on how Obama’s analytics department ran an A/B test which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into why Obama needs statisticians
  4. A cute statistics cartoon from @kristin_linn  via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). 
  5. Rafa’s post inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the Bioconductor community and the rise of Github among academics are a good start.  For example, it is really useful that Bioconductor now tracks the number of package downloads

"How do we evaluate statisticians working in genomics? Why don't they publish in stats journals?" Here is my answer

During the past couple of years I have been asked these questions by several department chairs and other senior statisticians interested in hiring or promoting faculty working in genomics. The main difficulty stems from the fact that we (statisticians working in genomics) publish in journals outside the mainstream statistical journals. This can be a problem during evaluation because a quick-and-dirty approach to evaluating an academic statistician is to count papers in the Annals of Statistics, JASA, JRSS and Biometrics. The evaluators feel safe counting these papers because they trust the fellow-statistician editors of these journals. However, statisticians working in genomics tend to publish in journals like Nature Genetics, Genome Research, PNAS, Nature Methods, Nucleic Acids Research, Genome Biology, and Bioinformatics. In general, these journals do not recruit statistical referees and a considerable number of papers with questionable statistics do get published in them. However, when the paper’s main topic is a statistical method or if it heavily relies on statistical methods, statistical referees are used. So, if the statistician is the corresponding or last author and it’s a stats paper, it is OK to assume the statistics are fine and you should go ahead and be impressed by the impact factor of the journal… it’s not east getting statistics papers in these journals. 

But we really should not be counting papers blindly. Instead we should be reading at least some of them. But here again the evaluators get stuck as we tend to publish papers with application/technology specific jargon and show-off by presenting results that are of interest to our potential users (biologists) and not necessarily to our fellow statisticians. Here all I can recommend is that you seek help. There are now a handful of us that are full professors and most of us are more than willing to help out with, for example, promotion letters.

So why don’t we publish in statistical journals? The fear of getting scooped due to the slow turnaround of stats journals is only one reason. New technologies that quickly became widely used (microarrays in 2000 and nextgen sequencing today) created a need for data analysis methods among large groups of biologists. Journals with large readerships and high impact factors, typically not interested in straight statistical methodology work, suddenly became amenable to publishing our papers, especially if they solved a data analytic problem faced by many biologists. The possibility of publishing in widely read journals is certainly seductive. 

While in several other fields, data analysis methodology development is restricted to the statistics discipline, in genomics we compete with other quantitative scientists capable of developing useful solutions: computer scientists, physicists, and engineers were also seduced by the possibility of gaining notoriety with publications in high impact journals. Thus, in genomics, the competition for funding, citation and publication in the top scientific journals is fierce. 

Then there is funding. Note that while most biostatistics methodology NIH proposals go to the Biostatistical Methods and Research Design (BMRD) study section, many of the genomics related grants get sent to other sections such as the Genomics Computational Biology and Technology (GCAT) and Biodata Management and Anlayis (BDMA) study sections. BDMA and GCAT are much more impressed by Nature Genetics and Genome Research than JASA and Biometrics. They also look for citations and software downloads. 

To be considered successful by our peers in genomics, those who referee our papers and review our grant applications, our statistical methods need to be delivered as software and garner a user base. Publications in statistical journals, especially those not appearing in PubMed, are not rewarded. This lack of incentive combined with how time consuming it is to produce and maintain usable software, has led many statisticians working in genomics to focus solely on the development of practical methods rather than generalizable mathematical theory. As a result, statisticians working in genomics do not publish much in the traditional statistical journals. You should not hold this against them, especially if they are developers and maintainers of widely used software.

Sunday data/statistics link roundup (5/20)

It’s grant season around here so I’ll be brief:
  1. I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K. 
  2. On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J. 
  3. Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
  4. Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis. 

The West Wing Was Always A Favorite Show Of Mine

[youtube http://www.youtube.com/watch?v=t7FJFuuvxpI?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]

The West Wing was always a favorite show of mine (at least, seasons 1-4, the Sorkin years) and I think this is a great scene which talks about the difference between evidence and interpretation. The topic is a 5-day waiting period for gun purchases and they’ve just received a poll in a few specific congressional districts showing weak support for this proposed policy.

Health by numbers: A statistician's challenge

Health by numbers: A statistician’s challenge