Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

R and the little data scientist's predicament

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video. 

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax. 

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background. 
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures. 
  3. It would be awesome if the functions would include some sort of dynamic graphics (with svgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn. 

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends. 

Sunday data/statistics link roundup (3/25)

  1. The psychologist whose experiment didn’t replicate then went off on the scientists who did the replication experiment is at it again. I don’t see a clear argument about the facts of the matter in his post, just more name calling. This seems to be a case study in what not to do when your study doesn’t replicate. More on “conceptual replication” in there too. 
  2. Berkeley is running a data science course with instructors Jeff Hammerbacher and Mike Franklin, I looked through the notes and it looks pretty amazing. Stay tuned for more info about my applied statistics class which starts this week. 
  3. A cool article about Factual, one of the companies whose sole mission in life is to collect and distribute data. We’ve linked to them before. We are so out ahead of the Times on this one…
  4. This isn’t statistics related, but I love this post about Jeff Bezos. If we all indulged our inner 11 year old a little more, it wouldn’t be a bad thing. 
  5. If you haven’t had a chance to read Reeves guest post on the Mayo Supreme Court decision yet, you should, it is really interesting. A fascinating intersection of law and statistics is going on in the personalized medicine world right now. 

Some thoughts from Keith Baggerly on the recently released IOM report on translational omics

Shortly after the Duke trial scandal broke, the Institute of Medicine convened a committee to write a report on translational omics. Several statisticians (including one of our interviewees) either served on the committee or provided key testimony. The report came out yesterday.  Nature, Nature Medicine, and Science had posts about the release. Keith Baggerly sent an email with his thoughts and he gave me permission to post it here. He starts by pointing out that the Science piece has a key new observation:

The NCI’s Lisa McShane, who spent months herself trying to validate Duke results, says the IOM committee “did a really fine job” in laying out the issues. NCI now plans to require that its cooperative groups who want to use omics tests follow a checklist similar to that in the IOM report. NCI has not yet decided whether it should add new requirements for omics tests to its peer review process for investigator-initiated grants. But “our hope is that this report will heighten everyone’s awareness,” McShane says. 

Some further thoughts from Keith:

First, the report helps clarify the regulatory landscape: if omics-based tests (which the FDA views as medical devices) will direct patient therapy, FDA approval in the form of an Investigational Device Exemption (IDE) is required. This is in keeping with increased guidance FDA has been providing over the past year and a half dealing with companion diagnostics. It seems likely that several of the problems identified with the Duke trials would have been caught by an FDA review, particularly if the agency already had cause for concern, such as a letter to the editor identifying analytical shortcomings. 

 Second, the report recommends the publication of the full data, code, and metadata used to construct the omics assays prior to their use to guide patient therapy. Had such data and code been available earlier, this would have greatly reduced the amount of effort required for others (including us) to check and potentially extend on the underlying results.

Third, the report emphasizes, repeatedly, that the test must be fully specified (“locked down”) before it is validated, let alone used to guide patient therapy. Quite a bit of effort is given to providing an explicit definition of locked down, in part (we suspect) because both Lisa McShane (NCI) and Robert Becker (FDA) provided testimony that incomplete specification was a problem their agencies encountered frequently. Such specification would have prevented problems such as that identified by the NCI for the Lung Metagene Score (LMS) in 2010, which led the NCI to remove the LMS evaluation as a goal of the Phase III cooperative group trial CALGB-30506.

 Finally, the very existence of the report is recognition that reproducibility is an important problem for the omics-test community. This is a necessary step towards fixing the problem.

This graph shows that President Obama's proposed budget treats the NIH even worse than G.W. Bush - Sign the petition to increase NIH funding!

The NIH provides financial support for a large percentage of biological and medical research in the United States. This funding supports a large number of US jobs, creates new knowledge, and improves healthcare for everyone. So I am signing this petition


NIH funding is essential to our national research enterprise, to our local economies, to the retention and careers of talented and well-educated people, to the survival of our medical educational system, to our rapidly fading worldwide dominance in biomedical research, to job creation and preservation, to national economic viability, and to our national academic infrastructure.


The current administration is proposing a flat $30.7 billion FY 2013 NIH budget. The graph below (left) shows how small the NIH budget is in comparison to the Defense and Medicare budgets in absolute terms. The difference between the administration’s proposal and the petition’s proposal ($33 billion) are barely noticeable. 

The graph on the right shows how in 2003 growth in the NIH budget fell dramatically while medicare and military spending kept growing. However, despite the decrease in rate, the NIH budget did continue to increase under Bush. If we follow Bush’s post 2003 rate (dashed line), the 2013 budget will be about what the petition asks for: $33 billion.  


If you agree that the relatively modest increase in the NIH budget is worth the incredibly valuable biological, medical, and economic benefits this funding will provide, please consider signing the petition before April 15 

Big Data for the Rest of Us, in One Start-Up

Big Data for the Rest of Us, in One Start-Up