Make a Christmas Tree in R with random ornaments/presents

24 Dec 2012

Happy holidays!

Sunday data/statistics link roundup 12/23/12

23 Dec 2012

A cool data visualization for blood glucose levels for diabetic individuals. This kind of interactive visualization can help people see where/when major health issues arise for chronic diseases. This was a class project by Jeff Heer’s Stanford CS448B students Ben Rudolph and Reno Bowen (twitter @RenoBowen). Speaking of interactive visualizations, I also got this link from Patrick M. It looks like a way to build interactive graphics and my understanding is it is compatible with R data frames, worth checking out (plus, Dex is a good name).
Here is an interesting review of Nate Silver’s book. The interesting thing about the review is that it doesn’t criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we’ve seen before. Gelman also reviews the review.
It’s a little late now, but this tool seems useful for folks who want to know whatdoineedonmyfinal?
A list of the best open data releases of 2012. I particularly like the rat sightings in New York and think the Baltimore fixed speed cameras (which I have a habit of running afoul of).
A map of data scientists on Twitter. Unfortunately, since we don’t have “data scientist” in our Twitter description, Simply Statistics does not appear. I’m sure we would have been central….
Here is an interesting paper where some investigators developed a technology that directly reads out a bar chart of the relevant quantities. They mention this means there is no need for statistical analysis. I wonder if the technology also reads out error bars.

The NIH peer review system is still the best at identifying innovative biomedical investigators

20 Dec 2012

This recent Nature paper makes the controversial claim that the most innovative (interpreted as best) scientists are not being funded by NIH. Not surprisingly, it is getting a lot of attention in the popular media. The title and introduction make it sound like there is a pervasive problem biasing the funding enterprise against innovative scientists. To me this appears counterintuitive given how much innovation, relative to other funding agencies around the world, comes out of NIH funded researchers (here is a recent example) and how many of the best biomedical investigators in the world elect to work for NIH funded institutions. The authors use data to justify their conclusions but I do not find it very convincing.

First, the paper defines innovative/non-conformist scientists as those with a first/last/single author paper with 1000+ citations in the years 2002-2012. Obvious problems with this definition are already pointed out in the comments of the original paper but for argument’s sake I will accept it as useful quantification The key data point the authors use is that only 2/5 of people with a first/last single author 1000+ citation paper are principal investigators on NIH grants. I would need to see the complete 2x2 table for people that actually applied for grants (1000+ citations or not x got NIH grant or not) to be convinced. The reported ratio is meaningful only if most people with 1000+ papers are applying for grants but the authors doen’t report how many are retired, or are still postdocs, or went into industry, or are one-hit-wonders. Given that the payline is about 8%-15%, the 40% number may actually imply that NIH is in fact funding innovative people at a high rate.

The paper also implies that many of the undeserving funding recipients are connected individuals that serve on study sections. The evidence for this is that they are funded at a much higher rate than individuals with 1000+ citation papers. But as the authors themselves point out, study section members are often recruited from the subset of individuals who have NIH grants (it’s a way to give back to NIH). This does not suggest bias in the process, it just suggests that if you recruit funded people to be on a panel, that panel will have a higher rate of funded people.

NIH’s peer review system is far from perfect but it somehow manages to produce the best biomedical research in the world. How does this happen? Well, I think it’s because NIH is currently funding some of the most innovative biomedical researchers in the world. The current system can certainly improve, but perhaps we should focus on concrete proposals with hard evidence that they will actually make things better.

Disclaimers: I am a regular member of an NIH study section. I am PI on NIH grants. I am on several papers with more than 1000 citations.

Rafa interviewed about statistical genomics

19 Dec 2012

He talks about the problems created by the speed of increase in data sizes in molecular biology, the way that genomics is hugely driven by data analysis/statistics, how Bioconductor is an example of bottom up science, Simply Statistics gets a shout out, how new data are going to lead to new modeling/statistical challenges, and gives an ode to boxplots. It’s worth watching the whole thing…

The value of re-analysis

18 Dec 2012

I just saw this really nice post over on John Cook’s blog. He talks about how it is a valuable exercise to re-type code for examples you find in a book or on a blog. I completely agree that this is a good way to learn through osmosis, learn about debugging, and often pick up the reasons for particular coding tricks (this is how I learned about vectorized calculations in Matlab, by re-typing and running my advisors code back in my youth).

In a more statistical version of this idea, Gary King has proposed reproducing the analysis in a published paper as a way to get a paper of your own. You can figure out the parts that a person did well and the parts that you would do differently, maybe finding enough insight to come up with your own new paper. But I think this level of replication involves actually two levels of thinking:

Can you actually reproduce the code used to perform the analysis?
Can you solve the “paper as puzzle” exercise proposed by Ethan Perlstein over at his site. Given the results in the paper, can you come up with the story?

Both of these things require a bit more “higher level thinking” than just re-running the analysis if you have the code. But I think even the seemingly “low-level” task of just retyping and running the code that is used to perform a data analysis can be very enlightening. The problem is that this code, in many cases, does not exist. But that is starting to change. If you check out Rpubs or RunMyCode or even the right parts of Figshare you can find data analyses you can run through and reproduce.

The only downside is there is currently no measure of quality on these published analyses. It would be great if people could focus their time re-typing only good data analyses, rather than one at random. Or, as a guy once (almost) said, “Data analysis practice doesn’t make perfect, perfect data analysis practice makes perfect.”

Older Newer

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Make a Christmas Tree in R with random ornaments/presents

Sunday data/statistics link roundup 12/23/12

The NIH peer review system is still the best at identifying innovative biomedical investigators

Rafa interviewed about statistical genomics

The value of re-analysis