Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

This Is Not About Statistics But Its About

[youtube http://www.youtube.com/watch?v=p3Te_a-AGqM?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]

This is not about statistics, but it’s about Emacs, which I’ve been using for a long time. This guy is an Emacs virtuoso, and the crazy thing is that he’s only been using it for 8 months!

Best line: “Should I wait for the next version of Emacs? Hell no!”

(Thanks to Brian C. and Kasper H. for the pointer.)

What is the most important code you write?

These days, like most people, the research I do involves writing a lot of code. A lot of it. Usually, you need some code to

  1. Process the data to take it from its original format to the format that’s convenient for you
  2. Run exploratory data analyses creating plots, calculating summary statistics, etc.
  3. Try statistical model 1
  4. Try statistical model 2
  5. Try statistical model 3
  6. Fit final statistical model; if this involves MCMC then there’s usually a ton of code to do this
  7. Make some more plots of results, make tables, more summary statistics of output

My question is, of all this code, which is the most important? The code that fits the final model? The code that does that summarizes results? Often you just see the code that fit the final statistical model and maybe some of the code that summarizes the results. The code for fitting all of the previous models and doing the exploratory analysis is lost in the ether (or at least the version control ether). Now, I’m not saying I always want to see all that other code. Usually, I am just interested in the final model.

The point is that the code for the final model only represents a small fraction of the work that was done to get there. This work is the bread and butter of applied statistics and it is essentially thrown out. Of course, life would be much easier if someone would just tell me what the final model would be every time. Then I would just fit it! But nooooo, hundreds or thousands of lines of code and numerous judgment calls go into figuring out what that last model is going to be. 

Yet when you read a paper, it more or less looks like the final model appeared out of thin air because there’s no space/time to tell the story about everything that came before. I would say the same is true for theoretical statistics too. It’s not as if theorems/proofs appear out of nowhere. Hard work goes into figuring out both the right theorem to prove and the right way to prove it.

But I would argue that there’s one key difference between theoretical and applied statistics in this regard: Everyone seems to accept that theoretical statistics is hard. So when you see a theorem/proof in a paper you consciously or unconsciously realize that it must have been hard work to arrive at that point. But in a great applied statistics paper, all you get is an interesting scientific question and some graphs/tables that provide an answer. Who cares about that?

Seriously though, even for a seasoned applied statistician, it’s sometimes easy to forget that everything looks easy once someone else has done all the work. It’s not clear to me whether we just need to change expectations or if we need a different method for communicating the effort involved (or both). Making research reproducible would be one approach as it would require all the code for the work be available. But that’s mostly just “final model” stuff plus some data processing code. Going one step further might require that a git repository be made available. That way you could see all the history in addition to the final stuff. I’m guessing there would be some resistance to universally adopting that approach!

Another approach might be to allow applied stat papers to go into more of the details about the process. With strict space limitations these days, it’s often hard enough to talk about the final model. But in some cases I think I would enjoy reading the story behind the story. Some of that “backstory” would make for good instructional material for applied stat classes.

Statistical Reasoning on iTunes U

Our colleague, the legendary John McGready has just put his Statistical Reasoning I and Statistical Reasoning II courses on iTunes U. This course sequence is extremely popular here at Johns Hopkins and now the entire world can experience the joy.

My worst (recent) experience with peer review

My colleagues and I just published a paper on validation of genomic results in BMC Bioinformatics. It is “highly accessed” and we are really happy with how it turned out. 

But it was brutal getting it published. Here is the line-up of places I sent the paper. 

  • Science: Submitted 10/6/10, rejected 10/18/10 without review. I know this seems like a long shot, but this paper on validation was published in Science not too long after. 
  • Nature Methods: Submitted 10/20/10, rejected 10/28/10 without review. Not much to say here, moving on…
  • Genome Biology: Submitted 11/1/10, rejected 1/5/11. 2/3 referees thought the paper was interesting, few specific concerns raised. I felt they could be addressed so appealed on 1/10/11, appeal accepted 1/20/11, paper resubmitted 1/21/11. Paper rejected 2/25/11. 2/3 referees were happy with the revisions. One still didn’t like it. 
  • Bioinformatics: Submitted 3/3/11, rejected 3/1311 without review. I appealed again, it turns out “I have checked with the editors about this for you and their opinion was that there was already substantial work in validating gene lists based on random sampling.” If anyone knows about one of those papers let me know :-). 
  • Nucleic Acids Research: Submitted 3/18/11, rejected with invitation for revision 3/22/11. Resubmitted 12/15/11 (got delayed by a few projects here) rejected 1/25/12. Reason for rejection seemed to be one referee had major “philosophical issues” with the paper.
  • BMC Bioinformatics: Submitted 1/31/12, first review 3/23/12, resubmitted 4/27/12, second revision requested 5/23/12, revised version submitted 5/25/12, accepted 6/14/12. 
An interesting side note is the really brief reviews from the Genome Biology submission inspired me to do this paper. I had time to conceive the study, get IRB approval, build a web game for peer review, recruit subjects, collect the data, analyze the data, write the paper, submit the paper to 3 journals and have it come out 6 months before the paper that inspired it was published! 
Ok, glad I got that off my chest.
What is your worst peer-review story?

How Does the Film Industry Actually Make Money?

How Does the Film Industry Actually Make Money?