Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (10/27/13)

  1. Pubmed Commons is a new post-publication commenting system. I think this is a great idea and I hope it succeeds. Right now it is in “private beta” so only people with Pubmed Commons accounts can post/view comments. But you can follow along with who is making comments via this neat twitter bot. I think the main feature this lacks to be a hugely successful experiment is some way to give real, tangible academic credit to commenters. One very obvious way would be by assigning DOIs to every comment and making the comments themselves Pubmed searchable. Then they could be listed as contributions on CVs - a major incentive.
  2. A post on the practice of asking potential hires tricky math problems  - even if they are going to be hired to do something else (like software engineering). This happens all the time in academia as well - often the exams we give/questions we ask aren’t neatly aligned with the ultimate goals of a program (producing innovative/determined researchers).
  3. This is going to be a short Sunday Links because my Coursera class is starting again tomorrow.
  4. Don’t forget that next week is the [ 1. Pubmed Commons is a new post-publication commenting system. I think this is a great idea and I hope it succeeds. Right now it is in “private beta” so only people with Pubmed Commons accounts can post/view comments. But you can follow along with who is making comments via this neat twitter bot. I think the main feature this lacks to be a hugely successful experiment is some way to give real, tangible academic credit to commenters. One very obvious way would be by assigning DOIs to every comment and making the comments themselves Pubmed searchable. Then they could be listed as contributions on CVs - a major incentive.
  5. A post on the practice of asking potential hires tricky math problems  - even if they are going to be hired to do something else (like software engineering). This happens all the time in academia as well - often the exams we give/questions we ask aren’t neatly aligned with the ultimate goals of a program (producing innovative/determined researchers).
  6. This is going to be a short Sunday Links because my Coursera class is starting again tomorrow.
  7. Don’t forget that next week is the](https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358) on Wednesday, October 30th at noon Baltimore time!

(Back to) The Future of Statistical Software #futureofstats

In anticipation of the upcoming Unconference on the Future of Statistics next Wednesday at 12-1pm EDT, I thought I’d dig up what people in the past had said about the future so we can see how things turned out. In doing this I came across an old National Academy of Sciences report from 1991 on the Future of Statistical Software. This was a panel discussion hosted by the National Research Council and summarized in this volume. I believe you can download the entire volume as a PDF for free from the NAS web site.

The entire volume is a delight to read but I was particularly struck by Daryl Pregibon’s presentation on “Incorporating Statistical Expertise into Data Analysis Software” (starting on p. 51). Pregibon describes his (unfortunate) experience trying to develop statistical software which has the ability to incorporate expert knowledge into data analysis. In his description of his goals, it’s clear in retrospect that he was incredibly ambitious to attempt to build a kind of general-purpose statistical analysis machine. In particular, it was not clear how to incorporate subject matter information.

[T]he major factor limiting the number of people using these tools was the recognition that (subject matter) context was hard to ignore and even harder to incorporate into software than the statistical methodology itself. Just how much context is required in an analysis? When is it used? How is it used? The problems in thoughtfully integrating context into software seemed overwhelming.

Pregibon skirted the problem of integrating subject matter context into statistical software.

I am not talking about integrating context into software. That is ultimately going to be important, but it cannot be done yet. The expertise of concern here is that of carrying out the plan, the sequence of steps used once the decision has been made to do, say, a regression analysis or a one-way analysis of variance. Probably the most interesting things statisticians do take place before that.

Statisticians (and many others) tend to focus on the application of the “real” statistical method–the regression model, lasso shrinkage, or support vector machine. But as much painful experience in a variety of fields has demonstrated, much what happens before the application of the key model is as important, or even more important.

Pregibon makes an important point that although statisticians are generally resistant to incorporating their own expertise into software, they have no problem writing textbooks about the same topic. I’ve observed the same attitude when I talk about evidence-based data analysis. If I were to guess, the problem is that textbooks are still to a certain extent abstract, while software is 100% concrete.

Initial efforts to incorporate statistical expertise into software were aimed at helping inexperienced users navigate through the statistical software jungle that had been created…. Not surprisingly, such ideas were not enthusiastically embraced by the statistics community. Few of the criticisms were legitimate, as most were concerned with the impossibility of automating the “art” of data analysis. Statisticians seemed to be making a distinction between providing statistical expertise in textbooks as opposed to via software. [emphasis added]

In short, Pregibon wanted to move data analysis from an art to a science, more than 20 years ago! He stressed that data analysis, at that point in time, was not considered a process worth studying. I found the following paragraph interesting and worth considering in now, over 20 years later. He talks about the reasons for incorporating statistical expertise into software.

The third [reason] is to study the data analysis process itself, and that is my motivating interest. Throughout American or even global industry, there is much advocacy of statistical process control and of understanding processes. Statisticians have a process they espouse but do not know anything about. It is the process of putting together many tiny pieces, the process called data analysis, and is not really understood. Encoding these pieces provides a platform from which to study this process that was invented to tell people what to do, and about which little is known. [emphasis added]

I believe we have come quite far since 1991, but I don't think we no much more about the process of data analysis, especially in newer areas that involve newer data. The reason is because the field has not put much effort into studying the whole data analysis process.  I think there is still a resistance to studying this process, in part because it involves "stooping" to analyze data and in part because it is difficult to model with mathematics. In his presentation, Pregibon suggests that resampling methods like the bootstrap might allow us to skirt the mathematical difficulties in studying data analysis processes.

One interesting lesson Pregibon relates during the development of REX, an early system that failed, involves the difference between the end-goals of statisticians and non-statisticians:

Several things were learned from the work on REX. The first was that statisticians wanted more control. There were no users, rather merely statisticians looking over my shoulder to see how it was working. Automatically, people reacted negatively. They would not have done it that way. In contrast, non-statisticians to whom it was shown loved it. They wanted less control. In fact they did not want the system--they wanted answers.

The Leek group guide to reviewing scientific papers

There has been a lot of discussion of peer review on this blog and elsewhere. One thing I realized is that no one ever formally taught me the point of peer review or how to write a review.

Like a lot of other people, I have been frustrated by the peer review process. I also now frequently turn to my students to perform supervised peer review of papers, both for their education and because I can’t handle the large number of peer review requests I get on my own.

So I wrote this guide on how to write a review of  a scientific paper on Github. Last time I did this with R packages a bunch of people contributed to make the guide better. I hope that the same thing will happen this time.

Blog posts that impact real science - software review and GTEX

There was a flurry of activity on social media yesterday surrounding a blog post by Lior Pachter. He was speaking about the GTEX project - a large NIH funded project that has the goal of understanding expression variation within and among human beings. The project has measured gene expression in multiple tissues of over 900 individuals.

In the post, the author claims that the GTEX project is “throwing away” 90% of its data. The basis for this claim is a simulation study using the parameters from one of the author’s papers. The claim of 90% is based on the fact that  increasing the number of mRNA fragments leads to increasing correlation in abundance measurements in the simulation study. In order to get the same Spearman correlation as other methodologies have at 10M fragments, the software being used by GTEX needs 100M fragments.

This post and the associated furor raises three issues:

  1. The power and advantage of blog posts and social media as a form of academic communication.
  2. The importance of using published software.
  3. Extreme critiques deserve as much scrutiny as extreme claims.

The first point is obvious; the post was rapidly disseminated and elicited responses from the leaders of the GTEX project. Interestingly, I think the authors got an early view of the  criticisms they would face from reviewers through the blog post. The short term criticism is probably not fun to deal with but it might save them time later.

I think the criticism about using software that has not been fully vetted through the publication/peer review process is an important one. For such a large scale project, you’d like to see the primary analysis being done with “community approved” software.  The reason is that we just don’t know if it is better or worse because no one published a study on the software.  It would be interesting to see how the bottom up approach would have faired here. The good news for GTEX here is that for future papers they will either get out a more comprehensive comparison or they will switch software - either of which will improve their work.

Regarding point 2, Pachter did a “back of the envelope” calculation that suggested the Flux software wasn’t performing well. These back of the envelope calculations are very important - if you can’t solve the easy case, how can you expect to solve the hard case. Lost in all of the publicity about the 90% number is that Pachter’s blog post hasn’t been vetted, either. Here are a few questions that immediately jumped to my mind when reading the blog post:

  1. Why use Spearman correlation as the important measure of agreement?
  2. What is the correlation between replicates?
  3. What parameters did he use for the Flux calculation?
  4. Where is his code so we can see if there were any bugs (I’m sure he is willing to share, but I don’t see a link)?
  5. That 90% number seems very high, I wonder if varying the simulation approach/parameter settings/etc. would show it isn’t quite that bad
  6. Throwing away 90% of you data might not matter if you get the right answer to the question you care about at the end. Can we evaluate something closer to what we care about? A list of DE genes/transcripts, for example?

Whenever a scientist sees a claim as huge as “throwing away 90% of the data” they should be skeptical. This is particularly true in genomics, where huge effects are often due to bugs or artifacts. So in general, it is important that we apply the same level of scrutiny to extreme critiques as we do to extreme claims.

My guess is ultimately, the 90% number may end up being an overestimate of how bad the problem is. On the other hand, I think it was hugely useful for Pachter to point out the potential issue and give GTEX the chance to respond. If nothing else, it points out (1) the danger of using  unpublished methods when good published alternatives exist and (2) that science moves faster in the era of blog posts and social media.

Disclaimers: I work on RNA-seq analysis although I’m not an author on any of the methods being considered. I have spoken at a GTEX meeting, but am not involved in the analysis of the data. Most importantly, I have not analyzed any data and am in no position to make claims about any of the software in question. I’m just making observations about the sociology of this interaction.

PubMed commons is launching

PubMed, the main database of life sciences and biomedical literature, is now allowing comments and upvotes. Here is more information and the twitter handle is @PubMedCommons.