Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Implementing Evidence-based Data Analysis: Treading a New Path for Reproducible Research (Part 3)

Last week I talked about how we might be able to improve data analyses by moving towards “evidence-based” data analysis and to use data analytic techniques that are proven to be useful based on statistical research rather. My feeling was this approach attacks the most “upstream” aspect of data analysis before problems have the chance to filter down into things like publications, or even worse, clinical decision-making.

In this third (and final!) post on this topic I wanted to describe a little how we could implement evidence-based data analytic pipelines. Depending on your favorite  software system you could imagine a number of ways to do this. If the pipeline were implemented in R, you could imagine it as an R package. The precise platform is not critical at this point; I would imagine most complex pipelines would involve multiple different software systems tied together.

Below is a rough diagram of how I think the various pieces of an evidence-based data analysis pipeline would fit together.

dsmpicThere are a few key elements of this diagram that I’d like to stress:

  1.  Inputs are minimal. You don’t want to allow for a lot of inputs or arguments that can be fiddled with. This reduces the number of degrees of freedom and hopefully reduces the amount of hacking. Basically, you want to be able to input the data and perhaps some metadata.
  2. Analysis comes in stages. There are multiple stages in any analysis, not just the part where you fit a model. Everything is important and every stage should use the best available method.
  3. The stuff in the red box does not involve manual intervention. The point is to not allow tweaking, fudging, and fiddling. Once the data goes in, we just wait for something to come out the other end.
  4. Methods should be benchmarked. For each stage of the analysis, there is a set of methods that are applied. These methods should, at a minimum, be benchmarked via a standard group of datasets. That way, if another method comes a long, we have an objective way to evaluate whether the new method is better than the older methods. New methods that improve on the benchmarks can replace the existing methods in the pipeline.
  5. Output includes a human-readable report. This report summarizes what the analysis was and what the results were (including results of any sensitivity analysis). The material in this report could be included in the “Methods” section of a paper and perhaps in the “Results” or “Supplementary Materials”. The goal would be to allow someone who was not intimately familiar with the all of the methods used in the pipeline to be able to walk away with a report that he/she could understand and interpret. At a minimum, this person could take the report and share it with their local statistician for help with interpretation.
  6. There is a defined set of output parameters. Each analysis pipeline should, in a sense, have an “API” so that we know what outputs to expect (not the exact values, of course, but what kinds of values). For example, if a pipeline fits a regression model at the end the regression parameters are the key objects of interest, then the output could be defined as a vector of regression parameters. There are two reasons to have this: (1) the outputs, if the pipeline is deterministic, could be used for regression testing in case the pipeline is modified; and (2) the outputs could serve as inputs into another pipeline or algorithm.

Clearly, one pipeline is not enough. We need many of them for different problems. So what do we do with all of them?

I think we could organize them in a central location (kind of a specialized GitHub) where people could search for, download, create, and contribute to existing data analysis pipelines. An analogy (but not exactly a model) is the Cochrane Collaboration which serves as a repository for evidence-based medicine. There are already a number of initiatives along these lines, such as the Galaxy Project for bioinformatics. I don’t know whether it’d be ideal to have everything in one place or have a number of sub-projects for specialized areas.

Each pipeline would have a leader (or “friendly dictator”) who would organize the contributions and determine which components would go where. This could obviously be contentious, more some in some areas than in others, but I don’t think any more contentious than your average open source project (check the archives of the Linus Kernel or Git mailing lists and you’ll see what I mean).

So, to summarize, I think we need to organize lots of evidence-based data analysis pipelines and make them widely available. If I were writing this 5 or 6 years ago, I’d be complaining about a lack of infrastructure out there to support this. But nowadays, I think we have pretty much everything we need in terms of infrastructure. So what are we waiting for?

Repost: A proposal for a really fast statistics journal

Editor’s note: This is a repost of a previous Simply Statistics column that seems to be relevant again in light of Marie Davidian’s really important column on the peer review process. You should also check out Yihui’s thoughts on this, which verge on the creation of a very fast/dynamic stats journal.  

I know we need a new journal like we need a good poke in the eye. But I got fired up by the recent discussion of open science (by Paul Krugman and others) and the seriously misguided Research Works Act- that aimed to make it illegal to deposit published papers funded by the government in Pubmed central or other open access databases.

I also realized that I spend a huge amount of time/effort on the following things: (1) waiting for reviews (typically months), (2) addressing reviewer comments that are unrelated to the accuracy of my work - just adding citations to referees papers or doing additional simulations, and (3) resubmitting rejected papers to new journals - this is a huge time suck since I have to reformat, etc. Furthermore, If I want my papers to be published open-access I also realized I have to pay at minimum $1,000 per paper.So I thought up my criteria for an ideal statistics journal. It would be accurate, have fast review times, and not discriminate based on how interesting an idea is. I have found that my most interesting ideas are the hardest ones to get published.  This journal would:</p>
  • Be open-access and free to publish your papers there. You own the copyright on your work.
  • The criteria for publication would be: (1) it has to do with statistics, computation, or data analysis, (2) is the work is technically correct.
  • We would accept manuals, reports of new statistical software, and full length research articles.
  • There would be no page limits/figure limits.
  • The journal would be published exclusively online.
  • We would guarantee reviews within 1 week and publication immediately upon review if criteria (1) and (2) are satisfied
  • Papers would receive a star rating from the editor - 0-5 stars. There would be a place for readers to also review articles
  • All articles would be published with a tweet/like button so they can be easily distributed
To achieve such a fast review time, here is how it would work. We would have a large group of Associate Editors (hopefully 30 or more). When a paper was received, it would be assigned to an AE. The AEs would agree to referee papers within 2 days. They would use a form like this:
  • Review of: Jeff’s Paper
  • Technically Correct: Yes
  • About statistics/computation/data analysis: Yes
  • Number of Stars: 3 stars
  • 3 Strengths of Paper (1 required):
  • This paper revolutionizes statistics
  • 3 Weakness of Paper (1 required):
  • * The proof that this paper revolutionizes statistics is pretty weak
  • because he only includes one example.
That’s it, super quick, super simple, so it wouldn’t be hard to referee. As long as the answers to the first two questions were yes, it would be published.
So now here’s my questions:
  1. Would you ever consider submitting a paper to such a journal?
  2. Would you be willing to be one of the AEs for such a journal?
  3. Is there anything you would change?

Sunday data/statistics link roundup (9/1/13)

  1. There has been a lot of discussion of the importance of open access on Twitter. I am 100% in favor of open access (I do wish it was less expensive), but I also think that sometimes people lose sight of other important issues for junior scientists that go beyond open access. Dr. Isis has a great example of this on her blog. 
  2. Sherri R. has a great list of resources for stats minded folks at the undergrad, grad, and faculty levels.
  3. There he goes again. Another awesome piece by Rafa on someone else’s blog. It is in Spanish but the google translate does ok. Be sure to check out questions 3 and 4.
  4. A really nice summary of Nate Silver’s talk at JSM and a post-talk interview (in video format) are available here. Pair with this awesome Onion piece (both links via Marie D.)
  5. A really nice post that made the rounds in the economics blogosphere talking about the use of mathematics in econ. This seems like a pretty relevant quote, “Instead, it was just some random thing that someone made up and wrote down because A) it was tractable to work with, and B) it sounded plausible enough so that most other economists who looked at it tended not to make too much of a fuss.”
  6. More on hiring technical people. This is related to Google saying their brainteaser interview questions don’t work. Check out the list here of things that this person found useful in hiring technical people that could be identified easily.  I like how typos and grammatical errors were one of the best predictors.

AAAS S&T Fellows for Big Data and Analytics

Thanks to Steve Pierson of the ASA for letting us know that the AAAS Science and Technology Fellowship program has a new category for “Big Data and Analytics”. For those not familiar, AAAS organizes the S&T Fellowship program to get scientists involved in the policy-making process in Washington and at the federal agencies. In general, the requirements for the program are

Applicants must have a PhD or an equivalent doctoral-level degree at the time of application. Individuals with a master’s degree in engineering and at least three years of post-degree professional experience also may apply. Some programs require additional experience. Applicants must be U.S. citizens. Federal employees are not eligible for the fellowships.

Further details are on the AAAS web site.

I’ve met a number of current and former AAAS fellows working on Capitol Hill and at the various agencies and I have to say I’ve been universally impressed. I personally think getting more scientists into the federal government and involved with the policy-making process is a Good Thing. If you’re a statistician looking to have a different kind of impact, this might be for you.

The return of the stat - Computing for Data Analysis & Data Analysis back on Coursera!

It’s the return of the stat. Roger and I are going to be re-offering our Coursera courses:

Computing for Data Analysis (starts Sept 23)

Sign up here.

Data Analysis (starts Oct 28)

Sign up here.