Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don’t, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people’s data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn’t change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn’t before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were’t able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn’t put me in charge anyway, I’ve never won the lottery with any infrastructure I’ve tried to build.