Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Reproducible Research in Computational Science

First of all, thanks to Rafa for scooping me with my own article. Not sure if that’s reverse scooping or recursive scooping or….

The latest issue of Science has a special section on Data Replication and Reproducibility. As part of the section I wrote a brief commentary on the need for reproducible research in computational science. Science has a pretty tight word limit for it’s commentaries and so it was unfortunately necessary to omit a number of relevant topics.

The editorial introducing the special section, as well as a separate editorial in the same issue, seem to emphasize the errors/fraud angle. This might be because Science has once or twice been at the center of instances of scientific fraud. But as I’ve said previously (and a point I tried to make in the commentary), reproducibility is not needed soley to prevent fraud, although that is an important objective. Another important objective is getting ideas across and disseminating knowledge. I think this second objective often gets lost because there’s a sense that knowledge dissemination already happens and that it’s the errors that are new and interesting. While the errors are perhaps new, there is a problem of ideas not getting across as quickly as they could because of a lack of code and/or data. The lack of published code/data is arguably holding up the advancement of science (if not Science).

One important idea I wanted to get across was that we can ramp up to achieve the ideal scenario, if getting there immediately is not possible. People often get hung up on making the data available but I think a substantial step could be made by simply making code available. Why doesn’t every journal just require it? We don’t have to start with a grand strategy involving funding agencies and large consortia. We can start modestly and make useful improvements

A final interesting question that came up as the issue was going to press was whether I was talking about “reproducibility” or “replication”. As I made clear in the commentary, I define “replication” as independent people going out and collecting new data and “reproducibility” as independent people analyzing the same data. Apparently, others have the reverse definitions for the two words. The confusion is unfortunate because one idea has a centuries long history whereas the importance of the other idea has only recently become relevant. I’m going to stick to my guns here but we’ll have to see how the language evolves.