Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Top-down versus bottom-up science: data analysis edition

In our most recent video, Steven Salzberg discusses the ENCODE project. Some of the advantages and disadvantages of top-down science are described.  Here, top-down refers to big coordinated projects like the Human Genome Project (HGP). In contrast, the approach of funding many small independent projects, via the R01 mechanism, is referred to as bottom-up. Note that for the cost of HGP we could have funded thousands of R01s. However it is not clear that without the HGP we would have had public sequence data as early as we did. As Steven points out, when it comes to data generation the economies of scale make big projects more efficient. But the same is not necessarily true for data analysis.

Big projects like ENCODE and 1000 genomes include data analysis teams that work in coordination with the data producers.  It is true that very good teams are assembled and very good tools developed. But what if instead of holding the data under embargo until the first analysis is done and a paper (or 30) is published, the data was made publicly available with no restrictions and the scientific community was challenged to compete for data analysis and biological discovery R01s? I have no evidence that this would produce better science, but my intuition is that, at least in the case of data analysis, better methods would be developed. Here is my reasoning. Think of the best 100 data analysts in academia and consider the following two approaches:

1- Pick the best among the 100 and have a small group carefully coordinate with the data producers to develop data analysis methods.

2- Let all 100 take a whack at it and see what falls out.

In scenario 1 the selected group has artificial protection from competing approaches and there are less brains generating novel ideas. In scenario 2 the competition would be fierce and after several rounds of sharing ideas (via publications and conferences), groups would borrow from others and generate even better methods.

Note that the big projects do make the data available and R01s are awarded to develop analysis tools for these data. But this only happens after giving the consortium’s group a substantial head start. 

I have not participated in any of these consortia and perhaps I am being naive. So I am very interested to hear the opinions of others.