Big Data in Your Blood

08 Sep 2012

The Weatherman Is Not a Moron

08 Sep 2012

Top-down versus bottom-up science: data analysis edition

07 Sep 2012

In our most recent video, Steven Salzberg discusses the ENCODE project. Some of the advantages and disadvantages of top-down science are described. Here, top-down refers to big coordinated projects like the Human Genome Project (HGP). In contrast, the approach of funding many small independent projects, via the R01 mechanism, is referred to as bottom-up. Note that for the cost of HGP we could have funded thousands of R01s. However it is not clear that without the HGP we would have had public sequence data as early as we did. As Steven points out, when it comes to data generation the economies of scale make big projects more efficient. But the same is not necessarily true for data analysis.

Big projects like ENCODE and 1000 genomes include data analysis teams that work in coordination with the data producers. It is true that very good teams are assembled and very good tools developed. But what if instead of holding the data under embargo until the first analysis is done and a paper (or 30) is published, the data was made publicly available with no restrictions and the scientific community was challenged to compete for data analysis and biological discovery R01s? I have no evidence that this would produce better science, but my intuition is that, at least in the case of data analysis, better methods would be developed. Here is my reasoning. Think of the best 100 data analysts in academia and consider the following two approaches:

1- Pick the best among the 100 and have a small group carefully coordinate with the data producers to develop data analysis methods.

2- Let all 100 take a whack at it and see what falls out.

In scenario 1 the selected group has artificial protection from competing approaches and there are less brains generating novel ideas. In scenario 2 the competition would be fierce and after several rounds of sharing ideas (via publications and conferences), groups would borrow from others and generate even better methods.

Note that the big projects do make the data available and R01s are awarded to develop analysis tools for these data. But this only happens after giving the consortium’s group a substantial head start.

I have not participated in any of these consortia and perhaps I am being naive. So I am very interested to hear the opinions of others.

Simply Statistics Podcast #3: Interview with Steven Salzberg

07 Sep 2012

Interview with Steven Salzberg about the ENCODE Project.

In this episode Jeff and I have a discussion with Steven Salzberg, Professor of Medicine and Biostatistics at Johns Hopkins University, about the recent findings from the ENCODE Project where he helps us separate fact from fiction. You’re going to want to watch to the end with this one.

Here are some excerpts from the interview.

Regarding why the data should have been released immediately without restriction:

If this [ENCODE] were funded by a regular investigator-initiated grant, then I would say you have your own grant, you’ve got some hypotheses you’re pursuing, you’re collecting data, you’ve already demonstrated that…you have some special ability to do this work and you should get some time to look at your data that you just generated to publish it. This was not that kind of a project. These are not hypothesis-driven projects. They are data collection projects. The whole model is…they’re creating a resource and it’s more efficient to create the resource in one place…. So we all get this data that’s being made available for less money…. I think if you’re going to be funded that way, you should release the data right away, no restrictions, because you’re funded because you’re good at generating this data cheaply….But you may not be the best person to do the analysis.

Regarding the problem with large-scale top-down funding approaches versus the individual investigator approach:

Well, it’s inefficient because it’s anti-competitive. They have a huge amount of money going to a few centers, they’ll do tons of experiments of the same type—may not be the best place to do that. They could instead give that money to 20 times as many investigators who would be refining the techniques and developing better ones. And a few years from now, instead of having another set of ENCODE papers—which we’re probably going to have—we might have much better methods and I think we’d have just as much in terms of discovery, probably more.

Regarding best way to make discoveries:

I think a problem I have with it…is that the top-down approach to science isn’t the way you make discoveries. And NIH has sort of said we’re going to fund these data generation and data analysis groups—they’re doing both…and by golly we’re going to discover some things. Well, it doesn’t always work if you do that. You can’t just say…so the Human Genome [Project], even though, of course there were lots of promises about curing cancer, we didn’t say we were going to discover how a particular gene works, we said we’re going to discover what the sequence is. And we did! Really well. With these [ENCODE] projects they said we’re going to figure out the function of all the elements, and they haven’t figured that out, at all.

[HD video RSS feed]

[Audio-only RSS feed]

[NOTE: Due to clumsy camera operator (who forgot to turn the camera on), we lost one of our three camera angles and so the there’s no front-facing view. Sorry!]

(Source: http://www.youtube.com/)

Simply Statistics Podcast #2

06 Sep 2012

In this episode of the Simply Statistics podcast Jeff and I discuss the deterministic statistical machine and increasing the cost of data analysis. We decided to eschew the studio setup this time and attempt a more guerilla style of podcasting. Also, Rafa was nowhere to be found when we recorded so you’ll have to catch his melodious singing voice in the next episode.

And in case you’re wondering, Jeff’s office is in fact that clean.

As always, we welcome your feedback!