Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The limiting reagent for big data is often small, well-curated data

I’ve been working on “big” data in genomics since I was a first year student in graduate school (a longer time than I’d rather admit). At the time, “big” meant microarray studies with a couple of hundred patients. Of course, that is now a really small drop in the pond compared to the huge sequencing data sets, like the one published recently in Nature.

Despite the exploding size of these genomic data sets, the discovery process is almost always limited by the quality and quantity of useful metadata that go along with them. In the trauma study I referenced above, the genomic data was both costly and hard to collect. But the bigger, more impressive feat was to collect the data from trauma patients at relatively precise time points after they had been injured. Along with the genomic data a host of clinical data was also collected and aligned with the genomic data.

The key insights derived from the data were the relationships between low-dimensional and high-dimensional measurements. 

This is actually relatively common:

  • In computer vision you need quality labeled images to use as a training set (this type of manual labeling is so common it forms the basis for major citizen science projects like zooniverse)
  • In genome-wide association studies you need accurate phenotypes.
  • In the analysis of social networks like the Framingham Heart Survey, you need to collect data on obesity levels, etc.

One common feature of these studies is that they are examples of what computer scientists call _supervised learning. _Most hypothesis-driven research falls into this type of study. It is important to recognize that these studies can only work with painstaking and careful collection of small data. So in many cases, the limits to the insights we can obtain from big data are imposed by how much schlep we are willing to put in to get small data.