Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Where do you get your data?

Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”

My contention is that if someone asks you these questions, start looking for the exits.

There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality. But too often, they are interested in getting the data because they believe it would be a good fit to a method that they have recently developed. Even if that is in fact true, there are some problems.

Before I go on, I need to clarify that I don’t have a problem with data sharing per se, but I usually get nervous when a person’s opening line is “Where do you get your data?” This question presumes a number of things that are usually signs of a bad collaborator:

  • The data are just numbers. My method works on numbers, and these data are numbers, so my method should work here. If it doesn’t work, then I’ll find some other numbers where it does work.
  • The data are all that are important. I’m not that interested in working with an actual scientist on an important problem that people care about, because that would be an awful lot of work and time (see here). I just care about getting the data from whomever will give it to me. I don’t care about the substantive context.
  • Once I have the data, I’m good, thank you. In other words, the scientific process is modular. Scientists generate the data and once I have it I’ll apply my method until I get something that I think makes sense. There’s no need for us to communicate. That is unless I need you to help make the data pretty and nice for me.

The real question that I think people should be asking is “Where do you find such great scientific collaborators?” Because it’s those great collaborators that generated the data and worked hand-in-hand with you to get intelligible results.

Niels Keiding wrote a provocative commentary about the tendency for statisticians to ignore the substantive context of data and to use illustrative/toy examples over and over again. He argued that because of this tendency, we should not be so excited about reproducible research, because as more data become available, we will see more examples of people ignoring the science.

I disagree that this is an argument against reproducible research, but I agree that statisticians (and others) do have a tendency to overuse datasets simply because they are “out there” (stackloss data, anyone?). However, it’s probably impossible to stop people from conducting poor science in any field, and we shouldn’t use the possibility that this might happen in statistics to prevent research from being more reproducible in general. 

But I digress…. My main point is that people who simply ask for “the data” are probably not interested in digging down and understanding the really interesting questions.