Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Statistics and the Science Club

One of my favorite movies is Woody Allen’s Annie Hall. If you’re my age and you haven’t seen it, I usually tell people it’s like When Harry Met Sally, except really good. The movie opens with Woody Allen’s character Alvy Singer explaining that he would “never want to belong to any club that would have someone like me for a member”, a quotation he attributes to Groucho Marx (or Freud).

Last week I posted a link to ASA President Robert Rodriguez’s column in Amstat News about big data. In the post I asked what was wrong with the column and there were a few good comments from readers. In particular, Alex wrote:

When discussing what statisticians need to learn, he focuses on technological changes (distributed computing, Hadoop, etc.) and the use of unstructured text data. However, Big Data requires a change in perspective for many statisticians. Models must expand to address the levels of complexity that massive datasets can reveal, and many standard techniques are limited in utility.

I agree with this, but I don’t think it goes nearly far enough. 

The key element missing from the column was the notion that statistics should take a  leadership role in this area. I was disappointed by the lack of a more expansive vision displayed by the ASA President and the ASA’s unwillingness to claim a leadership position for the field. Despite the name “big data”, big data is really about statistics and statisticians should really be out in front of the field. We should not be observing what is going on and adapting to it by learning some new technologies or computing techniques. If we do that, then as a field we are just leading from behind. Rather, we should be defining what is important and should be driving the field from both an educational and research standpoint.

However, the new era of big data poses a serious dilemma for the statistics community that needs to be addressed before real progress can be made, and that’s what brings me to Alvy Singer’s conundrum.

There’s a strong tradition in statistics of being the “outsiders” to whatever field we’re applying our methods to. In many cases, we are the outsiders to scientific investigation. Even if we are neck deep in collaborating with scientists and being involved in scientific work, we still maintain our ability to criticize and judge scientists because we are “outsiders” trained in a different set of (important) skills. In many ways, this is a Good Thing. The outsider status is important because it gives us the freedom to be “arbiters” and to ensure that scientists are doing the “right” things. It’s our job to keep people honest. However, being an arbiter by definition means that you are merely observing what is going on. You cannot be leading what is going on without losing your ability to arbitrate in an unbiased manner.

Big data poses a challenge to this long-standing tradition because all of the sudden statistics and science are more intertwined then ever before and statistical methodology is absolutely critical to making inferences or gaining insight from data. Because now there are data in more places than ever before, the demand for statistics is in more places than ever before. We are discovering that we can either teach people to apply the statistical methods to their data, or we can just do it ourselves!

This development presents an enormous opportunity for statisticians to play a new leadership role in scientific investigations because we have the skills to extract information from the data that no one else has (at least for the moment). But now we have to choose between being “in the club” by leading the science or remaining outside the club to be unbiased arbiters. I think as an individual it’s very difficult to be both simply because there are only 24 hours in the day. It takes an enormous amount of time to learn the scientific background required to lead scientific investigations and this is piled on top of whatever statistical training you receive.

However, I think as a field, we desperately need to promote both kinds of people, if only because we are the best people for the job. We need to expand the tent of statistics and include people who are using their statistical training to lead the new science. They may not be publishing papers in the Annals of Statistics or in JASA, but they are statisticians. If we do not move more in this direction, we risk missing out on one of the most exciting developments of our lifetime.