22 Jun 2012
One of my favorite movies is Woody Allen’s Annie Hall. If you’re my age and you haven’t seen it, I usually tell people it’s like When Harry Met Sally, except really good. The movie opens with Woody Allen’s character Alvy Singer explaining that he would “never want to belong to any club that would have someone like me for a member”, a quotation he attributes to Groucho Marx (or Freud).
Last week I posted a link to ASA President Robert Rodriguez’s column in Amstat News about big data. In the post I asked what was wrong with the column and there were a few good comments from readers. In particular, Alex wrote:
When discussing what statisticians need to learn, he focuses on technological changes (distributed computing, Hadoop, etc.) and the use of unstructured text data. However, Big Data requires a change in perspective for many statisticians. Models must expand to address the levels of complexity that massive datasets can reveal, and many standard techniques are limited in utility.
I agree with this, but I don’t think it goes nearly far enough.
The key element missing from the column was the notion that statistics should take a leadership role in this area. I was disappointed by the lack of a more expansive vision displayed by the ASA President and the ASA’s unwillingness to claim a leadership position for the field. Despite the name “big data”, big data is really about statistics and statisticians should really be out in front of the field. We should not be observing what is going on and adapting to it by learning some new technologies or computing techniques. If we do that, then as a field we are just leading from behind. Rather, we should be defining what is important and should be driving the field from both an educational and research standpoint.
However, the new era of big data poses a serious dilemma for the statistics community that needs to be addressed before real progress can be made, and that’s what brings me to Alvy Singer’s conundrum.
There’s a strong tradition in statistics of being the “outsiders” to whatever field we’re applying our methods to. In many cases, we are the outsiders to scientific investigation. Even if we are neck deep in collaborating with scientists and being involved in scientific work, we still maintain our ability to criticize and judge scientists because we are “outsiders” trained in a different set of (important) skills. In many ways, this is a Good Thing. The outsider status is important because it gives us the freedom to be “arbiters” and to ensure that scientists are doing the “right” things. It’s our job to keep people honest. However, being an arbiter by definition means that you are merely observing what is going on. You cannot be leading what is going on without losing your ability to arbitrate in an unbiased manner.
Big data poses a challenge to this long-standing tradition because all of the sudden statistics and science are more intertwined then ever before and statistical methodology is absolutely critical to making inferences or gaining insight from data. Because now there are data in more places than ever before, the demand for statistics is in more places than ever before. We are discovering that we can either teach people to apply the statistical methods to their data, or we can just do it ourselves!
This development presents an enormous opportunity for statisticians to play a new leadership role in scientific investigations because we have the skills to extract information from the data that no one else has (at least for the moment). But now we have to choose between being “in the club” by leading the science or remaining outside the club to be unbiased arbiters. I think as an individual it’s very difficult to be both simply because there are only 24 hours in the day. It takes an enormous amount of time to learn the scientific background required to lead scientific investigations and this is piled on top of whatever statistical training you receive.
However, I think as a field, we desperately need to promote both kinds of people, if only because we are the best people for the job. We need to expand the tent of statistics and include people who are using their statistical training to lead the new science. They may not be publishing papers in the Annals of Statistics or in JASA, but they are statisticians. If we do not move more in this direction, we risk missing out on one of the most exciting developments of our lifetime.
20 Jun 2012
This is the second in my series on pro tips for graduate students in statistics/biostatistics. For more tips, see part 1.
- Meet with seminar speakers. When you go on the job market face recognition is priceless. I met Scott Zeger at UW when I was a student. When I came for an interview I already knew him (and Ingo, and Rafa, and ….). An even better idea…ask a question during the seminar.
- Be a finisher. The key to getting a Ph.D. (other than passing your quals) is the ability to sit down and just power through and get it done. This means sometimes you will have to work late or on a weekend. The people who are the most successful in grad school are the people that just nd a way to get it done. If it was easy…anyone would do it.
- Work on problems you genuinely enjoy thinking about/are
passionate about. A lot of statistics (and science) is long periods of concentrated effort with no guarantee of success at the end. To be a really good statistician requires a lot of patience and effort. It is a lot easier to work hard on something you like or feel strongly about.
More to come soon.
18 Jun 2012
I just finished teaching a Ph.D. level applied statistical methods course here at Hopkins. As part of the course, I gave one “pro-tip” a day; something I wish I had learned in graduate school that has helped me in becoming a practicing applied statistician. Here are the first three, more to come soon.
- A major component of being a researcher is knowing what’s going on in the research community. Set up an RSS feed with journal articles. Google Reader is a good one, but there are others. Here are some good applied stat journals: Biostatistics, Biometrics, Annals of Applied Statistics…
- Reproducible research is a hot topic, in part because a couple of high-profile papers that were disastrously non-reproducible (see “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology”). When you write code for statistical analysis try to make sure that: (a) It is neat and well-commented - liberal and specific comments are your friend. (b)That it can be run by someone other than you, to produce the same results that you report.
- In data analysis - particularly for complex high-dimensional
data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of overfitting/ over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically “simple” analyses (Note: being sensible and simple does not always lead to higher prole results).