Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

How do you know if someone is great at data analysis?

Consider this exercise. Come up with a list of the top 5 people that you think are really good at data analysis.

There’s one catch: They have to be people that you’ve never met nor have had any sort of personal interaction with (e.g. email, chat, etc.). So basically people who have written papers/books you’ve read or have given talks you’ve seen or that you know through other publicly available information. Who comes to mind? It’s okay to include people who are no longer living.

The other day I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis. This turned out to be much harder than I thought. And I’m sure it’s not because they don’t exist, it’s just because I think good data analysis chops are hard to evaluate from afar using the standard methods by which we evaluate people.

I think there are a few reasons. First, people who are great at data analysis are likely not publishing papers or being productive in a manner that I, an outsider, would be able to observe. If they’re working at a pharmaceutical company working on a new drug or at some fancy new startup company, there’s no way I’m ever going to know about it unless I’m directly involved.

Another reason is that even for people who are well-known scientists or statisticians, the products they produce don’t really highlight the difficulties overcome in data analysis. For example, many good papers in the statistics literature will describe a new method with brief reference to the data that inspired the method’s development. In those cases, the data analysis usually appears obvious, as most things do after they’ve been done. Furthermore, papers usually exclude all the painful details about merging, cleaning, and inspecting the data as well as all the other things you tried that didn’t work. Papers in the substantive literature have a similar problem, which is that they focus on a scientific problem of interest and the analysis of the data is secondary.

As skills in data analysis become more important, it seems odd to me that we don’t have a great way to evaluate a person’s ability to do it as we do in other areas.