Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Computing for Data Analysis starts today!

Today marks the first Simply Statistics course offering happening over at Coursera. I’ll be teaching Computing for Data Analysis over the next four weeks. There’s still plenty of time to register if you are interested in learning about R and the activity on the discussion forums is already quite vibrant.

Also starting today is my colleague Brian Caffo’s Mathematical Biostatistics Bootcamp, which I hear also has had an energetic start. With any luck, the students in that class may get to see Brian dressed in military fatigues.

This is my first MOOC so I have no idea how it will go. But I’m excited to start and am looking forward to the next four weeks.

Sunday Data/Statistics Link Roundup (9/23/12)

  1. Harvard Business school is getting in on the fun, calling the data scientist the sexy profession for the 21st century. Although I am a little worried that by the time it gets into a Harvard Business document, the hype may be outstripping the real promise of the discipline. Still, good news for statisticians! (via Rafa via Francesca D.’s Facebook feed). 
  2. The counterpoint is this article which suggests that data scientists might be able to be replaced by tools/software. I think this is also a bit too much hype for my tastes. Certain things will definitely be automated and we may even end up with a deterministic statistical machine or two. But there will continually be new problems to solve which require the expertise of people with data analysis skills and good intuition (link via Samara K.)
  3. A bunch of websites are popping up where you can sign up and have people take your online courses for you. I’m not going to give them the benefit of a link, but they aren’t hard to find these days. The thing I don’t understand is, if it is a free online course, why have someone else take it for you? It’s free, its in your spare time, and the bar for passing is pretty low (links via Sherri R. redacted)….
  4. Maybe mostly useful for me, but for other people with Tumblr blogs, here is a way to insert Latex.
  5. Brian Caffo shares his impressions of the SAMSI massive data workshop.  He raises an important issue which definitely deserves more discussion: should we be focusing on specific or general problems? Worth a read. 
  6. For the people into self-tracking, Chris V. points to an app created by the University of Indiana that lets people track their sexual activity. The most interesting thing about that app is how it highlights a key and I suppose often overlooked issue with analyzing self-tracking data. Despite the size of these data sets, they are still definitely biased samples. It’s only a brave few who will tell the University of Indiana all about their sex life. 

Prediction contest

I have been seeing this paper all over Twitter/the blogosphere. It’s a sexy idea: can you predict how “high-impact” a scientist will be in the future. It is also a pretty flawed data analysis…so this weeks prediction contest is to identify why the statistics in this paper are so flawed. In my first pass read I noticed about 5 major flaws.

Editor’s note: I posted the criticisms and the authors respond here: http://disq.us/8bmrhl

In data science - it's the problem, stupid!

I just saw this article talking about how in the biotech world, you can’t get caught chasing the latest technology. You have to start with a problem you are solving for people and then work your way back. This reminds me a lot of Type B problems in data science/statistics. We have a pile of data, so we don’t need to have a problem to solve, it will come to us later. I think the answer to the question, “Did you start with a scientific/business problem that needs solving regardless of whether the data was in place?” will end up being a near perfect classifier for separating the “Big Data” projects that are just hype from the ones that will pan out long term. 

Every professor is a startup

There has been a lot of discussion lately about whether to be in academia or industry. Some of it I think is a bit unfair to academia. Then I saw this post on Quora asking what Hilary Mason’s contributions were to machine learning, like she hadn’t done anything. It struck me as a bit of academia hating on industry*. I don’t see why one has to be better/worse than the other, as Roger points out, there is no perfect job and it just depends on what you want to do. 

One thing that I think gets lost in all of this are the similarities between being an academic researcher and running a small startup. To be a successful professor at a research institution, you have to create a product (papers/software), network (sit on editorial boards/review panels), raise funds (by writing grants), advertise (by giving talks/presentations), identify and recruit talent (students and postdocs), manage people and personalities (students,postdocs, collaborators) and scale (you start as just yourself, and eventually grow to a group with lots of people). 

The goals are somewhat different. In a startup company, your goal is ultimately to become a profitable business. In academia, the goal is to create an enterprise that produces scientific knowledge. But in either enterprise it takes a huge amount of entrepreneurial spirit, passion, and hustle. It just depends on how you are spending your hustle. 

*Sidenote: One reason I think she is so famous is that she helps people, even people that can’t necessarily do anything for her. One time I wrote her out of the blue to see if we could get some Bitly data to analyze for a class. She cheerfully helped us get it, even though the immediate payout for her was not obvious. But I tell you what, when people ask me about her, I’ll tell them she is awesome.