Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Dropping the Stick in Data Analysis

When I was a kid growing up in rough-and-tumble suburban New York, one of the major summer activities was roller hockey, the kind with roller blades (remember roller blades?). My friends and I would be playing in some random parking lot and undoubtedly one of us would be just blowing it the whole game. This would usually lead to an impromptu intervention where the person screwing up (often me) would be told by everyone else on the team to “drop the stick”. The idea was you should stop playing, clear your head, skate around for a bit, and not try to do 20 things at once.

I don’t play much hockey now, but I do a bit more data analysis. Strangely, little has changed.

People come to me at various stages of data analysis. Close collaborators usually come to me with no data because they are planning a study and need some help. In those cases, I’m involved in the beginning and know how the data are generated. Usually, in those cases I analyze the data in the end so there’s less confusion.

Others usually come to me with data in hand wanting know what they should do now that they’ve got all this data. Often there’s confusion about where to start, what method to use, what program, what procedure, what function, what test, Bayesian or frequentist, mean or median, R or Stata, random effects or fixed effects, cat or dog, mice or men, etc. That’s usually the point where I tell them to “drop the stick”, or the data analysis version of that, which is “What question are you trying to answer?”

Usually, people know what question they’re trying to answer–they just forgot to tell me. But I’m always amazed at how this question can often be the subject of the entire discussion. We might end up answering a question the investigator hadn’t thought of yet, maybe a question that’s better suited to the data.

So, job #1 if you’re a statistician: Get more people to drop the stick.  You’ll make everyone play better in the end.

Email is a to-do list made by other people - can someone make it more efficient?!

This is a follow-up to one of our most popular posts: getting email responses from busy people. This post had been in the drafts for a few weeks, then this morning I saw this quote in our Twitter feed:

Your email inbox is a to-do list created by other people (via)

This is 100% true of my work email and I have to say, because of the way those emails are organized - as conversations rather than a prioritized, organized to-do list - I end up missing really important things or getting to them too late. This is happening to me with increasing enough frequency I feel like I’m starting to cause serious problems for people.

So I am begging someone with way better skills than me to produce software that replaces gmail in the following ways. It is a to-do list that I can allow people to add tasks too. The software shows me the following types of messages.

  1. We have an appointment at x time on y date to discuss z. Next to this message is a checkbox. If I click “ok” it gets added to my calendar, if I click “no” then a message gets sent to the person who scheduled the meeting saying I’m unavailable.
  2. A multiple choice question where they input the categories of answer I can give and I just pick one, it sends them the response.
  3. A request to be added as a person who can assign me tasks with a yes/no answer.
  4. A longer request email - this has three entry fields: (1) what do you want, (2) when do you want it by? and (3) a yes/no checkbox asking if I’m willing to perform the task.  If I say yes, it gets added to my calendar with automated reminders.
  5. It should interface with all the systems that send me reminder emails to organize the reminders.
  6. You can assign quotas to people, where they can only submit a certain number of tasks per month.
  7. It allows you to re-assign tasks to other people so when I am not the right person to ask, I can quickly move the task on to the right person.
  8. It would collect data and generate automated reports for me about what kind of tasks I’m usually forgetting/being late on and what times of day I’m bad about responding so that I could improve my response times.

The software would automatically reorganize events/to-dos to reflect changing deadlines/priorities, etc. This piece of software would revolutionize my life. Any takers?

Advice for students on the academic job market (2013 edition)

Job hunting season is upon us. Those on the job market should be sending in applications already. Here I provide links to some of the related posts we published last year.

Data analysis acquisition "worst deal ever"?

A little over a year ago I mentioned that data analysis companies were getting gobbled up by larger technology companies. In particular, HP bought Autonomy, a British data analysis company, for about $11 billion. (By the way, can anyone tell me if it’s still called Hewlett-Packard, or is it just “HP”, like “AT&T”?) From an article a year ago

Autonomy, with headquarters in Cambridge, England, helps companies and governments store, process, search and analyze large electronic data sets. Its specialty lies in its sophisticated algorithms, which can make sense of unstructured information.

At the time, the thinking was HP had overpaid (especially given HP’s recent high price for 3Par) but the deal went through anyway. Now, HP has discovered accounting problems at Autonomy and is writing down $8.8 billion.

Whoops.

James Stewart of the New York Times claims this is worse than the failed AOL-Time Warner merger (although the absolute numbers involved here are smaller). With 3 CEOs in 2 years, it seems HP just can’t get anything right these days. But what intrigues me most is the question of what companies like Autonomy are worth and the possibility that HP made a huge mistake in the valuation of this company. Of course, if there was fraud at Autonomy (as it seems to be alleged), then all bets are off. But if not, then perhaps this is the first bubble popping in the realm of data analysis companies more generally?

Sunday data/statistics link roundup (12/2/12)

  1. An interview with Anthony Goldbloom, CEO of Kaggle. I’m not sure I’d agree with the characterization that all data scientists are: creative, curious, and competitive and certainly those characteristics aren’t unique to data scientists. And I didn’t know this: “We have 65,000 data scientists signed up to Kaggle, and just like with golf tournaments, we have them all ranked from 1 to 65,000.” 
  2. Check it out, art with R! It’s actually pretty interesting to see how they use statistical algorithms to generate different artistic styles. Here are some more. 
  3. Now that Ethan Perlstein’s crowdfunding experiment was successful, other people are getting on the bandwagon. If you want to find out what kind of bacteria you have in your gut, for example, you could check out this
  4. I thought I had it rough, but apparently some data analysts spend all their time developing algorithms to detect penis drawings!
  5. Roger was on Anderson Cooper 360 as part of the Building America segment. We can’t find the video, but here is the transcript. 
  6. An interesting article on the half-life of facts. I think the analogy is an interesting one and certainly there is research to be done there. But I think it jumps the shark a bit when they start talking about how the moon landing was predictable, etc. I completely believe in the retrospective analysis of knowledge, but predicting things is pretty hard, especially when it is the future.