Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (11/9/14)

So I’m a day late, but you know, I got a new kid and stuff…

  1. The New Yorker hating on MOOCs, they mention all the usual stuff. Including the really poorly designed San Jose State experiment. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the wrong part of the hype curve. MOOCs won’t solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).
  2. My colleague Dan S. is teaching a missing data workshop here at Hopkins next week (via Dan S.)
  3. A couple of cool Youtube videos explaining how the normal distribution sounds and the pareto principle with paperclips (via Presh T., pair with the 80/20 rule of statistical methods development)
  4. If you aren’t following Research Wahlberg, you aren’t on academic twitter.
  5. I followed  #biodata14  closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren’t invited (we like to party too!).
  6. Our data science specialization generates almost 1,000 new R github repos a month! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.
  7. The Rstudio guys have also put together what looks like a great course similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.
  8. Congrats to Data Carpentry and [So I’m a day late, but you know, I got a new kid and stuff…

  9. The New Yorker hating on MOOCs, they mention all the usual stuff. Including the really poorly designed San Jose State experiment. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the wrong part of the hype curve. MOOCs won’t solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).
  10. My colleague Dan S. is teaching a missing data workshop here at Hopkins next week (via Dan S.)
  11. A couple of cool Youtube videos explaining how the normal distribution sounds and the pareto principle with paperclips (via Presh T., pair with the 80/20 rule of statistical methods development)
  12. If you aren’t following Research Wahlberg, you aren’t on academic twitter.
  13. I followed  #biodata14  closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren’t invited (we like to party too!).
  14. Our data science specialization generates almost 1,000 new R github repos a month! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.
  15. The Rstudio guys have also put together what looks like a great course similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.
  16. Congrats to Data Carpentry and](https://twitter.com/tracykteal) on their funding from the Moore Foundation!

Time varying causality in n=1 experiments with applications to newborn care

We just had our second son about a week ago and I’ve been hanging out at home with him and the rest of my family. It has reminded me of a few things from when we had our first son. First, newborns are tiny and super-duper adorable. Second, daylight savings time means gaining an extra hour of sleep for many people, but for people with young children it is more like (via Reddit):

 

Third, taking care of a newborn is like performing a series of n=1 experiments where the causal structure of the problem changes every time you perform an experiment.

Suppose, hypothetically, that your newborn has just had something to eat and it is 2am in the morning (again, just hypothetically). You are hoping he’ll go back down to sleep so you can catch some shut-eye yourself. But your baby just can’t sleep and seems uncomfortable. Here are a partial list of causes for this: (1) dirty diaper, (2) needs to burp, (3) still hungry, (4) not tired, (5) over tired, (6) has gas, (7) just chillin. So you start going down the list and trying to address each of the potential causes of late-night sleeplessness: (1) check diaper, (2) try burping, (3) feed him again, etc. etc. Then, miraculously, one works and the little guy falls asleep.

It is interesting how the natural human reaction  to this is to reorder the potential causes of sleeplessness and start with the thing that worked next time. Then often get frustrated when the same thing doesn’t work the next time. You can’t help it, you did an experiment, you have some data, you want to use it. But the reality is that the next time may have nothing to do with the first.

I’m in the process of collecting some very poorly annotated data collected exclusively at night if anyone wants to write a dissertation on this problem.

538 election forecasts made simple

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I’ve always wanted to know more details. After preparing a “predict the midterm elections“ homework for my data science class I have a better idea of what is going on.

Here is my best attempt at explaining the ideas of 538 using formulas and data. And here is the R markdown.

 

 

 

Sunday data/statistics link roundup (11/2/14)

Better late than never! If you have something cool to share, please continue to email it to me with subject line “Sunday links”.

  1. A DrivenData is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).
  2. This article claiming academic science isn’t sexist has been widely panned Emily Willingham pretty much destroys it here (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also publishing this piece about academic sexual harassment at Yale?
  3. Noah Smith, an economist, tries to summarize the problem with “most research being wrong”. It is an interesting take, I wonder if he read Roger’s piece saying almost exactly the same thing  like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should read our paper?
  4. Nature now requests that code sharing occur “where possible” (via Steven S.)
  5. Great [Better late than never! If you have something cool to share, please continue to email it to me with subject line “Sunday links”.

  6. A DrivenData is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).
  7. This article claiming academic science isn’t sexist has been widely panned Emily Willingham pretty much destroys it here (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also publishing this piece about academic sexual harassment at Yale?
  8. Noah Smith, an economist, tries to summarize the problem with “most research being wrong”. It is an interesting take, I wonder if he read Roger’s piece saying almost exactly the same thing  like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should read our paper?
  9. Nature now requests that code sharing occur “where possible” (via Steven S.)
  10. Great](http://imgur.com/gallery/ZpgQz) cartoons, I particularly like the one about replication (via Steven S.).

Why I support statisticians and their resistance to hype

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan’s reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention_._ Yet, you won’t find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan’s critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today’s hyped up press releases are long forgotten.