Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The scientific reasons it is not helpful to study the Newtown shooter's DNA

The Connecticut Medical Examiner has asked to sequence and study the DNA of the recent Newtown shooter. I’ve been seeing this pop up over the last few days on a lot of popular media sites, where they mention some objections scientists (or geneticists) may have to this “scientific” study. But I haven’t seen the objections explicitly laid out anywhere. So here are mine.

Ignoring the fundamentals of the genetics of complex disease: If the violent behavior of the shooter has any genetic underpinning, it is complex. If you only look at one person’s DNA, without a clear behavior definition (violent? mental disorder? etc.?) it is impossible to assess important complications such as penetranceepistasis, and gene-environment interactions, to name a few. These make statistical analysis incredibly complicated even in huge, well-designed studies.

Small Sample Size:  One person hit on the issue that is maybe the biggest reason this is a waste of time/likely to lead to incorrect results. _You can’t draw a reasonable conclusion about any population by looking at only one individual. _This is actually a fundamental component of statistical inference. The goal of statistical inference is to take a small, representative sample and use data from that sample to say something about the bigger population. In this case, there are two reasons that the usual practice of statistical inference can’t be applied: (1) only one individual is being considered, so we can’t measure anything about how variable (or accurate) the data are, and (2) we’ve picked one, incredibly high-profile, and almost certainly not representative, individual to study.

Multiple testing/data dredging: The small sample size problem is compounded by the fact that we aren’t looking at just one or two of the shooter’s genes, but rather the whole genome. To see why making statements about violent individuals based on only one person’s DNA is a bad idea, think about the 20,000 genes in a human body. Let’s suppose that only one of the genes causes violent behavior (it is definitely more complicated than that) and that there is no environmental cause to the violent behavior (clearly false). Furthermore, suppose that if you have the bad version of the violent gene you will do something violent in your life (almost definitely not a sure thing).

Now, even with all these simplifying (and incorrect) assumptions for each gene you flip a coin with a different chance of being heads. The violent gene turned up tails, but so did a large number of other genes. If we compare the set of genes that came up tails to another individual, they will have a huge number in common in addition to the violent gene. So based on this information, you would have no idea which gene causes violence even in this hugely simplified scenario.

Heavy reliance on prior information/intuition: This is a supposedly scientific study, but the small sample size/multiple testing problems mean any conclusions from the data will be very very weak. The only thing you could do is take the set of genes you found and then rely on previous studies to try to determine which one is the “violence gene”. But now you are being guided by intuition, guesswork, and a bunch of studies that may or may not be relevant. The result is that more than likely you’d end up on the wrong gene.

The result is that it is highly likely that no solid statistical information will be derived from this experiment. Sometimes, just because the technology exists to run an experiment, doesn’t mean that experiment will teach us anything.

Fitbit, why can't I have my data?

I have a Fitbit. I got it because I wanted to collect some data about myself and I liked the simplicity of the set-up. I also asked around and Fitbit seemed like the most “open” platform for collecting one’s own data. You have to pay $50 for a premium account, but after that, they allow you to download your data.

Or do they?

I looked into the details, asked a buddy or two, and found out that you actually can’t get the really interesting minute-by-minute data even with a premium account. You only get the daily summarized totals for steps/calories/stairs climbed. While this data is of some value, the minute-by-minute data are oh so much more interesting. I’d like to use it for personal interest, for teaching, for research, and for sharing interesting new ideas back to other Fitbit developers.

Since I’m not easily dissuaded, I tried another route. I created an application that accessed the Fitbit API. After fiddling around a bit with a few R packages, I was able to download my daily totals. But again, no minute-by-minute data. I looked into it and only [I have a Fitbit. I got it because I wanted to collect some data about myself and I liked the simplicity of the set-up. I also asked around and Fitbit seemed like the most “open” platform for collecting one’s own data. You have to pay $50 for a premium account, but after that, they allow you to download your data.

Or do they?

I looked into the details, asked a buddy or two, and found out that you actually can’t get the really interesting minute-by-minute data even with a premium account. You only get the daily summarized totals for steps/calories/stairs climbed. While this data is of some value, the minute-by-minute data are oh so much more interesting. I’d like to use it for personal interest, for teaching, for research, and for sharing interesting new ideas back to other Fitbit developers.

Since I’m not easily dissuaded, I tried another route. I created an application that accessed the Fitbit API. After fiddling around a bit with a few R packages, I was able to download my daily totals. But again, no minute-by-minute data. I looked into it and only](https://wiki.fitbit.com/display/API/Fitbit+Partner+API) have access to the intraday data. So I emailed Fitbit to ask if I could be a partner app. So far no word.

I guess it is true, if you aren’t paying for it, you are the product. But honestly, I’m just not that interested in being a product for Fitbit. So I think I’m bailing until I can download intraday data - I’m even happy to pay for it. If anybody has a suggestion of a more open self-monitoring device, I’d love to hear about it.

Happy 2013: The International Year of Statistics

The ASA has declared 2013 to be the International Year of Statistics and I am ready to celebrate it in full force. It is a great time to be a statistician and I am hoping more people will join the fun. In fact, as we like to point out in this blog, Statistics has already been at the center of many exciting accomplishments of the 21st century. Sabermetrics  has become a standard approach and inspired the Hollywood movie Money Ball. Friend of the blog Chris Volinsky, a PhD Statistician, led the team that won the Netflix million dollar prize. Nate Silver et al. proved the pundits wrong by, once again, using statistical models to predict election results almost perfectly. R has become one the most widely used programming languages in the world. Meanwhile, in academia, the number of statisticians becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. It is no surprise that stats majors at Harvard have more than quadrupled since 2000 and that statistics MOOCs are among the most popular.

The unprecedented advances in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming the world. Many areas of science are now being driven by new measurement technologies and many insights are being made by discovery-driven, as opposed to hypothesis-driven, experiments. Empiricism is back with a vengeance. The current scientific era is defined by its dependence on data and the statistical methods and concepts developed during the 20th century provide an incomparable toolbox to help tackle current challenges. The toolbox, along with computer science, will also serve as a base for the methods of tomorrow.  So I will gladly join the Year of Statistics' festivities during 2013 and beyond, during the era of data-driven science.

What makes a good data scientist?

Apparently, New Year’s Eve is not a popular day to come to the office as it seems I’m the only one here. No matter, it just means I can blast Mahler 3 (Bernstein, NY Phil, 1980s recording) louder than I normally would.

Today’s post is inspired by this latest article in the NYT about big data. The article for the most part describes a conference that happened at MIT recently on the topic of big data. Towards the end of the article, it is noted that one of the participants (Rachel Schutt) was asked what makes a good data scientist.

Obviously, she replied, the requirements include computer science and math skills, but you also want someone who has a deep, wide-ranging curiosity, is innovative and is guided by experience as well as data.</p>

“I don’t worship the machine,” she said.

I think I agree, but I would have put it a different way. Mostly, I think what makes a good data scientist is the same thing that makes you a good [insert field here] scientist. In other words, a good data scientist is a good scientist.

Sunday data/statistics link roundup (12/30/12)

  1. An interesting new app called 100plus, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. Here’s a post describing it on the heathdata.gov blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it.
  2. Data on mass shootings from the Mother Jones investigation.
  3. A post by Hilary M. on “Getting Started with Data Science”. I really like the suggestion of just picking a project and doing something, getting it out there. One thing I’d add to the list is that I would spend a little time learning about an area you are interested in. With all the free data out there, it is easy to just “do something”, without putting in the requisite work to know why what you are doing is good/bad. So when you are doing something, make sure you take the time to “know something”.
  4. An analysis of various measures of citation impact (also via Hilary M.). I’m not sure I follow the reasoning behind all of the analyses performed (seems a little like throwing everything at the problem and hoping something sticks) but one interesting point is how citation/usage are far apart from each other on the PCA plot. This is likely just because the measures cluster into two big categories, but it makes me wonder. Is it better to have a lot of people read your paper (broad impact?) or cite your paper (deep impact?).
  5. An [ 1. An interesting new app called 100plus, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. Here’s a post describing it on the heathdata.gov blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it.
  6. Data on mass shootings from the Mother Jones investigation.
  7. A post by Hilary M. on “Getting Started with Data Science”. I really like the suggestion of just picking a project and doing something, getting it out there. One thing I’d add to the list is that I would spend a little time learning about an area you are interested in. With all the free data out there, it is easy to just “do something”, without putting in the requisite work to know why what you are doing is good/bad. So when you are doing something, make sure you take the time to “know something”.
  8. An analysis of various measures of citation impact (also via Hilary M.). I’m not sure I follow the reasoning behind all of the analyses performed (seems a little like throwing everything at the problem and hoping something sticks) but one interesting point is how citation/usage are far apart from each other on the PCA plot. This is likely just because the measures cluster into two big categories, but it makes me wonder. Is it better to have a lot of people read your paper (broad impact?) or cite your paper (deep impact?).
  9. An](https://twitter.com/hmason/status/285163907360899072) on Twitter about how big data does not mean you can ignore the scientific method. We have talked a little bit about this before, in terms of how one should motivate statistical projects.