Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Beijing Air

If you’re interested in know what the air quality looks like in Beijing China, the US Embassy there has a particulate matter monitor on its roof that tweets the level of fine particulate matter (PM2.5) every hour (see @BeijingAir). In case you’re not used to staring at PM2.5 values all the time, let me provide some context.

The US National Ambient Air Quality Standard for the 24-hour average PM2.5 level is 35 mcg/m^3. The twitter feed shows hourly values, so you can’t compare it directly to the US NAAQS (you’d have to take the average of 24 values), but the levels are nevertheless pretty high.

For example, here’s the hourly time series plot of one 24-hour period in March of 2010:

The red and blue lines show the average and maximum 24-hour value for Wake County, NC for the period 2000-2006 (I made this plot when I was giving a talk in Raleigh).

So, things could be worse here in the US, but remember that there’s no real evidence of a threshold for PM2.5, so even levels here are potentially harmful. But if you’re traveling to China anytime soon, might want to bring a respirator.

DNA Sequencing Caught in Deluge of Data

DNA Sequencing Caught in Deluge of Data

Selling the Power of Statistics

A few weeks ago we learned that Warren Buffett is a big IBM fan (a $10 billion fan, that is). Having heard that I went over to the IBM web site to see what they’re doing these days. For starters, they’re not selling computers anymore! At least not the kind that I would use. One of the big things they do now is “Business Analytics and Optimization” (i.e. statistics), which is one of the reasons they bought SPSS and then later Algorithmics.

Roaming around the IBM web site, I found this little video on how IBM is involved with tennis matches like the US Open. It’s the usual promo video: a bit cheesy, but pretty interesting too. For example, they provide all the players an automatically generated post-game “match analysis DVD” that has summaries of all the data from their match with corresponding video.

It occurred to me that one of the challenges that a company like IBM faces is selling the “power of analytics” to other companies. They need to make these promo videos because, I guess, some companies are not convinced they need this whole analytics thing (or at least not from IBM). They probably need to do methods and software development too, but getting the deal in the first place is at least as important.

In contrast, here at Johns Hopkins, my experience has been that we don’t really need to sell the “power of statistics” to anyone. For the most part, researchers around here seem to be already “sold”. They understand that they are collecting a ton of data and they’re going to need statisticians to help them understand it. Maybe Hopkins is the exception, but I doubt it.

Good for us, I suppose, for now. But there is a danger that we take this kind of monopoly position for granted. Companies like IBM hire the same people we do (including one grad school classmate) and there’s no reason why they couldn’t become direct competitors. We need to continuously show that we can make sense of data in novel ways. 

Contributions to the R source

One of the nice things about tracking the R subversion repository using git instead of subversion is you can do

git shortlog -s -n

which gives you

19855  ripley
  6302  maechler
  5299  hornik
  2263  pd
  1153  murdoch
   813  iacus
   716  luke
   661  jmc
   614  leisch
   472  ihaka
   403  murrell
   286  urbaneks
   284  rgentlem
   269  apache
   253  bates
   249  tlumley
   164  duncan
    92  r
    43  root
    40  paul
    40  falcon
    39  lyndon
    34  thomas
    33  deepayan
    26  martyn
    18  plummer
    15  (no author)
    14  guido
     3  ligges
     1  mike

These data are since 1997 so for Brian Ripley, that’s 3.6 commits per day for the last 15 years. 

I think that number 1 position will be out of reach for a while. 

By the way, I highly recommend to anyone tracking subversion repositories that they use git to do it. You get all of the advantages of git and there are essentially no downsides.

Reproducible Research and Turkey

Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related. 

I sometimes get the sense that many people see reproducibility as essentially binary. A published paper is either reproducible, as in you can compute every single last numerical result to within epsilon precision, or it’s not. My feeling is that there is a spectrum of reproducibility when it comes to published scientific findings. Some papers are more reproducible than others. And that’s where cooking comes in.

I do a bit of cooking and I am a shameless consumer of food blogs/web sites. There seems pretty solid agreement (and my own experience essentially confirms) that the more you can make yourself and not have to rely on other people doing the cooking, the better. For example, for Thanksgiving, you could theoretically buy yourself a pre-roasted turkey that’s ready to eat. My brother tells me this is what homesick Americans do in China because so few people have an oven (I suppose you could steam a turkey?). Or you could buy an un-cooked turkey that is “flavor injected”. Or you could buy a normal turkey and brine/salt it yourself. Or you could get yourself one of those heritage turkeys. Or you could raise your own turkeys…. I think in all of these cases, the turkey would definitely be edible and maybe even tasty. But some would probably be more tasty than others. 

And that’s the point. There’s a spectrum when it comes to cooking and some methods result in better food than others. Similarly, when it comes to published research there is a spectrum of what authors can make available to reproduce their work. On the one hand, you have just the paper itself, which reveals quite a bit of information (i.e. the scientific question, the general approach) but usually too few details to actually reproduce (or even replicate) anything. Some authors might release the code, which allows you to study the algorithms and maybe apply them to your own work. Some might release the code and the data so that you can actually reproduce the published findings. Some might make a nice R package/vignette so that you barely have to lift a finger. Each case is better than the previous, but that’s not to say that I would only accept the last/best case. Some reproducibility is better than none.

That said, I don’t think we should shoot low. Ideally, we would have the best case, which would allow for full reproducibility and rapid dissemination of ideas. But while we wait for that best case scenario, it couldn’t hurt to have a few steps in between.