Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Apple this is ridiculous - you gotta upgrade to upgrade!?

So along with a few folks here around Hopkins we have been kicking around the idea of developing an app for the iPhone/Android. I’ll leave the details out for now (other than to say stay tuned!).

But to start developing an app for the iPhone, you need a version of Xcode, Apple’s development environment. The latest version of Xcode is version 4, which can only be installed with the latest version of Mac OS X Lion (10.7, I think) and above. So I dutifully went off to download Lion. Except, whoops! You can only download Lion from the Mac App store.

Now this wouldn’t be a problem, if you didn’t need OS X Snow Leopard (10.6 and above) to access the App store. Turns out I only have version 10.5 (must be OS X Housecat or something). I did a little searching and it looks like the only way I can get Lion is if I buy Snow Leopard first and upgrade to upgrade!

It isn’t the money so much (although it does suck to pay $60 for $30 worth of software), but the time and inconvenience this causes. Apple has done this a couple of times to me in the past with operating systems needing to be upgraded so I can buy things from iTunes. But this is getting out of hand….maybe I need to consider the alternatives.

An R function to analyze your Google Scholar Citations page

Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here.

I asked John Muschelli and Andrew Jaffeto write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.

So how does it work? Here is the code. You can source the functions like so:

source(“http://biostat.jhsph.edu/~jleek/code/googleCite.r”)

This will install the following packages if you don’t have them: wordcloud, tm, sendmailR, RColorBrewer. Then you need to find the url of a google scholar citation page. Here is Rafa Irizarry’s:

http://scholar.google.com/citations?user=nFW-2Q8AAAAJ

You can then call the googleCite function like this:

out = googleCite(“http://scholar.google.com/citations?user=nFW-2Q8AAAAJ;,pdfname=”rafa_wordcloud.pdf;)

or search by name like this:

out = searchCite(“Rafa Irizarry”,pdfname=”rafa_wordcloud.pdf”)

The function will download all of Rafa’s citation data and put it in the matrix out. It will also make wordclouds of (a) the co-authors on his papers and (b) the titles of his papers and save them in the pdf file specified (There is an option to turn off plotting if you want). Here is what Rafa’s clouds look like:

We have also written a little function to calculate many of the popular citation indices. You can call it on the output like so:

gcSummary(out)

When you download citation data, an email with the data table will also be sent to Simply Statistics so we can collect information on who is using the function and perform population-level analyses.

If you liked this function you might also be interesting in our R function to determine if you are a data scientist, or in some of the other stuff going on over at Simply Statistics.

Enjoy!

Data Scientist vs. Statistician

There’s in interesting discussion over at reddit on the difference between a data scientist and a statistician. My crude summary of the discussion seems to be that by and large they are the same but the phrase “data scientist” is just the hip new name for statistician that will probably sound stupid 5 years from now.

My question is why isn’t “statistician” hip? The comments don’t seem to address that much (although a few go in that direction). There a few interesting comments about computing. For example from ByteMining:

Statisticians typically don’t care about performance or coding style as long as it gets a result. A loop within a loop within a loop is all the same as an O(1) lookup.

Another more down-to-earth comment comes from marshallp:

There is a real distinction between data scientist and statistician

  • the statistician spent years banging his/her head against blackboards full of math notation to get a modestly paid job

  • the data scientist gets s—loads of cash after having learnt a scripting language and an api

More people should be encouraged into data science and not pointless years of stats classes

Not sure I fully agree but I see where he’s coming from!

[Note: See also our post on how determine whether you are a data scientist.]

Ozone rules

A recent article in the New York Times describes the backstory behind the decision to not revise the ozone national ambient air quality standard. This article highlights the reality of balancing the need to set air pollution regulation to protect public health and the desire to get re-elected. Not having ever served in politics (does being elected to the faculty senate count?) I can’t comment on the political aspect. But I wanted to highlight some of the scientific evidence that goes into developing these standards. 

A bit of background: the Clean Air Act of 1970 and its subsequent amendments requires that national ambient air quality standards be set to protect public health with “an adequate margin of safety”. Ozone (usually referred to as smog in the press) is one of the pollutants for which standards are set, along with particulate matter, nitrogen oxides, sulfur dioxide, carbon monoxide, and airborne lead. Importantly, the Clean Air Act requires that the EPA to set standards based on the best available scientific evidence.

The ozone standard was re-evaluated years ago under the (second) Bush administration. At the time, the EPA staff recommended a daily standard of between 60 and 70 ppb as providing an adequate margin of safety. Roughly speaking, if the standard is 70 ppb, this means that states cannot have levels of ozone higher than 70 ppb on any given day (that’s not exactly true but the real standard is a mouthful). Stephen Johnson, EPA administrator at the time, set the standard at 75 ppb, citing in part the lack of evidence showing a link between ozone and health at low levels.

We’ve conducted epidemiological analyses that show that ozone is associated with mortality even at levels far below 60 ppb (See Figure 2). Note, this paper was not published in time to make into the previous EPA review. The study suggests that if a threshold exists below which ozone has no health effect, it is probably at a level lower than the current standard, possibly nearing natural background levels. Detecting thresholds at very low levels is challenging because you start running out of data quickly. But other studies that have attempted to do this have found results similar to ours.

The bottom line is pollution levels below current air quality standards should not be misinterpreted as safe for human health.

Show 'em the data!

In a previouspostI argued that students entering college should be shown job prospect by major data. This week I found out the American Bar Association might make it a requirement for law school accreditation.

Hat tip to Willmai Rivera.