2013)

10 Feb 2013

An article about how NBA teams have installed cameras that allow their analysts to collect information on every movement/pass/play that is performed in a game. I think the most interesting part for me would be how you would define features. They talk about, for example, how many times a player drives. I wonder if they have an intern in the basement manually annotating those features or if they are using automatic detection algorithms (via Marginal Revolution).
Our friend Florian jumps into the MIC debate. I haven’t followed the debate very closely, but I agree with Florian that if a theory paper is published in a top journal, later falling back on heuristics and hand waving seems somewhat unsatisfying.
An opinion piece pushing the Journal of Negative Results in Biomedicine. If you can’t get your negative result in there, think about our P > 0.05 journal :-).
This has nothing to do with statistics/data but is a bit of nerd greatness. Run these commands from a terminal: traceroute 216.81.59.173.
A data visualization describing the effectiveness of each state’s election administrations. I think that it is a really cool idea, although I’m not sure I understand the index. A couple of related plots are this one that shows distance to polling place versus election day turnout and this one that shows the same thing for early voting. It’s pretty interesting how dramatically different the plots are.
Postdoc Sherri Rose writes about big data and junior statisticians at Stattrak. My favorite quote: “ We need to take the time to understand the science behind our projects before applying and developing new methods. The importance of defining our research questions will not change as methods progress and technology advances”.

Issues with reproducibility at scale on Coursera

06 Feb 2013

As you know, we are big fans of reproducible research here at Simply Statistics. As you know, we are [big fans of reproducible research](http://simplystatistics.org/?s=reproducible+research) here at Simply Statistics. around the lack of reproducibility in the analyses performed by Anil Potti and subsequent fallout drove the importance of this topic home.

So when I started teaching a course on Data Analysis for Coursera, of course I wanted to focus on reproducible research. The students in the class will be performing two data analyses during the course. They will be peer evaluated using a rubric specifically designed for evaluating data analyses at scale. One of the components of the rubric was to evaluate whether the code people submitted with their assignments reproduced all the numbers in the assignment.

Unfortunately, I just had to cancel the reproducibility component of the first data analysis assignment. Here are the things I realized while trying to set up the process that may seem obvious but weren’t to me when I was designing the rubric:

Security I realized (thanks to a very smart subset of the students in the class who posted on the message boards) that there is a major security issue with exchanging R code and data files with each other. Even if they use only the data downloaded from the official course website, it is possible that people could use the code to try to hack/do nefarious things to each other. The students in the class are great and the probability of this happening is small, but with a class this size, it isn’t worth the risk.
Compatibility I’m requiring that people use R for the course. Even so, people are working on every possible operating system, with many different versions of R . In this scenario, it is entirely conceivable for a person to write totally reproducible code that works on their machine but won’t work on a random peer-reviewers machine
Computing Resources The range of computing resources used by people in the class is huge. Everyone from people using modern clusters to people running on a single old beat up laptop. Inefficient code on a fast computer is fine, but on a slow computer with little memory it could mean the difference between reproducibility and crashed computers.

Overall, I think the solution is to run some kind of EC2 instance with a standardized set of software. That is the only thing I can think of that would be scalable to a class this size. On the other hand that would both be expensive, a pain to maintain, and would require everyone to run code on EC2.

Regardless, it is a super interesting question. How do you do reproducibility at scale?

Sunday data/statistics link roundup (2/3/2013)

03 Feb 2013

My student, Hilary, wrote a post about how her name is the most poisoned in history. A poisoned name is a name that quickly loses popularity year over year. The post is awesome for the following reasons: (1) she is a good/funny writer and has lots of great links in the post, (2) she very clearly explains concepts that are widely used in biostatistics like relative risk, and (3) she took the time to try to really figure out all the trends she saw in the name popularity. I’m not the only one who thinks it is a good post, it was reprinted in New York Magazine and went viral this last week.
In honor of it being Super Bowl Sunday (go Ravens!) here is a post about the reasons why it often doesn’t make sense to consider the odds of an event retrospectively due to the Wyatt Earp effect. Another way to think about it is, if you have a big tournament with tons of teams - someone will win. But at the very beginning, any team had a pretty small chance of winning all the games and taking the championship. If we wait until some team wins and calculate their pre-tournament odds of winning, it will probably be small. (via David S.)
A new article by Ben Goldacre in the NYT about unreported clinical trials. This is a major issue and Ben is all over it with his All Trials project. This is another reason we need a deterministic statistical machine. Don’t worry, we are working on building it.
Even though it is Super Bowl Sunday, I’m still eagerly looking forward to spring and the real sport of baseball. Rafa sends along this link analyzing the effectiveness of patient hitters when they swing at a first strike. It looks like it is only a big advantage if you are an elite hitter.
An article in Wired on the importance of long data. The article talks about how in addition to cross-sectional big data, we might also want to be looking at data over time - possibly large amounts of time. I think the title is maybe a little over the top, but the point is well taken. It turns out this is something a bunch of my colleagues in imaging and environmental health have been working on/talking about for a while. Longitudinal/time series big data seems like an important and wide-open field (via Nick R.).

paste0 is statistical computing's most influential contribution of the 21st century

31 Jan 2013

The day I discovered paste0 I literally cried. No more paste(bla,bla, sep=””). While looking through code written by a student who did not know about paste0 I started pondering about how many person hours it has saved humanity. So typing sep=”” takes about 1 second. We R users use paste about 100 times a day and there are about 1,000,000 R users in the world. That’s over 3 person years a day! Next up read.table0 (who doesn’t want as.is to be TRUE?).

Data supports claim that if Kobe stops ball hogging the Lakers will win more

28 Jan 2013

The Lakers recently snapped a four game losing streak. In that game Kobe, the league leader in field goal attempts and missed shots, had a season low of 14 points but a season high of 14 assists. This makes sense to me since Kobe shooting less means more efficient players are shooting more. Kobe has a lower career true shooting % than Gasol, Howard and Nash (ranked 17,3 and 2 respectively). Despite this he takes more than 1/4 of the shots. Commentators usually praise top scorers no matter what, but recently they The Lakers recently snapped a four game losing streak. In that game Kobe, the league leader in field goal attempts and missed shots, had a season low of 14 points but a season high of 14 assists. This makes sense to me since Kobe shooting less means more efficient players are shooting more. Kobe has a lower career true shooting % than Gasol, Howard and Nash (ranked 17,3 and 2 respectively). Despite this he takes more than 1/4 of the shots. Commentators usually praise top scorers no matter what, but recently they and noticed that the Lakers are 6-22 when Kobe has more than 19 field goal attempts and 12-3 in the rest of the games.

This graph shows score differential versus % of shots taken by Kobe* . Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/- 0.22) in score differential. It also suggests that when Kobe takes 15% of the shots, the Lakers win by an average of about 10 points, when he takes 30% (not a rare occurrence) they lose by an average of about 5. Of course we should not take this regression analysis to seriously but it’s hard to ignore the fact that when Kobe takes less than 23 23.25% of the shots the Lakers are 13-1.

I suspect that this relationship is not unique to Kobe and the Lakers. In general, teams with a more balanced attack probably do better. Testing this could be a good project for Jeff’s class.

* I approximated shots taken as field goal attempts + floor(0.5 x Free Throw Attempts).

Data is here.

Update: Commentator Sidney fixed some entires in the data file. Data and plot updated.

Older Newer

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (2/10/2013)

Issues with reproducibility at scale on Coursera

Sunday data/statistics link roundup (2/3/2013)

paste0 is statistical computing's most influential contribution of the 21st century

Data supports claim that if Kobe stops ball hogging the Lakers will win more