If you are near DC/Baltimore, come see Jeff talk about Coursera

23 Aug 2013

I’ll be speaking at the Data Science Maryland meetup. The title of my presentation is “Teaching Data Science to the Masses”. The talk is at 6pm on Thursday, Sept. 19th. More info here.

Chris Lane, U.S. tourism boycotts, and large relative risks on small probabilities

22 Aug 2013

Chris Lane was tragically killed (link via Leah J.) in a shooting in Duncan, Oklahoma. According to the reports, it sounds like it was apparently a random and completely senseless act of violence. It is horrifying to think that those kids were just looking around for someone to kill because they were bored.

Gun violence in the U.S. is way too common and I’m happy about efforts to reduce the chance of this type of event. But I noticed this quote in the above linked CNN article from the former prime minister of Australia, Tim Fischer:

People thinking of going to the USA for business or tourist trips should think carefully about it given the statistical fact you are 15 times more likely to be shot dead in the USA than in Australia per capita per million people.

The CNN article suggests he is calling for a boycott of U.S. tourism. I’m guessing he got his data from a table like this. According to the table, the total firearm related deaths per one million in Australia is 10.6 and in the U.S. 103. So the ratio is something like 10 times. If you restrict to homicides, the rates are 1.3 per million for Australia and 36 per million for the U.S. Here the ratio is almost 36 times.

So the question is, should you boycott the U.S. if you are an Australian tourist? Well, the percentage of people killed in firearm related deaths is 0.0036% in the U.S. and 0.00013% for Australia. So it is incredibly unlikely that you will be killed by a firearm in either country. The issue here is that with small probabilities, you can get huge relative risks, even when both outcomes are very unlikely in an absolute sense. The Chris Lane killing is tragic and horrifying, but I’m not sure a tourism boycott for the purposes of safety is justified.

Treading a New Path for Reproducible Research: Part 1

21 Aug 2013

Discussions about reproducibility in scientific research have been on the rise lately, including on this blog. There are many underlying trends that have produced this increased interest in reproducibility: larger and larger studies being harder to replicate independently, cheaper data collection technologies/methods producing larger datasets, cheaper computing power allowing for more sophisticated analyses (even for small datasets), and the rise of general computational science (for every “X” we now have “Computational X”).

For those that haven’t been following, here’s a brief review of what I mean when I say “reproducibility”. For the most part in science, we focus on what I and some others call “replication”. The purpose of replication is to address the validity of a scientific claim. If I conduct a study and conclude that “X is related to Y”, then others may be encouraged to replicate my study–with independent investigators, data collection, instruments, methods, and analysis–in order to determine whether my claim of “X is related to Y” is in fact true. If many scientists replicate the study and come to the same conclusion, then there’s evidence in favor of the claim’s validity. If other scientists cannot replicate the same finding, then one might conclude that the original claim was false. In either case, this is how science has always worked and how it will continue to work.

Reproducibility, on the other hand, focuses on the validity of the data analysis. In the past, when datasets were small and the analyses were fairly straightforward, the idea of being able to reproduce a data analysis was perhaps not that interesting. But now, with computational science, where data analyses can be extraodinarily complicated, there’s great interest in whether certain data analyses can in fact be reproduced. By this I mean is it possible to take someone’s dataset and come to the same numerical/graphical/whatever output that they came to. While this seems theoretically trivial, in practice it’s very complicated because a given data analysis, which typically will involve a long pipeline of analytic operations, may be difficult to keep track of without proper organization, training, or software.

What Problem Does Reproducibility Solve?

In my opinion, reproducibility cannot really address the validity of a scientific claim as well as replication. Of course, if a given analysis is not reproducible, that may call into question any conclusions drawn from the analysis. However, if an analysis is reproducible, that says practically nothing about the validity of the conclusion or of the analysis itself.

In fact, there are numerous examples in the literature of analyses that were reproducible but just wrong. Perhaps the most nefarious recent example is the Potti scandal at Duke. Given the amount of effort (somewhere close to 2000 hours) Keith Baggerly and his colleagues had to put into figuring out what Potti and others did, I think it’s reasonable to say that their work was not reproducible. But in the end, Baggerly was able to reproduce some of the results--this was how he was able to figure out that the analysis were incorrect. If the Potti analysis had not been reproducible from the start, it would have been impossible for Baggerly to come up with the laundry list of errors that they made.

The Reinhart-Rogoff kerfuffle is another example of analysis that ultimately was reproducible but nevertheless questionable. While Herndon did have to do a little reverse engineering to figure out the original analysis, it was nowhere near the years-long effort of Baggerly and colleagues. However, it was Reinhart-Rogoff’s unconventional weighting scheme (fully reproducible, mind you) that drew all of the attention and strongly influenced the analysis.

I think the key question we want to answer when seeing the results of any data analysis is “Can I trust this analysis?” It’s not possible to go into every data analysis and check everything, even if all the data and code were available. In most cases, we want to have a sense that the analysis was done appropriately (if not optimally). I would argue that requiring that analyses be reproducible does not address this key question.

With reproducibility you get a number of important benefits: transparency, data and code for others to analyze, and an increased rate of transfer of knowledge. These are all very important things. Data sharing in particular may be important independent of the need to reproduce a study if others want to aggregate datasets or do meta-analyses. But reproducibility does not guarantee validity or correctness of the analysis.

Prevention vs. Medication

One key problem with the notion of reproducibility is the point in the research process at which we can apply it as an intervention. Reproducibility plays a role only in the most downstream aspect of the research process--post-publication. Only after a paper is published (and after any questionable analyses have been conducted) can we check to see if an analysis was reproducible or conducted in error.

At this point it may be difficult to correct any mistakes if they are identified. Grad students have graduated, postdocs have left, people have moved on. In the Potti case, letters to the journal editors were ignored. While it may be better to check the research process at the end rather than to never check it, intervening at the post-publication phase is arguably the most expensive place to do it. At this phase of the research process, you are merely “medicating” the problem, to draw an analogy with chronic diseases. But fundamental data analytic damage may have already been done.

This medication aspect of reproducibility reminds me of a famous quotation from R. A. Fisher:

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Reproducibility allows for the statistician to conduct the post mortem of a data analysis. But wouldn’t it have been better to have prevented the analysis from dying in the first place?

Moving Upstream

There has already been much discussion of changing the role of reproducibility in the publication/dissemination process. What if a paper had to be deemed reproducible before it was published? The question here is who will reproduce the analysis? We can't trust the authors to do it so we have to get an independent third party. What about peer reviewers? I would argue that this is a pretty big burden to place on a peer reviewer who is already working for free. How about one of the Editors? Well, at the journal Biostatistics, that’s exactly what we do. However, our policy is voluntary and only plays a role after a paper has been accepted through the usual peer review process. At any rate, from a business perspective, most journal owners will be reluctant to implement any policy that might reduce the number of submissions to the journal.

What Then?

To summarize, I believe reproducibility of computational research is very important, primarily to increase transparency and to improve knowledge sharing. However, I don’t think reproducibility in and of itself addresses the fundamental question of “Can I trust this analysis?”. Furthermore, reproducibility plays a role at the most downstream part of the research process (post-publication) where it is costliest to fix any mistakes that may be discovered. Ultimately, we need to think beyond reproducibility and to consider developing ways to ensure the quality of data analysis from the start.

How can we address the key problem concerning the validity of a data analysis? I’ll talk about what I think we should do in Part 2 of this post.

A couple of requests for the @Statistics2013 future of statistics workshop

20 Aug 2013

Statistics 2013 is hosting a workshop on the future of statistics. Given the timing and the increasing popularity of our discipline I think its a great idea to showcase the future of our field.

I just have two requests:

Please invite more junior people to speak who are doing cutting edge work that will define the future of our field.
Please focus the discussion on some of the real and very urgent issues facing our field.

Regarding #1 the list of speakers appears to be only very senior people. I wish there were more junior speakers because: (1) the future of statistics will be defined by people who are just starting their careers now and (2) there are some awesome super stars who are making huge advances in, among other things, the theory of machine learning, high-throughput data analysis, data visualization, and software creation. I think including at least one person under 40 on the speaking list would bring some fresh energy.*

Regarding #2 I think there are a few issues that are incredibly important for our field as we move forward. I hope that the discussion will cover some of these:

Problem first not solution backward. It would be awesome if there was a whole panel filled with people from industry/applied statistics talking about the major problems where statisticians are needed and how we can train statisticians to tackle those problems. In particular it would be cool to see discussion of: (1) should we remove some math and add some software development to our curriculum?, (2) should we rebalance our curriculum to include more machine learning?, (3) should we require all students to do rotations in scientific or business internships?, (4) should we make presentation skills a high priority skill along with the required courses in math stats/applied stats?
Straight up embracing online education. We are teaching MOOCs here at Simply Stats. But that is only one way to embrace online education. What about online tutorials on Github. Or how about making educational videos for software packages?
Good software is now the most important contribution of statisticians. The most glaring absence from the list of speakers and panels is that there is no discussion of software! I have gone so far as to say if you (or someone else) aren’t writing software for your methods, they don’t really exist. We need to have a serious discussion as a field about how the future of version control, reproducibility, data sharing, etc. are going to work. This seems like the perfect venue.
How we can forge better partnerships with industry and other data generators? Facebook, Google, Bitly, Twitter, Fitbit etc. are all collecting huge amounts of data. But there is no data sharing protocol like there was for genomics. Similarly, much of the imaging data in the world is tied up in academic and medical institutes. Fresh statistical eyes can’t be placed on these problems until the data are available in easily accessible, analyzable formats. How can we forge partnerships that make the data more valuable to the companies/institutes creating them and add immense value to young statisticians?

These next two are primarily targeted at academics:

How we can speed up our publication process? For academic statisticians this is a killer and major problem. I regularly wait 3-5 months for papers to be reviewed for the first time at the fastest stat journals. Some people still wait years. By then, the highest impact applied problems have moved on with better technology, newer methodology etc.
How we can make our promotion process/awards process more balanced between theoretical and applied contributions? I think both are very important, but right now, on balance, papers in JASA are much more highly rated than Bioconductor packages with 10,000+ users. Both are hard work, both represent important contributions and both should be given strong weight (for example in rating ASA Fellows).

Anyway, I hope the conference is a huge success. I was pumped to see all the chatter on Twitter when Nate Silver spoke at JSM. That was a huge win for the organizers of the event. I am really hopeful that with the important efforts of the organizers of these big events that we will see a continued trend toward a bigger and bigger impact of statistics.

* Rafa is invited, but he’s over 40 :-).**

** Rafa told me to mention he’s barely over 40.

WANTED: Neuro-quants

13 Aug 2013

Our good colleagues Brian Caffo, Martin Lindquist, and Ciprian Crainiceanu have written a nice editorial for the HuffPo on the need for statisticians in neuroimaging.

Older Newer

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

If you are near DC/Baltimore, come see Jeff talk about Coursera

Chris Lane, U.S. tourism boycotts, and large relative risks on small probabilities

Treading a New Path for Reproducible Research: Part 1

A couple of requests for the @Statistics2013 future of statistics workshop

WANTED: Neuro-quants