Great scientist - statistics = lots of failed experiments

12 Apr 2013

E.O. Wilson is a famous evolutionary biologist. He is currently an emeritus professor at Harvard and just this last week dropped this little gem in the Wall Street Journal. In the piece, he suggests that knowing mathematics is not important for becoming a great scientist. Wilson goes even further, suggesting that you can be mathematically semi-literate and still be an amazing scientist. There are two key quotes in the piece that I think deserve special attention:

Fortunately, exceptional mathematical fluency is required in only a few disciplines, such as particle physics, astrophysics and information theory. Far more important throughout the rest of science is the ability to form concepts, during which the researcher conjures images and processes by intuition.

I agree with this quote in general as does Paul Krugman. Many scientific areas don’t require advanced measure theory, differential geometry, or number theory to make big advances. It seems like this is is the kind of mathematics to which E.O. Wilson is referring to and on that point I think there is probably universal agreement that you can have a hugely successful scientific career without knowing about measurable spaces.

Wilson doesn’t stop there, however. He goes on to paint a much broader picture about how one can pursue science without the aid of even basic mathematics or statistics_ _and this is where I think he goes off the rails a bit:

Ideas in science emerge most readily when some part of the world is studied for its own sake. They follow from thorough, well-organized knowledge of all that is known or can be imagined of real entities and processes within that fragment of existence. When something new is encountered, the follow-up steps usually require mathematical and statistical methods to move the analysis forward. If that step proves too technically difficult for the person who made the discovery, a mathematician or statistician can be added as a collaborator.

I see two huge problems with this statement:

Poor design of experiments is one of, if not the most, common reason for an experiment to fail. It is so important that Fisher said, “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” Wilson is suggesting that with careful conceptual thought and some hard work you can do good science, but without a fundamental understanding of basic math, statistics, and study design even the best conceived experiments are likely to fail.
While armchair science was likely the norm when Wilson was in his prime, huge advances have been made in both science and technology. Scientifically, it is difficult to synthesize and understand everything that has been done without some basic understanding of the statistical quality of previous experiments. Similarly, as data collection has evolved statistics and computation are playing a more and more central role. As Rafa has pointed out, people in positions of power who don’t understand statistics are a big problem for science.

More importantly, as we live in an increasingly data rich environment both in the sciences and in the broader community - basic statistical and numerical literacy are becoming more and more important. While I agree with Wilson that we should try not to discourage people who have a difficult first encounter with math from pursuing careers in science, I think it is both disingenuous and potentially disastrous to downplay the importance of quantitative skill at the exact moment in history that those skills are most desperately needed.

As a counter proposal to Wilson’s idea that we should encourage people to disregard quantitative sciences I propose that we build a better infrastructure for ensuring all people interested in the sciences are able to improve their quantitative skills and literacy. Here at Simply Stats we are all about putting our money where our mouth is and we have already started by creating free, online versions of our quantitative courses. Maybe Wilson should take one….

Climate Science Day on Capitol Hill

10 Apr 2013

A few weeks ago I participated in the fourth annual Climate Science Day organized by the ASA and a host of other professional and scientific societies. There’s a nice write up of the event written by Steve Pierson over at Amstat News. There were a number of statisticians there besides me, but the vast majority of people were climate modelers, atmospheric scientists, agronomists, and the like. Below is our crack team of scientists outside the office of (Dr.) Andy Harris. Might be the only time you see me wearing a suit.

The basic idea behind the day is to get scientists who do climate-related research into the halls of Congress to introduce themselves to members of Congress and make themselves available for scientific consultations. I was there (with Brooke Anderson, the other JHU rep) because of some of my work on the health effects of heat. I was paired up with Tony Broccoli, a climate modeler at Rutgers, as we visited the various offices of New Jersey and Maryland legislators. We also talked to staff from the Senate Health, Education, Labor, and Pensions (HELP) committee.

Here are a few things I learned:

It was fun. I’d never been to Congress before so it was interesting for me to walk around and see how people work. Everyone (regardless of party) was super friendly and happy to talk to us.
The legislature appears to be run by women. Seriously, I think every staffer we met with (but one) was a woman. Might have been a coincidence, but I was not expecting that. We only met with one actual member of Congress, and that was (Dr.) Andy Harris from Maryland’s first district.
Climate change is not really on anyone’s radar. Oh well, we were there 3 days before the sequester hit so there were understandably other things on their minds. Waxman-Markey was the most recent legislation taken up by the House and it went nowhere in the Senate.
The Senate HELP committee has PhDs working on its staff. Didn’t know that.
Staffers are working on like 90 things at once, probably none of which are related to each other. That’s got to be a tough job.
I used more business cards on this one day than in my entire life.
Senate offices are way nicer than House offices.
The people who write our laws are around 22 years old. Maybe 25 if they went to law school. I’m cool with that, I think.

NIH is looking for an Associate Director for Data Science: Statisticians should consider applying

08 Apr 2013

NIH understands the importance of data and several months ago they announced this new position. Here is an excerpt from the add:

The ADDS will focus on the urgent need and increased opportunities for capitalizing on the expanding collections of biomedical data to advance NIH’s mission. In doing so, the incumbent will provide programmatic NIH-wide leadership for areas of data science that relate to data emanating from many areas of study (e.g., genomics, imaging, and electronic heath records). This will require knowledge about multiple domains of study as well as familiarity with approaches for integrating data from these various domains.

In my opinion, the person holding this job should have hands-on experience with data analysis and programming. The ~~nuisances~~ nuances of what a data analyst needs to successfully do his/her job can’t be underestimated. This knowledge will help this director make the right decisions when it comes to choosing what data to make available and how to make it available. When it comes to creating data resources, good intentions don’t always translate into usable products.

In this new era of data driven science this position will be highly influential making this job quite attractive. If you know of a Statistician that you think is interested please pass along the information.

Introducing the healthvis R package - one line D3 graphics with R

02 Apr 2013

We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs!

Today I’m announcing the new healthvis R package that is being developed by my student Prasad Patil (who needs a website like yesterday), Hector Corrada Bravo, and myself*. The basic idea is that I have loved D3 interactive graphics for a while. But they are hard to create from scratch, since they require knowledge of both Javascript and the D3 library.

Even with those skills, it can take a while to develop a new graphic. On the other hand, I know a lot about R and am often analyzing biomedical data where interactive graphics could be hugely useful. There are a couple of really useful tools for creating interactive graphics in R, most notably Shiny, which is awesome. But these tools still require a bit of development to get right and are designed for “stand alone” tools.

So we created an R package that builds specific graphs that come up commonly in the analysis of health data like survival curves, heatmaps, and icon arrays. For example, here is how you make an interactive survival plot comparing treated to untreated individuals with healthvis:

# Load libraries

library(healthvis)
library(survival)

# Run a cox proportional hazards regression

cobj <- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)

# Plot using healthvis - one line!

survivalVis(cobj, data=veteran, plot.title="Veteran Survival Data", group="trt", group.names=c("Treatment", "No Treatment"), line.col=c("#E495A5","#39BEB1"))

The “survivalVis” command above produces an interactive graphic like this. Here it is embedded (you may have to scroll to see the dropdowns on the right - we are working on resizing)

`<p dir="ltr"> We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs! </p>

Today I’m announcing the new healthvis R package that is being developed by my student Prasad Patil (who needs a website like yesterday), Hector Corrada Bravo, and myself*. The basic idea is that I have loved D3 interactive graphics for a while. But they are hard to create from scratch, since they require knowledge of both Javascript and the D3 library.

Even with those skills, it can take a while to develop a new graphic. On the other hand, I know a lot about R and am often analyzing biomedical data where interactive graphics could be hugely useful. There are a couple of really useful tools for creating interactive graphics in R, most notably Shiny, which is awesome. But these tools still require a bit of development to get right and are designed for “stand alone” tools.

So we created an R package that builds specific graphs that come up commonly in the analysis of health data like survival curves, heatmaps, and icon arrays. For example, here is how you make an interactive survival plot comparing treated to untreated individuals with healthvis:

# Load libraries

library(healthvis)
library(survival)

# Run a cox proportional hazards regression

cobj <- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)

# Plot using healthvis - one line!

survivalVis(cobj, data=veteran, plot.title="Veteran Survival Data", group="trt", group.names=c("Treatment", "No Treatment"), line.col=c("#E495A5","#39BEB1"))

The “survivalVis” command above produces an interactive graphic like this. Here it is embedded (you may have to scroll to see the dropdowns on the right - we are working on resizing)

`

The advantage of this approach is that you can make common graphics interactive without a lot of development time. Here are some other unique features:

The graphics are hosted on Google App Engine. With one click you can get a permanent link and share it with collaborators.
With another click you can get the code to embed the graphics in your website.
If you have already created D3 graphics it only takes a few minutes to develop a healthvis version to let R users create their own - email us and we will make it part of the healthvis package!
healthvis is totally general - you can develop graphics that don’t have anything to do with health with our framework. Just email to get in touch if you want to be a developer at healthvis@gmail.com

We have started a blog over at healthvis.org where we will be talking about the tricks we learn while developing D3 graphics, updates to the healthvis package, and generally talking about visualization for new technologies like those developed by the CCNE and individualized health. If you are interested in getting involved as a developer, user or tester, drop us a line and let us know. In the meantime, happy visualizing!

* This project is supported by the JHU CCNE (U54CA151838) and the Johns Hopkins inHealth initiative.

An instructor's thoughts on peer-review for data analysis in Coursera

26 Mar 2013

I used peer-review for the data analysis course I just finished. As I mentioned in the post-mortem podcast I knew in advance that it was likely to be the most controversial component of the class. So it wasn’t surprising that based on feedback in the discussion boards and on this blog, the peer review process is by far the thing students were most concerned about.

But to evaluate complete data analysis projects at scale there is no other alternative that is economically feasible. To give you an idea, I have our local students perform 3 data analyses in an 8 week term here at Johns Hopkins. There are generally 10-15 students in that class and I estimate that I spend around an hour reading each analysis, digesting what was done, and writing up comments. That means I usually spend almost an entire weekend grading just for 10-15 data analyses. If you extrapolate that out to the 5,000 or so people who turned in data analysis assignments, it is clearly not possible for me to do all the grading.

Another alternative would be to pay trained data analysts to grade all the assignments. Of course that would be expensive - you couldn’t farm it out to the mechanical turk. If you want to get a better/more consistent grading scheme than peer review you’d need to hire highly trained data analysts to do that and that would be very expensive. While Johns Hopkins has been incredibly supportive in terms of technical support and giving me the flexibility to pursue the class, it is definitely something I did on my own time and with a lot of my own resources. It isn’t clear that it make sense for Hopkins to pour huge resources into really high-quality grading. At the same time, I’m not sure Coursera could afford to do this for all of the classes where peer review is needed, as they are just a startup.

So I think that at least for the moment, peer review is the best option for grading. This has big implications for the value of the Coursera statements of accomplishment in classe where peer review is necessary. I think that it would benefit Coursera hugely to do some research on how to ensure/maintain quality in peer review (Coursera - if you are reading this and you have some $$ you want to send my way to support some students/postdocs I have some ideas on how to do that). The good news is that the amazing Coursera platform collects so much data that it is possible to do that kind of research.

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Great scientist - statistics = lots of failed experiments

Climate Science Day on Capitol Hill

NIH is looking for an Associate Director for Data Science: Statisticians should consider applying

Introducing the healthvis R package - one line D3 graphics with R

An instructor's thoughts on peer-review for data analysis in Coursera