Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A Northwest Pipeline to Silicon Valley

A Northwest Pipeline to Silicon Valley

Skepticism+Ideas+Grit

A number of people seem to have objected to my post quoting Carl Sagan about skepticism (hi Paramita!) and I appreciate the comments. However, I wanted to clarify why I liked the quotation. I think in order to be successful in science three things are necessary:

  1. A healthy skepticism
  2. An original idea
  3. Quite a bit of grit and moxie

I find that too often, people consciously or unconsciously stop at (A). In fact some people make an entire career doing (A) but it’s not one that I can personally appreciate.

What we need more of is skepticism coupled with new ideas, not pure skepticism. 

The power of power

Those of you living in the mid-Atlantic region are probably not reading this right now because you don’t have power. I’ve been out of power in my house since last Friday and projections are it won’t come back until the end of the week. I am lucky because my family and I have some backup options, but not everyone has those options.

So that leads me to this question—do power outages affect health? There have been a number of papers examining this question, mostly looking at one-off episodes, as you might expect. One paper, written by Brooke Anderson (postdoctoral fellow here) and Michelle Bell at Yale University examined the effect of the massive 2003 power outage in New York City on all-cause mortality. This was the first city-wide blackout since 1977 and the data from the time period are striking.

A key point with this paper is that often mortality is under-estimated in these kinds of situations because deaths are only counted if they are identified as “disaster-related” (there may be other reasons, but I won’t get into that here). The NYC Department of Health and Mental Hygiene reported the total number of deaths to be 6 over the 2-day period of the blackout, mostly from carbon monoxide poisoning. However, the paper estimated a 28% increase in all-cause mortality which, in New York, translates to an excess mortality from of 90 deaths, an order of magnitude higher than official results.

The power outage in the mid-Atlantic is ongoing but things appear to be improving by the day. According to BGE, the primary electricity provider in Baltimore City, over half of its customers in the city were without power. On top of that the region is in the middle of a heat wave that has been going on for roughly the same amount of time as the power outage. If you figure the worst of it was in the first 3 days, and if New York’s relative risk could be applied here in Baltimore (a BIG if), then given a typical daily mortality of 17 deaths in the summer months, we would expect an excess mortality for the 3-day period of about 14 deaths from all causes.

Unfortunately, it seems power outages are likely to become more frequent because of increasing stress on an aging power grids and climate change causing more extreme weather (this outage was caused by a severe thunderstorm). It seems to me that the contribution of such infrastructure failures to health problems will be an interesting problem to study for the future.

Replication and validation in -omics studies - just as important as reproducibility

The psychology/social psychology community has made replication a huge focus over the last year. One reason is the recent, public blow-up over a famous study that did not replicate. There are also concerns about the experimental and conceptual design of these studies that go beyond simple lack of replication. In genomics, a similar scandal occurred due to what amounted to “data fudging”. Although, in the genomics case, much of the blame and focus has been on lack of reproducibility or data availability

I think one of the reasons that the field of genomics has focused more on reproducibility is that replication is already more consistently performed in genomics. There are two forms for this replication: validation and independent replication. Validation generally refers to a replication experiment performed by the same research lab or group - with a different technology or a different data set. On the other hand, independent replication of results is usually performed by an outside laboratory. 

Validation is by far the more common form of replication in genomics. In this article in Science, Ioannidis and Khoury point out that validation has different meaning depending on the subfield of genomics. In GWAS studies, it is now expected that every significant result will be validated in a second large cohort with genome-wide significance for the identified variants.

In gene expression/protein expression/systems biology analyses, there has been no similar definition of the “criteria for validation”. Generally the experiments are performed and if a few/a majority/most of the results are confirmed, the approach is considered validated. My colleagues and I just published a paper where we define a new statistical sampling approach for validating lists of features in genomics studies that is somewhat less ambiguous. But I think this is only a starting point. Just like in psychology, we need to focus not just on reproducibility, but also replicability of our results, and we need new statistical approaches for evaluating whether validation/replication have actually occurred. 

Computing and Sustainability: What Can Be Done?

Last Friday, the National Research Council released a report titled Computing Research for Sustainability, written by the NRC’s Committee on Computing Research for Environmental and Societal Sustainability, on which I served (press release). This was a novel experience for me given that I was the only non-computer scientist on the committee. That said, I think the report is quite interesting for a number of reasons. As a statistician, I took away a few lessons.

  • Sustainability presents many opportunities for CS. One of the first things the committee did was hold a workshop where researchers from all over presented their work on CS and sustainability—-and it was impressive. Everything from Shwetak Patel’s clever use of data analysis to monitor home power usage to Bill Tomlinson’s work in human computer interaction. Very educational for me. One thing I remember is that towards the end of the workshop John Doyle made some comment about IPv6 and everyone laughed and…I didn’t get it. I still don’t get it.
  • CS faces a number of statistical challenges. Many of the interesting areas posed by sustainability research come across, in my mind, as statistical problems. In particular, there is a need to develop better statistical models for understanding uncertainty in a variety of systems (e.g. electrical power grids, climate models, ecological dynamics). These are CS problems because they are “big data” systems but the underlying issues are largely statistical. Overall, it seems a lot of money has been put into collecting data but relatively little investment has been made (so far) in figuring out what to do with it.
  • Statistics and CS will be crashing into each other at a theater near you. In many discussions the Committee had, I couldn’t help thinking that a lot of the challenges in CS are exactly the same as in statistics. Specifically, how integrated should computer scientists be in the other sciences? Being an outsider to that area, it seems there is a debate going on between those who do “pure” computer science, like compilers and programming languages, and those who do “applied” computer science, like computational biology. This debate sounds eerily familiar.

It was fun to hang out with the computer scientists for a while, and this group was really exceptional. But now, back to my day job.