The bright future of applied statistics

15 May 2013

In 2013, the Committee of Presidents of Statistical Societies (COPSS) celebrates its 50th Anniversary. As part of its celebration, COPSS will publish a book, with contributions from past recipients of its awards, titled “Past, Present and Future of Statistical Science”. Below is my contribution titled The bright future of applied statistics.

When I was asked to contribute to this issue, titled Past, Present, and Future of Statistical Science, I contemplated my career while deciding what to write about. One aspect that stood out was how much I benefited from the right circumstances. I came to one clear conclusion: it is a great time to be an applied statistician. I decided to describe the aspects of my career that I have thoroughly enjoyed in the past and present and explain why I this has led me to believe that the is bright for applied statisticians.

I became an applied statistician while working with David Brillinger on my PhD thesis. When searching for an advisor I visited several professors and asked them about their interests. David asked me what I liked and all I came up with was “I don’t know. Music?”, to which he responded “That’s what we will work on”. Apart from the necessary theorems to get a PhD from the Statistics Department at Berkeley, my thesis summarized my collaborative work with researchers at the Center for New Music and Audio Technology. The work
involved separating and parameterizing the harmonic and non-harmonic components of musical sound signals [5]. The sounds had been digitized into data. The work was indeed fun, but I also had my first glimpse into the incredible potential of statistics in a world becoming more and more data-driven.

Despite having expertise only in music, and a thesis that required a CD player to hear the data, fitted models and residuals, I was hired by the Department of Biostatistics at Johns Hopkins School of Public Health. Later I realized what was probably obvious to the School’s leadership: that regardless of the subject matter of my thesis, my time series expertise could be applied to several public health applications [8, 2, 1]. The public health and biomedical challenges surrounding me were simply too hard to resist and my new
department knew this. It was inevitable that I would quickly turn into an applied Biostatistician.

Since the day that I arrived at Hopkins 15 years ago, Scott Zeger, the department chair, fostered and encouraged faculty to leverage their statistical expertise to make a difference and to have an immediate impact in science. At that time, we were in the midst of a measurement revolution that was transforming several scientific fields into data-driven ones. By being located in a School of Public Health and next to a medical school, we were surrounded by collaborators working in such fields. These included environmental science, neuroscience, cancer biology, genetics, and molecular biology. Much of my work was motivated by collaborations with biologists that, for the first time, were collecting large amounts of data. Biology was changing from a data poor discipline to a data intensive
ones.

A specific example came from the measurement of gene expression. Gene expression is the process where DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. Before microarrays were invented in the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure 1A below). With microarrays, this suddenly changed to sifting through tens of thousands of numbers (see Figure 1B). Biologists went from using their eyes to categorize results to having thousands (and now millions) of measurements per sample to analyze. Furthermore, unlike genomic DNA, which is static, gene expression is a dynamic quantity: different tissues express different genes at different levels and at different times. The complexity was exacerbated by unpolished technologies that made measurements much noisier than anticipated. This complexity and level of variability made statistical thinking an important aspect of the analysis. The Biologists that used to say, “if I need statistics, the experiment went wrong” were now seeking out our help. The results of these collaborations have led to, among other things, the development of breast cancer recurrence gene expression assays making it possible to identify patients at risk of distant recurrence following surgery [9].

Figure 1: Illustration of gene expression data before and after micorarrays.

When biologists at Hopkins first came to our department for help with their microarray data, Scott put them in touch with me because I had experience with (what was then) large datasets (digitized music signals are represented by 44,100 points per second). The more I learned about the scientific problems and the more data I explored, the more motivated I became. The potential for statisticians having an impact in this nascent field was clear and my department was encouraging me to take the plunge. This institutional encouragement and support was crucial as successfully working in this field made it harder to publish in the mainstream statistical journals; an accomplishment that had traditionally been heavily weighted in the promotion process. The message was clear: having an immediate impact on specific scientific fields would be rewarded as much as mathematically rigorous methods with general applicability.

As with my thesis applications, it was clear that to solve some of the challenges posed by microarray data I would have to learn all about the technology. For this I organized a sabbatical with Terry Speed’s group in Melbourne where they helped me accomplish this goal. During this visit I reaffirmed my preference for attacking applied problems with simple statistical methods, as opposed to overcomplicated ones or developing new techniques. Learning that deciphering clever ways of putting the existing statistical toolbox to work was good enough for an accomplished statistician like Terry gave me the necessary confidence to continue working this way. More than a decade later this continues to be my approach to applied statistics. This approach has been instrumental for some of my current collaborative work. In particular, it led to important new biological discoveries made together with Andy Feinberg’s lab [7].

During my sabbatical we developed preliminary solutions that improved precision and aided in the removal of systematic biases for microarray data [6]. I was aware that hundreds, if not thousands, of other scientists were facing the same problematic data and were searching for solutions. Therefore I was also thinking hard about ways in which I could share whatever solutions I developed with others. During this time I received an email from Robert Gentleman asking if I was interested in joining a new software project for the delivery of statistical methods for genomics data. This new collaboration eventually became the Bioconductor project, which to this day continues to grow its user and developer base [4]. Bioconductor was the perfect vehicle for having the impact that my department had encouraged me to seek. With Ben Bolstad and others we wrote an R package that has been downloaded tens of thousands of times [3]. Without the availability of software, the statistical method would not have received nearly as much attention. This lesson served me well throughout my career, as developing software packages has greatly helped disseminate my statistical ideas. The fact that my department and school rewarded software publications provided important support.

The impact statisticians have had in genomics is just one example of our fields accomplishment in the 21st century. In academia, the number of statistician becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. Outside of academia, Sabermetrics has become a standard approach in several sports (not just baseball) and inspired the Hollywood movie Money Ball. A PhD Statistician led the team that won the Netflix million dollar prize. Nate Silver proved the pundits wrong by once again using statistical models to predict election results almost perfectly. R has become a widely used programming language. It is no surprise that Statistics majors at Harvard have more than quadrupled since 2000 and that statistics MOOCs are among the most popular.

The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. Scientific fields that have traditionally relied upon simple data analysis techniques have been turned on their heads by these technologies. Furthermore, advances such as these have brought about a shift from hypothesis to discovery-driven research. However, interpreting information extracted from these massive and complex datasets requires sophisticated statistical skills as one can easily be fooled by patterns that arise by chance. This has greatly elevated the importance of our discipline in biomedical research.

I think that the data revolution is just getting started. Datasets are currently being, or have already been, collected that contain, hidden in their complexity, important truths waiting to be discovered. These discoveries will increase the scientific understanding of our world. Statisticians should be excited and ready to play an important role in the new scientific renaissance driven by the measurement revolution.

Bibliography

[1] NE Crone, L Hao, J Hart, D Boatman, RP Lesser, R Irizarry, and
B Gordon. Electrocorticographic gamma activity during word production
in spoken and sign language. Neurology, 57(11):2045–2053, 2001.

[2] Janet A DiPietro, Rafael A Irizarry, Melissa Hawkins, Kathleen A
Costigan, and Eva K Pressman. Cross-correlation of fetal cardiac and
somatic activity as an indicator of antenatal neural development. American
journal of obstetrics and gynecology, 185(6):1421–1428, 2001.

[3] Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A
Irizarry. affyanalysis of affymetrix genechip data at the probe level.
Bioinformatics, 20(3):307–315, 2004.

[4] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad,
Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao
Ge, Jeff Gentry, et al. Bioconductor: open software development for
computational biology and bioinformatics. Genome biology, 5(10):R80, 2004.

[5] Rafael A Irizarry. Local harmonic estimation in musical sound signals.
Journal of the American Statistical Association, 96(454):357–367, 2001.

[6] Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D
Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P Speed.
Exploration, normalization, and summaries of high density oligonucleotide
array probe level data. Biostatistics, 4(2):249–264, 2003.

[7] Rafael A Irizarry, Christine Ladd-Acosta, Bo Wen, Zhijin Wu, Carolina
Montano, Patrick Onyango, Hengmi Cui, Kevin Gabo, Michael Rongione,
Maree Webster, et al. The human colon cancer methylome shows similar
hypo-and hypermethylation at conserved tissue-specific cpg island shores.
Nature genetics, 41(2):178–186, 2009.

[8] Rafael A Irizarry, Clarke Tankersley, Robert Frank, and Susan
Flanders. Assessing homeostasis through circadian patterns. Biometrics,
57(4):1228–1237, 2001.

[9] Laura J van’t Veer, Hongyue Dai, Marc J Van De Vijver, Yudong D
He, Augustinus AM Hart, Mao Mao, Hans L Peterse, Karin van der Kooy,
Matthew J Marton, Anke T Witteveen, et al. Gene expression profiling
predicts clinical outcome of breast cancer. nature, 415(6871):530–536, 2002.

Sunday data/statistics link roundup (5/12/2013, Mother's Day!)

12 May 2013

A tutorial on deep-learning, I really enjoyed reading it, but I’m still trying to figure out how this is different than non-linear logistic regression to estimate features then supervised prediction using those features? Or maybe I’m just naive….
Rafa on political autonomy for science for a blog in PR called 80 grados. He writes about Rep. Lamar Smith and then focuses more closely on issues related to the University of Puerto Rico. A very nice read. (via Rafa)
Highest paid employees by state. I should have coached football…
Newton took the mean. It warms my empirical heart to hear about how the theoretical result was backed up by averaging (via David S.)
Reinhart and Rogoff publish a correction but stand by their original claims. I’m not sure whether this is a good or a bad thing. But it definitely is an overall win for reproducibility.
Statesy folks are getting some much-deserved attention. Terry Speed is a Fellow of the Royal Society, Peter Hall is a foreign associate of the NAS, Gareth Roberts is also a Fellow of the Royal Society (via Peter H.)
Statisticians go to the movies and the hot hand analysis makes the NY Times (via Dan S.)

Bonus Link! Karl B.’s Github tutorial is awesome and every statistician should be required to read it. I only ask why he gives all the love to Nacho’s admittedly awesome Clickme package and no love to healthvis, we are on Github too!

A Shiny web app to find out how much medical procedures cost in your state.

08 May 2013

Today the front page of the Huffington Post featured the Today the [front page of the Huffington Post featured](http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html) the that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked Today the [front page of the Huffington Post featured](http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html) the [Today the [front page of the Huffington Post featured](http://www.huffingtonpost.com/2013/05/08/hospital-prices-cost-differences_n_3232678.html) the](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html) that shows the cost of many popular procedures broken down by hospital. We here at Simply Statistics think you should be able to explore these data more easily. So we asked to help us build a Shiny App that allows you to interact with these data. You can choose your state and your procedure and see how much the procedure costs at hospitals in your state. It takes a second to load because it is a lot of data….

Here is the link the app.

Here are some screenshots for intracranial hemmhorage for the US and for Idaho.

The R code is here if you want to tweak/modify.

Why the current over-pessimism about science is the perfect confirmation bias vehicle and we should proceed rationally

06 May 2013

Recently there have been some high profile flameouts in scientific research. A couple examples include the Duke saga, the replication issues in social sciences, p-value hacking, fabricated data, not enough open-access publication, and on and on.

Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga [Recently there have been some high profile flameouts in scientific research. A couple examples include the Duke saga, the replication issues in social sciences, p-value hacking, fabricated data, not enough open-access publication, and on and on.

Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga](http://www.nytimes.com/2011/07/08/health/research/08genes.html?_r=0) , the lack of replication has led to high-profile arguments between scientists in Discover and Nature among other outlets, and the [Recently there have been some high profile flameouts in scientific research. A couple examples include the Duke saga, the replication issues in social sciences, p-value hacking, fabricated data, not enough open-access publication, and on and on.

Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga [Recently there have been some high profile flameouts in scientific research. A couple examples include the Duke saga, the replication issues in social sciences, p-value hacking, fabricated data, not enough open-access publication, and on and on.

Some of these results have had major non-scientific consequences, which is the reason they have drawn so much attention both inside and outside of the academic community. For example, the Duke saga](http://www.nytimes.com/2011/07/08/health/research/08genes.html?_r=0) , the lack of replication has led to high-profile arguments between scientists in Discover and Nature among other outlets, and the](http://www.businessinsider.com/why-the-reinhart-rogoff-excel-debacle-could-be-devastating-for-the-austerity-movement-2013-4) (sometimes comically) because of a lack of reproducibility.

The result of this high-profile attention is that there is a movement on to “clean up science”. As has been pointed out, there is a group of scientists who are making names for themselves primarily as critics of what is wrong with the scientific process. The good news is that these key players are calling attention to issues: reproducibility, replicability, and open access, among others, that are critically important for the scientific enterprise.

I too am concerned about these issues and have altered my own research process to try to address them for my own research group. I also think that the solutions others have proposed on a larger scale like alltrials.net or PLoS are great advances for the scientific community.

I am also very worried that people are using a few high-profile cases to hyperventilate about the real, solvable, and recognized problems in the scientific process These people get credit and a lot of attention for pointing out how science is “failing”. But they aren’t giving proportional time to all of the incredible success stories we have had, both in performing research and in reforming research with reproducibility, open access, and replication initiatives.

We should recognize that science is hard and even dedicated, diligent, and honest scientists will make mistakes , perform irreproducible or irreplicable studies, or publish in closed access journals. Sometimes this is because of ignorance of good research principles, sometimes it is because people are new to working in a world where data/computation are a major player, and some will be because it is legitimately, really hard to make real advances in science. I think people who participate in real science recognize these problems and are eager to solve them. I also have noticed that real scientists generally try to propose a solution when they complain about these issues.

But it seems like sometimes people use these high-profile mistakes out of context to push their own scientific pet peeves. For example:

I don’t like p-values and there are lots of results that fail to replicate so it must be the fault of p-values. Many studies fail to replicate not because the researchers used p-values, but because they performed studies that were either weak or had poorly understood scientific mechanisms.
I don’t like not being able to access people’s code so lack of reproducibility is causing science to fail. Even in the two most infamous cases (Potti and Reinhart - Rogoff) the problem with the science wasn’t reproducibility - it was that the analysis was incorrect/flawed. Reproducibility compounded the problem but wasn’t the root cause of the problem.
I don’t like not being able to access scientific papers so closed-access journals are evil. For whatever reason (I don’t know if I understand why) it is expensive to publish journals. Clearly, because publishing open access is expensive and closed access journals are expensive. If I’m a junior researcher, I’ll definitely post my preprints online, but I also want papers in “good” journals and don’t have a ton of grant money, so sometimes I’ll choose close access.
I don’t like these crazy headlines from social psychology (substitute other field here) and there have been some that haven’t replicated, so none must replicate. Of course some papers won’t replicate, including even high profile papers. If you are doing statistics, then by definition some papers won’t replicate since you have to make a decision on noisy data.

These are just a few examples where I feel like a basic, fixable flaw in science has been used to justify a hugely pessimistic view of science in general. I’m not saying it is all rainbows and unicorns. Of course we want to improve the process. But I’m worried that the rational reasonable problems we have, with enough hyperbole, will make it look like the scientific process “sky is falling” and will leave the door open for individuals like Rep. Lamar Smith to come in and turn the scientific process into a political one.

P.S. Andrew Gelman posted on a similar topic yesterday as well.. He argues the case for less optimism and to make sure we don’t stay complacent. He added a P.S. and mentioned two points on which we can agree: (1) science is hard and is a human system and we are working to fix the flaws inherent in such systems and (2) that it is still easier to publish as splashy claim than to publish a correction. I do definitely agree with both. I think Gelman would also likely agree that we need to be careful about reciprocity with these issues. If earnest scientists work hard to address reproducibility, replicability, open access, etc. then people who criticize them should have to work just as hard to justify their critiques. Just because it is a critique doesn’t mean it should automatically get the same treatment as the original paper.

Talking about MOOCs on MPT Direct Connection

06 May 2013

Watch Monday, April 29, 2013 on PBS. See more from Direct Connection.

I appeared on Maryland Public Television’s Direct Connection with Jeff Salkin last Monday to talk about MOOCs (along with our Dean Mike Klag).

Older Newer