Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Sunday data/statistics link roundup (8/4/13)

  1. The $4 million teacher. I love the idea that teaching is becoming a competitive industry where the best will get the kind of pay they really really deserve. I can’t think of another profession where the ratio of (if you are good at how much influence you have on the world)/(salary) is so incredibly large. MOOC’s may contribute to this, that is if they aren’t felled by the ecological fallacy (via Alex N.).
  2. The NIH is considering requiring replication of results (via Rafa). Interestingly, the article talks about reproducibility, as opposed to replication, throughout most of the text.
  3. R  jobs on the rise! Pair that with this rather intense critique of Marie Davidian’s interview about big data because she didn’t mention R. I think R/software development is definitely coming into its own as a critical part of any statistician’s toolbox. As that happens we need to take more and more care to include relevant training in version control, software development, and documentation for our students.
  4. Not technically statistics, but holy crap a 600,000 megapixel picture?
  5. A short history of data science. Not many card-carrying statisticians make the history, which is a shame, given all the good they have contributed to the development of the foundations of this exciting discipline (via Rafa).
  6. For those of you at JSM 2013, make sure you wear out that hashtag (#JSM2013) for those of us on the outside looking in. Watch out for the Lumley 12 and make sure you check out Shirley’s talk, Lumley and Hadley together, this interesting looking ethics session, and Martin doing his fMRI thang,  among others….

That causal inference came out of nowhere

This is a study of breastfeeding and its impact on IQ that has been making the rounds on a number of different media outlets. I first saw it on the Wall Street Journal where I was immediately drawn to this quote:

They then subtracted those factors using a statistical model. Dr. Belfort said she hopes that “what we have left is the true connection” with nursing and IQ.

As the father of a young child this was of course pretty interesting to me so I thought I’d go and check out the paper itself. I was pretty stunned to see this line right there in the conclusions:

Our results support a causal relationship of breastfeeding duration with receptive language and verbal and nonverbal intelligence later in life.

I immediately thought: “man how did they run a clinical trial of breastfeeding”. It seems like it would be a huge challenge to get past the IRB. So then I read a little bit more carefully how they performed the analysis. It was a prospective study, where they followed the children over time, then performed a linear regression analysis to adjust for a number of other factors that might influence childhood intelligence. Some examples include mother’s IQ, soci0-demographic information, and questionaires about delivery.

They then fit a number of regression models with different combinations of covariates and outcomes. They did not attempt to perform any sort of causal inference to make up for the fact that the study was not randomized. Moreover, they did not perform multiple hypothesis testing correction for all of the combinations of effects they observed. The actual reported connections represent just a small fraction of all the possible connections they tested.

So I was pretty surprised when they said:

In summary, our results support a causal relationship of breastfeeding in infancy with receptive language at age 3 and with verbal and nonverbal IQ at school age.

I'm as optimistic as science as they come. But where did that causal inference come from?

The ROC curves of science

Andrew Gelman’s recent post on what he calls the “scientific mass production of spurious statistical significance” reminded me of a thought I had back when I read John Ioannidis’ paper claiming that most published research finding are false. Many authors, which I will refer to as _the pessimists, _have joined Ioannidis in making similar claims and repeatedly blaming the current state of affairs on the mindless use of frequentist inference. The gist of my thought is that, for some scientific fields, the pessimist’s criticism is missing a critical point: that in practice, there is an inverse relationship between increasing rates of true discoveries and decreasing rates of false discoveries and that true discoveries from fields such as the biomedical sciences provide an enormous benefit to society. Before I explain this in more detail, I want to be very clear that I do think that reducing false discoveries is an important endeavor and that some of these false discoveries are completely avoidable. But, as I describe below, a general solution that improves the current situation is much more complicated than simply abandoning the frequentist inference that currently dominates.

Few will deny that our current system, with all its flaws, still produces important discoveries. Many of the pessimists’ proposals for reducing false positives seem to be, in one way or another, a call for being more conservative in reporting findings. Example of recommendations include that we require larger effect sizes or smaller p-values, that we correct for the “researcher degrees of freedom”, and that we use Bayesian analyses with pessimistic priors. I tend to agree with many of these recommendations but I have yet to see a specific proposal on exactly how conservative we should be. Note that we could easily bring the false positives all the way down to 0 by simply taking this recommendation to its extreme and stop publishing biomedical research results all together. This absurd proposal brings me to receiver operating characteristic (ROC) curves.

Slide1

ROC curves plot true positive rates (TPR) versus false positive rates (FPR) for a given classifying procedure. For example, suppose a regulatory agency that runs randomized trials on drugs (e.g. FDA) classifies a drug as effective when a pre-determined statistical test produces a p-value < 0.05 or a posterior probability > 0.95. This procedure will have a historical false positive rate and true positive rate pair: one point in an ROC curve. We can change the 0.05 to, say, 0.2 (or the 0.95 to 0.80) and we would move up the ROC curve: higher FPR and TPR. Not doing research would put us at the useless bottom left corner. It is important to keep in mind that biomedical science is done by imperfect humans on imperfect and stochastic measurements so to make discoveries the field has to tolerate some false discoveries (ROC curves don’t shoot straight up from 0% to 100%). Also note that it can take years to figure out which publications report important true discoveries.

I am going to use the concept of ROC curve to distinguish between reducing FPR by being statistically more conservative and reducing FPR via more general improvements.  In my ROC curve the y-axis represents the number of important discoveries per decade and the x-axis the number of false positives per decade (to avoid confusion I will continue to use the acronyms TPR and FPR). The current state of biomedical research is represented by one point on the red curve: one TPR,FPR pair. The pessimist argue that the FPR is close to 100% of all results but they rarely comment on the TPR. Being more conservative lowers our FPR, which saves us time and money, but it also lowers our TPR, which could reduce the number of important discoveries that improve human health. So what is the optimal balance and how far are we from it? I don’t think this is an easy question to answer.

Now, one thing we can all agree on is that moving the ROC curve up is a good thing, since it means that we get a higher TPR for any given FPR. Examples of ways we can achieve this are developing better measurement technologies, statistically improving the quality of these measurements, augmenting the statistical training of researchers, thinking harder about the hypotheses we test, and making less coding or experimental mistakes. However, applying a more conservative procedure does not move the ROC up, it moves our point left on the existing ROC: we reduce our FPR but reduce our TPR as well.

In the plot above I draw two imagined ROC curves: one for physics and one for biomedical research. The physicists’ curve looks great. Note that it shoots up really fast which means they can make most available discoveries with very few false positives. Perhaps due to the maturity of the field, physicists can afford and tend to use very stringent criteria. The biomedical research curve does not look as good. This is mainly due to the fact that biology is way more complex and harder to model mathematically than physics. However, because there is a larger uncharted territory and more research funding, I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher TPR, biomedical research has to tolerate a higher FPR. According to my imaginary ROC curves, if we become as stringent as physicists our TPR would be five times smaller. It is not obvious to me that this would result in a better situation than the current one. At the same time, note that the red ROC suggests that increasing the FPR, with the hopes of increasing our TPR, is not a good idea because the curve is quite flat beyond our current location on the curve.

Clearly I am oversimplifying a very complicated issue, but I think it is important to point out that there are two discussions to be had: 1) where should we be on the ROC curve (keeping in mind the relationship between FPR and TPR)? and 2) what can we do to improve the ROC curve? My own view is that we can probably move down the ROC curve some and reduce the FPR without much loss in TPR (for example, by raising awareness of the researcher degrees of freedom). But I also think that most our efforts should go to reducing the FPR by improving the ROC. In general, I think statisticians can add to the conversation about 1) while at the same time continue collaborating to move the red ROC curve up.

The researcher degrees of freedom - recipe tradeoff in data analysis

An important concept that is only recently gaining the attention it deserves is researcher degrees of freedom. From Simmons et al.:

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

So far, researcher degrees of freedom has primarily been used with negative connotations. This probably stems from the original definition of the idea which focused on how analysts could “manufacture” statistical significance by changing the way the data was processed without disclosing those changes. Reproducible research and distributed code would of course address these issues to some extent. But it is still relatively easy to obfuscate dubious analysis by dressing it up in technical language.

One interesting point that I think sometimes gets lost in all of this is the researcher degrees of freedom - recipe tradeoff. You could think of this as thebias-variance tradeoff for big data.

At one end of the scale you can allow the data analyst full freedom, in which case researcher degrees of freedom may lead to overfitting and open yourself up to the manufacture of statistical results (optimistic significance or point estimates or confidence intervals). Or you can require a recipe for every data analysis which means that it isn’t possible to adapt to the unanticipated quirks (missing data mechanism, outliers, etc.) that may be present in an individual data set.

As with the bias-variance tradeoff, the optimal approach probably depends on your optimality criteria. You could imagine fitting a model that minimizes the mean squared error for fitting a linear model where you do not constrain the degrees of freedom in any way (that might represent an analysis where the researcher tries all possible models, including all types of data munging, choices of which observations to drop, how to handle outliers, etc.) to get the absolute best fit. Of course, this would likely be a strongly overfit/biased model. Alternatively you could penalize the flexibility allowed to the analyst. For example, you minimize a weighted criteria like:

 \sum_{i=1}^n (y_i - b_0 x_{i1} + b_1 x_{i2})^2 + Researcher \; Penalty(\vec{y},\vec{x})

Some examples of the penalties could be:

  •  \lambda \times \sum_{i=1}^n 1_{researcher\; dropped \; ?y_i , x_i?\ ; from \; analysis}
  • \lambda \times \#\{of\;transforms\;tried\}
  •  \lambda \times \#{Outliers \; removed \; ad-hoc}

You could also combine all of the penalties together into the “elastic researcher net” type approach. Then as the collective pentalty  \lambda \rightarrow \infty you get the DSM, like you have in a clinical trial for example.As \lambda \rightarrow 0 you get fully flexible data analysis, which you might want for discovery.

Of course if you allow researchers to choose the penalty you are right back to a scenario where you have degrees of freedom in the analysis (the problem you always get with any penalized approach). On the other hand it would make it easier to disclose how those degrees of freedom were applied.

Sunday data/statistics link roundup (7/28/13)

  1. An article in the Huffpo about a report claiming there is no gender bias in the hiring of physics faculty. I didn’t read the paper carefully but  I definitely agree with the quote from  Prof. Dame Athene Donald that the comparison should be made to the number of faculty candidates on the market. I’d also be a little careful about touting my record of gender equality if only 13% of faculty in my discipline were women (via Alex N.).
  2. If you are the only person who hasn’t seen the upwardly mobile by geography article yet, here it is (via Rafa). Also covered over at the great “charts n things” blog.
  3. Finally some good news on the science funding front; a Senate panel raises NSF’s budget by 8% (the link worked for me earlier but I was having a little trouble today). I think that this is of course a positive development. I think that article pairs very well with this provocative piece suggesting Detroit might have done better if they had a private research school.
  4. I’m going to probably talk about this more later in the week because it gets my blood pressure up, but I thought I’d just say again that hyperbolic takedowns of the statistical methods in specific papers in the popular press leads only one direction.