Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

New ways to follow Simply Statistics

In case you prefer to follow Simply Statistics using some other platforms, we’ve added two new features. First, we have an official Twitter feed that you can follow. We also have a new Facebook page that you can like. Please follow us and join the discussion!

Interview with Victoria Stodden

Victoria Stodden

Victoria Stodden is an assistant professor of statistics at Columbia University in New York City. She moved to Columbia after getting her Ph.D. at Stanford University. Victoria has made major contributions to the area of reproducible research and has been appointed to the NSF’s Advisory Committee for Infrastructure. She is the recent recipient of an NSF grant for “Policy Design for Reproducibility and Data Sharing in Computational Science”

Which term applies to you: data scientist/statistician/analyst (or something else)?

Definitely statistician. My PhD is from the stats department at Stanford University.

How did you get into statistics/data science (e.g. your history)?

Since my undergrad days I’ve been motivated by problems in what’s called ‘social welfare economics.’ I interpret that as studying how people can best reach their potential, particularly how the policy environment affects outcomes. This includes the study of regulatory design, economic growth, access to knowledge, development, and empowerment. My undergraduate degree was in economics, and I thought I would carry on with a PhD in economics as well. I realized that folks with my interests were mostly doing empirical work so I thought I should prepare myself with the best training I could in statistics. Hence I chose to do a PhD in statistics to augment my data analysis capabilities as much as I could since I envisioned myself immersed in empirical research in the future.

What is the problem currently driving you?

Right now I’m working on the problem of reproducibility in our body of published computational science. This ties into my interests because of the critical role of knowledge and reasoning in advancing social welfare. Scientific research is becoming heavily computational and as a result the empirical work scientists do is becoming more complex and yet less tacit: the myriad decisions made in data filtering, analysis, and modeling are all recordable in code. In computational research there are so many details in the scientific process it is nearly impossible to communicate them effectively in the traditional scientific paper – rendering our published computational results unverifiable, if there isn’t access to the code and data that generated them.

Access to the code and data permits readers to check whether the descriptions in the paper correspond to the published results, and allows people to understand why independent implementations of the methods in the paper might produce differing results. It also puts the tools of scientific reasoning into people’s hands – this is new. For much of scientific research today all you need is an internet connection to download the reasoning associated with a particular result. Wide availability of the data and code is still largely a dream, but one the scientific community is moving towards.

Who were really good mentors to you? What were the qualities that really helped you?

My advisor, David Donoho, is an enormous influence. He is the clearest scientific thinker I have ever been exposed to. I’ve been so very lucky with the people who have come into my life. Through his example, Dave is the one who has had the most impact on how I think about and prioritize problems and how I understand our role as statisticians and scientific thinkers. He’s given me an example of how to do this and it’s hard to underestimate his influence in my life.

What do you think are the barriers to reproducible research?

At this point, incentives. There are many concrete barriers, which I talk about in my papers and talks (available on my website http://stodden.net), but they all stem from misaligned incentives. If you think about it, scientists do lots of things they don’t particularly like in the interest of research communication and scientific integrity. I don’t know any computational scientist who really loves writing up their findings into publishable articles for example, but they do. This is because the right incentives exist. A big part of the work I am doing concerns the scientific reward structure.  For example, my work on the Reproducible Research Standard is an effort to realign the intellectual property rules scientists are subject to, to be closer to our scientific norms. Scientific norms create the incentive structure for the production of scientific research, providing rewards for doing things people might not do otherwise. For example, scientists have a long established norm of giving up all intellectual property rights over their work in exchange for attribution, which is the currency of success. It’s the same for sharing the code and data that underlies published results – not part of the scientific incentive and reward structure today but becoming so, through adjusting a variety of other factors like finding agency policy, journal publication policy, and expectations at the institutional level.

What have been some success stories in reproducible research?

I can’t help but point to my advisor, David Donoho. An example he gives is his release of http://www-stat.stanford.edu/~wavelab - the first implementation of wavelet routines in MATLAB, before MATLAB included their own wavelet toolbox.  The release of the Wavelab code was a factor that he believes made him one of the top 5 highly cited authors in Mathematics in 2000.

Hiring and promotion committees seem to be starting to recognize the difference between candidates that recognize the importance of reproducibility and clear scientific communication, compared to others who seem to be wholly innocent of these issues.

There is a nascent community of scientific software developers that is achieving remarkable success.  I co-organized a workshop this summer bringing some of these folks together, see http://www.stodden.net/AMP2011. There are some wonderful projects underway to assist in reproducibility, from workflow tracking to project portability to unique identifiers for results reproducible in the cloud. Fascinating stuff.

Can you tell us a little about the legal ramifications of distributing code/data?

Sure. Many aspects of our current intellectual property laws are quite detrimental to the sharing of code and data. I’ll discuss the two most impactful ones. Copyright creates exclusive rights vested in the author for original expressions of ideas – and it’s a default. What this means is that your expression of your idea – your code, your writing, figures you create – are by default copyright to you. So for your lifetime and 70+ years after that, you (or your estate) need to give permission for the reproduction and re-use of the work – this is exactly counter to scientific norms or independent verification and building on others’ findings. The Reproducible Research Standard is a suite of licenses that permit scientists to set the terms of use of their code, data, and paper according to scientific norms: use freely but attribute. I have written more about this here: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4720221

In 1980 Congress passed the Bayh-Dole Act, which was designed to create incentives for access to federally funded scientific discoveries by securing ownership rights for universities with regard to inventions by their researchers. The idea was that these inventions could then by patented and licensed by the university, making the otherwise unavailable technology available for commercial development. Notice that Bayh-Dole was passed on the eve of the computer revolution and Congress could not have foreseen the future importance of code to scientific investigation and its subsequent susceptibility to patentability. The patentability of scientific code now creates incentives to keep the code hidden: to avoid creating prior art in order to maximize the chance of obtaining the patent, and to keep hidden from potential competitors any information that might be involved in commercialization. Bayh-Dole has created new incentives for computational scientists – that of startups and commercialization – that must be reconciled with traditional scientific norms of openness.

Related Posts: Jeff’s interviews with Daniela Witten and Chris Barr. Roger’s talk on reproducibility 

Free access publishing is awesome...but expensive. How do we pay for it?

I am a huge fan of open access journals. I think open access is good both for moral reasons (science should be freely available) and for more selfish ones (I want people to be able to read my work). If given the choice, I would publish all of my work in journals that distribute results freely.

But it turns out that for most open/free access systems, the publishing charges are paid by the scientists publishing in the journals. I did a quick scan and compiled this little table of how much it costs to publish a paper in different journals (here is a bigger table):

  • PLoS One $1,350.00
  • PLoS Biology: $2,900.00
  • BMJ Open $1,937.28
  • Bioinformatics (Open Access Option) $3,000.00
  • Genome Biology (Open Access Option) $2,500.00
  • Biostatistics (Open Access Option) $3,000.00

The first thing I noticed is that it is minimum about $1,500 to get a paper published open access. That may not seem like a lot of money and most journals offer discounts to people who can’t pay. But it still adds up, this last year my group has published 7 papers. If I paid for all of them to be published open access, that would be at minimum $10,500! That is half the salary of a graduate student researcher for an entire year. For a senior scientist that may be no problem, but for early career scientists, or scientists with limited access to resources, it is a big challenge.

Publishers who are solely dedicated to open access (PLoS, BMJ Open, etc.) seem to have on average lower publication charges than journals who only offer open access as an option. I think part of this is that the journals that aren’t open access in general have to make up some of the profits they lose by making the articles free. I certainly don’t begrudge the journals the costs. They have to maintain the websites, format the articles, and run the peer review process. That all costs money.

A modest proposal

What I wonder is if there was a better place for that money to come from? Here is one proposal (hat tip to Rafa): academic and other libraries pay a ton of money for subscriptions to journals like Nature and Science. They also are required to pay for journals in a large range of disciplines. What if, instead of investing this money in subscriptions for their university, academic libraries pitched in and subsidized the publication costs of open/free access?

If all university libraries pitched in, the cost for any individual library would be relatively small. It would probably be less than paying for subscriptions to hundreds of journals. At the same time, it would be an investment that would benefit not only the researchers at their school, but also the broader scientific community by keeping research open. Then neither the people publishing the work, nor the people reading it would be on the hook for the bill.

This approach is the route taken by ArXiv, a free database of unpublished papers. These papers haven’t been peer reviewed, so they don’t always carry the same weight as papers published in peer-reviewed journals. But there are a lot of really good and important papers in the database - it is an almost universally accepted pre-print server.

The other nice thing about ArXiv is that you don’t pay for article processing, the papers are published as is. The papers don’t look quite as pretty as they do in Nature/Science or even PLoS, but it is also much cheaper. The only costs associated with making this a full fledged peer-reviewed journal would be refereeing (which scientists do for free anyway) and editorial responsibilities (again mostly volunteer by scientists).

I Gave A Talk On Reproducible Research Back In

[youtube http://www.youtube.com/watch?v=aH8dpcirW1U?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]

I gave a talk on reproducible research back in July at the Applied Mathematics Perspectives workshop in Vancouver, BC.

In addition to the YouTube version, there’s also a Silverlight version where you can actually see the slides while I’m talking.

Guest Post: SMART thoughts on the ADHD 200 Data Analysis Competition

Note: This is a guest post by our colleagues Brian Caffo, Ani Eloyan, Fang Han, Han Liu,John Muschelli, Mary Beth Nebel, Tuo Zhao and Ciprian Crainiceanu. They won the ADHD 200 imaging data analysis competition. There has been some controversy around the results because one team obtained a higher score without using any of the imaging data. Our colleagues have put together a very clear discussion of the issues raised by the competition so we are publishing it here to contribute to the discussion. Questions about this post should be directed to the Hopkins team leader Brian Caffo 
 

Background

Below we share some thoughts about the ADHD 200 competition, a landmark competition using functional and structural brain imaging data to predict ADHD status.

 

Note, we’re calling these “SMART thoughts” to draw attention to our working group, “Statistical Methods and Applications for Research in Technology” (www.smart-stats.org), though hopefully the acronym applies in the non-intended sense as well.

 
Our team was declared the official winners of the competition. However, a team from the University of Alberta scored a higher number of competition points, but was disqualified for not having used imaging data. We have been in email contact with a representative of that team and have enjoyed the discussion. We found those team members to be gracious and to embody an energy and scientific spirit that are refreshing to encounter.
 
We mentioned our sympathy to them, in that the process seemed unfair, especially given the vagueness of what qualifies as use of the imaging data. More on this thought below.  
 
This brings us to the point of this note, concern over the narrative surrounding the competition based on our reading of web pages, social media and water cooler discussions.
 
We are foremost concerned with the unwarranted conclusion that because the team with the highest competition point total did not use imaging data, the overall scientific validity of using (f)MRI imaging data to study ADHD is now in greater doubt.
 
We stipulate that, like many others, we are skeptical of the utility of MRI data for tasks such as ADHD diagnoses. We are not arguing against such skepticism.
 
Instead we are arguing against using the competition results as if they were strong evidence for such skepticism.
 
We raise four points to argue against overreacting to the competition outcome with respect to the use of structural and functional MRI in the study of ADHD.

Point 1. The competition points are not an accurate measure of performance and scientific value.

Because the majority of the training, and hence presumably the test, sets in the competition were typically developing, the competition points favored specificity.
 
In addition, a correct label of TD yielded 1 point, while a correct ADHD diagnosis with incorrect subtype yielded .5 points.

These facts suggest a classifier that declares everyone as TD as a starting point. For example, if 60% of the 197 test subjects are controls, this algorithm would yield 118 competition points, better than all but a few entrants. In fact, if 64.5% or higher of the test set is TD, this algorithm wins over Alberta (and hence everyone else).

In addition, competition points are variables possessing randomness.  It is human nature to interpret the anecdotal rankings of competitions as definitive evidence of superiority. This works fine as long as rankings are reasonably deterministic. But is riddled with logical flaws when rankings are stochastic. Variability in rankings has a huge effect on the result of competitions, especially when highly tuned prediction methods from expert teams are compared. Indeed, in such cases the confidence intervals of the AUCs (or other competition criteria) overlap. The 5th or 10th place team may actually have had the most scientifically informative algorithm.

Point 2. Biologically valueless predictors were important.

Most importantly, contributing location (aka site), was a key determinant of prediction performance. Site is a proxy for many things: the demographics of the ADHD population in the site’s PI’s studies, the policies by which a PI chose to include data, scanner type, IQ measure, missing data patterns, data quality and so on.

In addition to site, missing data existence and data quality also held potentially important information about prediction, despite being (biologically) unrelated to ADHD. The likely causality, if existent, would point in the reverse direction (i.e. that presence of ADHD would result in a greater propensity for missing data and lower data quality, perhaps due to movement in the scanner).

This is a general fact regarding prediction algorithms, which do not intrinsically account for causal directions or biological significance.

Point 3. The majority of the imaging data is not prognostic.

Likely every entrant, and the competition organizers, were aware that the majority of the imaging data is not useful for predicting ADHD. (Here we use the term “imaging data” loosely, meaning raw and/or processed data.)   In addition, the imaging data are noisy. Therefore, use of these data introduced tens of billions of unnecessary numbers to predict 197 diagnoses.

As such, even if extremely important variables are embedded in the imaging data, (non-trivial) use of all of the imaging data could degrade performance, regardless of the ultimate value of the data.

To put this in other words, suppose all entrants were offered an additional 10 billion numbers, say genomic data, known to be noisy and, in aggregate, not predictive of disease. However, suppose that some unknown function of a small collection of variables was very meaningful for prediction, as is presumably the case with genomic data. If the competition did not require its use, a reasonable strategy would be to avoid using these data altogether.

Thus, in a scientific sense, we are sympathetic to the organizers’ choice to eliminate the Alberta team, since a primary motivation of the competition was to encourage a large set of eyes to sift through a large collection of very noisy imaging data.

Of course, as stated above, we believe that what constitutes a sufficient use of the imaging data is too vague to be an adequate rule to eliminate a team in a competition.

Thus our scientifically motivated support of the organizers conflicts with our procedural dispute of the decision made to eliminate the Alberta team.

Point 4. Accurate prediction of a response is neither necessary nor sufficient for a covariate to be biologically meaningful.

Accurate prediction of a response is an extremely high bar for a variable of interest. Consider drug development for ADHD. A drug does not have to demonstrate that its application to a collection of symptomatic individuals would predict with high accuracy a later abatement of symptoms.  Instead, a successful drug would have to demonstrate a mild averaged improvement over a placebo or standard therapy when randomized.

As an example, consider randomly administering such a drug to 50 of 100 subjects who have ADHD at baseline.  Suppose data are collected at 6 and 12 months. Further suppose that 8 out of 50 of those receiving the drug had no ADHD symptoms at 12 months, while 1 out of 50 of those receiving placebo had no ADHD symptoms at 12 months. The Fisher’s exact test P-value is .03, by the way.  

The statistical evidence points to the drug being effective. Knowledge of drug status, however, would do little to improve prediction accuracy. That is, given a new data set of subjects with ADHD at baseline and knowledge of drug status, the most accurate classification for every subject would be to guess that they will continue to have ADHD symptoms at 12 months.  Of course, our confidence in that prediction would be slightly lower for those having received the drug.

However, consider using ADHD status at 6 months as a predictor. This would be enormously effective at locating those subjects who have an abatement of symptoms whether they received the drug or not. In this thought experiment, one predictor (symptoms at 6 months) is highly predictive, but not meaningful (it simply suggests that Y is a good predictor of Y), while the other (presence of drug at baseline) is only mildly predictive, but is statistically and biologically significant.

As another example, consider the ADHD200 data set. Suppose that a small structural region is highly impacted in an unknown subclass of ADHD. Some kind of investigation of morphometry or volumetrics might detect an association with disease status. The association would likely be weak, given absence of a-priori knowledge of this region or the subclass. This weak association would not be useful in a prediction algorithm. However, digging into this association could potentially inform the biological basis of the disease and further refine the ADHD phenotype.

Thus, we argue that it is important to differentiate the ultimate goals of obtaining high prediction accuracy with that of biological discovery of complex mechanisms in the presence of high dimensional data.

Conclusions

We urge caution in over-interpretation of the scientific impact of the University of Alberta’s strongest performance in the competition.  

Ultimately, what Alberta’s having the highest point total established is that they are fantastic people to talk to if you want to achieve high prediction accuracy. (Looking over their work, this appears to have already been established prior to the competition :-).

It was not established that brain structure or resting state function, as measured by MRI, is a blind alley in the scientific exploration of ADHD.

Related Posts: Roger on “Caffo + Ninjas = Awesome”, Rafa on the “Self Assessment Trap”, Roger on “Private health insurers to release data