Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Reproducibility at Nature

Nature has jumped on to the reproducibility bandwagon and has announced a new approach to improving reproducibility of submitted papers. The new effort is focused primarily and methodology, including statistics, and in making sure that it is clear what an author has done.

To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.

To this end they have created a checklist for highlighting key aspects that need to be clear in the manuscript. A number of these points are statistical, and two specifically highlight data deposition and computer code availability. I think an important change is the following:

To allow authors to describe their experimental design and methods in as much detail as necessary, the participating journals, including Nature, will abolish space restrictions on the methods section.

I think this is particularly important because of the message it sends. Most journals have overall space limitations and some journals even have specific limits on the Methods section. This sends a clear message that “methods aren’t important, results are”. Removing space limits on the Methods section will allow people to just say what they actually did, rather than figure out some tortured way to summarize years of work into a smattering of key words.

I think this is a great step forward by a leading journal. The next step will be for Nature to stick to it and make sure that authors live up to their end of the bargain.

Reproducibility and reciprocity

One element about the entire discussion about reproducible research that I haven’t seen talked about very much is the potential for the lack of reciprocity. I think even if scientists were not concerned about the possibility of getting scooped by others by making their data/code available this issue would be sufficient to give people pause about making their work reproducible.

What do I mean by reciprocity? Consider the following (made up) scenario:

  1. I conduct a study (say, a randomized controlled trial, for concreteness) that I register at clinicaltrials.gov beforehand and specify details about the study like the design, purpose, and primary and secondary outcomes.
  2. I rigorously conduct the study, ensuring safety and privacy of subjects, collect the data, and analyze the data.
  3. I publish the results for the primary and secondary outcomes in the peer-reviewed literature where I describe how the study was conducted and the statistical methods that were used. For the sake of concreteness, let’s say the results were “significant” by whatever definition of significant you care to use and that the paper was highly influential.
  4. Along with publishing the paper I make the analytic dataset and computer code available so that others can look at what I did and, if they want, reproduce the result.

So far so good right? It seems this would be a great result for any study. Now consider the following possible scenarios:

  1. Someone obtains the data and the code from the web site where it is hosted, analyzes it, and then publishes a note claiming that the intervention negatively affected a different outcome not described in the original study (i.e. not one of the primary or secondary outcomes).
  2. A second person obtains the data, analyzes it, and then publishes a note on the web claiming that the intervention was ineffective for the primary outcome in a the subset of participants that were male.
  3. A third person obtains the data, analyzes the data, and then publishes a note on the web saying that the study is flawed and that the original results of the paper are incorrect. No code, data, or details of their methods are given.

Now, how should one react to the follow-up note claiming the study was flawed? It’s easy to imagine a spectrum of possible responses ranging from accusations of fraud to staunch defenses of the original study. Because the original study was influential, there is likely to be a kerfuffle either way.

But what’s the problem with the three follow-up scenarios described? The one thing that they have in common is that none of the three responding people were subjected to the same standards to which the original investigator (me) was subjected. I was required to register my trial and state the outcomes in advance. In an ideal world you might argue I should have stated my hypotheses in advance too. That’s fine, but the point is that the people analyzing the data subsequently were not required to do any of this. Why should they be held to a lower standard of scrutiny?

The first person analyzed a different outcome that was not a primary or secondary outcome. How many outcomes did they test before the came to that one negatively significant one? The second person examined a subset of the participants. Was the study designed (or powered) to look at this subset? Probably not. The third person claims fraud, but does not provide any details of what they did.

I think it’s easy to take care of the third person–just require that they make their work reproducible too. That way we can all see what they did and verify that there was in fact fraud. But the first two people are a little more difficult. If there are no barriers to obtaining the data, then they can just get the data and run a bunch of analyses. If the results don’t go their way, they can just move on and no one would be the wiser. If they did, they can try to publish something.

What I think a good reproducibility policy should have is a type of “viral” clause. For example, the GNU General Public License (GPL) is an open source software license that requires, among other things, that anyone who writes their own software, but links to or integrates software covered under the GPL, must publish their software under the GPL too. This “viral” requirement ensures that people cannot make use of the efforts of the open source community without also giving back to that community. There have been numerous heated discussions in the software community regarding the pros and cons of such a clause, with (large) commercial software developers often coming down against it. Open source developers have largely beens skeptical of the arguments of large commercial developers, claiming that those companies simply want to “steal” open source software and/or maintain their dominance.

I think it is important that if we are going to make reproducibility the norm in science, that we have analogous “viral” clauses to ensure that everyone is held to the same standard. This is particularly important in policy-relevant or in politically sensitive subject areas where there are often parties involved who have essentially no interest (and are in fact paid to have no interest) in holding themselves to the same standard of scientific conduct.

Richard Stallman was right to assume that without the copyleft clause in the GPL that large commercial interests would simply usurp the work of the free software community and essentially crush it before it got started. Reproducibility needs its own version of copyleft or else scientists will be left to defend themselves against unscrupulous individuals who are not held to the same standard.

Sunday data/statistics link roundup (4/28/2013)

  1. What it feels like to be bad at math. My personal experience like this culminated in some difficulties with Green’s functions back in my early days at USU. I think almost everybody who does enough math eventually runs into a situation where they don’t understand what is going on and it stresses them out.
  2. An article about companies that are using data to try to identify people for jobs (via Rafa).
  3. Google trends for predicting the market. I’m not sure that “predicting” is the right word here. I think a better word might be “explaining/associating”. I also wonder if this could go off the rails.
  4. This article [ 1. What it feels like to be bad at math. My personal experience like this culminated in some difficulties with Green’s functions back in my early days at USU. I think almost everybody who does enough math eventually runs into a situation where they don’t understand what is going on and it stresses them out.
  5. An article about companies that are using data to try to identify people for jobs (via Rafa).
  6. Google trends for predicting the market. I’m not sure that “predicting” is the right word here. I think a better word might be “explaining/associating”. I also wonder if this could go off the rails.
  7. This article](http://www.r-bloggers.com/faster-higher-stonger-a-guide-to-speeding-up-r-code-for-busy-people/?utm_source=feedly&utm_medium=feed&utm_campaign=Feed:+RBloggers+(R+bloggers)) in terms of describing the ways that you can speed up R code. My favorite part of it is that it starts with the “why”. Exactly. Premature optimization is the root of all evil.
  8. A discussion of data science at Tumblr. The author/speaker also has a great blog.

Mindlessly normalizing genomics data is bad - but ignoring unwanted variability can be worse

Yesterday, and bleeding over into today, quantile normalization (QN) was being discussed on Twitter. This is the Yesterday, and bleeding over into today, [quantile normalization](http://www.ncbi.nlm.nih.gov/pubmed/12538238) (QN) was being discussed on Twitter. This is the that started the whole thing off. The conversation went a bunch of different directions and then this happened:

well, this happens all over bio-statistics - ie, naive use in seemingly undirected ways until you get a “good” pvalue. And then end

So Jeff and I felt it was important to respond - since we are biostatisticians that work in genomics. We felt a couple of points were worth making:

  1. Most statisticians we know, including us, know QN’s limitations and are always nervous about using QN. But with most datasets we see, unwanted variability is overwhelming  and we are left with no choice but to normalize in orde to extract anything useful from the data.  In fact, many times QN is not enough and we have to apply further transformations, e.g., to remove batch effects.

2. We would be curious to know which biostatisticians were being referred to. We would like some examples, because most of the genomic statisticians we know work very closely with biologists to aid them in cleaning dirty data to help them find real sources of signal. Furthermore, we encourage biologists to validate their results. In many cases, quantile normalization (or other transforms) are critical to finding results that validate and there is a long literature (both biological and statistical) supporting the importance of appropriate normalization.

3. Assuming the data that you get (sequences, probe intensities, etc.) from high-throughput tech = direct measurement of abundance is incorrect. Before worrying about QN (or other normalization) being an arbitrary transformation that distorts the data, keep in mind that what you want to measure has already been distorted by PCR, the imperfections of the microarray, scanner measurement error, image bleeding, cross hybridization or alignment artifacts, ozone effects, etc…

To go into a little more detail about the reasons that normalization may be important in many cases, so I have written a little more detail below with data if you are interested.

Most, if not all, the high throughput data we have analyzed needs some kind of normalization. This applies to both microarrays and next-gen sequencing. To demonstrate why, below I include 5 boxplots of log intensities from 5 microarrays that were hybridized to the same RNA (technical replicates).

Screen shot 2013-04-25 at 11.12.20 PM

See the problem? If we took the data at face value we would conclude that there is a large (almost 2 fold) global change in expression when comparing, say, samples C and E. But they are technical replicates so the observed difference is not biologically driven. Discrepancies like these are the rule rather than the exception. Biologists seem to underestimate the amount of unwanted variability present in the data they produce. Look at enough data and you will quickly learn that, in most cases, unwanted experimental variability dwarfs the biological differences we are interested in discovering. Normalization is the statistical technique that saves biologists millions of dollars  a year by fixing this problem in silico rather than redoing the experiment.

For the data above you might be tempted to simply standardize the data by subtracting the median. But the problem is more complicated than that as shown in the plot below. This plot shows the log ratio (M) versus the average of the logs intensities (A) for two technical replicates in which 16 probes (red dots) have been “spiked-in” to have true fold changes of 2. The other ~20,000 probesets (blue streak) are supposed to be unchanged (M=0). See the curvature of the genes that are supposed to be at 0?  Taken at face value, thousands of the low expressed probes exhibit larger differential expression than the only 16 that are actually different. That’s a problem. And standardizing by the subtracting the median won’t fix it. Non-linear biases such as this one are also quite common.Screen shot 2013-04-25 at 11.14.20 PM

QN offers one solution to this problem  if you can assume that the true distribution of what you are measuring is roughly the same across samples. Briefly, QN forces each sample to have the same distribution. The after picture above is the result of QN. It removes the curvature but preserves most of the real differences.

So why should we be nervous? QN and other normalization techniques risk throwing the baby out with the bath water. What if there is a real global difference? If there is, and you use QN, you will miss it and you may introduce artifacts. But the assumptions are no secret and it’s up to the biologists to decide if they are reasonable. At the same time, we have to be very careful about interpreting large scale changes given that we see large scale changes when we know there are none. Other than cases were global differences are forced or simulated, I have yet to see a good example in which QN causes more harm than good. I’m sure there are some real data examples out there, so if you have one please share, as I would love to use it as an example in class.

Also note that statisticians (including me) are working hard at deciphering ways  to normalize without the need for such strong assumptions. Although in their first incarnation they were useless, current control probes/transcripts techniques are promising. We have used them in the past to normalize methylation data (a similar approach was used here for gene expression data). And then there is subset quantile normalization. I am sure there are others and more to come. So Biologists, don’t worry, we have your backs and serve at your pleasure. In the meantime don’t be so afraid of QN: at least give it a try before you knock it.

Interview at Yale Center for Environmental Law & Policy

Interview with Roger Peng from YCELP on Vimeo.

A few weeks ago I sat down with Angel Hsu of the Yale Center for Environmental Law and Policy to talk about some of their work on air pollution indicators.

(Note: I haven’t moved–I still work at the John_s_ Hopkins School of Public Health.)