Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

A summary of the evidence that most published research is false

One of the hottest topics in science has two main conclusions:

  • Most published research is false
  • There is a reproducibility crisis in science

The first claim is often stated in a slightly different way: that most results of scientific experiments do not replicate. I recently got caught up in this debate and I frequently get asked about it.

So I thought I’d do a very brief review of the reported evidence for the two perceived crises. An important point is all of the scientists below have made the best effort they can to tackle a fairly complicated problem and this is early days in the study of science-wise false discovery rates. But the take home message is that there is currently no definitive evidence one way or another about whether most results are false.

  1. Paper: Why most published research findings are falseMain idea: People use hypothesis testing to determine if specific scientific discoveries are significant. This significance calculation is used as a screening mechanism in the scientific literature. Under assumptions about the way people perform these tests and report them it is possible to construct a universe where most published findings are false positive results. Important drawback: The paper contains no real data, it is purely based on conjecture and simulation.
  2. Paper: Drug development: Raise standards for preclinical researchMain ideaMany drugs fail when they move through the development process. Amgen scientists tried to replicate 53 high-profile basic research findings in cancer and could only replicate 6. Important drawback: This is not a scientific paper. The study design, replication attempts, selected studies, and the statistical methods to define “replicate” are not defined. No data is available or provided.
  3. Paper: An estimate of the science-wise false discovery rate and application to the top medical literatureMain idea: The paper collects P-values from published abstracts of papers in the medical literature and uses a statistical method to estimate the false discovery rate proposed in paper 1 above. Important drawback: The paper only collected data from major medical journals and the abstracts. P-values can be manipulated in many ways that could call into question the statistical results in the paper.
  4. Paper: Revised standards for statistical evidenceMain idea: The P-value cutoff of 0.05 is used by many journals to determine statistical significance. This paper proposes an alternative method for screening hypotheses based on Bayes factors. Important drawback: The paper is a theoretical and philosophical argument for simple hypothesis tests. The data analysis recalculates Bayes factors for reported t-statistics and plots the Bayes factor versus the t-test then makes an argument for why one is better than the other.
  5. Paper: Contradicted and initially stronger effects in highly cited research Main idea: This paper looks at studies that attempted to answer the same scientific question where the second study had a larger sample size or more robust (e.g. randomized trial) study design. Some effects reported in the second study do not match the results exactly from the first. Important drawback: The title does not match the results. 16% of studies were contradicted (meaning effect in a different direction). 16% reported smaller effect size, 44% were replicated and 24% were unchallenged. So 44% + 24% + 16% = 86% were not contradicted. Lack of replication is also not proof of error.
  6. PaperModeling the effects of subjective and objective decision making in scientific peer reviewMain idea: This paper considers a theoretical model for how referees of scientific papers may behave socially. They use simulations to point out how an effect called “herding” (basically peer-mimicking) may lead to biases in the review process. Important drawback: The model makes major simplifying assumptions about human behavior and supports these conclusions entirely with simulation. No data is presented.
  7. Paper: Repeatability of published microarray gene expression analysesMain idea: This paper attempts to collect the data used in published papers and to repeat one randomly selected analysis from the paper. For many of the papers the data was either not available or available in a format that made it difficult/impossible to repeat the analysis performed in the original paper. The types of software used were also not clear. Important drawbackThis paper was written about 18 data sets in 2005-2006. This is both early in the era of reproducibility and not comprehensive in any way. This says nothing about the rate of false discoveries in the medical literature but does speak to the reproducibility of genomics experiments 10 years ago.
  8. Paper: Investigating variation in replicability: The “Many Labs” replication project. (not yet published) Main ideaThe idea is to take a bunch of published high-profile results and try to get multiple labs to replicate the results. They successfully replicated 10 out of 13 results and the distribution of results you see is about what you’d expect (see embedded figure below). Important drawback: The paper isn’t published yet and it only covers 13 experiments. That being said, this is by far the strongest, most comprehensive, and most reproducible analysis of replication among all the papers surveyed here.

I do think that the reviewed papers are important contributions because they draw attention to real concerns about the modern scientific process. Namely

  • We need more statistical literacy
  • We need more computational literacy
  • We need to require code be published
  • We need mechanisms of peer review that deal with code
  • We need a culture that doesn’t use reproducibility as a weapon
  • We need increased transparency in review and evaluation of papers

Some of these have simple fixes (more statistics courses, publishing code) some are much, much harder (changing publication/review culture).

The Many Labs project (Paper 8) points out that statistical research is proceeding in a fairly reasonable fashion. Some effects are overestimated in individual studies, some are underestimated, and some are just about right. Regardless, no single study should stand alone as the last word about an important scientific issue. It obviously won’t be possible to replicate every study as intensely as those in the Many Labs project, but this is a reassuring piece of evidence that things aren’t as bad as some paper titles and headlines may make it seem.

Many labs data. Blue x's are original effect sizes. Other dots are effect sizes from replication experiments (http://rolfzwaan.blogspot.com/2013/11/what-can-we-learn-from-many-labs.html)

The Many Labs results suggest that the hype about the failures of science are, at the very least, premature. I think an equally important idea is that science has pretty much always worked with some number of false positive and irreplicable studies. This was beautifully described by Jared Horvath in this blog post from the Economist.  I think the take home message is that regardless of the rate of false discoveries, the scientific process has led to amazing and life-altering discoveries.

Sunday data/statistics link roundup (12/15/13)

  1. Rafa (in Spanish) clarifying some of the problems with the anti-GMO crowd.
  2. Joe Bliztstein, most recently of #futureofstats fame, talks up data science in the Harvard Crimson (via Rafa). As has been pointed out by Rebecca Nugent when she stopped to visit us, class sizes in undergrad stats programs are blowing up!
  3. If you missed it, Michael Eisen dropped by to chat about open access (part 1/part 2). We talked about Randy Schekman, a recent Nobel prize winner who says he isn’t publishing in Nature/Science/Cell anymore. Professor Schekman did a Reddit AMA where he got grilled pretty hard about pushing a glamour open access journal eLife, while dissing N/S/C, where he published a lot of stuff before winning the Nobel.
  4. The article I received most the last couple of weeks is this one. In it, Peter Higgs says he wouldn’t have had time to think deeply to perform the research that led to the Boson discovery in the modern publish or perish academic system. But he got the prize, at least in part, because of the people who conceived/built/tested the theory in the Large Hadron Collider. I’m much more inclined to believe someone would have come up with the Boson theory in our current system than someone would have built the LHC in a system without competitive pressure.
  5. I think this post raises some interesting questions about the Obesity Paradox that says overweight people with diabetes may have lower risk of death than normal weight people. The analysis is obviously tongue-in-cheek, but I’d be interested to hear what other people think about whether it is a serious issue or not.

Simply Statistics Interview with Michael Eisen, Co-Founder of the Public Library of Science (Part 2/2)

Here is Part 2 of our Jeff’s and my interview with Michael Eisen, Co-Founder of the Public Library of Science.

The key word in "Data Science" is not Data, it is Science

One of my colleagues was just at a conference where they saw a presentation about using data to solve a problem where data had previously not been abundant. The speaker claimed the data were “big data” and a question from the audience was: “Well, that isn’t really big data is it, it is only X Gigabytes”.

While that exact question would elicit groans from most people who work with data, I think it highlights one of the key problems with the thinking around data science. Most people hyping data  science have focused on the first word: data. They care about volume and velocity and whatever other buzzwords describe data that is too big for you to analyze in Excel. This hype about the size (relative or absolute) of the data being collected fed into the second category of hype - hype about tools. People threw around EC2, Hadoop, Pig, and had huge debates about Python versus R.

But the key word in data science is not “data”; it is “science”. Data science is only useful when the data are used to answer a question. That is the science part of the equation. The problem with this view of data science is that it is much harder than the view that focuses on data size or tools. It is much, much easier to calculate the size of a data set and say “My data are bigger than yours” or to say, “I can code in Hadoop, can you?” than to say, “I have this really hard question, can I answer it with my data?”.

A few reasons it is harder to focus on the science than the data/tools are:

  1. John Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”. You may have 100 Gb and only 3 Kb are useful for answering the real question you care about. 
  2. When you start with the question you often discover that you need to collect new data or design an experiment to confirm you are getting the right answer.
  3. It is easy to discover “structure” or “networks” in a data set. There will always be correlations for a thousand reasons if you collect enough data. Understanding whether these correlations matter for specific, interesting questions is much harder.
  4. Often the structure you found on the first pass is due to a phenomena (measurement error, artifacts, data processing) that doesn’t answer an interesting question.

The issue is that the hype around big data/data science will flame out (it already is) if data science is only about “data” and not about “science”. The long term impact of data science will be measured by the scientific questions we can answer with the data.

Simply Statistics Interview with Michael Eisen, Co-Founder of the Public Library of Science (Part 1/2)

Jeff and I had a chance to interview Michael Eisen, a co-founder of the Public Library of Science, HHMI Investigator, and a Professor at UC Berkeley. We talked with him about publishing in open access and how young investigators might publish in open access journals under the current system. Watch part 1 of the interview above.