We need a statistically rigorous and scientifically meaningful definition of replication

20 Oct 2015

Replication and confirmation are indispensable concepts that help define scientific facts. However, the way in which we reach scientific consensus on a given finding is rather complex. Although some press releases try to convince us otherwise, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made. These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question. This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans. In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.

Regulatory agencies, such as the FDA, are an exception since they clearly spell out a definition of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.

In response to a growing concern over a reproducibility crisis, projects such as the Open Science Collaboration have commenced to systematically try to replicate published results. In a recent post, Jeff described one of their recent papers on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value < 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.

To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values <0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p<0.05). So which drug do you take? Let’s give a bit more information to help you decide. Here are the p-values for both original and replication trials:

Drug	Original	Replication	Replicated
A	0.0001	0.001	Yes
B	<0.000001	0.03	Yes
C	0.03	0.06	No
D	<0.000001	0.10	No

Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?

I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.

It seems that before continuing the debate over replication, and certainly before declaring that we are in a reproducibility crisis, we need a statistically rigorous and scientifically meaningful definition of replication. This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.

Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.

Theranos runs head first into the realities of diagnostic testing

16 Oct 2015

The Wall Street Journal has published a lengthy investigation into the diagnostic testing company Theranos.

The company offers more than 240 tests, ranging from cholesterol to cancer. It claims its technology can work with just a finger prick. Investors have poured more than $400 million into Theranos, valuing it at $9 billion and her majority stake at more than half that. The 31-year-old Ms. Holmes’s bold talk and black turtlenecks draw comparisons to Apple Inc. cofounder Steve Jobs.

If ever there were a warning sign, the comparison to Steve Jobs has got to be it.

But Theranos has struggled behind the scenes to turn the excitement over its technology into reality. At the end of 2014, the lab instrument developed as the linchpin of its strategy handled just a small fraction of the tests then sold to consumers, according to four former employees.

One former senior employee says Theranos was routinely using the device, named Edison after the prolific inventor, for only 15 tests in December 2014. Some employees were leery about the machine’s accuracy, according to the former employees and emails reviewed by The Wall Street Journal.

In a complaint to regulators, one Theranos employee accused the company of failing to report test results that raised questions about the precision of the Edison system. Such a failure could be a violation of federal rules for laboratories, the former employee said.

With these kinds of stories, it's always hard to tell whether there's reality here or it's just a bunch of axe grinding. But one thing that's for sure is that people are talking, and probably not for good reasons.

Minimal R Package Check List

14 Oct 2015

A little while back I had the pleasure of flying in a small Cessna with a friend and for the first time I got to see what happens in the cockpit with a real pilot. One thing I noticed was that basically you don’t lift a finger without going through some sort of check list. This starts before you even roll the airplane out of the hangar. It makes sense because flying is a pretty dangerous hobby and you want to prevent problems from occurring when you’re in the air.

That experience got me thinking about what might be the minimal check list for building an R package, a somewhat less dangerous hobby. First off, much has changed (for the better) since I started making R packages and I wanted to have some clean documentation of the process, particularly with using RStudio’s tools. So I wiped off my installations of both R and RStudio and started from scratch to see what it would take to get someone to build their first R package.

The list is basically a “pre-flight” list-–the presumption here is that you actually know the important details of building packages, but need to make sure that your environment is setup correctly so that you don’t run into errors or problems. I find this is often a problem for me when teaching students to build packages because I focus on the details of actually making the packages (i.e. DESCRIPTION files, Roxygen, etc.) and forget that way back when I actually configured my environment to do this.

Pre-flight Procedures for R Packages

Install most recent version of R
Install most recent version of RStudio
Open RStudio
Install devtools package
Click on Project –> New Project… –> New Directory –> R package
Enter package name
Delete boilerplate code and “hello.R” file
Goto “man” directory an delete “hello.Rd” file
In File browser, click on package name to go to the top level directory
Click “Build” tab in environment browser
Click “Configure Build Tools…”
Check “Generate documentation with Roxygen”
Check “Build & Reload” when Roxygen Options window opens –> Click OK
Click OK in Project Options window

At this point, you’re clear to build your package, which obviously involves writing R code, Roxygen documentation, writing package metadata, and building/checking your package.

If I’m missing a step or have too many steps, I’d like to hear about it. But I think this is the minimum number of steps you need to configure your environment for building R packages in RStudio.

UPDATE: I’ve made some changes to the check list and will be posting future updates/modifications to my GitHub repository.

Profile of Data Scientist Shannon Cebron

03 Oct 2015

The “This is Statistics” campaign has a nice profile of Shannon Cebron, a data scientist working at the Baltimore-based Pegged Software.

What advice would you give to someone thinking of a career in data science?

Take some advanced statistics courses if you want to see what it’s like to be a statistician or data scientist. By that point, you’ll be familiar with enough statistical methods to begin solving real-world problems and understanding the power of statistical science. I didn’t realize I wanted to be a data scientist until I took more advanced statistics courses, around my third year as an undergraduate math major.

Not So Standard Deviations: Episode 2 - We Got it Under 40 Minutes

02 Oct 2015

Episode 2 of my podcast with Hilary Parker, Not So Standard Deviations, is out! In this episode, we talk about user testing for statistical methods, navigating the Hadleyverse, the crucial significance of rename(), and the secret reason for creating the podcast (hint: it rhymes with “bee”). Also, I erroneously claim that Bill Cleveland is way older than he actually is. Sorry Bill.

In other news, we are finally on iTunes so you can subscribe from there directly if you want (just search for “Not So Standard Deviations” or paste the link directly into your podcatcher.

Download the audio file for this episode.

Notes:

Bill Cleveland’s paper in Science, on graphical perception, published in 1985
TomFest

Older Newer

Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

We need a statistically rigorous and scientifically meaningful definition of replication

Theranos runs head first into the realities of diagnostic testing

Minimal R Package Check List

Profile of Data Scientist Shannon Cebron

Not So Standard Deviations: Episode 2 - We Got it Under 40 Minutes