Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Correlation does not imply causation (parental involvement edition)

The New York Times recently published an article on education titled “Parental Involvement Is Overrated”. Most research in this area supports the opposite view, but the authors claim that “evidence from our research suggests otherwise”.  Before you stop helping your children understand long division or correcting their grammar, you should learn about one of the most basic statistical concepts: correlation does not imply causation. The first two chapters of this very popular text book describes the problem and even Khan Academy has a class on it. As several of the commenters in the  NYT article point out, the authors fail to make this distinction.

To illustrate the problem, imagine you want to know how effective tutoring is for students in a math class you are teaching.  So you compare the test scores of students that received tutoring to those that don’t. You find that receiving tutoring is correlated with lower test scores. So do you conclude that tutoring causes lower grades? Of course not!  In this particular case we are confusing cause and effect: students that have trouble with math are much more likely to seek out tutoring and this is what drives the observed correlation. With that example in mind,  consider this quote from the New York Times article:

When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades…. Even more surprising to us was that when parents regularly helped with homework, kids usually performed worse.

A first question we would ask here is: how do we know that the children’s performance would not have been even worse had they not received help? I imagine the authors made use of c_ontrols: _we compare the group that received the treatment (regular help with homework) to a control group that did not. But this brings up a more difficult question: how do we know that the treatment and control groups are comparable?

In a randomized controlled experiment, we would take a group of kids and randomly assign each one to the treatment group (will be helped with their homework) or control group (no help with homework). By doing this we can use probability calculations to determine the range of differences we expect to see by chance when the treatment has no effect.  Note that by chance one group may end up with a few more “better testers” than the other. However, if we see a big enough difference that can’t be explained by chance, then the alternative that the treatment is responsible for the observed differences becomes more believable.

Given all the prior research (and common sense) suggesting that parent involvement, in its many manifestations, is in fact helpful to students, many would consider it unethical to run a randomized controlled trial on this issue (you would knowingly hurt the control group). Therefore, the authors are left with no choice than to use an _observational study _to reach their conclusions. In this case, we have no control over who receives help and who doesn’t. Kids that require regular help with their homework are different in many ways to kids that don’t, even after correcting for all the factors mentioned. For example, one can envision how kids that have a mediocre teacher or have trouble with tests are more likely to be in the treatment group, while kids who naturally test well or go to schools that offer in-school tutoring are more likely to be in the control group.

I am not an expert on education, but as a statistician I am skeptical of the conclusions of this data-driven article.  In fact, I would  recommend parents actually do get involved early on by, for example, teaching children that correlation does not imply causation.

Note that I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.

The #rOpenSci hackathon #ropenhack

Editor’s note: This is a guest post by Alyssa Frazee, a graduate student in the Biostatistics department at Johns Hopkins and a participant in the recent rOpenSci hackathon. 

Last week, I took a break from my normal PhD student schedule to participate in a hackathon in San Francisco. The two-day event was hosted by rOpenSci, an organization committed to developing R tools for open science. Working with several wonderful people from the R community was inspiring, humbling, and incredibly fun. So many great things happened in a two-day whirlwind: it would be impossible now to capture the whole thing in a narrative that would do it justice. So instead of a play-by-play, here are some of the quotes from the event that I’ve recently been reflecting on:

“The enemy isn’t R, Python, or Julia. The enemy is closed-source science.”

There have been some lively internet debates recently about mathematical and scientific computing languages. While conversations about these languages are interesting and necessary, the forest often gets lost for the trees: in the end, we are here to do good science, and we should use whatever makes that easiest. We should build strong, collaborative communities, both within languages and across them. A closed-source science mentality hinders this kind of collaboration. I thought one of the hackathon projects, an R kernel for the iPython notebook, especially exemplified a commitment to open science and to cross-language collaboration. It was so awesome to spend two days with R folks like this who genuinely enjoy working together, in any language, to make scientific computing better.

“Pair debugging is fun!”

This quote perfectly captures one of my favorite things about hackathons: genuine group work! During my time in graduate school, I've done most of my programming solo. I think this is the nature of getting a PhD: the projects have to be yours, and all the other PhD students are working on their solo projects. So I really enjoyed the hackathon because it facilitated true pair/group work: two or more peers working on the same project, in the same room, at the same time. I like this work strategy for many reasons:

•          The rate at which I learn new things is high, since it's so easy to ask a question. Lots of time is saved by not having to sift through internet search results.

•          Sometimes I find solo debugging to be pretty painful. But I think pair debugging is fun and satisfying: it's like an inspirational sports movie. It's you and me, the ragtag underdogs, against the computer, the evil bully from across town. Relatedly, the sweet sweet taste of victory is also shared.

•          It's easier to stay focused on the task at hand. I'm not as easily distracted by email/Twitter/Facebook/blogs/the rest of the internet when I'm not coding alone.

My academic sister, Hilary, and I did a good amount of pair debugging during the hackathon, and I kept finding myself thinking "I wish this would have been possible while we were both grad students!" I think we both had lots of fun working together. For a short discussion of more fun aspects of pairing, here's a blog post I like. At the rOpenSci hackathon in particular, group work was especially awesome because we could ask questions in person to people who have written the libraries our projects depend on, or to RStudio developers, or to GitHub employees, or to potential users of the projects. Just some of the many joys of having lots of talented, friendly R programmers all in the same space!

“Want me to write some unit tests for your unit tests?”

During the hackathon, I primarily worked on a unit-testing package called testdat. Testdat provides functions that check for and fix common problems with tabular data, like UTF-8 characters and inconsistent missing data codes, with the overall goal of making data processing/cleaning more reproducible. The project was really good for a two-day hackathon, since it was small enough to almost finish in two days, and it was very modular: one person worked on the missing data checking functions, another worked on UTF-8 checking, a third wrote the tests for the finished functions (unit tests for unit tests!), etc. Also, it didn't require a lot of background knowledge in a specific subject area or a deep dive into an existing codebase: all it required were some coding skills and perhaps a frustrating experience with messy data in the past (for motivation).

Finding an appropriate project to work on was probably my biggest challenge at this hackathon. I spent the summer at Hacker School, where the days were structured similarly to how they were at the rOpenSci hackathon: there wasn't really any structure. In both scenarios, the minimal structure was intentional. Lots of great collaborative work can happen with a few free days of hacking. But with two free days at the hackathon (versus Hacker School's 50), it was much more important to choose a good project quickly and get coding. One way to do this would have been to arrive at the hackathon with a small project in hand (many people did this). My strategy, however, was to chat with a few different project groups for the first hour or two on day 1, and then stick with one of those groups for the rest of the time. It worked well -- as I mentioned above, testdat was a great project -- but I did feel some time pressure (internally!) to choose a small project quickly.

For a look at some of the other hackathon projects, check out rOpenSci's GitHub page, the hackathon GitHub page, project-specific posts on the rOpenSci blog, or the hackathon's live-tweet hashtag, #ropenhack.

“Why are there so many Minnesotans here?”

There were at least four hackathon attendees (out of 35-40 total) that either currently live in or hail from Minnesota. Talk about overrepresentation! We are everywhere.

“I love my job.”

I'm a late-stage PhD student, so the job market is looming closer with every passing day. When I meet new people working in statistics, genomics, data science, or another related field, I like to ask them whether they like their current work, how it compares to other jobs they've had, etc. Hackathon attendees had all kinds of jobs: academic researcher, industry scientist, freelancer, student, etc. The majority of the responses to my inquiries about how they liked their work was "I love it." The situation made the job market seem exciting, rather than intimidating: among the hackathon attendees and folks from the SF data science community that hung out with us for a dinner, the jobs themselves were pretty heterogeneous, but the general enjoyment of the work seemed consistently high.

“What’s the future of R?”

I suppose we should have known that existential questions like this would come up when 40 passionate R people spend two straight days together. Our discussion of the future of R didn't really yield any definitive answers or predictions, but I think we have big dreams for what R's future will look like: vibrant, open, collaborative, and scientifically driven. If the hackathon atmosphere was any indication of R's future, I'm feeling pretty optimistic about where things are going.

In closing: we’re really grateful to the people and organizations that made the hackathon possible: rOpenSci, Karthik Ram, GitHub, the Sloan Foundation, and F1000 Research. Thanks for strengthening the R community, giving us the chance to meet each other outside of the internet, and helping us have a great time doing R, for science, together!

Writing good software can have more impact than publishing in high impact journals for genomic statisticians

Every once in a while we see computational papers published in science journals with high impact factors.  Genomics related methods appear quite often in these journals. Several of my junior colleagues express frustration that all their papers get rejected from these journals. I tell them that the same is true for most of my papers and remind them of these examples:

Method Journal Year #Citations
PLINK AJHG 2007 6481
Bioconductor Genome Biology 2004 5973
RMA Biostatistics 2003 5674
limma SAGMB 2004 5637
quantile normalization Bioinformatics 2003 4646
Bowtie Genome Biology 2009 3849
BWA Bioinformatics 2009 3327
Loess normalization NAR 2002 3313
qvalues JRSS-B 2002 2758
tophat Bioinformatics 2008 1868
vsn Bioinformatics 2002 1398
GCRMA JASA 2004 1397
MACS Genome Biology 2008 1277
deseq Genome Biology 2010 1264
CBS Biostatistics 2004 1051
R/qtl Bioinformatics 2003 1027

Let me know of other examples in the comments.

update: I added one more to the list.

This is how an important scientific debate is being used to stop EPA regulation

Environmental regulation in the United States has protected human health for over 40 years. Since the Clean Air Act was enacted in 1970, levels of outdoor air pollution have dropped dramatically, changing the landscape of once heavily-polluted cities like Los Angeles and Pittsburgh. A 2011 cost-benefit analysis conducted by the U.S. Environmental Protection Agency estimated that the 1990 amendments to the CAA prevented 160,000 deaths and 13 million lost work days in the year 2010 alone. They estimated that the monetary benefits of the CAA were 30 times greater than the costs of implementing the regulations.

The benefits of environmental regulations like the CAA significantly outweigh their costs. But there are still costs, and those costs must be borne by someone. The burden is usually put on the polluters, such as the automobile and power generation industries, which have long fought any notion of air pollution regulation as a threat to their existence. Initially, as air pollution and health studies were still emerging, opponents of regulation often challenged the science itself, claiming flaws in the methodology, the measurements, or the interpretation. But when study after study demonstrated a connection between outdoor air pollution and a variety of health problems, it became increasingly difficult for critics to mount a credible challenge. Lawsuits are another tactic used by industry, with one case brought by the American Trucking Association going all the way to the U.S. Supreme Court.

The latest attack comes from the House of Representatives in the form of the Secret Science Reform Act, or H.R. 4102. In summary, the proposed bill requires that every scientific paper cited by the EPA to justify a new rule or regulation needs to be reproducible. What exactly does this mean? To answer that question we need to take a brief diversion into some recent important developments in statistical science.

The idea behind reproducibility is simple. All the data used in a scientific paper and all the computer code used to analyze that data should be made available to other researchers and the public. It may be surprising that much of this data actually isn’t already available. The primary reason most data isn’t available is because, until recently, most people didn’t ask scientists for their data. The data was often small and collected for a specific purpose so other scientists and the general public just weren’t that interested. If a scientist were interested in checking the truth of a claim, she could simply repeat the experiment in her lab to see if the claim could be replicated.

The nature of science has changed quickly over the last three decades. There has been an explosion of data, fueled by the decreasing cost of data collection technologies and computing power. At the same time, increased access to sophisticated computing power has let scientists conduct more sophisticated analyses on their data. The massive growth in data and the increasing sophistication of the analyses has made communicating what was done in a scientific study more complicated.

The traditional medium of journal publications has proven to be inadequate for describing the important details of a data analysis. As a result, it has been said that scientific articles are merely the “advertising” for the research that was conducted. The real research is buried in the data and the computer code actually used to compute the results. Journals have traditionally not required that data or computer code be published along with papers. As a result, many important details may be lost and prevent key studies from being fully reproducible.

The explosion of data has also made completely replicating a large study by an independent scientist much more difficult and costly. A large study is expensive to conduct in the first place; there is usually little appetite or funding to repeat it.  The result is that much of published scientific research cannot be reproduced by other scientists because the necessary data and analytic details are not available to others.

The scientific community is currently engaged in a debate over how to improve reproducibility across all of science. You might be tempted to ask, why not just share the data? Even if we could get everyone to agree with that in principle, it’s not clear how to do it.

Imagine if everyone in the U.S. decided we were all going to share our movie collections, and suppose for the sake of this example that the movie industry did not object. How would it work? Numerous questions immediately arise. Where would all these movies be stored? How would they be transferred from one person to another? How would I know what movies everyone else had? If my movies are all on the old DVD format, do I need to convert them to some other format before I can share? My Internet connection is very slow, how can I download a 3 hour HD movie? My mother doesn’t use computers much, but she has a great movie collection that I think others should have access to. What should she do? And who is going to pay for all of this? While each question may have a reasonable answer, it’s not clear what is the optimal combination and how you might scale it to the entire country.

Some of you may recall that the music industry had a brilliant sharing service that essentially allowed everyone to share their music collections. It was called Napster. Napster solved many of the problems raised above except for one – they failed to survive. So even when a decent solution is found, there’s no guarantee that it will always be there.

As outlandish as this example may seem, minor variations on these exact questions come up when we discuss how to share scientific data. The volume of data being produced today is enormous and making all of it available to everyone is not an easy task. That’s not to say it is impossible. If smart people get together and work constructively, it is entirely possible that a reasonable approach could be found. But at this point, a credible long-term solution has yet to emerge.

This brings us back to the Secret Science Reform Act. The latest tactic by opponents of air quality regulation is to force the EPA to ensure that all of the studies that it cites to support new regulations are reproducible. A cursory reading of the bill gives the impression that the sponsors are genuinely concerned about making science more transparent to the public. But when one reads the language of the bill in the context of ongoing discussions about reproducibility, it becomes clear that the sponsors of the bill have no such goal in mind. The purpose of H.R. 4102 is to prevent the Environmental Protection Agency from proposing new regulations.

The EPA develops rules and regulations on the basis of scientific evidence. For example, the Clean Air Act requires EPA to periodically review the scientific literature for the latest evidence on the health effects of air pollution. The science the EPA considers needs to be published in peer-reviewed journals. This makes the EPA a key consumer of scientific knowledge and it uses this knowledge to make informed decisions about protecting public health. What the EPA is not is a large funder of scientific studies. The entire budget for the Office of Research and Development at EPA is roughly $550 million (fiscal 2014), or less than 2 percent of the budget for the National Institutes of Health (about $30 billion for fiscal 2014). This means EPA has essentially no influence over the scientists behind many of the studies it cites because it funds very few of those studies. The best the EPA can do is politely ask scientists to make their data available. If a scientist refuses, there’s not much the EPA can use as leverage.

The latest controversy to come up involves the Harvard Six Cities study published in 1993. This landmark study found a large difference in mortality rates comparing cities with high and low air pollution, even after adjusting for smoking and other factors. The House committee has been trying to make the data for this study publicly available so that it can ensure that regulations are “backed by good science”. However, the Committee has either forgotten or never knew that this particular study has been fully reproduced by independent investigators. In 2005, independent investigators found that they were “...able to reproduce virtually all of the original numerical results, including the 26 percent increase in all-cause mortality in the most polluted city (Stubenville, OH) as compared to the least polluted city (Portage, WI). The audit and validation of the Harvard Six Cities Study conducted by the reanalysis team generally confirmed the quality of the data and the numerical results reported by the original investigators.”

It would be hard to find an air pollution study that has been subject to more scrutiny than the Six Cities studies. Even if you believed the Six Cities study was totally wrong, its original findings have been replicated numerous times since its publication, with different investigators, in different populations, using different analysis techniques, and in different countries. If you’re looking for an example where the science was either not reproducible or not replicable, sorry, but this is not your case study.

Ultimately, it is clear that the sponsors of this bill are cynically taking advantage of a genuine (but difficult) scientific debate over reproducibility to push a political agenda. Scientists are in agreement that reproducibility is important, but there is no consensus yet on how to make it happen for everyone. By forcing the EPA to ensure reproducibility of the science on which it bases regulation, lawmakers are asking the EPA to solve a problem that the entire scientific community has yet to figure out. The end result of passing a bill like H.R. 4102 is that the EPA will be forced to stop proposing any new regulation, handing a major victory to opponents of air quality standards and dealing a major blow to public health in the U.S.

Data Analysis for Genomics edX Course

Mike Love (@mikelove) and I have been working hard the past couple of months preparing a free online edX course on data analysis for genomics. Our target audience are the postdocs, graduate students and research scientists that are tasked with analyzing genomics data, but don’t have any formal training. The eight week course will start with the very basics, but will ramp up rather quickly and end with real-life workflows for genome variation, RNA-seq, DNA methylation, and ChIP-seq.

Throughout the course students will learn skills and concepts that provide a foundation for analyzing genomics data. Specifically, we will cover exploratory data analysis, basic statistical inference, linear regression, modeling with parametric distributions, empirical Bayes, multiple comparison corrections and smoothing techniques.

In the class we will make heavy use of computer labs. Almost every lecture is accompanied by an R markdown document that students can use to recreate the plots shown in the lectures. The html document resulting from these R markdown files will result in an html document that will serve as a text book for the class.

Questions will be discussed on online forums led by Stephanie Hicks (@stephaniehicks) and Jim MacDonald.

If you want to sign up, here is the link.