04 Jun 2015
Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.
See the full interview at the Class Central site. Below is short excerpt.
01 Jun 2015
Editor’s note: We are trying something a little new here and doing an interview with Google Hangouts on Air. The interview will be live at 11:30am EST. I have some questions lined up for Chris, but if you have others you’d like to ask, you can tweet them @simplystats and I’ll see if I can work them in. After the livestream we’ll leave the video on Youtube so you can check out the interview if you can’t watch the live stream. I’m embedding the Youtube video here but if you can’t see the live stream when it is running go check out the event page: https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o.
28 May 2015
Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead which you should go check out right now. You can also find the slightly adapted Leek group career planning guide here.
The most common reason that people go into science is altruistic. They loved dinosaurs and spaceships when they were a kid and that never wore off. On some level this is one of the reasons I love this field so much, it is an area where if you can get past all the hard parts can really keep introducing wonder into what you work on every day.
Sometimes I feel like this altruism has negative consequences. For example, I think that there is less emphasis on the career planning and development side in the academic community. I don’t think this is malicious, but I do think that sometimes people think of the career part of science as unseemly. But if you have any job that you want people to pay you to do, then there will be parts of that job that will be career oriented. So if you want to be a professional scientist, being brilliant and good at science is not enough. You also need to pay attention to and plan carefully your career trajectory.
A colleague of mine, Ben Langmead, created a really nice guide for his postdocs to thinking about and planning the career side of a postdoc which he has over on Github. I thought it was such a good idea that I immediately modified it and asked all of my graduate students and postdocs to fill it out. It is kind of long so there was no penalty if they didn’t finish it, but I think it is an incredibly useful tool for thinking about how to strategize a career in the sciences. I think that the more we are concrete about the career side of graduate school and postdocs, including being honest about all the realistic options available, the better prepared our students will be to succeed on the market.
You can find the Leek Group Guide to Career Planning here and make sure you also go check out Ben’s since it was his idea and his is great.
20 May 2015
In a 2005 OMICS paper, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that “any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue”. Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc… Two follow-up papers (here and here) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely confounded. In a 2010 paper, we confirmed that once this technical variability was accounted for, the number of genes expressed in common between the same tissue across the two species was much higher than the those expressed in common between two species across the different tissues (see Figure 2 here).
So what is confounding and why is it a problem? This topic has been discussed broadly. We wrote a review some time ago. But based on recent discussions I’ve participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use
- Xi to represent the gene expression measurements for human tissue i,
- aX to represent the level of expression that is specific to humans and
- bX to represent the batch effect introduced by the use of the human microarray platform.
- Therefore Xi =aX + bX + ei, with ei the tissue i effect and other uninteresting sources of variability.
Similarly, we will use:
- Yi to represent the measurements for mouse tissue i
- aY to represent the mouse specific level and
- bY the batch effect introduced by the use of the mouse microarray platform.
- Therefore Yi = aY+ bY + fi, with fi tissue i effect and other uninteresting sources of variability.
If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:
aY - aX
Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: Y1 - X1 , Y2 - X2 , etc… The problem is that aX and bX are always together as are aY and bY. We say that the batch effect bX is confounded with the species effect aX. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a model like this:
Yi - Xi = (aY - aX) + (bY - bX) + other sources of variability
and then estimate the unknown quantities of interest: (aY - aX) and (bY - bX) from the observed data Y1 - X1, Y2 - X2, etc... The problem is that, we can estimate the aggregate effect (aY - aX) + (bY - bX), but, mathematically, we can't tease apart the two differences. To see this note that if we are using least squares, the estimates (aY - aX) = 7, (bY - bX)=3 will fit the data exactly as well as (aY - aX)=3,(bY - bX)=7 since
{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.
In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in this linear models course.
Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the bY - bX differences for each gene are squared and added up.
In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a 2010 Nature Genetics Review about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach.
Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in this PNAS paper: "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered*. Therefore, the authors implicitly assumed that (bY - bX)=0. However, in a recent F1000 Research publication Gilad and Mizrahi-Man describe describe an exercise in forensic bioinformatics that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see Figure 1). As pointed out by these authors, with this experimental design we can't simply accept that (bY - bX)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a linear model (ComBat) to account for the batch/species effect and find that samples cluster almost perfectly by tissue. However, Gilad and Mizrahi-Man correctly note that, due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data and more analyses are needed.
Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.
* The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols.
Update: Shin Lin has repeated a small version of the experiment described in the PNAS paper. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments here.
18 May 2015
Editor’s note: I have been unsuccessfully attempting to finish a book I started 3 years ago about how and why everyone should get pumped about reading and understanding scientific papers. I’ve adapted part of one of the chapters into this blogpost. It is pretty raw but hopefully gets the idea across.
An episode of_ The Daily Show with Jon Stewart_ featured physicist Lisa Randall, an incredible physicist and noted scientific communicator, as the invited guest.
Near the end of the interview, Stewart asked Randall why, with all the scientific progress we have made, that we have been unable to move away from fossil fuel-based engines. The question led to the exchange:
Randall: “So this is part of the problem, because I’m a scientist doesn’t mean I know the answer to that question.”
**
** Stewart: ”Oh is that true? Here’s the thing, here’s what’s part of the answer. You could say anything and I would have no idea what you are talking about.”
Professor Randall is a world leading physicist, the first woman to achieve tenure in physics at Princeton, Harvard, and MIT, and a member of the National Academy of Sciences.2 But when it comes to the science of fossil fuels, she is just an amateur. Her response to this question is just perfect - it shows that even brilliant scientists can just be interested amateurs on topics outside of their expertise. Despite Professor Randall’s over-the-top qualifications, she is an amateur on a whole range of scientific topics from medicine, to computer science, to nuclear engineering. Being an amateur isn’t a bad thing, and recognizing where you are an amateur may be the truest indicator of genius. That doesn’t mean Professor Randall can’t know a little bit about fossil fuels or be curious about why we don’t all have nuclear-powered hovercrafts yet. It just means she isn’t the authority.
Stewart’s response is particularly telling and indicative of what a lot of people think about scientists. It takes years of experience to become an expert in a scientific field - some have suggested as many as 10,000 hours of dedicated time. Professor Randall is a scientist - so she must have more information about any scientific problem than an informed amateur like Jon Stewart. But of course this isn’t true, Jon Stewart (and you) could quickly learn as much about fossil fuels as a scientist if the scientist wasn’t already an expert in the area. Sure a background in physics would help, but there are a lot of moving parts in our dependence on fossil fuels, including social, political, economic problems in addition to the physics involved.
This is an example of “residual expertise” - when people without deep scientific training are willing to attribute expertise to scientists even if it is outside their primary area of focus. It is closely related to the logical fallacy behind the argument from authority:
A is an authority on a particular topic
A says something about that topic
A is probably correct
the difference is that with residual expertise you assume that since A is an authority on a particular topic, if they say something about another, potentially related topic, they will probably be correct. This idea is critically important, it is how quacks make their living. The logical leap of faith from “that person is a doctor” to “that person is a doctor so of course they understand epidemiology, or vaccination, or risk communication” is exactly the leap empowered by the idea of residual expertise. It is also how you can line up scientific experts against any well established doctrine like evolution or climate change. Experts in the field will know all of the relevant information that supports key ideas in the field and what it would take to overturn those ideas. But experts outside of the field can be lined up and their residual expertise used to call into question even the most supported ideas.
What does this have to do with you?
Most people aren’t necessarily experts in scientific disciplines they care about. But becoming a successful amateur requires a much smaller time commitment than becoming an expert, but can still be incredibly satisfying, fun, and useful. This book is designed to help you become a fired-up amateur in the science of your choice. Think of it like a hobby, but one where you get to learn about some of the coolest new technologies and ideas coming out in the scientific literature. If you can ignore the way residual expertise makes you feel silly for reading scientific papers you don’t fully understand - you can still learn a ton and have a pretty fun time doing it.