Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Interview with Cole Trapnell of UW Genome Sciences

cole_cropped
Cole Trapnell is an Assistant Professor of Genome Sciences at the University of Washington. He is the developer of multiple incredibly widely used tools for genomics including Tophat, Cufflinks, and Monocle. His lab at UW studies cell differentiation, reprogramming, and other transitions between stable or metastable cellular states using a combination of computational and experimental techniques. We talked to Cole as part of our ongoing interview series with exciting junior data scientists. 
SS: Do you consider yourself a computer scientist, a statistician, a computational biologist, or something else?

CT: The questions that get me up and out of bed in the morning the fastest are biology questions. I work on cell differentiation - I want to know how to define the state of a cell and how to predict transitions between states. That said, my approach to these questions so far has been to use new technologies to look at previously hard to access aspects of gene regulation.  For example, I’ve used RNA-Seq to look beyond gene expression into finer layers of regulation like splicing. Analyzing sequencing experiments often involves some pretty non-trivial math, computer science, and statistics.  These data sets are huge, so you need fast algorithms to even look at them. They all involve transforming reads into a useful readout of biology, and the technical and biological variability in that transformation needs to be understood and controlled for, so you see cool mathematical and statistical problems all the time. So I guess you could say that I’m a biologist, both experimental and computational. I have to do some computer science and statistics in order to do biology.

SS: You got a Ph.D. in computer science but have spent the last several years in a wet lab learning to be a bench biologist - why did you make that choice?

CT: Three reasons, mainly:

1) I thought learning to do bench work would make me a better overall scientist.  It has, in many ways, I think. It’s fundamentally changed the way I approach the questions I work on, but it’s also made me more effective in lots of tiny ways. I remember when I first got to John Rinn’s lab, we needed some way to track lots of libraries and other material.  I came up with some scheme where each library would get an 8-digit alphanumeric code generated by a hash function or something like that (we’d never have to worry about collisions!). My lab mate handed me a marker and said, “OK, write that on the side of these 12 micro centrifuge tubes”.  I threw out my scheme and came up with something like “JR_1”, “JR_2”, etc.  That’s a silly example, but I mention it because it reminds me of how completely clueless I was about where biological data really comes from.

2) I wanted to establish an independent, long-term research program investigating differentiation, and I didn’t want to have to rely on collaborators to generate data. I knew at the end of grad school that I wanted to have my own wet lab, and I doubted that anyone would trust me with that kind of investment without doing some formal training. Despite the now-common recognition by experimental biologists that analysis is incredibly important, there’s still a perception out there that computational biologists aren’t “real biologists”, and that computational folks are useful tools, but not the drivers of the intellectual agenda. That's of course not true, but I didn’t want to fight the stigma.

3) It sounded fun. I had one or two friends who had followed the "dry to wet” training trajectory, and they were having a blast.   Seeing a result live under the microscope is satisfying in a way that I’ve rarely experienced looking at a computer screen.

SS: Do you plan to have both a wet lab and a dry lab when you start your new group? 

CT: Yes. I’m going to be starting my lab at the University of Washington in the department of Genome Sciences this summer, and it’s going to be a roughly 50/50 operation, I hope. Many of the labs there are set up that way, and there’s a real culture of valuing both sides. As a postdoc, I’ve been extremely fortunate to collaborate with grad students and postdocs who were trained as cell or molecular biologists but wanted to learn sequencing analysis. We’d train each other, often at great cost in terms of time spent solving “somebody else’s problem”.  I’m going to do my best to create an environment like that, the way John did for me and my lab mates.

SS: You are frequently on the forefront of new genomic technologies. As data sets get larger and more complicated how do we ensure reproducibility and replicability of computational results? 

CT: That’s a good question, and I don’t really have a good answer. You’ve talked a lot on this blog about the importance of making science more reproducible and how journals could change to make it so. I agree wholeheartedly with a lot of what you’ve said. I like the idea of "papers as packages”, but I don’t see it happening soon, because it’s a huge amount of extra work and there’s not a big incentive to do so.  Doing so might make it easier to be attacked, so there could even a disincentive! Scientists do well when the publish papers and those papers are cited widely. We have lots of ways to quantify “impact” - h-index, total citation count, how many times your paper is shared via twitter on a given day, etc.  (Say what you want about whether these are meaningful measures).

We don’t have a good way to track who’s right and who’s wrong, or whose results are reproducible and whose aren’t, short of full blown paper retraction.  Most papers aren’t even checked in a serious way. Worse, the papers that are checked are the ones that a lot of people see - few people spend precious time following up on tangential observations in low circulation journals.  So there’s actually an incentive to publish “controversial" results in highly visible journals because at least you’re getting attention.

Maybe we need a Yelp for papers and data sets?  One where in order to dispute the reproducibility of the analysis, you’d have to provide the code *you* ran to generate a contradictory result?  There needs to be a genuine and tangible *reward* (read: funding and career advancement) for putting up an analysis that others can dive into, verify, extend, and learn from.

In any case, I think it’s worth noting that reproducibility is not a problem unique to computation - experimentalists have a hard time reproducing results they got last week, much less results that came from some other lab!  There’s all kinds of harmless reasons for that.  Experiments are hard.  Reagents come in bad lots. You had too much coffee that morning and can’t steady your pipet hand to save your life. But I worry a bit that we could spend a lot of effort making our analysis totally automated and perfectly reproducible and still be faced with the same problem.

SS: What are the interesting statistical challenges in single-cell RNA-sequencing? 

CT:

Oh man, there are many.  Here’s a few:

1) There some very interesting questions about variability in expression across cells, or within one cell across time. There’s clearly a lot of variability in the expression level of a given gene across cells.  But there’s really no way right now to take “replicate” measurements of a single cell.  What would that mean?  With current technology, to make an RNA-Seq library form a cell, you have to lyse it.  So that’s it for that cell.  Even if you had a non-invasive way to measure the whole transcriptome, the cell is a living machine that’s always changing in ways large and small, even in culture. Would you consider repeated measurements “replicates”.  Furthermore, how can you say that two different cells are “replicate” measurements of a  single, defined cell state?  Do such states even really exist?

For that matter, we don’t have a good way of assessing how much variability stems from technical sources as opposed to biological sources.  One common way of assessing technical variability is to spike some alien transcripts at known concentrations in to purified RNA before making the library, so you can see how variable your endpoint measurements are for those alien transcripts. But to do that for single-cell RNA-Seq, we’d have to actually spike transcripts *into* the nucleus of a cell before we lyse it and put it through the library prep process.  Just doping it into the lysate’s not good enough, because the lysis itself might (and likely does) destroy a substantial fraction of the endogenous RNA in the cell.  So there are some real barriers to overcome in order to get a handle on how much variability is really biological.

2) A second challenge is writing down what a biological process looks like at single cell resolution. I mean we want to write down a model that predicts the expression levels of each gene in a cell as it goes through some biological process. We want to be able to say this gene comes on first, then this one, then these genes, and so on. In genomics up until now, we’ve been in the situation where we are measuring many variables (P) from few measurements (N).  That is, N << P, typically, which has made this problem extremely difficult.  With single cell RNA-Seq, that may no longer be the case.  We can already easily capture hundreds of cells, and thousands of cells per capture is just around the corner, so soon, N will be close to P, and maybe someday greater.

Assume for the moment that we are capturing cells that are either resting at or transiting between well defined states. You can think of each cell as a point in a high-dimensional geometric space, where each gene is a different dimension.  We’d like to find those equilibrium states and figure out which genes are correlated with which other genes.  Even better, we’d like to study the transitions between states and identify the genes that drive them.  The curse of dimensionality is always going to be a problem (we’re not likely to capture millions or billions of cells anytime soon), but maybe we have enough data to make some progress. There’s interesting literature out there for tackling problems at this scale, but to my knowledge these methods haven’t yet been widely applied in biology.  I guess you can think of cell differentiation viewed at whole-transcriptome, single-cell resolution as one giant manifold learning problem.  Same goes for oncogenesis, tissue homeostasis, reprogramming, and on and on. It’s going to be very exciting to see the convergence of large scale statistical machine learning and cell biology.

SS: If you could do it again would you do computational training then wet lab training or the other way around? 

CT: I’m happy with how I did things, but I’ve seen folks go the other direction very successfully.  My labmates Loyal Goff and Dave Hendrickson started out as molecular biologists, but they’re wizards at the command line now.

SS: What is your programming language of choice? 

CT: Oh, I’d say I hate them all equally 😉

Just kidding. I’ll always love C++. I work in R a lot these days, as my work has veered away from developing tools for other people towards analyzing data I’ve generated.  I still find lots of things about R to be very painful, but ggplot2, plyr, and a handful of other godsend packages make the juice worth the squeeze.