Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The tyranny of the idea in science

There are a lot of analogies between startups and academic science labs. One thing that is definitely very different is the relative value of ideas in the startup world and in the academic world. For example, Paul Graham has said:

Actually, startup ideas are not million dollar ideas, and here’s an experiment you can try to prove it: just try to sell one. Nothing evolves faster than markets. The fact that there’s no market for startup ideas suggests there’s no demand. Which means, in the narrow sense of the word, that startup ideas are worthless.

In academics, almost the opposite is true. There is huge value to being first with an idea, even if you haven’t gotten all the details worked out or stable software in place. Here are a couple of extreme examples illustrated with Nobel prizes:

  1. Higgs Boson - Peter Higgs postulated the Boson in 1964, he won the Nobel Prize in 2013 for that prediction, in between tons of people did follow on work, someone convinced Europe to build one of the most expensive pieces of scientific equipment ever built and conservatively thousands of scientists and engineers had to do a ton of work to get the equipment to (a) work and (b) confirm the prediction.
  2. Human genome - Watson and Crick postulated the structure of DNA in 1953, they won the Nobel Prize in  medicine in 1962 for this work. But the real value of the human genome was realized when the largest biological collaboration in history sequenced the human genome, along with all of the subsequent work in the genomics revolution.

These are two large scale examples where the academic scientific community (as represented by the Nobel committee, mostly because it is a concrete example) rewards the original idea and not the hard work to achieve that idea. I call this, “the tyranny of the idea.” I notice a similar issue on a much smaller scale, for example when people don’t recognize software as a primary product of science. I feel like these decisions devalue the real work it takes to make any scientific idea a reality. Sure the ideas are good, but it isn’t clear that some ideas wouldn’t be discovered by someone else - but surely we aren’t going to build another large hadron collider. I’d like to see the scales correct back the other way a little bit so we put at least as much emphasis on the science it takes to follow through on an idea as on discovering it in the first place.

Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously

Joe Pickrell has an interesting new paper out about Mendelian randomization. He discusses some of the interesting issues that come up with these studies and performs a mini-review of previously published studies using the technique.

The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). A very famous example of this was an experiment performed by Leonid Kruglyak’s group where they took two strains of yeast and repeatedly mated them, then measured genetics and gene expression data. The experimental design looked like this:

Slide06

 

If you look at the allele inherited from the two parental strains (BY, RM)  at two separate genes on different chromsomes in each of the 112 segregants (yeast offspring)  they do appear to be random and independent:

Screen Shot 2015-05-07 at 11.20.46 AM

 

 

So this is a randomized trial in yeast where the yeast were each randomized to many many genetic “treatments” simultaneously. Now this isn’t strictly true, since genes on the same chromosomes near each other aren’t exactly random and in humans it is definitely not true since there is population structure, non-random mating and a host of other issues. But you can still do cool things to try to infer causality from the genetic “treatments” to downstream things like gene expression ( and even do a reasonable job in the model organism case).

In my mind this raises a potentially interesting study design for clinical trials. Suppose that there are 10 treatments for a disease that we know about. We design a study where each of the patients in the trial was randomized to receive treatment or placebo for each of the 10 treatments. So on average each person would get 5 treatments.  Then you could try to tease apart the effects using methods developed for the Mendelian randomization case. Of course, this is ignoring potential interactions, side effects of taking multiple drugs simultaneously, etc. But I’m seeing lots of interesting proposals for new trial designs (which may or may not work), so I thought I’d contribute my own interesting idea.

Rafa's citations above replacement in statistics journals is crazy high.

Editor’s note:  I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. 

I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from Web of Science. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.

 

journals

 

There are several interesting things about this graph right away. One is that JASA has the highest median number of citations, but has fewer “big hits” (papers with 100+ citations/year) than Annals of Statistics, JMLR, or JRSS-B. Another thing is how much of a lottery developing statistical methods seems to be. Most papers, even among the 400 most cited, have around 3 citations/year on average. But a few lucky winners have 100+ citations per year. One interesting thing for me is the papers that get 10 or more citations per year but aren’t huge hits. I suspect these are the papers that solve one problem well but don’t solve the most general problem ever.

Something that jumps out from that plot is the outlier for the journal Biostatistics. One of their papers is cited 367.85 times per year. The next nearest competitor is 67.75 and it is 19 standard deviations above the mean! The paper in question is: “Exploration, normalization, and summaries of high density oligonucleotide array probe level data”, which is the paper that introduced RMA, one of the most popular methods for pre-processing microarrays ever created. It was written by Rafa and colleagues. It made me think of the statistic “wins above replacement” which quantifies how many extra wins a baseball team gets by playing a specific player in place of a league average replacement.

What about a “citations /year above replacement” statistic where you calculate for each journal:

Median number of citations to a paper/year with Author X - Median number of citations/year to an average paper in that journal

Then average this number across journals. This attempts to quantify how many extra citations/year a person’s papers generate compared to the “average” paper in that journal. For Rafa the numbers look like this:

  • Biostatistics: Rafa = 15.475, Journal = 1.855, CAR/Y =  13.62
  • JASA: Rafa = 74.5, Journal = 5.2, CAR/Y = 69.3
  • Biometrics: Rafa = 4.33, Journal = 3.38, CAR/Y = 0.95

So Rafa’s citations above replacement is (13.62 + 69.3 + 0.95)/3 =  27.96! There are a couple of reasons why this isn’t a completely accurate picture. One is the low sample size, the second is the fact that I only took the 400 most cited papers in each journal. Rafa has a few papers that didn’t make the top 400 for journals like JASA - which would bring down his CAR/Y.

 

Figuring Out Learning Objectives the Hard Way

When building the Genomic Data Science Specialization (which starts in June!) we had to figure out the learning objectives for each course. We initially set our ambitions high, but as you can see in this video below, Steven Salzberg brought us back to Earth.

Data analysis subcultures

Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials [Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials](http://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en) across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

  • Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).
  • In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.
  • This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.
  • Psychology has a history of using parametric statistics, genomics is big into empirical Bayes, and you see a lot of Bayesian statistics in climate studies.
  • You see [Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials [Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials](http://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en) across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

  • Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).
  • In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.
  • This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.
  • Psychology has a history of using parametric statistics, genomics is big into empirical Bayes, and you see a lot of Bayesian statistics in climate studies.
  • You see](http://en.wikipedia.org/wiki/White_test) used a lot in econometrics, but that is hardly ever done through formal hypothesis testing in biostatistics.
  • Training sets and test sets are used in machine learning for prediction, but rarely used for inference.

This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter a lot in a data analysis.  The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we’d like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, “I saw it in a widely-cited paper in journal XX from my field” it is likely that little progress will be made on resolving the statistical problems in science.