21 Feb 2014
There’s been much discussion recently about how the scientific publishing system is “broken”. Just the latest one that I saw was a tweet from Princeton biophysicist Josh Shaevitz:
On this blog, we’ve talked quite a bit about the publishing system, including in this interview with Michael Eisen. Jeff recently posted about changing the reviewing system (again). We have a few other posts on this topic. Yes, we like to complain like the best of them.
But there’s a simple fact: The scientific publishing system, as broken as you may find it to be, can never truly be fixed.
Here’s the tl;dr
- The collection of scientific publications out there make up a marketplace of ideas, hypotheses, theorems, conjectures, and comments about nature.
- Each member of society has an algorithm for placing a value on each of those publications. Valuation methodologies vary, but they often include factors like the reputation of the author(s), the journal in which the paper was published, the source of funding, as well as one’s own personal beliefs about the quality of the work described in the publication.
- Given a valuation methodology, each scientist can rank order the publications from “most valuable” to “least valuable”.
- Fixing the scientific publication system would require forcing everyone to agree on the same valuation methodology for all publications.
The Marketplace of Publications
The first point is that the collection of scientific publications make up a kind of market of ideas. Although we don’t really “trade” publications in this market, we do estimate the value of each publication and label some as “important” and some as not important. I think this is important because it allows us to draw analogies with other types of markets. In particular, consider the following question: Can you think of a market in any item where each item was priced perfectly, so that every (rational) person agreed on its value? I can’t.
Consider the stock market, which might be the most analyzed market in the world. Professional investors make their entire living analyzing the companies that are listed on stock exchanges and buying and selling their shares based on what they believe is the value of those companies. And yet, there can be huge disagreements over the valuation of these companies. Consider the current Herbalife drama, where investors William Ackman and Carl Icahn (and Daniel Loeb) are taking complete opposite sides of the trade (Ackman is short and Icahn is long). They can’t both be right about the valuation; they must have different valuation strategies. Everyday, the market’s collective valuation of different companies changes, reacting to new information and perhaps to irrational behavior. In the long run, good companies survive while others do not. In the meantime, everyone will argue about the appropriate price.
Journals are in some ways like the stock exchanges of yore. There are very prestigious ones (e.g. NYSE, the “big board”) and there are less prestigious ones (e.g. NASDAQ) and everyone tries to get their publication into the prestigious journals. Journals have listing requirements–you can’t just put any publication in the journal. It has to meet certain standards set by the journal. The importance of being listed on a prestigious stock exchange has diminished somewhat over the years. The most valuable company in the world trades on the NASDAQ. Similarly, although Science, Nature, and the New England Journal of Medicine are still quite sought after by scientists, competition is increasing from journals (such as those from the Public Library of Science) who are willing to publish papers that are technically correct and let readers determine their importance.
What’s the “Fix”?
Now let’s consider a world where we obliterate journals like Nature and Science and that there’s only the “one true journal”. Suppose this journal accepts any publication that satisfies some basic technical requirements (i.e. not content-based) and then has a sophisticated rating system that allows readers to comment on, rate, and otherwise evaluate each publication. There is no pre-publication peer review. Everything is immediately published. Problem solved? Not really, in my opinion. Here’s what I think would end up happening:
- People would have to (slightly) alter their methodology for ranking individual scientists. They would not be able to say “so-and-so has 10 Nature papers, so he must be good”. But most likely, another proxy for actually reading the appears would arise. For example, “My buddy from University of Whatever put this paper in his top-ten list, so it must be good”. As Michael Eisen said in our interview, the ranking system induced by journals like Science and Nature is just an abstract hierarchy; we can still reproduce the hierarchy even if Science/Nature don’t exist.
- In the current system, certain publications often “get stuck” with overly inflated valuations and it is often difficult to effectively criticize such publications because there does not exist an equivalent venue for informed criticism on par with Science and Nature. These publications achieve such high valuations partly because they are published in high-end journals like Nature and Science, but partly it is because some people actually believe they are valuable. In other words, it is possible to create a “bubble” where people irrationally believe a publication is valuable, just because everyone believes it’s valuable. If you destroy the current publication system, there will still be publications that are “over-valued”, just like in every other market. And furthermore, it will continue to be difficult to criticize such publications. Think of all the analysts that were yelling about how the housing market was dangerously inflated back in 2007. Did anyone listen? Not until it was too late.
What Can be Done?
I don’t mean for this post to be depressing, but I think there’s a basic reality about publication that perhaps is not fully appreciated. That said, I believe there are things that can be done to improve science itself, as well as the publication system.
- Raise the ROC curves of science. Efforts in this direction make everyone better and improve our ability to make more important discoveries.
- Increase the reproducibility of science. This is kind of the “Sarbanes-Oxley” of science. For the most part, I think the debate about whether science should be made more reproducible is coming to a close (or it is for me). The real question is how do we do it, for all scientists? I don’t think there are enough people thinking about this question. It will likely be a mix of different strategies, policies, incentives, and tools.
- Develop more sophisticated evaluation technologies for publications. Again, to paraphrase Michael Eisen, we are better able to judge the value of a pencil on Amazon than we are able to judge a scientific publication. The technology exists for improving the system, but someone has to implement it. I think a useful system along these lines would go a long way towards de-emphasizing the importance of “vanity journals” like Nature and Science.
- Make open access more accessible. Open access journals have been an important addition to the publication universe, but they are still very expensive (the cost has just been shifted). We need to think more about lowering the overall cost of publication so that it is truly open access.
Ultimately, in a universe where there are finite resources, a system has to be developed to determine how those resources should be distributed. Any system that we can come up with will be flawed as there will by necessity have to be winners and losers. I think there are serious efforts that need to be made to make the system more fair and more transparent, but the problem will never truly be “fixed” to everyone’s satisfaction.
17 Feb 2014
Editor’s Note: Ronald This is a repost of the post “R.A. Fisher is the most influential scientist ever” with a picture of my pilgrimage to his gravesite in Adelaide, Australia.
You can now see profiles of famous scientists on Google Scholar citations. Here are links to a few of them (via Ben L.). Von Neumann, Einstein, Newton, Feynman
But their impact on science pales in comparison (with the possible exception of Newton) to the impact of one statistician: R.A. Fisher. Many of the concepts he developed are so common and are considered so standard, that he is never cited/credited. Here are some examples of things he invented along with a conservative number of citations they would have received calculated via Google Scholar*.
- P-values - 3 million citations
- Analysis of variance (ANOVA) - 1.57 million citations
- Maximum likelihood estimation - 1.54 million citations
- Fisher’s linear discriminant 62,400 citations
- Randomization/permutation tests 37,940 citations
- Genetic linkage analysis 298,000 citations
- Fisher information 57,000 citations
- Fisher’s exact test 237,000 citations
A couple of notes:
- These are seriously conservative estimates, since I only searched for a few variants on some key words
- These numbers are BIG, there isn’t another scientist in the ballpark. The guy who wrote the “most highly cited paper” got 228,441 citations on GS. His next most cited paper? 3,000 citations. Fisher has at least 5 papers with more citations than his best one.
- This page says Bert Vogelstein has the most citations of any person over the last 30 years. If you add up the number of citations to his top 8 papers on GS, you get 57,418. About as many as to the Fisher information matrix.
I think this really speaks to a couple of things. One is that Fisher invented some of the most critical concepts in statistics. The other is the breadth of impact of statistical ideas across a range of disciplines. In any case, I would be hard pressed to think of another scientist who has influenced a greater range or depth of scientists with their work.
Update: I recently when to Adelaide to give a couple of talks on Bioinformatics, Statistics and MOOCs. My host Gary informed me that Fisher was buried in Adelaide. I went to the cathedral to see the memorial and took this picture. I couldn’t get my face in the picture because the plaque was on the ground. You’ll have to trust me that these are my shoes.
14 Feb 2014
Executive Summary
- The problem is not p-values it is a fundamental shortage of data analytic skill.
- In general it makes sense to reduce researcher degrees of freedom for non-experts, but any choice of statistic, when used by many untrained people, will be flawed.
- The long term solution is to require training in both statistics and data analysis for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.
- The Johns Hopkins Specialization in Data Science runs every month and can be easily integrated into any program. Other, more specialized, online courses and short courses make it possible to round this training out in ways that are appropriate for each discipline.
Scalability of Statistical Procedures
The P-value is in the news again. Nature came out with a piece talking about how scientists are naive about the use of P-values among other things. P-values have known flaws which have been regularly discussed. If you want to see some criticisms just Google “NHST”. Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis.
Data are now abundant in nearly every discipline from astrophysics, to biology, to the social sciences, and even in qualitative disciplines like literature. By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven’t adapted to this new era. This is particularly true in genomics where within one generation we went from a data poor environment to a data rich environment. [Executive Summary
- The problem is not p-values it is a fundamental shortage of data analytic skill.
- In general it makes sense to reduce researcher degrees of freedom for non-experts, but any choice of statistic, when used by many untrained people, will be flawed.
- The long term solution is to require training in both statistics and data analysis for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.
- The Johns Hopkins Specialization in Data Science runs every month and can be easily integrated into any program. Other, more specialized, online courses and short courses make it possible to round this training out in ways that are appropriate for each discipline.
Scalability of Statistical Procedures
The P-value is in the news again. Nature came out with a piece talking about how scientists are naive about the use of P-values among other things. P-values have known flaws which have been regularly discussed. If you want to see some criticisms just Google “NHST”. Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis.
Data are now abundant in nearly every discipline from astrophysics, to biology, to the social sciences, and even in qualitative disciplines like literature. By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven’t adapted to this new era. This is particularly true in genomics where within one generation we went from a data poor environment to a data rich environment.](http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/) were trained before data were widely available and used.
The result is that the vast majority of people performing statistical and data analysis are people with only one or two statistics classes and little formal data analytic training under their belt. Many of these scientists would happily work with a statistician, but as any applied statistician at a research university will tell you, it is impossible to keep up with the demand from our scientific colleagues. Everyone is collecting major data sets or analyzing public data sets; there just aren’t enough hours in the day.
Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many “researcher degrees of freedom”. If a naive analyst can pick any of a range of methods and does not understand how they work, they will generally pick the one that gives them maximum benefit.
The short-term solution is to find a balance between researcher degrees of freedom and “recipe book” style approaches that require a specific method to be applied. In general, for naive analysts, it makes sense to lean toward less flexible methods that have been shown to work across a range of settings. The key idea here is to evaluate methods in the hands of naive users and see which ones work best most frequently, an idea we have previously called “evidence based data analysis”.
An incredible success story of evidence based data analysis in genomics is the use of the limma package for differential expression analysis of microarray data. Limma can be beat in certain specific scenarios, but it is robust to such a wide number of study designs, sample sizes, and data types that the choice to use something other than limma should only be exercised by experts.
The trouble with criticizing p-values without an alternative
P-values are an obvious target of wrath by people who don’t do day to day statistical analysis because the P-value is the most successful statistical procedure ever invented. If every person who used a P-value cited the inventor, P-values would have, very conservatively, 3 million citations. That’s an insane amount of use for one statistic.
Why would such a terrible statistic be used by so many people? The reason is that it is critical that we have some measure of uncertainty we can assign to data analytic results. Without such a measure, the only way to determine if results are real or not is to rely on people’s intuition, which is a notoriously unreliable metric when uncertainty is involved. It is pretty clear science would be much worse off if we decided if results were reliable based on peoples’ gut feeling about the data.
P-values can and are misinterpreted, misused, and abused both by naive analysts and by statisticians. Sometimes these problems are due to statistical naiveté, sometimes they are due to wishful thinking and career pressure, and sometimes they are malicious. The reason is that P-values are complicated and require training to understand.
Critics of the P-value argue in favor of a large number of the procedures to be used in place of P-values. But when considering the scale at which the methods must be used to address the demands of the current data rich world, many alternatives would result in similar flaws. This is in no way proves the use of P-values is a good idea, but it does prove that coming up with an alternative is hard. Here are a few potential alternatives.
- Methods should only be chosen and applied by true data analytic experts. Pros: This is the best case scenario. Cons: Impossible to implement broadly given the level of statistical and data analytic expertise in the community
- The full prior, likelihood and posterior should be detailed and complete sensitivity analysis should be performed. Pros: In cases where this can be done this provides much more information about the model and uncertainty being considered. Cons: The model requires more advanced statistical expertise, is computationally much more demanding, and can not be applied in problems where model based approaches have not been developed. Yes/no decisions about credibility of results still come down to picking a threshold or allowing more researcher degrees of freedom.
- A direct Bayesian approach should be used reporting credible intervals and Bayes estimators. Pros: In cases where the model can be fit, can be used by non-experts, provides scientific measures of uncertainty like confidence intervals. Cons: The prior allows a large number of degrees of freedom when not used by an expert, sensitivity analysis is required to determine the effect of the prior, many more complex models can not be implemented, results are still sample size dependent.
- Replace P-values with likelihood ratios. Pros: In cases where it is available may reduce some of the conceptual difficulty with the null hypothesis. Cons: Likelihood ratios can usually only be computed exactly for cases with few or no nuisance parameters, likelihood ratios run into trouble for complex alternatives, they are still sample size dependent, the a likelihood ratio threshold is equivalent to a p-value threshold in many cases.
- We should use Confidence Intervals exclusively in the place of p-values. Pros: A measure and variability on the scale of interest will be reported. We can evaluate effect sizes on a scientific scale. Cons: Confidence intervals are still sample size dependent and can be misleading for large samples, significance levels can be chosen to make intervals artificially wide/small, if used as a decision making tool there is a one-to-one mapping between a confidence interval and a p-value threshold.
- We should use Bayes Factors instead of p-values. Pros: They can compare the evidence (loosely defined) for both the null and alternative. They can incorporate prior information. Cons: Priors provide researcher degrees of freedom, cutoffs may still lead to false/true positives, BF’s still depend on sample size.
This is not to say that many of these methods have advantages over P-values. But at scale any of these methods will be prone to abuse, misinterpretation and error. For example, none of them by default deals with multiple testing. Reducing researcher degrees of freedom is good when dealing with a lack of training, but the consequence is potential for mistakes and all of these methods would be ferociously criticized if used as frequently as p-values.
The difference between data analysis and statistics
Many disciplines including medicine and molecular biology usually require an introductory statistics or machine learning class during their program. This is a great start, but is not sufficient for the modern data saturated era. The introductory statistics or machine learning class is enough to teach someone the language of data analysis, but not how to use it. For example, you learn about the t-statistic and how to calculate it. You may also learn the asymptotic properties of the statistic. But you rarely learn about what happens to the t-statistic when there is an unmeasured confounder. You also don’t learn how to handle non iid data, sample mixups, reproducibility, most of scripting, etc.
It is therefore critical that if you plan to use or understand data analysis you take both the introductory course and at least one data analysis course. The data analysis course should cover study design, more general data analytic reasoning, non-iid data, biased sampling, basics of non-parametrics, training vs test sets, prediction error, sources of likely problems in data sets (like sample mixups), and reproducibility. These are the concepts that appear regularly when analyzing real data that don’t usually appear in the first course in statistics that most medical and molecular biology professionals see. There are awesome statistical educators who are trying hard to bring more of this into the introductory stats world, but it is just too much to cram into one class.
What should we do
The thing that is the most frustrating about the frequent and loud criticisms of P-values is that they usually point out what is wrong with P-values, but don’t suggest what we should do about it. When they do make suggestions, they frequently ignore the fundamental problems:
- Statistics are complicated and require careful training to understand properly. This is true regardless of the choice of statistic, philosophy, or algorithm.
- Data is incredibly abundant in all disciplines and shows no sign of slowing down.
- There is a fundamental shortage of training in statistics and data analysis
- Giving untrained analysts extra researcher degrees of freedom is dangerous.
The most direct solution to this problem is increased training in statistics and data analysis. Every major or program in a discipline that regularly analyzes data (molecular biology, medicine, finance, economics, astrophysics, etc.) should require at minimum an introductory statistics class and a data analysis class. If the expertise doesn’t exist to create these sorts of courses there are options. For example, we have introduced a series of 9 courses that run every month that cover most of the basic topics that are common across disciplines.
http://jhudatascience.org/
https://www.coursera.org/specialization/jhudatascience/1
I think of particular interest given the NIH Director’s recent comments on reproducibility is our course on Reproducible Research. There are also many more specialized resources that are very good and widely available that will build on the base we created with the data science specialization.
- For scientific software engineering/reproducibility: Software Carpentry.
- For data analysis in genomics: Rafa’s Data Analysis for Genomics Class.
- For Python and computing: The Fundamentals of Computing Specialization
Enforcing education and practice in data analysis is the only way to resolve the problems that people usually attribute to P-values. In the short term, we should at minimum require all the editors of journals who regularly handle data analysis to show competency in statistics and data analysis.
_Correction: _After seeing Katie K.’s comment on Facebook I concur that P-values were not directly referred to as “worse than useless”, so to more fairly represent the article, I have deleted that sentence.