Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

NSF should understand that Statistics is not Mathematics

NSF has realized that the role of Statistics is growing in all areas of science and engineering and NSF has realized that the role of Statistics is growing in all areas of science and engineering and  to examine the current structure of support of the statistical sciences.  As Roger explained in August, the NSF is divided into directorates composed of divisions. Statistics is in the Division of Mathematical Sciences (DMS) within the Directorate for Mathematical and Physical Sciences. Within this division it is a Disciplinary Research Program along with Topology, Geometric Analysis, etc.. To statisticians this does not make much sense, and my first thought when asked for recommendations was that we need a proper division. But the committee is seeking out recommendations that

[do] not include renaming of the Division of Mathematical Sciences. Particularly desired are recommendations that can be implemented within the current divisional and directorate structure of NSF; Foundation (NSF) and to provide recommendations for NSF to consider.

This clarification is there because former director [NSF has realized that the role of Statistics is growing in all areas of science and engineering and NSF has realized that the role of Statistics is growing in all areas of science and engineering and  to examine the current structure of support of the statistical sciences.  As Roger explained in August, the NSF is divided into directorates composed of divisions. Statistics is in the Division of Mathematical Sciences (DMS) within the Directorate for Mathematical and Physical Sciences. Within this division it is a Disciplinary Research Program along with Topology, Geometric Analysis, etc.. To statisticians this does not make much sense, and my first thought when asked for recommendations was that we need a proper division. But the committee is seeking out recommendations that

[do] not include renaming of the Division of Mathematical Sciences. Particularly desired are recommendations that can be implemented within the current divisional and directorate structure of NSF; Foundation (NSF) and to provide recommendations for NSF to consider.

This clarification is there because former director](http://simplystatistics.org/2012/08/21/nsf-recognizes-math-and-statistics-are-not-the-same/) names to “Division of Mathematical and Statistical Sciences”.  The NSF shot down this idea and listed this as one of the reasons:

If the name change attracts more proposals to the Division from the statistics community, this could draw funding away from other subfields

So NSF does not want to take away from the other math programs and this is understandable given the current levels of research funding for Mathematics. But this being the case, I can’t really think of a recommendation other than giving Statistics it’s own division or give data related sciences their own directorate. Increasing support for the statistical sciences means increasing funding. You secure the necessary funding either by asking congress for a bigger budget (good luck with that) or by cutting from other divisions, not just Mathematics. A new division makes sense not only in practice but also in principle because Statistics is not Mathematics.

Statistics is analogous to other disciplines that use mathematics as a fundamental language, like Physics, Engineering, and Computer Science. But like those disciplines, Statistics contributes separate and fundamental scientific knowledge. While the field of applied mathematics tries to explain the world with deterministic equations, Statistics takes a dramatically different approach. In highly complex systems, such as the weather, Mathematicians battle LaPlace’s demon and struggle to explain nature using mathematics derived from first principles. Statisticians accept  that deterministic approaches are not always useful and instead develop and rely on random models. These two approaches are both important as demonstrated by the improvements in meteorological predictions  achieved once data-driven statistical models were used to compliment deterministic mathematical models.

Although Statisticians rely heavily on theoretical/mathematical thinking, another important distinction from Mathematics is that advances in our field are almost exclusively driven by empirical work. Statistics always starts with a specific, concrete real world problem: we thrive in Pasteur’s quadrant. Important theoretical work that generalizes our solutions always follows. This approach, built mostly by basic researchers, has yielded some of the most useful concepts relied upon by modren science: the p-value, randomization, analysis of variance, regression, the proportional hazards model, causal inference, Bayesian methods, and the Bootstrap, just to name a few examples. These ideas were instrumental in the most important genetic discoveries, improving agriculture, the inception of the empirical social sciences, and revolutionizing medicine via randomized clinical trials. They have also fundamentally changed the way we abstract quantitative problems from real data.

The 21st century brings the era of big data, and distinguishing Statistics from Mathematics becomes more important than ever.  Many areas of science are now being driven by new measurement technologies. Insights are being made by discovery-driven, as opposed to hypothesis-driven, experiments. Although testing hypotheses developed theoretically will of course remain important to science, it is inconceivable to think that, just like Leeuwenhoek became the father of microbiology by looking through the microscope without theoretical predictions, the era of big data will enable discoveries that we have not yet even imagined. However, it is naive to think that these new datasets will be free of noise and unwanted variability. Deterministic models alone will almost certainly fail at extracting useful information from these data just like they have failed at predicting complex systems like the weather. The advancement in science during the era of big data that the NSF wants to see will only happen if the field that specializes in making sense of data is properly defined as a separate field from Mathematics and appropriately supported.

Addendum: On a related note, NIH just announced that they plan to recruit a new senior scientific position: the Associate Director for Data Science

The landscape of data analysis

I have been getting some questions via email, LinkedIn, and Twitter about the content of the Data Analysis class I will be teaching for Coursera. Data Analysis and Data Science mean different things to different people. So I made a video describing how Data Analysis fits into the landscape of other quantitative classes here:

Here is the corresponding presentation. I also made a tentative list of topics we will cover, subject to change at the instructor’s whim. Here it is:

  • The structure of a data analysis  (steps in the process, knowing when to quit, etc.)
  • Types of data (census, designed studies, randomized trials)
  • Types of data analysis questions (exploratory, inferential, predictive, etc.)
  • How to write up a data analysis (compositional style, reproducibility, etc.)
  • Obtaining data from the web (through downloads mostly)
  • Loading data into R from different file types
  • Plotting data for exploratory purposes (boxplots, scatterplots, etc.)
  • Exploratory statistical models (clustering)
  • Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)
  • Basic model checking (primarily visually)
  • The prediction process
  • Study design for prediction
  • Cross-validation
  • A couple of simple prediction models
  • Basics of simulation for evaluating models
  • Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)

Of course that is a ton of material for 8 weeks and so obviously we will be covering just the very basics. I think it is really important to remember that being a good Data Analyst is like being a good surgeon or writer. There is no such thing as a prodigy in surgery or writing, because it requires long experience, trying lots of things out, and learning from mistakes. I hope to give people the basic information they need to get started and point to resources where they can learn more. I also hope to give them a chance to practice a couple of times some basics and to learn that in data analysis the first goal is to “do no harm”.

By introducing competition open online education will improve teaching at top universities

It is no secret that faculty evaluations at top universities weigh research much more than teaching. This is not surprising given that, among other reasons,  global visibility comes from academic innovation (think Nobel Prizes) not classroom instruction. Come promotion time the peer review system carefully examines your publication record and ability to raise research funds. External experts within your research area are asked if you are a leader in the field. Top universities maintain their status by imposing standards that lead to a highly competitive environment in which only the most talented researchers survive.

However, the assessment of teaching excellence is much less stringent. Unless they reveal utter incompetence, teaching evaluations are practically ignored; especially if you have graduated numerous PhD students. Certainly, outside experts are not asked about your teaching. This imbalance in incentives explains why faculty use research funding to buy-out of teaching and why highly recruited candidates negotiate low teaching loads.

Top researchers end up at top universities but being good at research does not necessarily mean you are a good teacher. Furthermore,  the effort required to be a competitive researcher leaves limited time for class preparation. To make matters worse, within a university, faculty have a monopoly on the classes they teach. With few incentives and  practically no competition it is hard to believe that top universities are doing the best they can when it comes to classroom instruction. By introducing competition, MOOCs might change this.

To illustrate, say you are a chair of a soft money department in 2015. Four of your faculty receive 25% funding to teach the big Stat 101 class and your graduate program’s three main classes. But despite being great researchers these four are mediocre teachers. So why are they teaching if 1) a MOOC exists for each of these classes and 2) these professors can easily cover 100% of their salary with research funds. As chair, not only do you wonder why not let these four profs  focus on what they do best, but also why your department is not creating MOOCs and getting global recognition for it. So instead of hiring 4 great researchers that are mediocre teachers why not hire (for the same cost) 4 great researchers (fully funded by grants) and 1 great teacher (funded with tuition $)? I think in the future tenure track positions will be divided into top researchers doing mostly research and top teachers doing mostly classroom teaching and MOOC development. Because top universities will feel the pressure to compete and develop the courses that educate the world, there will be no room for mediocre teaching.

 

Sunday data/statistics link roundup (1/6/2013)

  1. Not really statistics, but this is an interesting article about how rational optimization by individual actors does not always lead to an optimal solutiohn. Related, ere is the coolest street sign I think I’ve ever seen, with a heatmap of traffic density to try to influence commuters.
  2. An interesting paper that talks about how clustering is only a really hard problem when there aren’t obvious clusters. I was a little disappointed in the paper, because it defines the “obviousness” of clusters only theoretically by a distance metric. There is very little discussion of the practical distance/visual distance metrics people use when looking at clustering dendograms, etc.
  3. A post about the two cultures of statistical learning and a related post on how data-driven science is a failure of imagination. I think in both cases, it is worth pointing out that the only good data science is good science - i.e. it seeks to answer a real, specific question through the scientific method. However, I think for many modern scientific problems it is pretty naive to think we will be able to come to a full, mechanistic understanding complete with tidy theorems that describe all the properties of the system. I think the real failure of imagination is to think that science/statistics/mathematics won’t change to tackle the realistic challenges posed in solving modern scientific problems.
  4. A graph that shows the incredibly strong correlation ( > 0.99!) between the growth of autism diagnoses and organic food sales. Another example where even really strong correlation does not imply causation.
  5. The Buffalo Bills are going to start an advanced analytics department (via Rafa and Chris V.), maybe they can take advantage of all this free play-by-play data from years of NFL games.
  6. A prescient interview with Isaac Asimov on learning, predicting the Kahn Academy, MOOCs and other developments in online learning (via Rafa and Marginal Revolution).
  7. The statistical software signal - what your choice of software says about you. Just another reason we need a deterministic statistical machine.

 

Does NIH fund innovative work? Does Nature care about publishing accurate articles?

Editor’s Note: In a recent post we disagreed with a Nature article claiming that NIH doesn’t support innovation. Our colleague Steven Salzberg actually looked at the data and wrote the guest post below. 

Nature published an article last month with the provocative title “Research grants: Conform and be funded.”  The authors looked at papers with over 1000 citations to find out whether scientists “who do the most influential scientific work get funded by the NIH.”  Their dramatic conclusion, widely reported, was that only 40% of such influential scientists get funding.

Dramatic, but wrong.  I re-analyzed the authors’ data and wrote a letter to Nature, which was published today along with the authors response, which more or less ignored my points.  Unfortunately, Nature cut my already-short letter in half, so what readers see in the journal omits half my argument.  My entire letter is published here, thanks to my colleagues at Simply Statistics.  I titled it “NIH funds the overwhelming majority of highly influential original science results,” because that’s what the original study should have concluded from their very own data.  Here goes:

To the Editors:

In their recent commentary, "Conform and be funded," Joshua Nicholson and John Ioannidis claim that "too many US authors of the most innovative and influential papers in the life sciences do not receive NIH funding."  They support their thesis with an analysis of 200 papers sampled from 700 life science papers with over 1,000 citations.  Their main finding was that only 40% of "primary authors" on these papers are PIs on NIH grants, from which they argue that the peer review system "encourage[s] conformity if not mediocrity."

While this makes for an appealing headline, the authors' own data does not support their conclusion.  I downloaded the full text for a random sample of 125 of the 700 highly cited papers [data available upon request].  A majority of these papers were either reviews (63), which do not report original findings, or not in the life sciences (17) despite being included in the authors' database.  For the remaining 45 papers, I looked at each paper to see if the work was supported by NIH.  In a few cases where the paper did not include this information, I used the NIH grants database to determine if the corresponding author has current NIH support.  34 out of 45 (75%) of these highly-cited papers were supported by NIH.  The 11 papers not supported included papers published by other branches of the U.S. government, including the CDC and the U.S. Army, for which NIH support would not be appropriate.  Thus, using the authors' own data, one would have to conclude that NIH has supported a large majority of highly influential life sciences discoveries in the past twelve years.

The authors – and the editors at Nature, who contributed to the article – suffer from the same biases that Ioannidis himself has often criticized.  Their inclusion of inappropriate articles and especially the choice to require that both the first and last author be PIs on an NIH grant, even when the first author was a student, produced an artificially low number that misrepresents the degree to which NIH supports innovative original research.

It seems pretty clear that Nature wanted a headline about how NIH doesn’t support innovation, and Ioannidis was happy to give it to them.  Now, I’d love it if NIH had the funds to support more scientists, and I’d also be in favor of funding at least some work retrospectively - based on recent major achievements, for example, rather than proposed future work.  But the evidence doesn’t support the “Conform and be funded” headline, however much Nature might want it to be true.