Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

On the relative importance of mathematical abstraction in graduate statistical education

Editor’s Note: This is the counterpoint in our series of posts on the value of abstraction in graduate education. See Brian’s defense of abstraction on Monday and the comments on his post, as well as the comments on our original teaser post for more. See below for a full description of the T-bone inside joke*.**</p>

</strong>Brian did a good job at defining abstraction. In a cagey debater’s move, he provided an incredibly broad definition of abstraction that includes the reason we call a :-)a smiley face, the reason why we can apply least squares to a variety of data types, and the reason we write functions when programming. At this very broad level, it is clear that abstract thinking is necessary for graduate students or any other data professional.

But our debate was inspired by a discussion of whether measure-theoretic probability was a key component of our graduate program. There was some agreement that for many biostatistics Ph.D. students, this exact topic may not be necessary for their research or careers. Brian suggested that measure-theoretic probability was a surrogate marker for something more important - abstract thinking and the ability to generalize ideas. This is a very specific form of generalization and abstraction that is used most commonly by statisticians: the ability that permits one to prove theorems and develop statistical models that can be applied to a variety of data types. I will therefore refocus the debate on the original topic. I have three main points:
**
**

  1. There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods.
  2. It is possible to create incredible statistical value without developing generalizable statistical methods
  3. While abstraction as defined generally is good, overemphasis on this specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.


There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods. </p>

</strong>At a top program, you can expect to take courses in very theoretical statistics, measure theoretic probability, and an applied (or methods) sequence. The first two courses are exclusively mathematical. The third (at the programs I have visited, graduated from, taught in), despite its name, is most generally focused on mathematical details underlying statistical methods. The result is that most Ph.D. students are heavily trained in the mathematical theory behind statistics.

At the same time, there are a long list of skills necessary to develop a successful Ph.D. statistician. These include creativity in applications, statistical programming skills, grit to power through the boring/hard parts of research, interpretation of statistical results on real data, ability to identify the most important scientific problems, and a deep understanding of the scientific problems you are working on. Abstraction is on that list, but it is just one of many skills on that list. Graduate education is a zero-sum game over a finite period of time. Our strong focus on mathematical abstraction means there is less time for everything else.

Any hard quantitative course will measure the ability of a student to abstract in the general sense Brian defined. One of these courses would be very useful for our students. But it is not clear that we should focus on mathematical abstraction to the exclusion of other important characteristics of graduate students.

It is possible to create incredible statistical value without developing generalizable statistical methods</p>

</strong>A major standard for success in academia is the ability to generate solutions to problems that are widely read, cited, and used. A graduate student who produces these types of solutions is likely to have a high-impact and well-respected career. In general, it is not necessary to be able to prove theorems, understand measure theory, or develop generalizable statistical models to have this type of success.

One example is one of the co-authors of our blog, best known for his work in genomics. In this field, data is noisy and full of systematic errors, and for several technologies, he invented methods to correct them. For example, he developed the most popular method for making measurements from different experiments comparable, for removing the dependence of measurements on the letters in a gene, and for reducing variability due to operators who run the machine or the ozone levels. Each of these discoveries involved: (1) deep understanding of the specific technology used, (2) a good intuition of what signals were due to biology and which were due to technology, (3) application/development of specific, somewhat ad-hoc, statistical procedures to correct the mistakes, and (4) the development and distribution of good software. His work has been hugely influential on genomics, has been cited thousands of times, and has substantially improved the quality of both biological and statistical results.

But the work did not result in knowledge that was generalizable to other areas of application, it deals with problems that are highly specialized to genomics. If these were his only contributions (they are not), he’d be a hugely successful Ph.D. statistician. But had he focused on general solutions he would have never solved the problems at hand, since the problems were highly specific to a single application. And this is just one example I know well because I work in the area. There are a ton more just like it.

While abstraction as defined generally is good, overemphasis on a specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.</p>

</strong>One could argue that the choice of statistical techniques during data analysis is abstraction, or that one needs to abstract to develop efficient software. But the ability to abstract needed for these tasks can be measured by a wide range of classes, not just measure theoretic probability. Some of these classes might teach practically applicable skills like writing fast and efficient algorithms. Many results of high statistical value do not require mathematical proofs, abstract inductive reasoning, or asymptotic theory. It is a good idea to have a some people who can abstract away the science behind statistical methods to the core mathematical philosophy. But our current curriculum is too heavily weighted in this direction. In some cases, statisticians are even being left behind because they do not have sufficient time in their curriculum to develop the computational skills and amass the necessary subject matter knowledge needed to compete with the increasingly diverse set of engineers, computer scientists, data scientists, and computational biologists tackling the same scientific problems.

We need to reserve a larger portion of graduate education for diving deeply into specific scientific problems, even if it means they spend less time developing generalizable/abstract statistical ideas.

* Inside joke explanation: Two years ago at JSM I ran a footrace with this guy for the rights to the name “Jeff” in the department of Biostatistics at Hopkins for the rest of 2011. Unfortunately, we did not pro-rate for age and he nipped me by about a half-yard. True to my word, I went by Tullis (my middle name) for a few months, including on the title slide of my JSM talk. This was, of course, immediately subjected to all sorts of nicknaming and B-Caffo loves to use “T-bone”. I apologize on behalf of those that brought it up.

My worst nightmare...

I don’t know if you have seen this about a person who’s iCloud account was hacked. But man does it freak me out. As a person who relies pretty heavily on cloud-based storage devices and does some cloud-computing based research as well, this is a pretty freaky scenario. Time to go back everything up again…

In which Brian debates abstraction with T-Bone

Editor’s Note: This is the first in a set of point-counterpoint posts related to the value of abstract thinking in graduate education that we teased a few days ago. Brian Caffo, recently installed Graduate Program Director at the best Biostat department in the country, has kindly agreed to lead off with the case for abstraction. We’ll follow up later in the week with my counterpoint. In the meantime, there have already been a number of really interesting and insightful comments inspired by our teaser post that are well worth reading. See the comments here

The impetus for writing this blog post came out of a particularly heady lunchroom discussion on the role of measure theoretic probability in our curriculum. We have a very mathematically rigorous program at Hopkins Biostatistics that includes a full academic year of measure theoretic probability.  Similar to elsewhere, many faculty dispute the necessity of this course. I am in favor of it. My principal reason being that I believe it is useful for building up and evaluating a student’s abilities in abstraction and generalization.

In our discussion, abstraction was the real point of contention. Emphasizing abstraction versus more immediately practical tools is an age-old argument of ivory tower stereotypes (the philosopher archetype) versus  equally stereotypically scientific pragmatists (the engineering archetype).

So, let’s begin picking this scab. For your sake and mine, I’ll try to be brief.</p>

My definitions:
</strong>

Abstraction - reducing a technique, idea or concept to its essence or core.

Generalization -  extending a technique, idea or concept to areas for which it was not originally intended.

PhD - a post baccalaureate degree  that requires substantial new contributions to knowledge.


The term “substantial new contributions” in my definition of a PhD is admittedly fuzzy. To tie it down, examples that I think do create new knowledge in the field of statistics include:

  1. applying existing techniques to data where they have not been used before (generalization of the application of the techniques),
  2. developing statistical software (abstraction of statistical and mathematical thoughts into code),
  3. developing new statistical methods from existing ones (generalization),
  4. proving new theory (both abstraction and generalization) and
  5. creating new data analysis pipelines (both abstraction and generalization).

In every one of these examples, generalization or abstraction is what differentiates it from a purely technical accomplishment.

To give a contrary activity, consider statistical technical specialization. That is, the application an existing method to data where the method is already known to be effective and no new statistical thought is required. Regardless of how necessary, difficult or important applying that method is, such activity does not constitute the creation of new statistical knowledge, even if it is a necessary schlep in the creation of new knowledge of another sort.

Though many statistics graduate level activities require substantial technical specialization, to be doctoral statistical research in a way that satisfies my definition, generalization and abstraction are necessary components.

I further contend that abstraction is a key tool for obtaining meaningful generalization. A method, theory, analysis, etcetera can not be retooled to non-intended use without stripping away some of its specialization and abstracting it to its core utility.

Abstraction is constantly necessary when applying statistical methods. For example, whenever a statistician says “Method A really was designed for a different kind of data than mine. But at its core it’s really useful for finding out B, which I need to know. So I’ll use it anyway until (if ever) I come up with something better.”  

As examples: A = CLT, B = distribution for normalized means, A =  principal components, B = directions of variation, A = bootstrap, B = sampling distributions, A = linear models, B = mean relationships with covariates.

Abstraction and generalization facilitates learning new areas. Knowledge of the abstract core of a discipline makes that knowledge much more portable. This is seen across every discipline. Musicians who know music theory can use their knowledge for any instrument;  computer scientists who understand data structures and algorithms can switch languages easily; electrical engineers who understand signal processing can switch between technologies easily. Abstraction is what allows them to see past the concrete (instrument, syntax, technology) to the essence (music, algorithm, signal).

And statisticians learn statistical and probability theory. However, in statistics, abstraction is not represented only by mathematics and theory. As pointed out by the absolutely unimpeachable source, Simply Statistics, software is exactly an abstraction.

I think abstraction is important and we need to continue publishing those kinds of ideas. However, I think there is one key point that the statistics community has had difficulty grasping, which is that software represents an important form of abstraction, if not the most important form …

(A QED is in order, I believe.)

Samuel Kou wins COPSS Award

At JSM this year we learned that Samuel Kou of Harvard’s Department of Statistics won the Committee of Presidents of Statistical Societies (COPSS) President’s award. The award is given annually to

a young member of the statistical community in recognition of an outstanding contribution to the profession of statistics. The recipient of the Presidents’ Award must be a member of at least one of the participating societies. The candidate may be chosen for a single contribution of extraordinary merit, or an outstanding aggregate of contributions, to the profession of statistics. 

Samuel’s work spans a wide range of areas from biophysics to MCMC to model selection with contributions in the top journals in statistics and elsewhere. He is also a member of a highly selective group of people who have been promoted to full Professor at Harvard’s Department of Statistics. (Bonus points to those who can name the last person to achieve such a distinction.)

This is a well-deserved honor to an exemplary member of our field.

NYC and Columbia to Create Institute for Data Sciences & Engineering

NYC and Columbia to Create Institute for Data Sciences & Engineering