Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

In which Brian debates abstraction with T-Bone

Editor’s Note: This is the first in a set of point-counterpoint posts related to the value of abstract thinking in graduate education that we teased a few days ago. Brian Caffo, recently installed Graduate Program Director at the best Biostat department in the country, has kindly agreed to lead off with the case for abstraction. We’ll follow up later in the week with my counterpoint. In the meantime, there have already been a number of really interesting and insightful comments inspired by our teaser post that are well worth reading. See the comments here

The impetus for writing this blog post came out of a particularly heady lunchroom discussion on the role of measure theoretic probability in our curriculum. We have a very mathematically rigorous program at Hopkins Biostatistics that includes a full academic year of measure theoretic probability.  Similar to elsewhere, many faculty dispute the necessity of this course. I am in favor of it. My principal reason being that I believe it is useful for building up and evaluating a student’s abilities in abstraction and generalization.

In our discussion, abstraction was the real point of contention. Emphasizing abstraction versus more immediately practical tools is an age-old argument of ivory tower stereotypes (the philosopher archetype) versus  equally stereotypically scientific pragmatists (the engineering archetype).

So, let’s begin picking this scab. For your sake and mine, I’ll try to be brief.</p>

My definitions:
</strong>

Abstraction - reducing a technique, idea or concept to its essence or core.

Generalization -  extending a technique, idea or concept to areas for which it was not originally intended.

PhD - a post baccalaureate degree  that requires substantial new contributions to knowledge.


The term “substantial new contributions” in my definition of a PhD is admittedly fuzzy. To tie it down, examples that I think do create new knowledge in the field of statistics include:

  1. applying existing techniques to data where they have not been used before (generalization of the application of the techniques),
  2. developing statistical software (abstraction of statistical and mathematical thoughts into code),
  3. developing new statistical methods from existing ones (generalization),
  4. proving new theory (both abstraction and generalization) and
  5. creating new data analysis pipelines (both abstraction and generalization).

In every one of these examples, generalization or abstraction is what differentiates it from a purely technical accomplishment.

To give a contrary activity, consider statistical technical specialization. That is, the application an existing method to data where the method is already known to be effective and no new statistical thought is required. Regardless of how necessary, difficult or important applying that method is, such activity does not constitute the creation of new statistical knowledge, even if it is a necessary schlep in the creation of new knowledge of another sort.

Though many statistics graduate level activities require substantial technical specialization, to be doctoral statistical research in a way that satisfies my definition, generalization and abstraction are necessary components.

I further contend that abstraction is a key tool for obtaining meaningful generalization. A method, theory, analysis, etcetera can not be retooled to non-intended use without stripping away some of its specialization and abstracting it to its core utility.

Abstraction is constantly necessary when applying statistical methods. For example, whenever a statistician says “Method A really was designed for a different kind of data than mine. But at its core it’s really useful for finding out B, which I need to know. So I’ll use it anyway until (if ever) I come up with something better.”  

As examples: A = CLT, B = distribution for normalized means, A =  principal components, B = directions of variation, A = bootstrap, B = sampling distributions, A = linear models, B = mean relationships with covariates.

Abstraction and generalization facilitates learning new areas. Knowledge of the abstract core of a discipline makes that knowledge much more portable. This is seen across every discipline. Musicians who know music theory can use their knowledge for any instrument;  computer scientists who understand data structures and algorithms can switch languages easily; electrical engineers who understand signal processing can switch between technologies easily. Abstraction is what allows them to see past the concrete (instrument, syntax, technology) to the essence (music, algorithm, signal).

And statisticians learn statistical and probability theory. However, in statistics, abstraction is not represented only by mathematics and theory. As pointed out by the absolutely unimpeachable source, Simply Statistics, software is exactly an abstraction.

I think abstraction is important and we need to continue publishing those kinds of ideas. However, I think there is one key point that the statistics community has had difficulty grasping, which is that software represents an important form of abstraction, if not the most important form …

(A QED is in order, I believe.)