Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Why are the best relievers not used when they are most needed?

During Saturday’s ALCS game 6 the Red Sox’s manager John Farrell took out his starter in the 6th inning. They were leading by 1, but had runners on first and second with no outs. This is a hard situation to get out of without giving up a run. The chances of scoring with an average pitcher are about 64%. I am sure that with a top of the line pitcher, like Koji Uehara, this number goes down substantially. So what does a typical manager do in this situation? Because managers like to save their better relievers for the end, and it’s only the 6th inning, they will bring in a mediocre one instead. This is what Farrell did and 2 batters latter the score was 2-1 Tigers. To really understand why this is bad move, the chances of a mediocre pitcher giving up runs when starting an inning is about 28%.  So why not bring in your best reliever when the game is actually on the line? Here is an article by John Dewan with a good in -depth discussion.  Note that the Red Sox won the game 5-2 and Koji Uehara was brought in the ninth inning to get 3 outs with the bases empty and a 3 run lead.

Platforms and Integration in Statistical Research (Part 2/2)

In my last post, I talked about two general approaches to conducting statistical research: platforms and integration. In this followup I thought I’d describe the characteristics of certain fields that suggesting taking one approach over another.

I think in practice, most statisticians will dedicate some time to both the platform and integrative approaches to doing statistical research because different approaches work better in different situations. The question then is not “Which approach is better?” but rather “What characteristics of a field suggest one should take a platform / integrative approach in order to have the greatest impact?” I think one way to answer this question is to make an analogy with transaction costs a la the theory of the firm. (This kind of analogy also plays a role in determining who best to collaborate with but that’s a different post).

In the context of an academic community, I think if it’s easy to exchange information, for example, about data, then building platforms that are widely used makes sense. For example, if everyone uses a standardized technology for collecting a certain kind of data, then it’s easy to develop a platform that applies some method to that data. Regression methodology works in any field that can organize their data into a rectangular table. On the other hand, if information exchange is limited, then building platforms is more difficult and closer collaboration may be required with individual investigators. For example, if there is no standard data collection method or if everyone uses a different proprietary format, then it’s difficult to build a platform that generalizes to many different areas.

There are two case studies with which I am somewhat familiar that I think are useful for demonstrating these characteristics.

  • Genomics. I think genomics is an area where you can see statisticians definitely taking both approaches. However, I’m struck by the intense focus on the development of methods and data analysis pipelines, particularly in order to adapt to the ever-changing ‘omics technologies that are being developed. Part of the reason is that for a given type of data, there are relatively few companies developing the technology for collecting the data. Here, it is possible to develop a method or pipeline to deal with a new kind of data generated by a new technology in the early stages of when that data are being produced. If your method works well relative to others, then it’s possible for your method to become essentially a standard approach that everyone uses for that technology. So there’s a pretty big incentive to be the person who develops a platform for a data collection technology on which everyone builds their research. It is helpful if you can get early access to new technologies so you can get a peek at the data before everyone else and get a head start on developing the methods. Another aspect of genomics is that the field is quite open relative to others, in that there is quite a bit of information sharing. With the enormous amount of publicly available data out there, there’s a very large population of potential users of your method/software. Those people who don’t collect their own primary data can still take your method and apply it to data that’s already out there. Therefore, I think from a statistician’s point of view, genomics is a field that presents many opportunities to build platforms that will be used by many people addressing many different types of questions.
  • Environmental Health. The area of environmental health, where I generally operate, is a much smaller field than genomics. You can see this by looking at things like journal impact factors and h-indices. It does not have the same culture as genomics and relatively little data is shared openly and there are typically no requirements from journals to make data available upon publication. Data are often very expensive and time-consuming to collect, particularly if you are running large cohorts and are monitoring things like personal exposure. There are no real standardized methods for data collection and many formats are proprietary. Statisticians in this area tend to be attached to larger groups who run studies or collect human health and exposure data. It’s relatively hard to be an independent statistician here because you need access to a collaborator who has relevant expertise, resources, and data. The lack of publicly available health data severely limits the participation of statisticians outside biomedical research institutions where the data are collected primarily. I would argue that in environmental health, the integrative approach is more fruitful because (1) in order to do the work in the first place you already need be working closely with people collecting the health data; (2) there is a general lack of information sharing and standardization with respect to data collection; (3) if you develop a new tool, there is not a particularly large audience available to adopt those tools; (4) because studies are not unified by shared technologies, as in genomics, it’s often difficult to usefully generalize methodology from one study to the next. While I think it’s possible to develop general methodology for a certain type of study in this field, the impact is inherently limited due to the small size of the field.

In the end I think areas that are ripe for the platform approach to statistical research are those that are very open and have culture of information sharing, have a large community of active methodologists, and have a lot of useful publicly available data. Areas that do not have these qualities might be better served by an integrative approach where statisticians work more directly with scientific collaborators and focus on the specific questions and problems of a given study.

The @fivethirtyeight effect - watching @walthickey gain Twitter followers in real time

Last night Nate Silver announced that he had hired Walt Hickey away from Business Insider to be an editor for the new http://www.fivethirtyeight.com/ website with a couple of tweets:

I knew about Walt because he syndicated one of my posts about Fox News Graphics on Business Insider. But he clearly wasn’t as well known as Nate S. who is probably the face of statistical analysis to most people in the world. So I figured the announcement might increase Walt’s following on Twitter.

After goofing around a bit with the Twitter api and the twitteR R package. I managed to start sampling the number of followers for Walt H. This started about an hour or so (I think) after the announcement was made, here is a plot of Walt’s followers over about two hours.

walthickey-followers

Over the two hours he gained almost 1,000 followers! We can also take a look at the rate he was gaining followers.

walthickey-raten

So he was gaining followers at around 10-15 per minute on average at 7:30 yesterday. It cooled off over those two hours, but he was still getting a few followers a minute. To put those numbers in perspective, our Twitter account @simplystats, gets on average about 10 new followers a day.

So there you have it, the real time (albeit two hours too late) 538 bump in Twitter followers.

Platforms and Integration in Statistical Research (Part 1/2)

In the technology world today, one of the common topics of interest is the so-called “war” between Apple and Google (or Android). This war is ostensibly over dominance of the mobile phone industry, where Apple sells the most popular phone but Google/Android (as an operating system) controls over half the market. (Android phones themselves are manufactured by a variety of companies and no one of those companies sells more phones than Apple.)

Apple vs. Google (vs. Microsoft)

Apple’s model is to own the entire (or most of the) development of the phone. They build the hardware, the software, and create the design. They also control the App Store for selling their own software and third party software, distribute the music from their iTunes store, and distribute the e-books through their iBookstore. They even have their own proprietary messaging platform. This “walled-garden” approach is a hallmark of Apple and its famously controlling founder Steve Jobs. Rather than “walled-garden”, I would call it more of an “integrative” approach, where Apple essentially has its fingers in all the relevant pies, controlling every aspect.

The Google/Android approach is more modular and involves controlling the platform on which pretty much every phone could theoretically be built. Until recently, Google did not build their own phones, but rather let other companies build the phones and use Google’s operating system as the software for the phone. The model here is similar to the Unix philosophy, which is to “do one thing well”. Google is really good at developing Android and handset manufacturers are really good at building phones. There’s no point in one company doing two things moderately well when you could have two companies each do one thing really well. Here, Google focuses on the platform, the Android operating system, and tries to spread it as far and wide as possible to cover the most possible phones, tablets, watches, whatever mobile device is relevant.

For us older people, the more relevant “war” is between Microsoft and everyone else. Microsoft built one of the most legendary platforms in computer history-–the Windows operating system. For decades this platform was (and continues to be) the dominant operating system for personal desktop computers. Although Microsoft never really focused on building its own hardware, Microsoft’s impact on the PC world through its control of Windows is undeniable. Unfortunately, an asterisk must be put next to all of this history because we now know that much of this dominance was achieved through criminal activity.

Theory and Methods vs. Applications

There’s much debate in the technology world over which approach is better, the Apple integrative model or the Google/Microsoft modular platform model. I think this “debate” exists because it’s fun to argue about Apple vs. Google and it gives technology reporters something to write about. When the dust settles (if ever) I think the answer will be “it depends”.

In the statistical community I find there’s often an analogous debate that goes on regarding which is the more important form of statistical activity, theory/methods or applications. In a nutshell (perhaps even a cartoon nutshell) there’s a sense that theoretical or abstract methodological development has a greater impact because it is broadly generalizable to many different areas. Applications work is less impactful because it is focused on a specific area and any lessons learned that might be applicable to other areas would only be realized much later.

We could spend a lot of time debating the specific arguments here (and I have already spent that time!) but I think a better way to frame this debate is to use the analogy of Apple and Google, that is between integrative statistical research and platforms research. In particular, I think the “theory vs. applications” moniker is a bit outdated and does not cover many of the recent developments in the field of statistics.

Platforms in Statistics

When I was in graduate school and learning about being a statistician, it was pretty much hammered into my brain that the ultimate goal of a statistician is to build a platform. It was not described to me in those words, but that was the essential message. The basic idea was that you would develop a new method that was as general as possible so that it could be applied to a wide variety of fields, from agriculture to zoology. Ideally, you would demonstrate that this method was better than any other method through some sort of theorem.

They ultimate platform in statistics might be the t-test, or perhaps the p-value. Those two statistical methods are used in some form in almost any scientific context you could possibly imagine. I’d argue that the p-value is the Microsoft Windows of science. (Much like with Windows, you could argue this is for better or for worse.) Other essential platforms in statistics might be linear regression, generalized linear models, the bootstrap, the EM algorithm, etc. If you could be the developer of one of these platforms your impact would be tremendous because everyone in every discipline would use it. That’s why Ronald Fisher should be the most influential scientist ever.

I think the notion of a platform, rather than theory/methods, is a much more useful context here because it more accurately describes why these things are so important. Generalized linear models may be interesting because they represent an abstract concept of linear relationships, but it’s useful because it’s a platform on which a ton of other research in many other fields can be built. If we accept the idea that something is important because it serves as a platform on which many other things can be built, then I think this idea encompasses more than what might be published in the pages of the Journal of the American Statistical Association or the Annals of Statistics.

In particular, I think one of the greatest statistical platforms developed in the last 10 to 15 years is R. If you consider what R really is, yes it’s a software package that does statistical calculations, but primarily it’s a platform on which an enormous community of people can build things. The Comprehensive R Archive Network is the “App Store” through which statisticians can develop and distribute their tools. R itself has been extended (through packages) and applied to almost every imaginable scientific discipline. Take one look at the Task Views section to get a sense of the diversity of areas to which R has been applied. Entire sub-projects (i.e. Bioconductor) have been developed around using R in specific fields of research. At this point the impact of R on both the sciences and on statistics is as undeniable as the t-test.

Integrative Statistical Research

Integrative research in statistics is something that I think harks back to a much earlier era in the history of statistics, the era in which the field of statistics didn’t really exist. Before the field really had solidified itself as a separate discipline, people “doing statistics” came from all areas of science as well as mathematics. Here, the statistician was involved in all aspects of research and not just walled-off in a separate area dreaming up abstract methods. Many methods were later abstracted and generalized, but this largely grew out of an initial need to solve a specific problem.

As the field matured and separate Departments of Statistics started to appear, the discipline moved more towards a platform approach by focusing more on abstraction and generalizable approaches. It’s easy to see why this move would occur. If you’re trying to distinguish your discipline as being separate from other disciplines (and therefore deserving of separate resources), you need to demonstrate a unique contribution that is separate from the other fields and, in a sense, wall yourself off a little from the others. Given that computers were generally available at the time this move began, mathematics was the most useful and easily accessible tool to build new statistical platforms.

Today, I think the field of statistics is moving back towards the old model of integrating closer with scientists in other disciplines. In particular, we are seeing more and more people “invading” the field from other related areas like computer science, just like the old days. Personally, I think these “outsiders” should be welcomed under our tent as they bring unique insights to our field and provide a richness not otherwise obtainable.

With the integrative statistical research model we see more statisticians “embedded” into the sciences, in the the thick of it, so to speak, with involvement in every aspect. They publish in discipline-specific journals and in some cases are flat-out leading large-scale scientific collaborations. The reasons for this are many, but I think are centered around advances in computer technology that has allowed for the rapid collection of large and complex datasets and the easy implementation of sophisticated models. The heterogeneity and unique complexity of these different datasets has required statisticians to dig deeper into the field and learn more of the substantive details before a useful contribution can be made. This accumulation of deep knowledge of a field occurs at the expense of being able to work in many different fields at once, or as John Tukey said, to “play in everyone’s backyard”.

The integrative approach to statistical research is exciting because it allows for the statistician to have a direct impact on a scientific discipline rather than a more indirect one through developing platforms. However, the approach is resource intensive in that it requires an interdisciplinary research environment with good collaborators in the relevant disciplines. As such, it may only be possible to take the integrative approach in certain institutions and environments. I think a similar argument could be made with respect to conducting platform research but I think there are many cases there where it was not strictly necessary.

In the next post, I’ll talk a bit (and give examples) about where I think the platform and integrative approaches may be more or less fruitful.

Teaching least squares to a 5th grader by calibrating a programmable robot

The Lego Mindstorm kit provides software and hardware to create programmable robots. A very simple first task is figuring out how to make the robot move any given distance. You get to program the number of wheel rotations. The video below shows how one can use this to motivate and teach least squares. We assumed the formula was distance = K × rotations, collected data for 1,2…, 10 rotations, then used R to motivate (via plots) and calculate the least squares estimate of K.

Not shown in the video is my explanation of how we could also use the formula circumference  = pi x diameter to figure out K and a discussion about which approach is better.  Next project will be to calibrate turns which are achieved by rotating the wheels in opposite directions. This time I will use both the geometric approach (compute the wheel circumference and the circumference defined by robot turns) and the statistical approach.