Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Platforms and Integration in Statistical Research (Part 2/2)

In my last post, I talked about two general approaches to conducting statistical research: platforms and integration. In this followup I thought I’d describe the characteristics of certain fields that suggesting taking one approach over another.

I think in practice, most statisticians will dedicate some time to both the platform and integrative approaches to doing statistical research because different approaches work better in different situations. The question then is not “Which approach is better?” but rather “What characteristics of a field suggest one should take a platform / integrative approach in order to have the greatest impact?” I think one way to answer this question is to make an analogy with transaction costs a la the theory of the firm. (This kind of analogy also plays a role in determining who best to collaborate with but that’s a different post).

In the context of an academic community, I think if it’s easy to exchange information, for example, about data, then building platforms that are widely used makes sense. For example, if everyone uses a standardized technology for collecting a certain kind of data, then it’s easy to develop a platform that applies some method to that data. Regression methodology works in any field that can organize their data into a rectangular table. On the other hand, if information exchange is limited, then building platforms is more difficult and closer collaboration may be required with individual investigators. For example, if there is no standard data collection method or if everyone uses a different proprietary format, then it’s difficult to build a platform that generalizes to many different areas.

There are two case studies with which I am somewhat familiar that I think are useful for demonstrating these characteristics.

  • Genomics. I think genomics is an area where you can see statisticians definitely taking both approaches. However, I’m struck by the intense focus on the development of methods and data analysis pipelines, particularly in order to adapt to the ever-changing ‘omics technologies that are being developed. Part of the reason is that for a given type of data, there are relatively few companies developing the technology for collecting the data. Here, it is possible to develop a method or pipeline to deal with a new kind of data generated by a new technology in the early stages of when that data are being produced. If your method works well relative to others, then it’s possible for your method to become essentially a standard approach that everyone uses for that technology. So there’s a pretty big incentive to be the person who develops a platform for a data collection technology on which everyone builds their research. It is helpful if you can get early access to new technologies so you can get a peek at the data before everyone else and get a head start on developing the methods. Another aspect of genomics is that the field is quite open relative to others, in that there is quite a bit of information sharing. With the enormous amount of publicly available data out there, there’s a very large population of potential users of your method/software. Those people who don’t collect their own primary data can still take your method and apply it to data that’s already out there. Therefore, I think from a statistician’s point of view, genomics is a field that presents many opportunities to build platforms that will be used by many people addressing many different types of questions.
  • Environmental Health. The area of environmental health, where I generally operate, is a much smaller field than genomics. You can see this by looking at things like journal impact factors and h-indices. It does not have the same culture as genomics and relatively little data is shared openly and there are typically no requirements from journals to make data available upon publication. Data are often very expensive and time-consuming to collect, particularly if you are running large cohorts and are monitoring things like personal exposure. There are no real standardized methods for data collection and many formats are proprietary. Statisticians in this area tend to be attached to larger groups who run studies or collect human health and exposure data. It’s relatively hard to be an independent statistician here because you need access to a collaborator who has relevant expertise, resources, and data. The lack of publicly available health data severely limits the participation of statisticians outside biomedical research institutions where the data are collected primarily. I would argue that in environmental health, the integrative approach is more fruitful because (1) in order to do the work in the first place you already need be working closely with people collecting the health data; (2) there is a general lack of information sharing and standardization with respect to data collection; (3) if you develop a new tool, there is not a particularly large audience available to adopt those tools; (4) because studies are not unified by shared technologies, as in genomics, it’s often difficult to usefully generalize methodology from one study to the next. While I think it’s possible to develop general methodology for a certain type of study in this field, the impact is inherently limited due to the small size of the field.

In the end I think areas that are ripe for the platform approach to statistical research are those that are very open and have culture of information sharing, have a large community of active methodologists, and have a lot of useful publicly available data. Areas that do not have these qualities might be better served by an integrative approach where statisticians work more directly with scientific collaborators and focus on the specific questions and problems of a given study.