Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Feeling optimistic after the Future of the Statistical Sciences Workshop

Last I week I participated in the Future of the Statistical Sciences Workshop. I arrived feeling somewhat pessimistic about the future of our discipline. My pessimism stemmed from the emergence of the term Data Science and the small role academic (bio)statistics department are playing in the excitement and initiatives surrounding it.  Data Science centers/departments/initiatives are propping up in universities without much interaction with (bio)statistics departments. Funding agencies, interested in supporting Data Science, are not always including academic statisticians in the decision making process.

About 100 participants, including many of our discipline’s leaders, attended the workshop. It was organized in sessions and about a dozen talks; some about the future, others featuring collaborations between statisticians and subject matter experts. The collaborative talks provided great examples of the best our field has to offer and the rest generated provocative discussions. In most of these discussions the disconnect between our discipline and  Data Science was raised as cause for concern.

Some participants thought Data Science is just another fad like Data Mining was 10-20 years ago. I actually disagree because I view the recent increase in  the number of fields that have suddenly become data-driven as a historical discontinuity. For example, we first posted about statistics versus data science back in 2011.

At the workshop, Mike Jordan explained that the term was coined up by industry for practical reasons: emerging companies needed a work force that could solve problems with data and statisticians were not fitting the bill. However, at the workshop there was consensus that our discipline needs a jolt to meet these new challenges. The take away messages were all in line with ideas we have been promoting here in Simply Statistics (here is a good summary post from Jeff):

  1. We need to engage in real present-day problems (problem first not solution backward)

  2. Computing should be a big part of our PhD curriculum (here are some suggestions)

  3. We need to deliver solutions (and stop whining about not being listened to); be more like engineers than mathematicians. (here is a related post by Roger, in statistical genomics this has been the de facto rule for a while.)

  4. We need to improve our communication skills (in talks or on Twitter)

The fact that there was consensus on these four points gave me reason to feel optimistic about our future.

What should statistics do about massive open online courses?

Marie Davidian, the President of the American Statistical Association, writes about the JHU Biostatistics effort to deliver massive open online courses. She interviewed Jeff, Brian Caffo, and me and summarized our thoughts.

All acknowledge that the future is unknown. How MOOCs will affect degree programs remains to be seen. Roger notes that the MOOCs he, Jeff, Brian, and others offer seem to attract many students who would likely not enter a degree program at Hopkins, regardless, so may be filling a niche that will not result in increased degree enrollments. But Brian notes that their MOOC involvement has brought extensive exposure to the Hopkins Department of Biostatistics—for many people the world over, Hopkins biostatistics is statistics.

What's the future of inference?

Rob Gould reports on what appears to have been interesting panel discussion on the future of statistics hosted by the UCLA Statistics Department. The panelists were Songchun Zhu (UCLA Statistics), Susan Paddock (RAND Corp.), and Jan de Leeuw (UCLA Statistics).

He describes Jan’s thoughts on the future of inference in the field of statistics:

Jan said that inference as an activity belongs in the substantive field that raised the problem.  Statisticians should not do inference.  Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.

I found this comment to be thought provoking. First of all, it sounds exactly like something Jan would say, which makes me smile. In principle, I agree with the premise. In order to make a reasonable (or intelligible) inference you have to have some knowledge of the substantive field. I don’t think that’s too controversial. However, I think it’s incredibly short-sighted to conclude therefore that statisticians should not be engaged in inference. To me, it seems more logical that statisticians should go learn some science. After all, we keep telling the scientists to learn some statistics.

In my experience, it’s not so easy to draw a clean line between the person analyzing the data and the person drawing the inferences. It’s generally not possible to say to someone, “Hey, I just analyze the data, I don’t care about your science.” For starters, that tends to make for bad collaborations. But more importantly, that kind of attitude assumes that you can effectively analyze the data without any substantive knowledge. That you can just “crunch the numbers” and produce a useful product.

Ultimately, I can see how statisticians would want to stay away from the inference business. That part is hard, it’s controversial, it involves messy details about sampling, and opens one up to criticism. And statisticians love to criticize other people. Why would anyone want to get mixed up with that? This is why machine learning is so attractive–it’s all about prediction and in-sample learning.

However, I think I agree with Daniela Witten, who at our recent Unconference, said that the future of statistics is inference. That’s where statisticians are going to earn their money.

The Leek group guide to sharing data with a data analyst to speed collaboration

My group collaborates with many different scientists and the number one determinant of how fast we can turn around results is the status of the data we receive from our collaborators. If the data are well organized and all the important documentation is there, it dramatically speeds up the analysis time.

I recently had the experience where a postdoc requesting help with an analysis provided an amazing summary of the data she wanted analyzed. It has made me want to prioritize her analysis in my queue and it inspired me to write a how-to guide that will help scientific/business collaborators get speedier results from their statistician colleagues.

Here is the Leek group guide to sharing data with statisticians/data analysts.

As usual I put it on Github because I’m sure this first draft will have mistakes or less than perfect ideas. I would love help in making the guide more comprehensive and useful. If you issue a pull request make sure you add yourself to list of contributors at the end.

Original source code for Apple II DOS

Someone needs to put this on GitHub right now.

Thanks Paul Laughton for your donation of this superb collection of early to mid-1978 documents including the letters, agreements, specifications (including hand-written code and schematics), and two original source code listing for the creation of the Apple II “DOS” (Disk Operating System).This was, of course, Apple’s first operating system, written not by Steve Wozniak (“Woz”) but by an external contractor (Paul Laughton working for Shepardson Microsystems). Woz lacked the skills to write an OS (as did anyone then at Apple). Paul authored the actual Apple II DOS to its release in the fall of 1978.

Update: At this point I see some GitHub stub accounts, but no real code (yet).