Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The Mozilla Fellowship for Science

This looks like an interesting opportunity for grad students, postdocs, and early career researchers:

We’re looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Fellows will spend 10 months starting September 2015 as community catalysts at their institutions, mentoring the next generation of open data practitioners and researchers and building lasting change in the global open science community.

Throughout their fellowship year, chosen fellows will receive training and support from Mozilla to hone their skills around open source and data sharing. They will also craft code, curriculum and other learning resources that help their local communities learn open data practices, and teach forward to their peers.

Here’s what you get:

Fellows will receive:

  • A stipend of $60,000 USD, paid in 10 monthly installments.
  • One-time health insurance supplement for Fellows and their families, ranging from $3,500 for single Fellows to $7,000 for a couple with two or more children.
  • One-time childcare allotment for families with children of up to $6,000.
  • Allowance of up to $3,000 towards the purchase of laptop computer, digital cameras, recorders and computer software; fees for continuing studies or other courses, research fees or payments, to the extent related to the fellowship.
  • All approved fellowship trips – domestic and international – are covered in full.

Deadline is August 14.

JHU, UMD researchers are getting a really big Big Data center

From Technical.ly Baltimore:

A nondescript, 3,700-square-foot building on Johns Hopkins’ Bayview campus will house a new data storage and computing center for university researchers. The $30 million Maryland Advanced Research Computing Center (MARCC) will be available to faculty from JHU and the University of Maryland, College Park.

The web site has a pretty cool time-lapse video of the construction of the computing center. There’s also a bit more detail at the JHU Hub site.

The Massive Future of Statistics Education

NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. 

Data are eating the world, but our collective ability to analyze data is going on a starvation diet.

Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions.

So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we are interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone.

In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college.

Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight.

Limited Capacity

The McKinsey Global Institute, in a highly cited report, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality.

Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of 730 people. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent.

On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in some data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for everyone to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time.

Enter the MOOCs

In April of 2014, Jeff Leek, Brian Caffo, and I launched the Johns Hopkins Data Science Specialization on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is

  1. Formulating a question that can be answered with data
  2. Assembling, cleaning, tidying data relevant to a question
  3. Exploring data, checking, eliminating hypotheses
  4. Developing a statistical model
  5. Making statistical inference
  6. Communicating findings
  7. Making the work reproducible

We took these basic steps and designed courses around each one of them.

Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month.

We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is what we do every single day. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work.

The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist.

For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with SwiftKey, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public.

Degree Alternatives

The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis.

All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses.

Each learner who completes a course using Coursera’s “Signature Track” (which currently costs $49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses.

Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs.

Early Numbers

So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project.

Some Early Lessons

From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned.

Data Science as Art and Science. Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower.

When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach?

Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his 1991 talk at the National Academies of Science, we had a process that we regularly espoused but barely understood.

Content vs. Curation. Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing?

Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a curation of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated.

Skill sets vs. Certification. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly.

In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing.

Conclusions

As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work.

Looks like this R thing might be for real

Not sure how I missed this, but the Linux Foundation just announced the R Consortium for supporting the “world’s most popular language for analytics and data science and support the rapid growth of the R user community”. From the Linux Foundation:

The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.

Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.

How Airbnb built a data science team

From Venturebeat:

Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.

But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.